论文标题

GNN预处理有助于分子表示吗?

Does GNN Pretraining Help Molecular Representation?

论文作者

Sun, Ruoxi, Dai, Hanjun, Yu, Adams Wei

论文摘要

使用图神经网络(GNN)提取分子的信息表示,对于AI驱动的药物发现至关重要。最近,图形研究界一直在试图复制自然语言处理预处理的成功,并获得了一些成功。但是,我们发现在许多情况下,自我监管预处理带来的益处可以忽略不计。我们对GNN预处理的关键组成部分进行了彻底的消融研究,包括预处理目标,数据拆分方法,输入特征,预绘制数据集量表和GNN体系结构,以了解它们如何影响下游任务的准确性。我们的第一个重要发现是,在许多情况下,自我监督的图表预处理并不总是具有统计学意义的优势。其次,尽管可以通过额外的监督预处理可以观察到明显的改进,但通过更丰富或更平衡的数据拆分,改进可能会减少。第三,超参数可能对下游任务的准确性产生更大的影响,而不是训练训练的任务,尤其是当下游任务的尺度很小时。最后,我们提供了我们的猜想,其中一些小分子上某些预处理方法的复杂性可能不足,其次是不同预科数据集的经验证据。

Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源