蛋白质表示通过预测的几何结构学习

论文标题

蛋白质表示通过预测的几何结构学习

Protein Representation Learning by Geometric Structure Pretraining

论文作者

Zhang, Zuobai, Xu, Minghao, Jamasb, Arian, Chenthamarakshan, Vijil, Lozano, Aurelie, Das, Payel, Tang, Jian

论文摘要

学习有效的蛋白质表示在生物学的各种任务中至关重要，例如预测蛋白质功能或结构。现有的方法通常在大量未标记的氨基酸序列上预先蛋白质语言模型，然后在下游任务中使用一些标记的数据进行修复模型。尽管基于序列的方法具有有效性，但尚未探索蛋白质性质预测的已知蛋白质结构的预处理功能，尽管蛋白质结构已知是蛋白质功能的决定因素。在本文中，我们建议根据其3D结构预处理蛋白质。我们首先提出一个简单而有效的编码器，以学习蛋白质的几何特征。我们通过利用多视图对比学习和不同的自我预测任务来预构蛋白质图编码器。对功能预测和折叠分类任务的实验结果表明，我们提出的预处理方法的表现优于基于最新的序列方法，同时使用了较少的训练训练的数据。我们的实现可在https://github.com/deepgraphlearning/gearnet上获得。

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题