Chemberta-2：迈向化学基础模型

论文标题

Chemberta-2：迈向化学基础模型

ChemBERTa-2: Towards Chemical Foundation Models

论文作者

Ahmad, Walid, Simon, Elana, Chithrananda, Seyone, Grand, Gabriel, Ramsundar, Bharath

论文摘要

诸如GPT-3之类的大型审计模型通过利用自学学习的学习来学习显着表示，可以轻松地在各种下游任务上进行轻松挑选，从而对现代自然语言处理产生了巨大影响。我们使用微笑的语言来构建化学基础模型Chemberta-2，研究将这种进步转移到分子机器学习的可能性。虽然标记的分子预测任务数据通常很少，但微笑字符串的库很容易获得。在这项工作中，我们通过优化预处理过程来建立Chemberta。我们通过不同的超参数和预训练的数据集大小来比较多任务和自我监督的预处理，该数据集尺寸高达PubChem的77m化合物。据我们所知，77m集构成了迄今为止用于分子预处理的最大数据集之一。我们发现，通过这些预处理的改进，我们与Moleculenet基准套件上现有的最新架构具有竞争力。我们分析了预读的改进的程度，转化为下游任务的改进。

Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题