论文标题
公开催化剂2020(OC20)数据集和社区挑战
The Open Catalyst 2020 (OC20) Dataset and Community Challenges
论文作者
论文摘要
催化剂的发现和优化是解决许多社会和能源挑战的关键,包括太阳能燃料合成,长期能源储存和可再生肥料的产生。尽管催化社区将机器学习模型应用于计算催化剂发现过程,但建立可以在表面的元素组成和吸附物身份/配置中概括的模型仍然是一个开放的挑战,这可能是因为数据集的催化中比相关领域小。为了解决这个问题,我们开发了OC20数据集,其中包括1,281,040个密度功能理论(DFT)松弛(〜264,890,000个单点评估),遍及多种材料,表面和吸附物(硝基,碳,碳和氧气化学)。我们用随机扰动的结构,短时间分子动力学和电子结构分析补充了该数据集。该数据集包括三个指示日常催化剂建模的中心任务,并带有预定义的火车/验证/测试拆分,以促进与未来模型开发工作的直接比较。我们将三种最先进的图形神经网络模型(CGCNN,Schnet,Dimenet ++)应用于这些任务中的每个任务,作为供社区建立的基线演示。在几乎所有任务中,都没有确定模型大小的上限,这表明甚至更大的模型可能会改善初始结果。数据集和基线模型均作为开放资源提供,以及公共负责人委员会,以鼓励社区捐款解决这些重要任务。
Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.