论文标题
用AI-和启用HPC的潜在客户生成定位SARS-COV-2:第一个数据发布
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
论文作者
论文摘要
全球的研究人员正在寻求快速重新使用现有药物,或发现新药以应对由严重急性呼吸综合征冠状病毒2(SARS-COV-2)引起的新型冠状病毒病(COVID-19)。一种有希望的方法是训练机器学习(ML)和人工智能(AI)工具来筛选大量小分子。作为对这项工作的贡献,我们使用高性能计算(HPC)将许多小分子从各种来源汇总到这些分子的计算机多样性,并使用计算特性来训练ML/AI模型,然后使用所得模型进行筛选。在第一个数据发行中,我们从社区来源收集了23个数据集,这些数据集代表超过4.2 B的分子,这些分子富含预计:1)分子指纹以帮助相似性搜索,2)分子的2D图像,以启用基于图像的深度学习方法的探索和应用。该数据发布涵盖了4.2 B分子和60 TB预计数据的结构信息。未来的版本将扩展数据,包括更详细的分子模拟,计算模型和其他产品。
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.