使用自然语言处理从大型聚合物语料库中的通用材料属性数据提取管道

论文标题

使用自然语言处理从大型聚合物语料库中的通用材料属性数据提取管道

A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

论文作者

Shetty, Pranav, Rajan, Arunkumar Chitteth, Kuenneth, Christopher, Gupta, Sonkakshi, Panchumarti, Lakshmi Prerana, Holm, Lauren, Zhang, Chao, Ramprasad, Rampi

论文摘要

不断增加的材料科学文章使得很难从已发表的文献中推断化学结构 - 性能关系。我们使用自然语言处理（NLP）方法从聚合物文献的摘要中自动提取材料属性数据。作为管道的组成部分，我们使用240万材料科学摘要培训了一种语言模型的材料，该材料摘要在五分之三的命名实体识别数据集中优于其他基线模型，当时用作文本编码器。使用此管道，我们在60小时内从约130,000个摘要中获得了约300,000个材料属性记录。分析了提取的数据，用于各种应用，例如燃料电池，超级电容器和聚合物太阳能电池，以恢复非平凡的见解。通过我们的管道提取的数据可通过https://polymerscholar.org的Web平台提供，该数据可用于方便地找到摘要中记录的材料属性数据。这项工作证明了自动管道的可行性，该管道从已发布的文献开始，并以一组完整的提取物质属性信息结束。

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题