关于API学习预处理模型的有效性

论文标题

关于API学习预处理模型的有效性

On the Effectiveness of Pretrained Models for API Learning

论文作者

Hadi, Mohammad Abdul, Yusuf, Imam Nur Bani, Thung, Ferdian, Luong, Kien Gia, Lingxiao, Jiang, Fard, Fatemeh H., Lo, David

论文摘要

开发人员经常使用API来实现某些功能，例如解析Excel文件，逐行阅读和编写文本文件等。开发人员可以根据自然的API使用顺序从自然语言查询中产生大大受益，以更快，更清洁的方式构建应用程序。现有方法利用信息检索模型来搜索给定查询或使用基于RNN的编码器解码器生成API序列的匹配API序列。就目前而言，第一种方法将查询和API名称视为单词袋。它缺乏对查询语义的深刻理解。后一种方法适应神经语言模型，以将用户查询编码为固定长度上下文向量，并从上下文向量生成API序列。我们想了解最近基于预训练的变压器模型（PTM）在API学习任务中的有效性。这些PTM以无监督的方式对大型自然语言语料库进行了培训，以保留有关该语言的上下文知识，并在解决类似的自然语言处理（NLP）问题方面取得了成功。但是，尚未在API序列生成任务中探索PTM的适用性。我们使用一个包含从GitHub收集的700万个注释的数据集来经验评估PTMS。该数据集也用于评估以前的方法。基于我们的结果，PTMS生成更准确的API序列，并胜过其他相关方法约为11％。我们还确定了两种不同的令牌化方法，可以为API序列生成任务的PTMS效果显着提高。

Developers frequently use APIs to implement certain functionalities, such as parsing Excel Files, reading and writing text files line by line, etc. Developers can greatly benefit from automatic API usage sequence generation based on natural language queries for building applications in a faster and cleaner manner. Existing approaches utilize information retrieval models to search for matching API sequences given a query or use RNN-based encoder-decoder to generate API sequences. As it stands, the first approach treats queries and API names as bags of words. It lacks deep comprehension of the semantics of the queries. The latter approach adapts a neural language model to encode a user query into a fixed-length context vector and generate API sequences from the context vector. We want to understand the effectiveness of recent Pre-trained Transformer based Models (PTMs) for the API learning task. These PTMs are trained on large natural language corpora in an unsupervised manner to retain contextual knowledge about the language and have found success in solving similar Natural Language Processing (NLP) problems. However, the applicability of PTMs has not yet been explored for the API sequence generation task. We use a dataset that contains 7 million annotations collected from GitHub to evaluate the PTMs empirically. This dataset was also used to assess previous approaches. Based on our results, PTMs generate more accurate API sequences and outperform other related methods by around 11%. We have also identified two different tokenization approaches that can contribute to a significant boost in PTMs' performance for the API sequence generation task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题