基于多任务学习的预先训练的语言模型用于代码完成

论文标题

基于多任务学习的预先训练的语言模型用于代码完成

Multi-task Learning based Pre-trained Language Model for Code Completion

论文作者

Liu, Fang, Li, Ge, Zhao, Yunfei, Jin, Zhi

论文摘要

代码完成是集成开发环境（IDE）中最有用的功能之一，它可以通过实时建议基于上下文代码的下一个可能的令牌来加速软件开发。最近的研究表明，统计语言建模技术可以通过从大规模软件存储库中学习来提高代码完成工具的性能。但是，这些模型具有两个主要缺点：a）现有研究使用静态嵌入，无论其上下文如何，它们都会将单词映射到同一向量。当每个令牌与单个表示相关联时，在不同上下文中的令牌含义的差异就会丢失。 b）现有的基于语言模型的代码完成模型在完成标识符方面的性能较差，并且在大多数这些模型中都忽略了标识符的类型信息。为了解决这些挑战，在本文中，我们开发了一种基于多任务学习的预训练的语言模型，以使用基于变压器的神经体系结构来理解和代码生成代码。我们使用混合目标函数进行预处理，以结合代码理解和代码生成任务。然后，我们在代码完成时微调了预训练的模型。在完成期间，我们的模型无法直接预测下一步的令牌。取而代之的是，我们采用多任务学习来预测令牌及其类型，并利用预测类型来协助令牌预测。两个现实世界数据集的实验结果证明了与最新方法相比，我们的模型的有效性。

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown that statistical language modeling techniques can improve the performance of code completion tools through learning from large-scale software repositories. However, these models suffer from two major drawbacks: a) Existing research uses static embeddings, which map a word to the same vector regardless of its context. The differences in the meaning of a token in varying contexts are lost when each token is associated with a single representation; b) Existing language model based code completion models perform poor on completing identifiers, and the type information of the identifiers is ignored in most of these models. To address these challenges, in this paper, we develop a multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture. We pre-train it with hybrid objective functions that incorporate both code understanding and code generation tasks. Then we fine-tune the pre-trained model on code completion. During the completion, our model does not directly predict the next token. Instead, we adopt multi-task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction. Experiments results on two real-world datasets demonstrate the effectiveness of our model when compared with state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题