一种简单的方法，用于处理源代码的深度学习中的量量标识符

论文标题

一种简单的方法，用于处理源代码的深度学习中的量量标识符

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

论文作者

Chirkova, Nadezhda, Troshin, Sergey

论文摘要

对将自然语言处理模型应用于源代码处理任务的应用有新兴的兴趣。将深度学习应用于软件工程的主要问题之一是，源代码通常包含许多稀有标识符，从而产生巨大的词汇。我们提出了一种基于标识符匿名化的简单而有效的方法，以处理量不计（OOV）标识符。我们的方法可以视为预处理步骤，因此可以轻松实施。我们表明，提出的OOV匿名方法显着改善了两个代码处理任务中变压器的性能：代码完成和错误修复。

There is an emerging interest in the application of natural language processing models to source code processing tasks. One of the major problems in applying deep learning to software engineering is that source code often contains a lot of rare identifiers, resulting in huge vocabularies. We propose a simple, yet effective method, based on identifier anonymization, to handle out-of-vocabulary (OOV) identifiers. Our method can be treated as a preprocessing step and, therefore, allows for easy implementation. We show that the proposed OOV anonymization method significantly improves the performance of the Transformer in two code processing tasks: code completion and bug fixing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题