CoreGen：提交消息生成的上下文化代码表示学习

论文标题

CoreGen：提交消息生成的上下文化代码表示学习

CoreGen: Contextualized Code Representation Learning for Commit Message Generation

论文作者

Nie, Lun Yiu, Gao, Cuiyun, Zhong, Zhicong, Lam, Wai, Liu, Yang, Xu, Zenglin

论文摘要

自动生成代码提交的高质量提交消息可以实质上促进软件开发人员的作品和协调。但是，源代码和自然语言之间的语义差距对该任务构成了重大挑战。已经提出了几项研究以减轻挑战，但没有明确涉及在提交消息生成期间代码上下文信息。具体而言，现有研究采用静态嵌入代码令牌，无论其上下文如何，它都会将令牌映射到同一向量。在本文中，我们提出了一个新颖的上下文化代码表示策略，以实现提交消息生成（CoreGen）。 CoreGen首先学习上下文化的代码表示，以利用代码提交序列背后的上下文信息。然后，对下游提交消息生成进行微调进行微调。基准数据集上的实验证明了我们模型的效率高于基线模型，而BLEU-4分数至少提高了28.18％。此外，我们还强调了较大代码语料库上的上下文化代码表示的未来机会，以解决低资源任务的解决方案，并将上下文化的代码表示框架调整为其他代码到文本生成任务。

Automatic generation of high-quality commit messages for code commits can substantially facilitate software developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for the task. Several studies have been proposed to alleviate the challenge but none explicitly involves code contextual information during commit message generation. Specifically, existing research adopts static embedding for code tokens, which maps a token to the same vector regardless of its context. In this paper, we propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen). CoreGen first learns contextualized code representations which exploit the contextual information behind code commit sequences. The learned representations of code commits built upon Transformer are then fine-tuned for downstream commit message generation. Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score. Furthermore, we also highlight the future opportunities in training contextualized code representations on larger code corpus as a solution to low-resource tasks and adapting the contextualized code representation framework to other code-to-text generation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题