HELOC：层次对比度学习源代码表示

论文标题

HELOC：层次对比度学习源代码表示

HELoC: Hierarchical Contrastive Learning of Source Code Representation

论文作者

Wang, Xiao, Wu, Qiong, Zhang, Hongyu, Lyu, Chen, Jiang, Xue, Zheng, Zhuoran, Lyu, Lei, Hu, Songlin

论文摘要

抽象语法树（ASTS）在源代码表示中起着至关重要的作用。但是，由于AST和通常深度AST层次结构中的大量节点，有效地学习AST的层次结构是一项挑战。在本文中，我们提出了HELOC，这是源代码表示的层次对比学习模型。为了有效地学习AST层次结构，我们使用对比度学习来允许网络预测AST节点级别，并以自我监督的方式学习节点之间的层次结构，这使节点的表示向量具有更大的AST级别差异的差异。通过使用此类向量，可以更精确地测量代码段之间的结构相似性。在学习过程中，设计了一种新颖的GNN（称为残留自我发项图神经网络RSGNN），使Heloc能够专注于嵌入AST的局部结构，同时捕获其整体结构。 HELOC是自我监督的，可以应用于许多与源代码相关的下游任务，例如代码分类，代码克隆检测和预训练后代码群集。我们的广泛实验表明，HELOC胜过最先进的源代码表示模型。

Abstract syntax trees (ASTs) play a crucial role in source code representation. However, due to the large number of nodes in an AST and the typically deep AST hierarchy, it is challenging to learn the hierarchical structure of an AST effectively. In this paper, we propose HELoC, a hierarchical contrastive learning model for source code representation. To effectively learn the AST hierarchy, we use contrastive learning to allow the network to predict the AST node level and learn the hierarchical relationships between nodes in a self-supervised manner, which makes the representation vectors of nodes with greater differences in AST levels farther apart in the embedding space. By using such vectors, the structural similarities between code snippets can be measured more precisely. In the learning process, a novel GNN (called Residual Self-attention Graph Neural Network, RSGNN) is designed, which enables HELoC to focus on embedding the local structure of an AST while capturing its overall structure. HELoC is self-supervised and can be applied to many source code related downstream tasks such as code classification, code clone detection, and code clustering after pre-training. Our extensive experiments demonstrate that HELoC outperforms the state-of-the-art source code representation models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题