对比代码表示学习

论文标题

对比代码表示学习

Contrastive Code Representation Learning

论文作者

Jain, Paras, Jain, Ajay, Zhang, Tianjun, Abbeel, Pieter, Gonzalez, Joseph E., Stoica, Ion

论文摘要

最近的工作通过从其上下文中重建代币来了解源代码的上下文表示。对于下游的语义理解任务，例如总结英语代码，这些表示形式应理想地捕获程序功能。但是，我们表明，即使在编辑保留语义时，流行的基于重建的BERT模型也对源代码编辑也很敏感。我们提出了违反：学习代码功能而不是形式的对比前训练任务。在许多非等效的干扰因素中，降低了训练的神经网络，以识别程序的功能相似变体。我们使用自动化的源对源编译器作为数据增强形式可靠地生成这些变体。对比的预训练将JavaScript摘要和打字稿类型推理精度提高了2％至13％。我们还提出了一个新的零射击JavaScript代码克隆检测数据集，表明隔离模式既有强大又具有语义有意义。在它上，我们在对抗性设置中以39％的AUROC胜过罗伯塔，而自然代码最高可达5％。

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题