论文标题
通过分裂,编码和聚合来解决长期搜索
Tackling Long Code Search with Splitting, Encoding, and Aggregating
论文作者
论文摘要
使用自然语言搜索的代码搜索有助于我们重复使用现有代码段。得益于基于变压器的预读模型,代码搜索的性能得到了显着提高。但是,由于多头自我注意力的二次复杂性,输入令牌长度有限制。为了对标准GPU(如V100)进行有效的培训,现有的预验证代码模型,包括GraphCodebert,Codebert,Roberta(Code),默认情况下使用第一个256个令牌,这使得它们无法代表大于256个令牌的长代码的完整信息。为了解决长期代码问题,我们提出了一个新的基线海(分式,编码和汇总),该海将长代码拆分为代码块,将这些块编码为嵌入式,并汇总它们以获得全面的长代码表示。使用SEA,我们可以直接使用基于变压器的预训练模型来对长代码进行建模,而无需更改其内部结构和重新预测。我们还将海与稀疏的trasnformer方法进行比较。 SEA以GraphCodebert为编码器,取得了总体平均值等级评分为0.785,比CodesearchNet基准上的GraphCodebert高10.1%,这证明了SEA是长期搜索的强大基线。我们的源代码和实验数据可在以下网址获得:https://github.com/fly-dragon211/sea。
Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. We also compare SEA with sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark, justifying SEA as a strong baseline for long code search. Our source code and experimental data are available at: https://github.com/fly-dragon211/SEA.