论文标题
编程语言不可知论的代码和语言对,并具有基于序列标签的问题回答
Programming Language Agnostic Mining of Code and Language Pairs with Sequence Labeling Based Question Answering
论文作者
论文摘要
采矿对齐自然语言(NL)和编程语言(PL)对是NL-PL理解的关键任务。现有方法为每个PL应用专门的手工制作功能或单独训练的模型。但是,它们通常遭受多个PL的可传递性低,尤其是对于具有较少带注释数据的利基PL。幸运的是,堆栈溢出答案帖子本质上是一系列文本和代码块,其全局文本上下文可以提供PL-AFR-NOSTIC补充信息。在本文中,我们提出了一种基于序列标记的问题回答(SLQA)方法,以PL-ASTONSIC的方式挖掘NL-PL对。特别是,我们建议应用生物标记方案,而不是传统的二进制方案,以挖掘通常由帖子的多个块组成的代码解决方案。对当前单PL单块基准和手动标记的跨PL多块基准的实验证明了SLQA的有效性和可传递性。我们进一步提出了一个名为lang2code的平行NL-PL语料库,自动开采了SLQA,该语料在6个PLS上包含约140万对。在统计分析和下游评估下,我们证明了Lang2Code是用于进一步NL-PL研究的大型高质量数据资源。
Mining aligned natural language (NL) and programming language (PL) pairs is a critical task to NL-PL understanding. Existing methods applied specialized hand-crafted features or separately-trained models for each PL. However, they usually suffered from low transferability across multiple PLs, especially for niche PLs with less annotated data. Fortunately, a Stack Overflow answer post is essentially a sequence of text and code blocks and its global textual context can provide PL-agnostic supplementary information. In this paper, we propose a Sequence Labeling based Question Answering (SLQA) method to mine NL-PL pairs in a PL-agnostic manner. In particular, we propose to apply the BIO tagging scheme instead of the conventional binary scheme to mine the code solutions which are often composed of multiple blocks of a post. Experiments on current single-PL single-block benchmarks and a manually-labeled cross-PL multi-block benchmark prove the effectiveness and transferability of SLQA. We further present a parallel NL-PL corpus named Lang2Code automatically mined with SLQA, which contains about 1.4M pairs on 6 PLs. Under statistical analysis and downstream evaluation, we demonstrate that Lang2Code is a large-scale high-quality data resource for further NL-PL research.