论文标题
旨在使用数据影响方法来检测源代码中的嘈杂样本
Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
论文作者
论文摘要
尽管最近将神经源代码模型开发和应用于软件工程任务的趋势,但此类模型的质量不足以实现现实世界的使用。这是因为用于训练此类模型的源代码语料库中可能会有噪音。我们适应了数据影响方法,以检测本文的此类噪声。在机器学习中使用了数据影响方法来评估目标样本与正确样本的相似性,以确定目标样本是否嘈杂。我们的评估结果表明,数据影响方法可以从基于分类的任务中的神经代码模型中识别出嘈杂的样本。这种方法将有助于从以数据为中心的角度开发更好的神经源代码模型的更大愿景,这是在实践中开发有用的源代码模型的关键驱动力。
Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.