方法2 -test：映射到测试用例的焦点方法的数据集

论文标题

方法2 -test：映射到测试用例的焦点方法的数据集

Methods2Test: A dataset of focal methods mapped to test cases

论文作者

Tufano, Michele, Deng, Shao Kun, Sundaresan, Neel, Svyatkovskiy, Alexey

论文摘要

单元测试是软件开发过程的重要组成部分，该过程有助于在开发的早期阶段确定源代码问题并防止回归。机器学习已成为可行的方法，可帮助软件开发人员生成自动化的单元测试。但是，生成可靠的单位测试用例，这些测试用例在语义上是正确的，并且能够通过机器学习捕获软件错误或意外行为需要大型，元数据富含的数据集。在本文中，我们介绍方法2 -test：映射到测试用例的焦点方法的数据集：大型，有监督的测试用例数据集映射到所测试的相应方法（即焦点方法）。该数据集包含780,944对JUNIT测试和焦点方法，从总共从GitHub上托管的91,385个Java开源项目中提取，并允许重新分配许可证。创建方法2 -Test的主要挑战是在测试用例和相关焦点方法之间建立可靠的映射。为此，我们根据开发人员在软件测试中的最佳实践设计了一套启发式方法，这些实践确定了给定测试案例的可能焦点方法。为了促进进一步的分析，我们在JSON-Formatted文件中为每个方法测试对存储了丰富的元数据。此外，我们以不同的上下文级别从数据集中提取文本语料库，并以原始和令牌化形式提供，以使研究人员能够培训和评估自动测试生成的机器学习模型。 Methods2test可公开可用：https：//github.com/microsoft/methods2test

Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: A dataset of focal methods mapped to test cases: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the creation of the Methods2Test was to establish a reliable mapping between a test case and the relevant focal method. To this aim, we designed a set of heuristics, based on developers' best practices in software testing, which identify the likely focal method for a given test case. To facilitate further analysis, we store a rich set of metadata for each method-test pair in JSON-formatted files. Additionally, we extract textual corpus from the dataset at different context levels, which we provide both in raw and tokenized forms, in order to enable researchers to train and evaluate machine learning models for Automated Test Generation. Methods2Test is publicly available at: https://github.com/microsoft/methods2test

下载PDF全文

下载文献需遵守相关版权规定

论文标题