论文标题
总结大数据:针对软件工程挑战的常见GITHUB数据集
Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges
论文作者
论文摘要
在开源软件开发环境中;生成的文本,数值和基于关系的数据对研究人员感兴趣。这些数据可用于各种数据集,这些数据集经常用于软件工程和自然语言处理等领域。但是,由于这些数据集包含环境中的所有数据,因此在数据处理的Terabytes中出现了问题。因此,根据某些标准,几乎所有使用GitHub数据使用过滤数据的研究。在这种情况下,使用每个研究中的不同数据集可以比较研究的准确性。为了解决此问题,创建了一个常见的数据集并与研究人员共享,这将使我们能够解决许多软件工程问题。
In open-source software development environments; textual, numerical and relationship-based data generated are of interest to researchers. Various data sets are available for this data, which is frequently used in areas such as software engineering and natural language processing. However, since these data sets contain all the data in the environment, the problem arises in the terabytes of data processing. For this reason, almost all of the studies using GitHub data use filtered data according to certain criteria. In this context, using a different data set in each study makes a comparison of the accuracy of the studies quite difficult. In order to solve this problem, a common dataset was created and shared with the researchers, which would allow us to work on many software engineering problems.