论文标题
Dockerfile的数据集
A Dataset of Dockerfiles
论文作者
论文摘要
Dockerfiles是行业中使用的最普遍的DevOps工件之一。尽管存在盛行,但缺乏对Dockerfiles的精致语义意识到的静态分析。在本文中,我们介绍了一个从Github收集的大约178,000个独特码头的数据集。为了增强这些数据的可用性,我们描述了我们为使用,开采和分析这些Dockerfiles设计的五个表示。每个Dockerfile表示形式都建立在先前的代表上,而由三个层次的解析和抽象创建的最终表示形式使诸如采矿和静态检查易于处理之类的任务。五五个表示中的每一个中的dockerfiles以及元数据以及用于将数据从一个表示形式转移到下一个表示的工具,请访问:https://doi.org/10.5281/zenodo.3628771。
Dockerfiles are one of the most prevalent kinds of DevOps artifacts used in industry. Despite their prevalence, there is a lack of sophisticated semantics-aware static analysis of Dockerfiles. In this paper, we introduce a dataset of approximately 178,000 unique Dockerfiles collected from GitHub. To enhance the usability of this data, we describe five representations we have devised for working with, mining from, and analyzing these Dockerfiles. Each Dockerfile representation builds upon the previous ones, and the final representation, created by three levels of nested parsing and abstraction, makes tasks such as mining and static checking tractable. The Dockerfiles, in each of the five representations, along with metadata and the tools used to shepard the data from one representation to the next are all available at: https://doi.org/10.5281/zenodo.3628771.