论文标题
一组完整的相关GIT存储库通过共享提交通过社区检测方法确定
A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits
论文作者
论文摘要
为了了解整个开源软件的状态和演变,我们需要在一组不同的软件项目上处理一个。目前,大多数开源项目都利用Git,这是一个分布式版本控制系统,允许轻松创建克隆,并产生了许多存储库,这些存储库几乎完全基于某些父母存储库。 git提交是基于默克尔树,而两次提交则极不可能独立生产。因此,共享提交看起来像是分组克隆存储库并获得此类存储库准确地图的绝佳方法。我们使用包含大约2b consits和100m存储库的代码基础架构世界来创建和共享此类地图。我们发现,最大的群体包含近1400万个存储库,其中大多数彼此无关。事实证明,开发人员可以将git对象推向任意存储库或从无关存储库中拉出对象,从而链接无关的存储库。为了解决这个问题,我们将卢旺社区检测算法应用于这个非常大的图表,该图由提交与项目之间的链接组成。该方法成功地减少了大型互联项目,其中包含100K存储库的高度相互联系的项目。我们希望所产生的相关项目映射以及处理非常大图的工具和方法的工具将作为采矿软件项目和其他应用程序的参考集。需要进一步的工作来确定由共享提交和其他关系引起的项目之间的不同类型的关系,例如通过共享源代码或类似的文件名。
In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are based on Merkle Tree and two commits are highly unlikely to be produced independently. Shared commits, therefore, appear like an excellent way to group cloned repositories and obtain an accurate map for such repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 100K repositories. We expect the tools that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.