论文标题
使用多标签分类算法的软件存储库的主题建议
Topic Recommendation for Software Repositories using Multi-label Classification Algorithms
论文作者
论文摘要
许多平台利用协作标签,在搜索或导航时为用户提供更快,更准确的结果。标签可以传达不同的概念,例如主要功能,技术,功能和软件存储库的目标。最近,GitHub使用户可以用主题标签注释存储库。它还提供了一系列特色主题,并在社区的帮助下精心策划了他们可能的别名。这创造了使用这种初始主题种子来自动注释所有剩余存储库的机会,该培训模型向开发人员推荐高质量的主题标签。 在这项工作中,我们研究了多标签分类技术在预测软件存储库主题中的应用。首先,我们将较大的用户定义主题映射到GitHub所提供的主题。核心思想是从项目的可用文档中获取更多信息。我们的数据包含$ 152 $ k GitHub存储库和228美元的特色主题。然后,我们将监督模型应用于存储库的文本信息,例如描述,读书文件,Wiki页面和文件名。我们在定量和定性上评估方法的性能。我们提出的模型召回@5和LRAP分别为$ 0.890 $和$ 0.805 $。此外,根据用户评估,我们的方法高度能够推荐正确且完整的主题集。最后,我们使用模型来开发一个名为\ texttt {存储库目录}的在线工具,该工具自动预测了GitHub存储库的主题,并可以公开使用。
Many platforms exploit collaborative tagging to provide their users with faster and more accurate results while searching or navigating. Tags can communicate different concepts such as the main features, technologies, functionality, and the goal of a software repository. Recently, GitHub has enabled users to annotate repositories with topic tags. It has also provided a set of featured topics, and their possible aliases carefully curated with the help of the community. This creates the opportunity to use this initial seed of topics to automatically annotate all remaining repositories, by training models that recommend high-quality topic tags to developers. In this work, we study the application of multi-label classification techniques to predict software repositories' topics. First, we map the large space of user-defined topics to those featured by GitHub. The core idea is to derive more information from projects' available documentation. Our data contains about $152$K GitHub repositories and $228$ featured topics. Then, we apply supervised models on repositories' textual information such as descriptions, README files, wiki pages, and file names. We assess the performance of our approach both quantitatively and qualitatively. Our proposed model achieves Recall@5 and LRAP scores of $0.890$ and $0.805$, respectively. Moreover, based on users' assessment, our approach is highly capable of recommending a correct and complete set of topics. Finally, we use our models to develop an online tool named \texttt{Repository Catalogue}, that automatically predicts topics for GitHub repositories and is publicly available.