论文标题

Github在学术出版物中的兴起

The Rise of GitHub in Scholarly Publications

论文作者

Escamilla, Emily, Klein, Martin, Cooper, Talya, Rampin, Vicky, Weigle, Michele C., Nelson, Michael L.

论文摘要

学术内容的定义已扩展到包括有助于出版的数据和源代码。尽管正在进行PDF(例如锁,时钟,门廊)中保留传统学术含量的主要归档努力,但尚无类似的努力来保留这些PDF中引用的数据和代码,尤其是在Git Hosting Platforms(GHPS)上在线托管的学术代码(GHP)。同样,软件Heritage Foundation正在努力归档公共源代码,但是归档问题线程,拉动请求和Wiki具有价值,这些问题在维护其原始URL的同时为代码提供了重要上下文。在当前的实施中,源代码及其ephemera尚未保留,这为可重复性重要的学术项目带来了问题。为了理解和量化此问题的范围,我们分析了从2007年1月到2021年12月,我们分析了GHP URI在ARXIV和PMC Corpora中的使用。总共有253,590 URI到Github,SourceForge,Bitbucket和Gitlab Repositore,以及该公司的26.6亿个出版物中的Gitlab存储库。我们发现,Github,Gitlab,SourceForge和Bitbucket在2007年共同将160次与2021年的160次联系在一起。2021年,Arxiv Corpus的五个出版物中,有一个将URI包括在Github上。像GITHUB这样的GHP的复杂性不适合传统的Web存档技术。因此,在学术出版物中越来越多地使用GHP表明,迫切而越来越多的需求是为了保留研究法规及其学术词汇,以纪念其持有量。

The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora from January 2007 to December 2021. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.66 million publications in the corpora. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 160 times in 2007 and 76,746 times in 2021. In 2021, one out of five publications in the arXiv corpus included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源