论文标题
堆栈:3汤匙允许许可的源代码
The Stack: 3 TB of permissively licensed source code
论文作者
论文摘要
大型语言模型(LLM)在人工智能(AI)领域中起不断增强的作用 - 不仅在自然语言处理中,而且对于代码理解和生成。为了刺激对代码LLM的开放和负责任的研究,我们介绍了堆栈,这是一个3.1 TB数据集,该数据集由30种编程语言的允许许可的源代码组成。我们描述了如何收集完整数据集,构建允许许可的子集,介绍数据治理计划,讨论限制并通过培训不同的Python子集的350m参数解码器对Text2Code基准测试显示出令人鼓舞的结果。我们发现(1)将数据近乎二的数据显着提高所有实验的性能,(2)可以仅使用允许许可的数据匹配先前报道的先前报道的HOMANEVAL和MBPP性能。我们在https://hf.co/bigcode上提供数据集可用,提供了一个称为“ Am I In the stack”的工具(https://hf.co/spaces/bigcode/in-in-in-in-in-the-pack),以供开发人员搜索堆栈以获取其代码的副本,并通过遵循数据的副本,并提供指令的过程,以便通过数据量进行删除。 https://www.bigcode-project.org/docs/about/the-stack/。
Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.