Github认为有害？分析自动生成密码API呼叫序列的开源项目

论文标题

Github认为有害？分析自动生成密码API呼叫序列的开源项目

GitHub Considered Harmful? Analyzing Open-Source Projects for the Automatic Generation of Cryptographic API Call Sequences

论文作者

Tony, Catherine, Ferreyra, Nicolás E. Díaz, Scandariato, Riccardo

论文摘要

GitHub是代码示例的流行数据存储库。它被连续用于训练几种基于AI的工具以自动生成代码。但是，此类工具在正确证明加密API的使用方面的有效性尚未得到彻底评估。在本文中，我们研究了滥用的程度和严重性，特别是由GitHub中不正确的密码API调用序列引起的。我们还分析了GITHUB数据对训练基于学习的模型的适用性，以生成正确的加密API呼叫序列。为此，我们手动提取并分析了GitHub的呼叫序列。使用这些数据，我们增强了一个名为DeepApi的现有基于学习的模型，以创建两个安全特定模型，以生成给定自然语言（NL）描述的加密API调用序列。我们的结果表明，在使用GitHub（例如GitHub）的数据源来训练生成代码的模型时，必须不要忽略API调用序列中的滥用。

GitHub is a popular data repository for code examples. It is being continuously used to train several AI-based tools to automatically generate code. However, the effectiveness of such tools in correctly demonstrating the usage of cryptographic APIs has not been thoroughly assessed. In this paper, we investigate the extent and severity of misuses, specifically caused by incorrect cryptographic API call sequences in GitHub. We also analyze the suitability of GitHub data to train a learning-based model to generate correct cryptographic API call sequences. For this, we manually extracted and analyzed the call sequences from GitHub. Using this data, we augmented an existing learning-based model called DeepAPI to create two security-specific models that generate cryptographic API call sequences for a given natural language (NL) description. Our results indicate that it is imperative to not neglect the misuses in API call sequences while using data sources like GitHub, to train models that generate code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题