论文标题

JTrans:二进制代码相似性的Jump-Aware Transformer

jTrans: Jump-Aware Transformer for Binary Code Similarity

论文作者

Wang, Hao, Qu, Wenjie, Katz, Gilad, Zhu, Wenyu, Gao, Zeyu, Qiu, Han, Zhuge, Jianwei, Zhang, Chao

论文摘要

二进制代码相似性检测(BCSD)在各个领域具有重要应用,例如漏洞检测,软件组件分析和逆向工程。最近的研究表明,深度神经网络(DNNS)可以理解二进制代码和支持BCSD的指令或控制流图(CFG)。在这项研究中,我们提出了一种基于变压器的新方法,即JTRAN,以了解二进制代码的表示。它是第一个将二进制代码的控制流信息嵌入基于变压器的语言模型中的解决方案,它是使用新颖的二进制文件和新设计的预训练任务的新型跳水表示。此外,我们向社区发布了一个新创建的大型二进制数据集,即BinaryCorp,这是迄今为止最多样化的。评估结果表明,JTRANS在此更具挑战性的数据集上的最先进(SOTA)的方法高30.5%(即从32.0%到62.5%)。在已知漏洞搜索的现实世界中,JTRAN的召回率比现有的SOTA基准高2倍。

Binary code similarity detection (BCSD) has important applications in various fields such as vulnerability detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源