论文标题

SourceFinder:从公开可用存储库中查找恶意软件源代码

SourceFinder: Finding Malware Source-Code from Publicly Available Repositories

论文作者

Rokon, Md Omar Faruk, Islam, Risul, Darki, Ahmad, Papalexakis, Vagelis E., Faloutsos, Michalis

论文摘要

我们在哪里可以找到恶意软件源代码?这个问题是由真正的需求激发的:缺乏恶意软件源代码,这阻碍了各种类型的安全研究。我们的工作是由以下见解驱动的:像Github这样的公共档案馆,具有令人惊讶的恶意软件存储库。我们提出了利用这一机会,SourceFinder是一种有监督的学习方法,以有效地识别恶意软件源代码的存储库。我们使用GitHub的97K存储库来评估和应用我们的方法。首先,我们证明我们的方法可以使用标记的数据集识别具有89%精度的恶意软件存储库,而86%的方法可以召回86%。其次,我们使用SourceFinder识别7504个恶意软件源代码存储库,这可以说是最大的恶意软件源代码数据库。最后,我们研究了恶意软件存储库及其作者的基本属性和趋势。此类存储库的数量似乎每4年就会通过数量级增长,而18位恶意软件作者似乎是“专业人士”,并具有良好的在线声誉。我们认为,我们的方法和大型恶意软件源代码存储库可能是研究研究的催化剂,目前是不可能的。

Where can we find malware source code? This question is motivated by a real need: there is a dearth of malware source code, which impedes various types of security research. Our work is driven by the following insight: public archives, like GitHub, have a surprising number of malware repositories. Capitalizing on this opportunity, we propose, SourceFinder, a supervised-learning approach to identify repositories of malware source code efficiently. We evaluate and apply our approach using 97K repositories from GitHub. First, we show that our approach identifies malware repositories with 89% precision and 86% recall using a labeled dataset. Second, we use SourceFinder to identify 7504 malware source code repositories, which arguably constitutes the largest malware source code database. Finally, we study the fundamental properties and trends of the malware repositories and their authors. The number of such repositories appears to be growing by an order of magnitude every 4 years, and 18 malware authors seem to be "professionals" with well-established online reputation. We argue that our approach and our large repository of malware source code can be a catalyst for research studies, which are currently not possible.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源