Sorel-20m：大规模的基准数据集用于恶意PE检测

论文标题

Sorel-20m：大规模的基准数据集用于恶意PE检测

SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection

论文作者

Harang, Richard, Rudd, Ethan M.

论文摘要

在本文中，我们描述了Sorel-20m（Sophos/reverversingLabs-200亿）数据集：一个大规模数据集，该数据集由近2000万个文件组成，具有预提取功能和元数据，高质量的标签，多个来源，来自多个来源的信息，有关在收集示例时的恶意软件示例示例示例的信息，以及其他针对其他目标的示例示例，以及其他针对其他标签。除了功能和元数据外，我们还提供了约1000万个“解除武装”恶意软件样本 - 具有可选的\ _headers.subsystem和file \ _header.machine Flags设置为零的样品 - 可用于进一步探索功能和检测策略。我们还提供Python代码来与数据和功能进行交互，以及基线神经网络和梯度增强的决策树模型及其结果以及完整的培训和评估代码，以作为进一步实验的起点。

In this paper we describe the SOREL-20M (Sophos/ReversingLabs-20 Million) dataset: a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional ``tags'' related to each malware sample to serve as additional targets. In addition to features and metadata, we also provide approximately 10 million ``disarmed'' malware samples -- samples with both the optional\_headers.subsystem and file\_header.machine flags set to zero -- that may be used for further exploration of features and detection strategies. We also provide Python code to interact with the data and features, as well as baseline neural network and gradient boosted decision tree models and their results, with full training and evaluation code, to serve as a starting point for further experimentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题