论文标题
端到端Infosec任务的变压器:可行性研究
Transformers for End-to-End InfoSec Tasks: A Feasibility Study
论文作者
论文摘要
在本文中,我们评估了端到端Infosec设置中变压器模型的生存能力,在该设置中没有中间特征表示或处理步骤发生在模型之外。我们在新型的端到端方法中为两种不同的INFOSEC数据格式(特别是URL和PE文件)实施变压器模型,并探索各种建筑设计,培训制度和实验设置,以确定性能检测模型所需的成分。我们表明,与经过更标准的NLP相关任务训练的传统变压器相比,我们的URL变压器模型需要采用不同的训练方法才能达到高性能水平。具体而言,我们表明1)预先培训对无标记的URL数据进行自动退缩任务不会容易转移到恶意或良性URL的二进制分类中,但2)使用辅助自动登记损失可改善从Scratch训练时的性能。我们引入了一种混合客观优化的方法,该方法可以动态地平衡两个损失项的贡献,以使它们俩都占主导地位。我们表明,该方法产生的定量评估指标与几个表现最好的基准分类器相当。与URL不同,二进制可执行文件包含较长且更分布的信息丰富字节的序列。为了适应这种冗长的字节序列,我们通过提供类似于Sukhbaatar等人的自适应跨度提供自适应跨度来将额外的上下文长度引入变压器。我们证明,这种方法与基准PE文件数据集上建立的恶意软件检测模型相当,但也指出需要进一步探索可扩展性和计算效率的模型改进。
In this paper, we assess the viability of transformer models in end-to-end InfoSec settings, in which no intermediate feature representations or processing steps occur outside the model. We implement transformer models for two distinct InfoSec data formats - specifically URLs and PE files - in a novel end-to-end approach, and explore a variety of architectural designs, training regimes, and experimental settings to determine the ingredients necessary for performant detection models. We show that in contrast to conventional transformers trained on more standard NLP-related tasks, our URL transformer model requires a different training approach to reach high performance levels. Specifically, we show that 1) pre-training on a massive corpus of unlabeled URL data for an auto-regressive task does not readily transfer to binary classification of malicious or benign URLs, but 2) that using an auxiliary auto-regressive loss improves performance when training from scratch. We introduce a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates. We show that this method yields quantitative evaluation metrics comparable to that of several top-performing benchmark classifiers. Unlike URLs, binary executables contain longer and more distributed sequences of information-rich bytes. To accommodate such lengthy byte sequences, we introduce additional context length into the transformer by providing its self-attention layers with an adaptive span similar to Sukhbaatar et al. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets, but also point out the need for further exploration into model improvements in scalability and compute efficiency.