降低培训安全分类器的成本（通过优化的半监督学习）

论文标题

降低培训安全分类器的成本（通过优化的半监督学习）

Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

论文作者

Shu, Rui, Xia, Tianpei, Tu, Huy, Williams, Laurie, Menzies, Tim

论文摘要

背景：用于安全任务的大多数现有机器学习模型，例如垃圾邮件检测，恶意软件检测或网络入侵检测，都是基于监督机器学习算法的。在这样的范式中，模型需要大量标记的数据来学习所选特征与目标类之间的有用关系。但是，此类标记的数据可能稀缺且获取昂贵。目标：帮助安全从业者培训有用的安全分类模型，当时很少有标记的培训数据和许多未标记的培训数据。方法：我们提出了一个称为Dapper的自适应框架，该框架优化了1）半监督学习算法，以在传播范式中分配伪标记为未标记的数据和2）机器学习分类器（即随机森林）。当数据集类高度不平衡时，Dapper然后自适应地集成并优化了一种称为Smote的数据超采样方法。我们使用新型的贝叶斯优化来搜索这些调谐目标的大型超参数空间。结果：我们使用三个安全数据集（即Twitter垃圾邮件数据集，恶意软件URL数据集和CIC-IDS-2017数据集）评估DAPPER。实验结果表明，与以监督的方式使用100％标记的数据相比，我们可以使用低至原始标记数据的10％，但实现近甚至更好的分类性能。结论：根据这些结果，我们建议在处理标记的安全数据短缺时使用半监督学习的超参数优化。

Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题