使用机器学习和标题信息在电子邮件中检测异常

论文标题

使用机器学习和标题信息在电子邮件中检测异常

Anomaly Detection in Emails using Machine Learning and Header Information

论文作者

Beaman, Craig, Isah, Haruna

论文摘要

网络钓鱼和垃圾邮件等电子邮件中的异常情况呈现出主要的安全风险，例如丧失个人和组织的隐私，金钱和品牌声誉。对电子邮件异常检测的先前研究依赖于一种异常类型以及对电子邮件主体和主题内容的分析。这种方法的缺点是它考虑了电子邮件内容的书面语言。为了克服这一赤字，这项研究在电子邮件标头数据集上进行了特征提取和选择，并利用了多类异常检测方法。获得的实验分析结果表明，电子邮件标头信息仅足以可靠地检测垃圾邮件和网络钓鱼电子邮件。被监督的学习算法，例如随机森林，SVM，MLP，KNN及其堆叠的合奏非常成功，网络钓鱼的高精度得分为97％，垃圾邮件电子邮件的垃圾邮件分数为99％。一级SVM的一级分类分别通过垃圾邮件和网络钓鱼电子邮件的精度分别达到87％和89％。现实世界中的电子邮件过滤应用程序将受益于仅在资源利用率和效率方面使用标头信息。

Anomalies in emails such as phishing and spam present major security risks such as the loss of privacy, money, and brand reputation to both individuals and organizations. Previous studies on email anomaly detection relied on a single type of anomaly and the analysis of the email body and subject content. A drawback of this approach is that it takes into account the written language of the email content. To overcome this deficit, this study conducted feature extraction and selection on email header datasets and leveraged both multi and one-class anomaly detection approaches. Experimental analysis results obtained demonstrate that email header information only is enough to reliably detect spam and phishing emails. Supervised learning algorithms such as Random Forest, SVM, MLP, KNN, and their stacked ensembles were found to be very successful, achieving high accuracy scores of 97% for phishing and 99% for spam emails. One-class classification with One-Class SVM achieved accuracy scores of 87% and 89% with spam and phishing emails, respectively. Real-world email filtering applications will benefit from the use of only the header information in terms of resources utilization and efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题