用深度学习来武器化Unidodes-识别具有弱标记数据的同型盲文

论文标题

用深度学习来武器化Unidodes-识别具有弱标记数据的同型盲文

Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data

论文作者

Deng, Perry, Linsky, Cooper, Wright, Matthew

论文摘要

视觉上相似的角色或同义可用于执行社会工程攻击或逃避垃圾邮件和窃探测器。因此，重要的是要了解攻击者识别同质的能力（尤其是以前没有被发现的），并将其利用在攻击中。我们使用嵌入学习，转移学习和增强来研究一个深度学习模型，以确定字符的视觉相似性，从而确定潜在的同素同形。我们的方法独特地利用了大多数字符不是同义的事实引起的弱标签。我们的模型在成对同质识别上大大优于归一化压缩距离方法，我们的平均精度为0.97。我们还提出了首次尝试将同与等价类集中的同与镜头集群，这比安全从业人员快速查找同义或将可混淆的字符串编码归一化更为有效。为了衡量聚类性能，我们提出了在经典交叉点（IOU）公制上建造的度量标准（MBIOU）。我们的聚类方法达到0.592 MBIOU，而天真基线的基线为0.430。我们还使用我们的模型来预测8,000多个以前未知的同属文，并找到很好的早期迹象表明其中许多可能是真正的阳性。源代码和预测同符的列表上传到GitHub：https：//github.com/perryxdeng/weaponizing_unicode

Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://github.com/PerryXDeng/weaponizing_unicode

下载PDF全文

下载文献需遵守相关版权规定

论文标题