论文标题
解释2 Attact:通过跨域可解释性发动文本对抗性攻击
Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability
论文作者
论文摘要
培训强大的深度学习模型以进行下游任务是一个至关重要的挑战。研究表明,下游模型很容易被看起来像训练数据但略有干扰的对抗性输入所愚弄,而人类无法察觉。在这些攻击下了解自然语言模型的行为对于更好地捍卫这些模型免受此类攻击至关重要。在黑框攻击设置中,如果无法访问模型参数,则攻击者只能从目标模型中查询输出信息以制定成功的攻击。当前的黑盒最新模型在计算复杂性和制作成功的对抗性示例所需的查询数量方面都是昂贵的。对于现实世界的情况,查询的数量至关重要,需要更少的查询以避免对攻击代理的怀疑。在本文中,我们提出了对文本分类任务的黑框对抗攻击。解释2 Attack并没有通过查询目标模型来搜索要干扰的重要单词,而是使用类似域中的可解释的替代模型来学习单词重要性分数。我们表明,我们的框架要么实现最先进的模型的攻击率,但较低的查询成本和较高的效率。
Training robust deep learning models for down-stream tasks is a critical challenge. Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. Understanding the behavior of natural language models under these attacks is crucial to better defend these models against such attacks. In the black-box attack setting, where no access to model parameters is available, the attacker can only query the output information from the targeted model to craft a successful attack. Current black-box state-of-the-art models are costly in both computational complexity and number of queries needed to craft successful adversarial examples. For real world scenarios, the number of queries is critical, where less queries are desired to avoid suspicion towards an attacking agent. In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task. Instead of searching for important words to be perturbed by querying the target model, Explain2Attack employs an interpretable substitute model from a similar domain to learn word importance scores. We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.