论文标题
可视化自动语音识别 - 意味着更好地理解?
Visualizing Automatic Speech Recognition -- Means for a Better Understanding?
论文作者
论文摘要
自动语音识别(ASR)在模仿人类言语处理方面正在越来越有所改善。然而,ASR的功能在很大程度上仍然被它们基于的深神经网络(DNN)的复杂结构所困扰。在本文中,我们展示了我们从图像识别中导入并适当适应音频数据的所谓归因方法如何有助于阐明ASR的工作。作为案例研究,采用DeepSpeech是ASR的端到端模型,我们展示了这些技术如何有助于可视化输入的哪些功能在确定输出方面最有影响力。我们专注于三种可视化技术:层次相关性传播(LRP),显着图和Shapley添加说明(SHAP)。我们比较这些方法并讨论潜在的进一步应用,例如在检测对抗示例中。
Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.