扬声器验证的图形注意网络

论文标题

扬声器验证的图形注意网络

Graph Attention Networks for Speaker Verification

论文作者

Jung, Jee-weon, Heo, Hee-Soo, Yu, Ha-Jin, Chung, Joon Son

论文摘要

这项工作为使用图形注意力网络提供了一个新颖的后端框架，用于扬声器验证。从语音中的多种农作物中提取的细分说话者的嵌入式嵌入被解释为图的节点表示。提出的框架输入片段的扬声器嵌入从注册和测试话语中嵌入，并直接输出相似性得分。我们首先使用细分扬声器嵌入构造图形，然后将其输入图形注意力网络。经过几个具有残差连接的图形注意层后，每个节点都会使用仿射变换投影到一个一维空间中，然后进行读取操作，从而产生标量相似性分数。为了成功适应扬声器验证，我们提出了技术，例如将可训练的权重分开，以便在细分扬声器嵌入之间的注意力图与不同的话语之间进行计算。使用三种不同的式嵌入提取器验证了拟议框架的有效性，该提取器训练有不同的架构和客观功能。实验结果表明，对各种基线后端分类器的一致改善，在余弦相似性后端的平均误差率提高了20％，而没有测试时间增加。

This work presents a novel back-end framework for speaker verification using graph attention networks. Segment-wise speaker embeddings extracted from multiple crops within an utterance are interpreted as node representations of a graph. The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score. We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks. After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform, followed by a readout operation resulting in a scalar similarity score. To enable successful adaptation for speaker verification, we propose techniques such as separating trainable weights for attention map calculations between segment-wise speaker embeddings from different utterances. The effectiveness of the proposed framework is validated using three different speaker embedding extractors trained with different architectures and objective functions. Experimental results demonstrate consistent improvement over various baseline back-end classifiers, with an average equal error rate improvement of 20% over the cosine similarity back-end without test time augmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题