论文标题
深扬声器矢量标准化和最大的高斯训练
Deep Speaker Vector Normalization with Maximum Gaussianality Training
论文作者
论文摘要
深扬声器嵌入代表了说话者识别的最新技术。这种方法的一个关键问题是,由此产生的深扬声器向量往往是不规则分布的。在先前的研究中,我们提出了一种基于新的判别归一化流量(DNF)模型的深度归一化方法,通过该模型,可以说,可以说,单个说话者的分布转换为同质的高斯人。这种归一化已被证明是有效的,但是尽管取得了巨大的成功,但我们从经验上发现,DNF模型产生的潜在代码通常既不是均匀的,也不是高斯,尽管该模型已经假设了。在本文中,我们认为这个问题主要归因于DNF模型的最大样本(ML)训练标准,该标准旨在最大程度地提高观测值的可能性,但不一定改善潜在代码的高斯。因此,我们提出了一种新的最大高斯(MG)训练方法,该方法直接最大化潜在代码的高斯性。我们对两个数据集SITW和CNCELEB的实验表明,我们的新MG培训方法比以前的ML培训可以提供更好的性能,并且展示了提高的域可推广性,尤其是在余弦评分方面。
Deep speaker embedding represents the state-of-the-art technique for speaker recognition. A key problem with this approach is that the resulting deep speaker vectors tend to be irregularly distributed. In previous research, we proposed a deep normalization approach based on a new discriminative normalization flow (DNF) model, by which the distributions of individual speakers are arguably transformed to homogeneous Gaussians. This normalization was demonstrated to be effective, but despite this remarkable success, we empirically found that the latent codes produced by the DNF model are generally neither homogeneous nor Gaussian, although the model has assumed so. In this paper, we argue that this problem is largely attributed to the maximum-likelihood (ML) training criterion of the DNF model, which aims to maximize the likelihood of the observations but not necessarily improve the Gaussianality of the latent codes. We therefore propose a new Maximum Gaussianality (MG) training approach that directly maximizes the Gaussianality of the latent codes. Our experiments on two data sets, SITW and CNCeleb, demonstrate that our new MG training approach can deliver much better performance than the previous ML training, and exhibits improved domain generalizability, particularly with regard to cosine scoring.