语言不匹配在自动法医语音比较中使用深度学习嵌入的影响

论文标题

语言不匹配在自动法医语音比较中使用深度学习嵌入的影响

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

论文作者

Sztahó, Dávid, Fejes, Attila

论文摘要

在法医语音比较中，扬声器的嵌入在过去10年中已广泛流行。大多数审慎的扬声器嵌入式嵌入经过英语语料库的培训，因为它很容易访问。因此，语言依赖性可能是自动法医语音比较的重要因素，尤其是当目标语言在语言上非常不同时。有许多商业系统可用，但是它们的模型主要接受与目标语言不同的语言（主要是英语）的培训。在低资源语言的情况下，开发一个用于法医目的的语料库，其中包含足够的扬声器来训练深度学习模型是昂贵的。这项研究旨在调查是否可以在目标低资源语言（匈牙利语）上使用预先培训的英语语料库的模型，与模型不同。同样，通常没有犯罪者（未知的说话者）获得多个样本。因此，在有或没有说话者入学人数的嫌疑人（已知）扬声器的情况下对样品进行比较。应用了两个语料库，这些语料库是专门用于法医目的的，第三个是用于传统演讲者验证的第三个语料库。使用了两种基于深度学习的扬声器嵌入向量提取方法：X-Vector和ecapa-tdnn。说话者验证在似然比框架中进行了评估。在语言组合（建模，LR校准，评估）之间进行了比较。结果通过MINCLLR和EER指标评估。发现该模型以不同的语言进行了预训练，但是在具有大量扬声器的语料库上，在语言不匹配的样本上表现良好。还检查了样本持续时间和口语样式的影响。发现相关样本的持续时间越长，性能就越好。另外，如果采用各种口语样式，没有真正的区别。

In forensic voice comparison the speaker embedding has become widely popular in the last 10 years. Most of the pretrained speaker embeddings are trained on English corpora, because it is easily accessible. Thus, language dependency can be an important factor in automatic forensic voice comparison, especially when the target language is linguistically very different. There are numerous commercial systems available, but their models are mainly trained on a different language (mostly English) than the target language. In the case of a low-resource language, developing a corpus for forensic purposes containing enough speakers to train deep learning models is costly. This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language (here, Hungarian), different from the model is trained on. Also, often multiple samples are not available from the offender (unknown speaker). Therefore, samples are compared pairwise with and without speaker enrollment for suspect (known) speakers. Two corpora are applied that were developed especially for forensic purposes, and a third that is meant for traditional speaker verification. Two deep learning based speaker embedding vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker verification was evaluated in the likelihood-ratio framework. A comparison is made between the language combinations (modeling, LR calibration, evaluation). The results were evaluated by minCllr and EER metrics. It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch. The effect of sample durations and speaking styles were also examined. It was found that the longer the duration of the sample in question the better the performance is. Also, there is no real difference if various speaking styles are applied.

下载PDF全文

下载文献需遵守相关版权规定

论文标题