论文标题
通过融合声带和语音源特征来证明欺骗语音的文物
Evince the artifacts of Spoof Speech by blending Vocal Tract and Voice Source Features
论文作者
论文摘要
随着合成语音生成技术的快速发展,研究界出现了对将欺骗语音与自然语音区分开来的极大兴趣。这些合成信号的识别不仅是尖端分类模型,而且对于人类本身而言,这是一项艰巨的任务。为了防止潜在的不良影响,检测欺骗信号至关重要。从取证的角度来看,预测产生它们以识别伪造者的算法也很重要。这需要了解欺骗信号的基本属性,这些属性是合成器的签名。这项研究强调了语音信号的片段,即通过利用声道系统(\ textit {vts})和语音源(\ textit {vs})功能来识别其真实性至关重要。 在本文中,我们提出了一个检测欺骗信号并识别相应语音生成算法的系统。我们在算法分类精度中达到99.58 \%。从实验中,我们发现基于VS功能的系统更加关注音素的过渡,而基于VTS功能的系统则更多地关注语音信号的固定段。我们对基于VS和基于VTS的系统执行模型融合技术,以利用互补信息来开发可靠的分类器。在分析混乱图后,我们发现Wavernn分类不佳,描绘了更多的自然性。另一方面,我们确定了合成器(例如波形串联),而神经源过滤器的精度最高。这项工作的实际含义可以帮助取证(利用文物)和语音群落(减轻文物)的研究人员。
With the rapid advancement in synthetic speech generation technologies, great interest in differentiating spoof speech from the natural speech is emerging in the research community. The identification of these synthetic signals is a difficult task not only for the cutting-edge classification models but also for humans themselves. To prevent potential adverse effects, it becomes crucial to detect spoof signals. From a forensics perspective, it is also important to predict the algorithm which generated them to identify the forger. This needs an understanding of the underlying attributes of spoof signals which serve as a signature for the synthesizer. This study emphasizes the segments of speech signals critical in identifying their authenticity by utilizing the Vocal Tract System(\textit{VTS}) and Voice Source(\textit{VS}) features. In this paper, we propose a system that detects spoof signals as well as identifies the corresponding speech-generating algorithm. We achieve 99.58\% in algorithm classification accuracy. From experiments, we found that a VS feature-based system gives more attention to the transition of phonemes, while, a VTS feature-based system gives more attention to stationary segments of speech signals. We perform model fusion techniques on the VS-based and VTS-based systems to exploit the complementary information to develop a robust classifier. Upon analyzing the confusion plots we found that WaveRNN is poorly classified depicting more naturalness. On the other hand, we identified that synthesizer like Waveform Concatenation, and Neural Source Filter is classified with the highest accuracy. Practical implications of this work can aid researchers from both forensics (leverage artifacts) and the speech communities (mitigate artifacts).