论文标题

部分可观测时空混沌系统的无模型预测

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

论文作者

Fan, Zhiyun, Liang, Zhenlin, Dong, Linhao, Liu, Yi, Zhou, Shiyu, Cai, Meng, Zhang, Jun, Ma, Zejun, Xu, Bo

论文摘要

在诸如会议和对话之类的多对话者方案中,通常需要语音处理系统来分割音频然后转录每个细分。这两个阶段通过说话者变更检测(SCD)和自动语音识别(ASR)分别解决。大多数以前的SCD系统仅依赖说话者信息,而忽略语音内容的重要性。在本文中,我们提出了一个新颖的SCD系统,该系统既考虑说话者差异的提示和语音内容。这两个提示通过连续的集成和开火(CIF)机制转化为令牌级表示,然后合并以检测代币声学边界上的扬声器变化。我们评估了在公开录制的会议数据集Aishell-4上的方法的性能。实验结果表明,我们的方法的表现优于竞争性框架级基线系统2.45%相等的覆盖范围(ECP)。此外,我们证明了语音内容和说话者差异对SCD任务的重要性,以及与执行SCD框架相比,在令牌声学边界上进行SCD的优势。

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection (SCD) and automatic speech recognition (ASR). Most previous SCD systems rely solely on speaker information and ignore the importance of speech content. In this paper, we propose a novel SCD system that considers both cues of speaker difference and speech content. These two cues are converted into token-level representations by the continuous integrate-and-fire (CIF) mechanism and then combined for detecting speaker changes on the token acoustic boundaries. We evaluate the performance of our approach on a public real-recorded meeting dataset, AISHELL-4. The experiment results show that our method outperforms a competitive frame-level baseline system by 2.45% equal coverage-purity (ECP). In addition, we demonstrate the importance of speech content and speaker difference to the SCD task, and the advantages of conducting SCD on the token acoustic boundaries compared with conducting SCD frame by frame.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源