语音中的挖掘单词边界作为自然注释的单词分割数据

论文标题

语音中的挖掘单词边界作为自然注释的单词分割数据

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

论文作者

Zhang, Lei, Li, Zhenghua, Zhou, Shilin, Gong, Chen, Wang, Zhefeng, Huai, Baoxing, Zhang, Min

论文摘要

受到早期研究的启发，探索中文单词分割（CWS）的自然注释数据，以及最新的语音和文本处理整合研究，这项工作首次提议从并行语音/文本数据中挖掘单词边界。首先，我们从两个与实验中使用的CWS数据相关的Internet来源收集并行语音/文本数据。然后，我们将获得字符级别的对齐和设计简单的启发式规则，以根据相邻字符之间的暂停持续时间来确定单词边界。最后，我们提出了一种有效的完整培训策略，可以更好地利用额外的自然注释数据进行模型培训。实验表明我们的方法可以显着提高跨域和低资源场景中的CWS性能。

Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to mine word boundaries from parallel speech/text data. First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments. Then, we obtain character-level alignments and design simple heuristic rules for determining word boundaries according to pause duration between adjacent characters. Finally, we present an effective complete-then-train strategy that can better utilize extra naturally annotated data for model training. Experiments demonstrate our approach can significantly boost CWS performance in both cross-domain and low-resource scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题