论文标题
语音中的挖掘单词边界作为自然注释的单词分割数据
Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data
论文作者
论文摘要
受到早期研究的启发,探索中文单词分割(CWS)的自然注释数据,以及最新的语音和文本处理整合研究,这项工作首次提议从并行语音/文本数据中挖掘单词边界。首先,我们从两个与实验中使用的CWS数据相关的Internet来源收集并行语音/文本数据。然后,我们将获得字符级别的对齐和设计简单的启发式规则,以根据相邻字符之间的暂停持续时间来确定单词边界。最后,我们提出了一种有效的完整培训策略,可以更好地利用额外的自然注释数据进行模型培训。实验表明我们的方法可以显着提高跨域和低资源场景中的CWS性能。
Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to mine word boundaries from parallel speech/text data. First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments. Then, we obtain character-level alignments and design simple heuristic rules for determining word boundaries according to pause duration between adjacent characters. Finally, we present an effective complete-then-train strategy that can better utilize extra naturally annotated data for model training. Experiments demonstrate our approach can significantly boost CWS performance in both cross-domain and low-resource scenarios.