论文标题

自动化语音工具,用于帮助社区处理语言复兴努力的受限访问语料库

Automated speech tools for helping communities process restricted-access corpora for language revival efforts

论文作者

San, Nay, Bartelds, Martijn, Ògúnrèmí, Tolúlopé, Mount, Alison, Thompson, Ruben, Higgins, Michael, Barker, Roy, Simpson, Jane, Jurafsky, Dan

论文摘要

濒危语言的许多言语档案记录仍然没有宣布,社区成员和语言学习计划无法访问。一种瓶颈是注释的时间密集型性质。对于具有访问限制的录音,发生了一个更狭窄的瓶颈,例如在注释开始注释之前必须由授权的社区成员审查或过滤的语言。我们建议使用隐私的工作流程,以扩大两种瓶颈,以供纪录,其中濒危语言中的语音与更广泛使用的语言(例如英语语言)相结合,用于元语言评论和问题(例如,“树”的词是什么?)。我们集成了语音活动检测(VAD),口语识别(SLI)和自动语音识别(ASR),以抄录金属语言内容,授权的人可以迅速扫描到可以由访问量较低的人注释的分类记录。我们报告了制作工作中的136小时档案音频,其中包含英语和Muruwari的混合。我们与档案材料的Muruwari托管人的合作工作表明,即使只有最少的带注释的培训数据,该工作流程将术语降低了20%:SLI的每语言每种语言10次话语,最多39分钟,可能只有39秒。

Many archival recordings of speech from endangered languages remain unannotated and inaccessible to community members and language learning programs. One bottleneck is the time-intensive nature of annotation. An even narrower bottleneck occurs for recordings with access constraints, such as language that must be vetted or filtered by authorised community members before annotation can begin. We propose a privacy-preserving workflow to widen both bottlenecks for recordings where speech in the endangered language is intermixed with a more widely-used language such as English for meta-linguistic commentary and questions (e.g. What is the word for 'tree'?). We integrate voice activity detection (VAD), spoken language identification (SLI), and automatic speech recognition (ASR) to transcribe the metalinguistic content, which an authorised person can quickly scan to triage recordings that can be annotated by people with lower levels of access. We report work-in-progress processing 136 hours archival audio containing a mix of English and Muruwari. Our collaborative work with the Muruwari custodian of the archival materials show that this workflow reduces metalanguage transcription time by 20% even given only minimal amounts of annotated training data: 10 utterances per language for SLI and for ASR at most 39 minutes, and possibly as little as 39 seconds.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源