Huqariq：秘鲁母语的多语言语料库，用于语音识别

论文标题

Huqariq：秘鲁母语的多语言语料库，用于语音识别

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

论文作者

Zevallos, Rodolfo, Camacho, Luis, Melgarejo, Nelsi

论文摘要

Huqariq语料库是来自本地秘鲁语言的多语言集合。转录的语料库旨在研究和开发语音技术，以保护秘鲁的濒危语言。 Huqariq主要设计用于开发自动语音识别，语言识别和文本到语音的工具。为了可持续获得语料库收集，我们采用众包方法。 Huqariq包括秘鲁的四种母语，预计到2022年底，秘鲁的48种母语中最多可以达到20种母语。该语料库有500多名志愿者记录的220小时的转录音频，使其成为秘鲁母语最大的语料库。为了验证语料库的质量，我们使用220小时的完全转录音频提出语音识别实验。

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题