印度 - 英语口音的深度基于言语的端到端自动语音识别（ASR）

论文标题

印度 - 英语口音的深度基于言语的端到端自动语音识别（ASR）

Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents

论文作者

Dubey, Priyank, Shah, Bilal

论文摘要

自动语音识别（ASR）是计算机科学和语言学的跨学科应用，使我们能够从说话的语音波形中得出转录。它发现了在军事中的多个应用，例如高性能战斗机，直升机，人流流量控制器。除军事言论识别外，医疗保健，残疾人等等。 ASR一直是一个活跃的研究领域。已经提出了几种用于文本语音（STT）的模型和算法。最近的一项是Mozilla Deep Speak，它基于百度的Deep Speak Research论文。深度语音是使用端到端深度学习开发的最新语音识别系统，它是使用多个图形处理单元（GPU）的良好优化的复发神经网络（RNN）培训系统对其进行训练的。该培训主要是使用美国英语口音数据集进行的，这导致对其他英语口音的普遍性差。印度是一片多样性的土地。甚至可以在演讲中看到，有几种英语口音因州而异。在这项工作中，我们使用了最新的深层语音模型，即DeepSpeech-0.9.3使用转移学习方法来开发用于印度英语口音的端到端语音识别系统。这项工作利用微调和数据论证来进一步优化和改善深层语音ASR系统。印度 - 英语口音的指示性TTS数据用于转移学习和微调预训练的深层语音模型。在未经训练的模型，我们的训练有素的模型和其他用于印度英语口音的语音识别服务之间进行了一般比较。

Automated Speech Recognition (ASR) is an interdisciplinary application of computer science and linguistics that enable us to derive the transcription from the uttered speech waveform. It finds several applications in Military like High-performance fighter aircraft, helicopters, air-traffic controller. Other than military speech recognition is used in healthcare, persons with disabilities and many more. ASR has been an active research area. Several models and algorithms for speech to text (STT) have been proposed. One of the most recent is Mozilla Deep Speech, it is based on the Deep Speech research paper by Baidu. Deep Speech is a state-of-art speech recognition system is developed using end-to-end deep learning, it is trained using well-optimized Recurrent Neural Network (RNN) training system utilizing multiple Graphical Processing Units (GPUs). This training is mostly done using American-English accent datasets, which results in poor generalizability to other English accents. India is a land of vast diversity. This can even be seen in the speech, there are several English accents which vary from state to state. In this work, we have used transfer learning approach using most recent Deep Speech model i.e., deepspeech-0.9.3 to develop an end-to-end speech recognition system for Indian-English accents. This work utilizes fine-tuning and data argumentation to further optimize and improve the Deep Speech ASR system. Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model. A general comparison is made among the untrained model, our trained model and other available speech recognition services for Indian-English Accents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题