使用bertscore评估无序语音的ASR模型质量

论文标题

使用bertscore评估无序语音的ASR模型质量

Assessing ASR Model Quality on Disordered Speech using BERTScore

论文作者

Tobin, Jimmy, Li, Qisheng, Venugopalan, Subhashini, Seaver, Katie, Cave, Richard, Tomanek, Katrin

论文摘要

单词错误率（WER）是用于评估自动语音识别（ASR）模型质量的主要度量。已经表明，与典型的英语说话者相比，ASR模型的语音障碍者的扬声器往往更高。在如此高的错误率下，很难确定模型是否可以很有用。这项研究调查了BertScore的使用，BertScore是文本生成的评估指标，以提供对ASR模型质量和实用性的更有信息度量。将Bertscore和WER与语言病理学家手动注释的预测错误进行了比较，以进行错误类型和评估。发现Bertscore与人类的误差类型和评估评估更相关。在保留含义的拼字法变化（收缩和归一化误差）上，Bertscore特别强大。此外，使用顺序的逻辑回归和Akaike的信息标准（AIC）测量，Bertscore比WER更好地评估了错误评估。总体而言，我们的发现表明，从实际角度评估ASR模型性能时，Bertscore可以补充，尤其是对于可访问性应用程序，即使模型的精度也比典型语音较低的模型也很有用。

Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers. It is hard to determine if models can be be useful at such high error rates. This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness. Both BERTScore and WER were compared to prediction errors manually annotated by Speech Language Pathologists for error type and assessment. BERTScore was found to be more correlated with human assessment of error type and assessment. BERTScore was specifically more robust to orthographic changes (contraction and normalization errors) where meaning was preserved. Furthermore, BERTScore was a better fit of error assessment than WER, as measured using an ordinal logistic regression and the Akaike's Information Criterion (AIC). Overall, our findings suggest that BERTScore can complement WER when assessing ASR model performance from a practical perspective, especially for accessibility applications where models are useful even at lower accuracy than for typical speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题