DPCSPELL：基于变压器的检测器 - 纯式 - 矫正器框架，用于拼写错误校正孟加拉和资源稀缺语言

论文标题

DPCSPELL：基于变压器的检测器 - 纯式 - 矫正器框架，用于拼写错误校正孟加拉和资源稀缺语言

DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework for Spelling Error Correction of Bangla and Resource Scarce Indic Languages

论文作者

Bijoy, Mehedi Hasan, Hossain, Nahid, Islam, Salekul, Shatabda, Swakkhar

论文摘要

拼写误差校正是识别和纠正文本中拼写错误的单词的任务。由于人类语言理解中的许多应用，这是自然语言处理中潜在的积极研究主题。语音或视觉上相似但具有语义上不同的字符使其在任何语言中都是艰巨的任务。在孟加拉和资源范围中拼写误差校正的早期努力指示了基于规则，统计和基于机器学习的方法的语言，我们发现我们发现这相当低效率。特别是，基于机器学习的方法表现出与基于规则和统计方法相比的性能优越的方法，因此无效，无论其适当性如何，它们都会纠正每个字符。在本文中，我们提出了一个新颖的检测器 - 纯种 - 矫正器框架，DPCSpell通过解决以前的问题而基于DeNo transformer的DPCSpell。除此之外，我们还提出了一种从头开始创建大规模语料库的方法，进而解决了任何左右脚本语言的资源限制问题。 The empirical outcomes demonstrate the effectiveness of our approach, which outperforms previous state-of-the-art methods by attaining an exact match (EM) score of 94.78%, a precision score of 0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of 0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error correction.这些模型和语料库可在https://tinyurl.com/dpcspell上公开获取。

Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this paper, we propose a novel detector-purificator-corrector framework, DPCSpell based on denoising transformers by addressing previous issues. In addition to that, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach, which outperforms previous state-of-the-art methods by attaining an exact match (EM) score of 94.78%, a precision score of 0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of 0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.

下载PDF全文

下载文献需遵守相关版权规定

论文标题