Uchecker：无监督的中国拼写检查器，蒙面审计的语言模型

论文标题

Uchecker：无监督的中国拼写检查器，蒙面审计的语言模型

uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers

论文作者

Li, Piji

论文摘要

中文拼写检查（CSC）的任务旨在检测和纠正文本中可以找到的拼写错误。虽然手动注释高质量的数据集既昂贵又耗时，因此训练数据集的规模通常很小（例如，Sighan15仅包含2339个用于培训的样本），因此，基于监督的基于学习的模型通常会遭受数据稀疏性限制和过度合适的问题，尤其是在大语言模型的时代。在本文中，我们致力于研究\ textbf {无监督}范式来解决CSC问题，我们提出了一个名为\ textbf {uchecker}的框架，以进行无监督的拼写错误检测和校正。考虑到其强大的语言诊断能力，将蒙面审慎的语言模型（例如BERT）引入为骨干模型。从各种且灵活的掩蔽操作中受益，我们提出了一种混乱的掩盖策略，以精细培训掩盖语言模型，以进一步提高无监督的检测和校正的性能。标准数据集的实验结果证明了我们提出的模型Uchecker在字符级别和句子级的准确性，精度，回忆和F1分别对拼写错误检测和校正任务分别进行的有效性。

The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the \textbf{unsupervised} paradigm to address the CSC problem and we propose a framework named \textbf{uChecker} to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model to further improve the performance of unsupervised detection and correction. Experimental results on standard datasets demonstrate the effectiveness of our proposed model uChecker in terms of character-level and sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling error detection and correction respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题