论文标题
没有地面真相材料的HTR模型的评估
Evaluation of HTR models without Ground Truth Material
论文作者
论文摘要
在开发过程中对手写文本识别(HTR)模型的评估非常简单:因为HTR是一个有监督的问题,因此通常的数据将其分为培训,验证和测试数据集允许在准确性或错误率方面对模型进行评估。但是,一旦我们从开发转换为应用,评估过程就变得棘手。从我们要应用模型的数据样本中对新的(且强行较小)真实(GT)的汇编,随后对模型的评估仅提供有关识别文本质量的提示,以及置信度得分(如果可用)。此外,如果我们手头有几个模型,我们将面临模型选择问题,因为我们希望在应用阶段获得最佳结果。这要求无需GT的指标选择最佳模型,这就是为什么我们(重新)介绍和比较不同的指标,从基于简单的词典到更精细的指标,使用标准语言模型和蒙版语言模型(MLM)。我们表明,基于MLM的评估可以与基于词典的方法竞争,具有大型和多语言变压器的优势,从而为其他指标提供了编译的词汇资源。
The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.