数据驱动的模型在跨属性低资源形态分段中的概括性

论文标题

数据驱动的模型在跨属性低资源形态分段中的概括性

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

论文作者

Liu, Zoey, Prud'hommeaux, Emily

论文摘要

模型评估的常见设计通常集中在单语设置上，其中根据单个数据集的性能比较不同的模型，该模型被认为代表了手头任务的所有可能数据。尽管这对于大型数据集可能是合理的，但在低资源场景中难以维持此假设，其中数据收集的伪像可以产生与众不同的数据集，从而有可能得出有关模型性能重合的结论。为了解决这些问题，我们调查了跨语言低资源场景中的模型通用性。使用形态分割作为测试案例，我们比较了具有不同参数化的三类模型，从6个语言家族中获取11种语言的数据。在每个实验设置中，我们都会在第一个数据集中评估所有模型，然后在引入具有相同大小的新的随机采样数据集时检查其性能一致性，并在将训练有素的模型应用于看不见的测试集的不同大小时。结果表明，模型概括的程度取决于数据集的特征，并且不一定在很大程度上取决于数据集的大小。在我们研究的特征中，训练和测试集之间的词素重叠的比率和每个单词平均数量的比率是两个最突出的因素。我们的发现表明，未来的工作应采用随机抽样来构建具有不同尺寸的数据集，以便对模型评估提出更负责任的主张。

Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题