论文标题

意第绪语的言论一部分

A Part-of-Speech Tagger for Yiddish

论文作者

Kulick, Seth, Ryant, Neville, Santorini, Beatrice, Wallenberg, Joel, Urieli, Assaf

论文摘要

我们描述了意第绪语的语音贴标签的构建和评估。这是一个更大的项目的第一步,即为语言研究而自动为意第绪文本分配部分语音标签和句法结构。我们将两项资源结合在一起,用于当前工作 - 宾夕法尼亚州解析的历史意第绪语(PPCHY)的80,00个子集,以及在意第绪书中(YBC)中的6.5亿个ocr的意第绪文字。在YBC语料库中的意第绪拼字法具有许多拼写不一致,我们提供了一些证据,即在YBC训练的简单的非上下文化的嵌入式也能够捕获拼写变体之间的关系,而无需先“标准化”语料库。我们还使用YBC继续仔细预测连续的嵌入,然后将其集成到PPCHY上训练和评估的标签模型中。我们在10倍的交叉验证拆分上评估了标签仪性能,这表明将YBC文本用于上下文化的嵌入可以改善标签性能。我们通过讨论下一步的一些步骤来结束,包括需要其他带注释的培训和测试数据。

We describe the construction and evaluation of a part-of-speech tagger for Yiddish. This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We also use YBC for continued pretraining of contexualized embeddings, which are then integrated into a tagger model trained and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold cross-validation split, showing that the use of the YBC text for the contextualized embeddings improves tagger performance. We conclude by discussing some next steps, including the need for additional annotated training and test data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源