跨语性单词的嵌入突出语言

论文标题

跨语性单词的嵌入突出语言

Cross-Lingual Word Embeddings for Turkic Languages

论文作者

Kuriyozov, Elmurod, Doval, Yerai, Gómez-Rodríguez, Carlos

论文摘要

人们对学习跨语性单词嵌入的兴趣越来越多，以将从资源丰富的语言（例如英语）获得的知识转移到稀缺的带注释数据的低资源语言中，例如土耳其语，俄罗斯和许多其他语言。在本文中，我们介绍了首次对土耳其，乌兹别克，乌兹别克，阿塞拉国，哈萨克州和吉尔吉斯的既定技术的既定技术的生存能力研究，这是土耳其家族的成员，受到了低位数约束的严重影响。众所周知，这些技术几乎不需要明确的监督，主要是以双语词典的形式进行，因此很容易适应不同的领域，包括低资源的领域。我们为这些语言获得了新的双语词典和新单词嵌入，并显示了使用最新技术获取跨语性单词嵌入的步骤。然后，我们使用双语词典归纳任务评估结果。我们的实验证实，所获得的双语词典优于以前可用的词典，而低资源语言中的单词嵌入可以从资源丰富的密切相关的密切相关语言中受益。此外，对外部任务的评估（对乌兹别克的情感分析）证明，单语单词嵌入可以稍微受益于跨语言对准。

There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many others. In this paper, we present the first viability study of established techniques to align monolingual embedding spaces for Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic family which is heavily affected by the low-resource constraint. Those techniques are known to require little explicit supervision, mainly in the form of bilingual dictionaries, hence being easily adaptable to different domains, including low-resource ones. We obtain new bilingual dictionaries and new word embeddings for these languages and show the steps for obtaining cross-lingual word embeddings using state-of-the-art techniques. Then, we evaluate the results using the bilingual dictionary induction task. Our experiments confirm that the obtained bilingual dictionaries outperform previously-available ones, and that word embeddings from a low-resource language can benefit from resource-rich closely-related languages when they are aligned together. Furthermore, evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves that monolingual word embeddings can, although slightly, benefit from cross-lingual alignments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题