以不同形态形式训练的土耳其单词表示形式的比较

论文标题

以不同形态形式训练的土耳其单词表示形式的比较

Comparison of Turkish Word Representations Trained on Different Morphological Forms

论文作者

Güler, Gökhan, Tantuğ, A. Cüneyd

论文摘要

不同文本表示形式的普及还带来了自然语言处理（NLP）任务的许多改进。在不需要监督的数据的情况下，对大型语料库进行培训的嵌入式为我们提供了有意义的关系，可用于不同的NLP任务。即使训练这些向量对最近的方法相对容易，但从数据中获得的信息很大程度上取决于语料语言的结构。由于所研究的语言具有相似的形态结构，因此在研究中主要忽略了形态丰富的语言的问题。对于形态上丰富的语言，无上下文的单词向量忽略了语言的形态结构。在这项研究中，我们以形态上丰富的语言（土耳其语）的形式编写了文本，并比较了不同内在和外部任务的结果。为了查看形态结构的效果，我们训练了Word2Vec模型对引理和后缀有所不同的文本。我们还训练了子词模型FastText，并比较了单词类比，文本分类，情感分析和语言模型任务的嵌入。

Increased popularity of different text representations has also brought many improvements in Natural Language Processing (NLP) tasks. Without need of supervised data, embeddings trained on large corpora provide us meaningful relations to be used on different NLP tasks. Even though training these vectors is relatively easy with recent methods, information gained from the data heavily depends on the structure of the corpus language. Since the popularly researched languages have a similar morphological structure, problems occurring for morphologically rich languages are mainly disregarded in studies. For morphologically rich languages, context-free word vectors ignore morphological structure of languages. In this study, we prepared texts in morphologically different forms in a morphologically rich language, Turkish, and compared the results on different intrinsic and extrinsic tasks. To see the effect of morphological structure, we trained word2vec model on texts which lemma and suffixes are treated differently. We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题