成语 - 为非本地学习者的Word2Vec构建一个补充搭配的英语成语的反向词典

论文标题

成语 - 为非本地学习者的Word2Vec构建一个补充搭配的英语成语的反向词典

Idiomify -- Building a Collocation-supplemented Reverse Dictionary of English Idioms with Word2Vec for non-native learners

论文作者

Kim, Eu-Bin

论文摘要

成语的目的是为非母语的英语学习者建立一个习惯的反向词典。我们之所以这样做，是因为反向字典可以帮助非本地人按需探索成语，并且搭配也可以指导他们更充分地使用习语。该项目的基石是从语料库中挖掘成语的可靠方式，但是，这是一个挑战，因为成语的形式很大。我们通过自动从其基本形式得出匹配规则来解决此问题。我们使用点相互包容（PMI），项频率 - 逆文档频率（TF-IDF）进行模型搭配，因为它们俩都是成对意义的流行度量标准。我们还尝试用作为基线模型来尝试术语频率（TF）。至于实施反词，可以采用三种方法：倒置索引，图形和分布语义。我们选择采用最后的方法并使用Word2Vec实现反向字典，因为它是所有方法中最灵活的方法，而Word2Vec是一个简单而强大的基线。评估方法已揭示了改进的房间。我们了解到，我们可以在SLOP，通配符和重新排序技术的帮助下更好地识别成语。我们还了解到，如果我们使用机器学习找到最佳位置，我们可以同时获得PMI和TF-IDF的最佳状态。最后，我们了解到，通过倒置索引和分布语义方法的混合，可以进一步改善习语。除了限制之外，提出的方法是可行的，它们对非本地人的好处是显而易见的，因此可以用来帮助非国家人获取英语成语。

The aim of idiomify is to build a collocation-supplemented reverse dictionary of idioms for the non-native learners of English. We aim to do so because the reverse dictionary could help the non-natives explore idioms on demand, and the collocations could also guide them on using idioms more adequately. The cornerstone of the project is a reliable way of mining idioms from corpora, which is however a challenge because idioms extensively vary in forms. We tackle this by automatically deriving matching rules from their base forms. We use Point-wise Mutual Inclusion (PMI), Term Frequency - Inverse Document Frequency (TF-IDF) to model collocations, since both of them are popular metric for pairwise significance. We also try Term Frequency (TF) as the baseline model. As for implementing the reverse-dictionary, three approaches could be taken: inverted index, graphs and distributional semantics. We choose to take the last approach and implement the reverse dictionary with Word2Vec, because it is the most flexible approach of all and Word2Vec is a simple yet strong baseline. Evaluating the methods has revealed rooms for improvement. We learn that we can better identify idioms with the help of slop, wildcard and reordering techniques. We also learn that we can get the best of both PMI and TF-IDF if we use machine learning to find the sweet spot. Lastly, We learn that Idiomify could be further improved with a mixture of inverted index and distributional semantics approach. The limits aside, the proposed methods are feasible, and their benefits to the non-natives are apparent, which therefore can be used to aid the non-natives in acquiring English idioms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题