论文标题
Lexpander:将室杂质化网络应用于自动化词典扩展
LEXpander: applying colexification networks to automated lexicon expansion
论文作者
论文摘要
来自社交媒体和其他语料库的最新文本分析方法依靠单词列表来检测主题,衡量含义或选择相关文档。这些列表通常是通过将计算词典扩展方法应用于小的,手动策划的根单词集而产生的。尽管这种方法广泛使用,但我们仍然缺乏对词汇扩展方法的性能以及如何通过其他语言数据改进它们的详尽比较分析。在这项工作中,我们介绍了Lexpander,这是一种词汇扩展的方法,它利用了有关colexification的新颖数据,即基于共享概念和翻译与其他语言的语言网络连接单词的语义网络。我们以基准评估Lexpander,包括基于各种单词嵌入模型和同义词网络的词典扩展的广泛使用方法。我们发现,在各种测试中,精确度和回忆中生成的单词列表的精确度和召回之间的权衡,Lexpander的表现都优于现有方法。我们的基准包括英语和德语的几个语言类别和情感变量。我们还表明,扩展的单词列表构成了针对各个语料库的应用程序案例中的高性能文本分析方法。这样,Lexpander提出了一个系统的自动化解决方案,将简短的单词列表扩展到详尽而准确的单词列表中,这些单词列表可以紧密近似于心理学和语言学专家生成的单词列表。
Recent approaches to text analysis from social media and other corpora rely on word lists to detect topics, measure meaning, or to select relevant documents. These lists are often generated by applying computational lexicon expansion methods to small, manually-curated sets of root words. Despite the wide use of this approach, we still lack an exhaustive comparative analysis of the performance of lexicon expansion methods and how they can be improved with additional linguistic data. In this work, we present LEXpander, a method for lexicon expansion that leverages novel data on colexification, i.e. semantic networks connecting words based on shared concepts and translations to other languages. We evaluate LEXpander in a benchmark including widely used methods for lexicon expansion based on various word embedding models and synonym networks. We find that LEXpander outperforms existing approaches in terms of both precision and the trade-off between precision and recall of generated word lists in a variety of tests. Our benchmark includes several linguistic categories and sentiment variables in English and German. We also show that the expanded word lists constitute a high-performing text analysis method in application cases to various corpora. This way, LEXpander poses a systematic automated solution to expand short lists of words into exhaustive and accurate word lists that can closely approximate word lists generated by experts in psychology and linguistics.