论文标题
考虑同义词:扩展到TF-IDF,确定哈萨克语文本文档的相似性
Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF
论文作者
论文摘要
确定文本文档相似性的任务已在许多领域(例如信息检索,文本挖掘,自然语言处理(NLP)和计算语言学)受到了极大的关注。将数据传输到数字向量是一项复杂的任务,其中使用了算法,诸如代币化,停止词过滤,茎和术语加权等算法。术语频率 - 逆文档频率(TF -IDF)是最广泛使用的术语加权方法,可促进搜索相关文档。为了改善术语的加权,进行了大量的TF-IDF扩展。在本文中,提出了TF-IDF方法的另一个扩展,其中考虑了同义词。该方法的有效性通过实验对诸如余弦,骰子和jaccard之类的功能的实验确认,以衡量哈萨克语的文本文档的相似性。
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.