论文标题

多层文本摘要模型对矢量空间信息检索的效率和相关性的影响

The Effect of the Multi-Layer Text Summarization Model on the Efficiency and Relevancy of the Vector Space-based Information Retrieval

论文作者

Ababneh, Ahmad Hussein, Lu, Joan, Xu, Qiang

论文摘要

互联网上的大量文字上传在信息检索系统中创造了巨大的倒置索引,这损害了他们的效率。这项研究的目的是衡量自动文本摘要的多层相似性模型对在IR系统中构建信息丰富且凝结的反转索引的影响。为了实现此目的,我们使用多层相似性模型总结了大量文档,并通过该模型生成的自动摘要构建了倒置索引。进行了一系列实验,以在效率和相关性方面测试性能。实验包括与三个现有文本摘要模型的比较; Jaccard系数模型,向量空间模型和潜在的语义分析模型。实验检查了三组查询,并进行了手动和自动相关性评估。多层相似性在IR系统效率上的积极影响很明显,而相关结果中没有明显的损失。但是,评估表明,没有语义调查的传统统计模型未能提高信息检索效率。与以前的出版物相比,将摘要用作指数的来源,对我们工作的相关评估较高,并且多层相似性检索构建了一个倒数索引,该指数比主要语料库倒置指数小58%。

The massive upload of text on the internet creates a huge inverted index in information retrieval systems, which hurts their efficiency. The purpose of this research is to measure the effect of the Multi-Layer Similarity model of the automatic text summarization on building an informative and condensed invert index in the IR systems. To achieve this purpose, we summarized a considerable number of documents using the Multi-Layer Similarity model, and we built the inverted index from the automatic summaries that were generated from this model. A series of experiments were held to test the performance in terms of efficiency and relevancy. The experiments include comparisons with three existing text summarization models; the Jaccard Coefficient Model, the Vector Space Model, and the Latent Semantic Analysis model. The experiments examined three groups of queries with manual and automatic relevancy assessment. The positive effect of the Multi-Layer Similarity in the efficiency of the IR system was clear without noticeable loss in the relevancy results. However, the evaluation showed that the traditional statistical models without semantic investigation failed to improve the information retrieval efficiency. Comparing with the previous publications that addressed the use of summaries as a source of the index, the relevancy assessment of our work was higher, and the Multi-Layer Similarity retrieval constructed an inverted index that was 58% smaller than the main corpus inverted index.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源