论文标题

三部分密钥指数构建的有效算法

An efficient algorithm for three-component key index construction

论文作者

Veretennikov, Alexander B.

论文摘要

在本文中,考虑了大型文本阵列中的接近全文搜索。搜索查询由几个单词组成。搜索结果是包含这些单词的文档列表。在现代搜索系统中,包含彼此靠近的搜索查询单词的文档比不共享此特征的文档更相关。要解决此任务,对于每个索引文档中的每个单词,我们需要在索引中存储一个记录。在这种情况下,查询搜索时间与索引文档中查询单词的出现数量成正比。因此,搜索系统通常比包含频率较低的普通单词更慢得多的查询评估查询是很常见的。对于文本中的每个单词,我们使用其他索引来存储与给定单词小于或等于MaxDistance的距离的附近单词的信息,这是一个参数。该参数的值为5、7甚至更多。可以为更快的查询执行而创建三部分键索引。以前,我们介绍了实验的结果表明,当查询包含经常出现的单词时,使用三组分密钥索引的查询执行时间的平均时间比使用普通倒置索引时所需的时间少94.7倍。在当前的工作中,我们描述了一种新的三成分键索引构建算法,并演示了该算法的正确性。我们介绍了创建这种索引的实验结果,该索引取决于最大值的值。

In this paper, proximity full-text searches in large text arrays are considered. A search query consists of several words. The search result is a list of documents containing these words. In a modern search system, documents that contain search query words that are near each other are more relevant than documents that do not share this trait. To solve this task, for each word in each indexed document, we need to store a record in the index. In this case, the query search time is proportional to the number of occurrences of the queried words in the indexed documents. Consequently, it is common for search systems to evaluate queries that contain frequently occurring words much more slowly than queries that contain less frequently occurring, ordinary words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. This parameter can take a value of 5, 7, or even more. Three-component key indexes can be created for faster query execution. Previously, we presented the results of experiments showing that when queries contain very frequently occurring words, the average time of the query execution with three-component key indexes is 94.7 times less than that required when using ordinary inverted indexes. In the current work, we describe a new three-component key index building algorithm and demonstrate the correctness of the algorithm. We present the results of experiments creating such an index that is dependent on the value of MaxDistance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源