论文标题

大型指数大小的密集低维信息检索的诅咒

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

论文作者

Reimers, Nils, Gurevych, Iryna

论文摘要

使用密集的低维表示的信息检索最近变得很流行,并表现出与传统稀疏代表(如BM25)的表现。但是,以前没有研究调查密集表示的指数大小。我们从理论和经验上表明,密集表示的性能比稀疏表示的降低更快,以增加指数大小。在极端情况下,这甚至可能导致一个转折点,在某些指数尺寸下,稀疏表示的表现优于密集表示。我们表明,这种行为与表示形式的维度数密切相关:尺寸越低,误报的机会越高,即返回无关的文档。

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源