在压缩尝试中定位路径

论文标题

在压缩尝试中定位路径

On Locating Paths in Compressed Tries

论文作者

Prezza, Nicola

论文摘要

在本文中，我们考虑了在支持强大的\ emph {locate}查询的同时压缩TRIE的问题：要返回带有给定查询模式的路径所能达到的所有节点的预订标识符。我们的结果建立在Ferragina等人的XBWT树转化之上。 [FOCS 2005]并概括了\ emph {r-index}定位Gagie等人的机械。 [SODA 2018，JACM 2020]基于运行长度编码的洞穴 - 轮毂变换（BWT）。我们的第一个贡献是提出对运行长度BWT的合适概括。我们表明，这种自然概括享有其在字符串上的几种有用的特性：尤其是，这种转换本身支持了Trie路径上查询模式的计数发生，并且其大小$ r $ $ r $捕获了Trie的重复性，并降低了Trie Trie entrapy的自然概念。我们的主要贡献是对该对象的组合结构的深入了解。详细说明，我们表明$ O（r \ log n） + 2n + o（n）$位的数据结构，其中$ n $是节点的数量，允许查找$ occ $ occ $ occ $ ock $ m $的$ m $ ock $ m $中的$ m $（m \logσ + occ）$ time，$σ$是$σ$的尺寸。我们的解决方案包括在定位过程中对可以用作“锚点”的$ o（r）$节点。一旦获得了第一种模式出现的预订标识符（以共同摄影顺序），我们表明，在这些锚点之间，恒定数量的恒定时间跳跃导致下一个模式出现的标识符，从而以最佳$ o（1）$ $ O（1）$的时间启用。

In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our first contribution is to propose a suitable generalization of the run-length BWT to tries. We show that this natural generalization enjoys several of the useful properties of its counterpart on strings: in particular, the transform natively supports counting occurrences of a query pattern on the trie's paths and its size $r$ captures the trie's repetitiveness and lower-bounds a natural notion of trie entropy. Our main contribution is a much deeper insight into the combinatorial structure of this object. In detail, we show that a data structure of $O(r\log n) + 2n + o(n)$ bits, where $n$ is the number of nodes, allows locating the $occ$ occurrences of a pattern of length $m$ in nearly-optimal $O(m\logσ+ occ)$ time, where $σ$ is the alphabet's size. Our solution consists in sampling $O(r)$ nodes that can be used as "anchor points" during the locate process. Once obtained the pre-order identifier of the first pattern occurrence (in co-lexicographic order), we show that a constant number of constant-time jumps between those anchor points lead to the identifier of the next pattern occurrence, thus enabling locating in optimal $O(1)$ time per occurrence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题