论文标题

在压缩尝试中定位路径

On Locating Paths in Compressed Tries

论文作者

Prezza, Nicola

论文摘要

在本文中,我们考虑了在支持强大的\ emph {locate}查询的同时压缩TRIE的问题:要返回带有给定查询模式的路径所能达到的所有节点的预订标识符。我们的结果建立在Ferragina等人的XBWT树转化之上。 [FOCS 2005]并概括了\ emph {r-index}定位Gagie等人的机械。 [SODA 2018,JACM 2020]基于运行长度编码的洞穴 - 轮毂变换(BWT)。我们的第一个贡献是提出对运行长度BWT的合适概括。我们表明,这种自然概括享有其在字符串上的几种有用的特性:尤其是,这种转换本身支持了Trie路径上查询模式的计数发生,并且其大小$ r $ $ r $捕获了Trie的重复性,并降低了Trie Trie entrapy的自然概念。我们的主要贡献是对该对象的组合结构的深入了解。详细说明,我们表明$ O(r \ log n) + 2n + o(n)$位的数据结构,其中$ n $是节点的数量,允许查找$ occ $ occ $ occ $ ock $ m $的$ m $ ock $ m $中的$ m $(m \logσ + occ)$ time,$σ$是$σ$的尺寸。我们的解决方案包括在定位过程中对可以用作“锚点”的$ o(r)$节点。一旦获得了第一种模式出现的预订标识符(以共同摄影顺序),我们表明,在这些锚点之间,恒定数量的恒定时间跳跃导致下一个模式出现的标识符,从而以最佳$ o(1)$ $ O(1)$的时间启用。

In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our first contribution is to propose a suitable generalization of the run-length BWT to tries. We show that this natural generalization enjoys several of the useful properties of its counterpart on strings: in particular, the transform natively supports counting occurrences of a query pattern on the trie's paths and its size $r$ captures the trie's repetitiveness and lower-bounds a natural notion of trie entropy. Our main contribution is a much deeper insight into the combinatorial structure of this object. In detail, we show that a data structure of $O(r\log n) + 2n + o(n)$ bits, where $n$ is the number of nodes, allows locating the $occ$ occurrences of a pattern of length $m$ in nearly-optimal $O(m\logσ+ occ)$ time, where $σ$ is the alphabet's size. Our solution consists in sampling $O(r)$ nodes that can be used as "anchor points" during the locate process. Once obtained the pre-order identifier of the first pattern occurrence (in co-lexicographic order), we show that a constant number of constant-time jumps between those anchor points lead to the identifier of the next pattern occurrence, thus enabling locating in optimal $O(1)$ time per occurrence.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源