编辑距离的局部敏感态度函数

论文标题

编辑距离的局部敏感态度函数

Locality-sensitive bucketing functions for the edit distance

论文作者

Chen, Ke, Shao, Mingfu

论文摘要

许多生物信息学应用程序涉及将一组序列存储在其中，其中允许将每个序列分配到多个存储桶中。为了达到高灵敏度和精确度，需要使用铲斗方法将相似的序列分配到同一桶中，同时将不同的序列分配给不同的桶。现有的$ k $ - 基于基于错误率较低的测序数据的测序数据已经有效地降低了误差率的数据敏感性。局部敏感的哈希（LSH）方案能够通过耐受相似序列的编辑来减轻此问题，但是最新的方法仍然存在较大的差距。在这里，我们通过允许将一个序列放入多个存储桶来概括LSH函数。 Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be $(d_1, d_2)$-sensitive if any two sequences within an edit distance of $d_1$ are mapped into at least one shared bucket, and any two sequences with distance at least $d_2$ are mapped into disjoint subsets of buckets.我们构建对局部敏感的铲斗（LSB）函数，具有$（d_1，d_2）$的各种值，并分析其相对于所需的桶总数以及特定序列映射到的桶数的效率。我们还证明了这两个参数在不同的设置中的下限，并表明我们构造的LSB功能是最佳的。这些结果为其在分析误差率高的序列中实际使用的理论基础提供了基础，同时还为设计未封闭的LSH函数的硬度提供了见解。

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing $k$-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be $(d_1, d_2)$-sensitive if any two sequences within an edit distance of $d_1$ are mapped into at least one shared bucket, and any two sequences with distance at least $d_2$ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of $(d_1,d_2)$ and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题