论文标题
Sopang 2:在线搜索泛基因组,没有误报
SOPanG 2: online searching over a pan-genome without false positives
论文作者
论文摘要
动机:泛基因组可以存储为弹性脱位(ED)字符串,这是最近引入的多个重叠序列的紧凑表示。但是,对ED字符串的搜索并未指示哪个个人(如果有)匹配整个查询。 结果:我们使用源(个人的索引)增强ED字符串,并提出Sopang(pan-genome)工具的扩展,以仅报告真正的积极匹配,从而省略了任何单倍型中没有发生的匹配。检查比赛的额外阶段在实践中的相对速度小于3.5%的罚款,这意味着Sopang 2能够以泛基因组报告模式匹配,并以实际数据的430 MB/s的单线程吞吐量将其映射到个体上。 可用性和实施:Sopang 2可以在此处下载:github.com/mralexsee/sopang
Motivation: The pan-genome can be stored as elastic-degenerate (ED) string, a recently introduced compact representation of multiple overlapping sequences. However, a search over the ED string does not indicate which individuals (if any) match the entire query. Results: We augment the ED string with sources (individuals' indexes) and propose an extension of the SOPanG (Shift-Or for Pan-Genome) tool to report only true positive matches, omitting those not occurring in any of the haplotypes. The additional stage for checking the matches yields a penalty of less than 3.5% relative speed in practice, which means that SOPanG 2 is able to report pattern matches in a pan-genome, mapping them onto individuals, at the single-thread throughput of above 430 MB/s on real data. Availability and implementation: SOPanG 2 can be downloaded here: github.com/MrAlexSee/sopang