论文标题
TARGETCALL:消除通过预先筛选过滤的基本的浪费计算
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
论文作者
论文摘要
基本调解是纳米孔测序分析中的重要步骤,其中将纳米孔测序仪的原始信号转换为核苷酸序列,即读取。最先进的基本收藏家采用复杂的深度学习模型来实现高基本的准确性。这使得基本的计算效率低下且渴望记忆,从而瓶颈整个基因组分析管道。但是,对于许多应用,大多数读取都不匹配感兴趣的参考基因组(即目标参考),因此在基因组学管道中的以后步骤中被丢弃,从而浪费了基本计算。为了克服这个问题,我们提出了TargetCall,这是第一个消除基本浪费计算的预扣过滤器。 TargetCall的关键想法是丢弃在基本之前与目标参考(即脱离目标读取)不匹配的读取。 TargetCall由两个主要组成部分组成:(1)LightCall,这是一种产生嘈杂读数的轻型神经网络基本底座; (2)相似性检查通过将它们与目标参考匹配,标记这些嘈杂的每个嘈杂的标记为“目标”或“脱离目标”。我们详尽的实验评估表明,目标节目1)提高了最先进的基本校验器的端到端基本的运行时性能,同时保持高(98.88%)的召回率保持在目标上的读取,2)保持在下游分析方面的较高精度,以及3)在下游分析方面和3)提高运行时的效果,逐步恢复,恢复性,优先级别,以前的效率,以前的综合性,以前的综合性,以前,物质。 TargetCall可从https://github.com/cmu-safari/targetCall获得。
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality compared to prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.