论文标题
DNA数据存储,测序数据携带DNA
DNA data storage, sequencing data-carrying DNA
论文作者
论文摘要
由于其密度,耐用性和可持续性,DNA是下一个档案介质的领先候选人。读取(和写入)数据DNA存储利用数十年来开发的技术来对生命科学中自然存在的DNA进行序列。为了实现以前看不见的生物DNA的更高精度,测序依赖于扩展和训练被称为基本的深度机器学习模型。模型复杂性的这种增长需要大量资源,包括计算和数据集。它还消除了将DNA作为存储介质的紧凑型读取头的可能性。 我们认为,我们需要盲目使用生命科学的测序模型进行DNA数据存储。区别在于惊人:对于生命科学应用,我们无法控制DNA,但是,在DNA数据存储的情况下,我们控制了它的书面方式以及特定的写入头。更具体地说,可以对数据携带的DNA进行调节和嵌入对齐标记和校正校正代码,以确保更高的忠诚度并执行机器学习模型执行的一些工作。 在本文中,我们研究了深层模型大小和错误纠正代码之间的准确性权衡。我们表明,从模型大小为107MB开始,可以通过使用DNA序列中的简单误差校正代码来补偿模型压缩的精度。在我们的实验中,我们表明,模型的大小大大减少并不会对所使用的误差校正代码产生不当的惩罚,因此为便携式数据携带DNA读取头铺平了道路。至关重要的是,我们表明,通过联合使用模型压缩和误差校正代码,我们获得了比没有压缩和误差校正代码更高的读取精度。
DNA is a leading candidate as the next archival storage media due to its density, durability and sustainability. To read (and write) data DNA storage exploits technology that has been developed over decades to sequence naturally occurring DNA in the life sciences. To achieve higher accuracy for previously unseen, biological DNA, sequencing relies on extending and training deep machine learning models known as basecallers. This growth in model complexity requires substantial resources, both computational and data sets. It also eliminates the possibility of a compact read head for DNA as a storage medium. We argue that we need to depart from blindly using sequencing models from the life sciences for DNA data storage. The difference is striking: for life science applications we have no control over the DNA, however, in the case of DNA data storage, we control how it is written, as well as the particular write head. More specifically, data-carrying DNA can be modulated and embedded with alignment markers and error correcting codes to guarantee higher fidelity and to carry out some of the work that the machine learning models perform. In this paper, we study accuracy trade-offs between deep model size and error correcting codes. We show that, starting with a model size of 107MB, the reduced accuracy from model compression can be compensated by using simple error correcting codes in the DNA sequences. In our experiments, we show that a substantial reduction in the size of the model does not incur an undue penalty for the error correcting codes used, therefore paving the way for portable data-carrying DNA read head. Crucially, we show that through the joint use of model compression and error correcting codes, we achieve a higher read accuracy than without compression and error correction codes.