论文标题
SARS-COV-2冠状病毒数据压缩基准
SARS-CoV-2 Coronavirus Data Compression Benchmark
论文作者
论文摘要
本文引入了无损数据压缩竞争,该竞争通过44,981个串联的SARS-COV-2序列的压缩大小对解决方案(计算机程序)进行基准测试,总未压缩大小为1,339,868,341字节。该数据于2020年12月13日下载,从严重的急性呼吸综合症冠状病毒2 ncbi.nlm.nih.gov的数据中心以FASTA和2bit格式介绍。这项竞争的目的是鼓励多学科研究找到序列的最短无损描述,并证明数据压缩可以作为一项客观且可重复的措施,以使跨学科的科学突破保持一致。数据的最短描述是最佳模型。因此,进一步降低了此描述的大小,需要对基础上下文和数据有基本的了解。本文提出了初步的结果,并提供了多种众所周知的压缩算法,用于基线测量以及有关有前途的研究途径的见解。比赛的进度将在\ url {https://coronavirus.innar.com}上报告,并且基准开放供所有人参与和贡献。
This paper introduces a lossless data compression competition that benchmarks solutions (computer programs) by the compressed size of the 44,981 concatenated SARS-CoV-2 sequences, with a total uncompressed size of 1,339,868,341 bytes. The data, downloaded on 13 December 2020, from the severe acute respiratory syndrome coronavirus 2 data hub of ncbi.nlm.nih.gov is presented in FASTA and 2Bit format. The aim of this competition is to encourage multidisciplinary research to find the shortest lossless description for the sequences and to demonstrate that data compression can serve as an objective and repeatable measure to align scientific breakthroughs across disciplines. The shortest description of the data is the best model; therefore, further reducing the size of this description requires a fundamental understanding of the underlying context and data. This paper presents preliminary results with multiple well-known compression algorithms for baseline measurements, and insights regarding promising research avenues. The competition's progress will be reported at \url{https://coronavirus.innar.com}, and the benchmark is open for all to participate and contribute.