“GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data”

From Navigators

(Difference between revisions)
Jump to: navigation, search
(Created page with "{{Publication |type=article |document=Document for Publication-Cogo2020GenoDedup.pdf |title=GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Dat...")
 
Line 3: Line 3:
|document=Document for Publication-Cogo2020GenoDedup.pdf
|document=Document for Publication-Cogo2020GenoDedup.pdf
|title=GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
|title=GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
-
|author=Vinicius Vielmo Cogo, João Paulo, Alysson Bessani,  
+
|author=Vinicius Vielmo Cogo, João Paulo, Alysson Bessani,
-
|Project=Project:SUPERCLOUD, Project:IRCoC,  
+
|Project=Project:SUPERCLOUD, Project:IRCoC,
|ResearchLine=Fault and Intrusion Tolerance in Open Distributed Systems (FIT)
|ResearchLine=Fault and Intrusion Tolerance in Open Distributed Systems (FIT)
|month=may
|month=may
-
|year=2020
+
|year=2021
|abstract=The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9% of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this work, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8% of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB.
|abstract=The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9% of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this work, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8% of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB.
|journal=IEEE Transactions on Computers
|journal=IEEE Transactions on Computers
-
|note= DOI: 10.1109/TC.2020.2994774  
+
|note=DOI: 10.1109/TC.2020.2994774
-
|volume=Early Access
+
|volume=70
-
|pages=1-12
+
|number=5
 +
|pages=669--681
|url=https://ieeexplore.ieee.org/document/9094002
|url=https://ieeexplore.ieee.org/document/9094002
}}
}}

Latest revision as of 14:59, 22 September 2021

Vinicius Vielmo Cogo, João Paulo, Alysson Bessani

IEEE Transactions on Computers, vol. 70, no. 5, pp. 669–681, May 2021.

DOI: 10.1109/TC.2020.2994774.
Abstract: The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9% of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this work, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8% of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB.

Download paper

Download GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data

Export citation

BibTeX

Project(s): Project:SUPERCLOUD, Project:IRCoC

Research line(s): Fault and Intrusion Tolerance in Open Distributed Systems (FIT)

Personal tools
Navigators toolbox