New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads

doi:10.1101/2022.08.30.505891

Open AccessPosted Content10.1101/2022.08.30.505891

New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads

Laura Natalia González-García, +15 more

- 01 Sep 2022

- Life science alliance

- Vol. 6

4

TL;DR: New algorithms for assembling long-DNA sequencing reads from haploid and diploid organisms showed competitive efficiency and contiguity of assemblies, as well as superior accuracy in some cases, as compared to other currently used software.

Abstract: Innovative algorithmic approaches were used to perform assembly of complex genomes across the tree of life, from long DNA sequencing data. Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1038/s41598-024-52449-x

A phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families

Maria Camila Hoyos Sanchez, +8 more

- 24 Jan 2024

- Dental science reports

TL;DR: It is expected that this new assembly of a chromosome-level phased assembly of a TcI T. cruzi strain will be a valuable resource for further studies on evolution and functional genomics of Trypanosomatids.

...read moreread less

3

Journal Article•10.1128/spectrum.02928-23

Unveiling potential virulence determinants in Vibrio isolates from Anadara tuberculosa through whole genome analyses

Mariana Restrepo-Benavides, +10 more

- 08 Jan 2024

- Microbiology spectrum

TL;DR: The first comprehensive report on the whole genome analysis of Vibrio isolates obtained from Anadara tuberculosa, a bivalve species of great significance for social and economic matters on the Pacific coast of Colombia, represents the first genomic analysis of bacteria within A. tuberculosa.

...read moreread less

1

10.1101/2023.07.17.549441

A nearly complete and phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families

C. Ospina, +2 more

- 19 Jul 2023

- bioRxiv

TL;DR: In this paper , a chromosome-level phased assembly of a T. cruzi strain (Dm25) is presented, isolated from a reservoir of the species Didelphis marsupialis located at the Tolima department in Colombia, and belonging to the TcI DTU.

...read moreread less

1

10.1016/j.celrep.2024.113699

Efficient data reconstruction: The bottleneck of large-scale application of DNA storage.

Ben Cao, +7 more

- 21 Mar 2024

- Cell Reports

TL;DR: DNA storage has overcome capacity and persistence bottlenecks, but high data reconstruction costs and latency hinder practical implementation beyond laboratory settings, necessitating efficient data reconstruction methods for large-scale applications.

...read moreread less

References

•Journal Article•10.1038/s41587-022-01261-x

Haplotype-resolved assembly of diploid genomes without parental data

Haoyu Cheng, +6 more

- 24 Mar 2022

- Nature Biotechnology

TL;DR: In this article , the authors describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.

...read moreread less

371

•Posted Content•10.1101/2020.03.14.992248

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

Sergey Nurk, +10 more

- 17 Mar 2020

- bioRxiv

TL;DR: This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes.

...read moreread less

369

•Journal Article•10.1007/S00251-007-0262-2

Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project

Roger Horton, +27 more

- 10 Jan 2008

- Immunogenetics

TL;DR: The MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHChaplotype in the European population.

...read moreread less

342

•Journal Article•10.1038/S41467-020-20236-7

Efficient assembly of nanopore reads via highly accurate and intact error correction.

Ying Chen, +15 more

- 04 Jan 2021

- Nature Communications

TL;DR: NECAT as mentioned in this paper is an error correction and de novo assembly tool designed to overcome complex errors in nanopore reads, which uses an adaptive read selection and two-step progressive method to quickly correct the reads to high accuracy.

...read moreread less

289

...

Expand

Related Papers (5)

Improvements in the sequencing and assembly of plant genomes

[...]

Priyanka Sharma, +14 more

- 10 Jun 2021

Improvements in the Sequencing and Assembly of Plant Genomes

[...]

Priyanka Sharma, +14 more

- 22 Jan 2021

- bioRxiv

Idba-ud

[...]

Yu Peng, +3 more

- 01 Jun 2012

- Bioinformatics

Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly

[...]

Tsunglin Liu, +3 more

- 29 Jul 2013

- PLOS ONE

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

[...]

Xutao Deng, +6 more

- 20 Apr 2015

- Nucleic Acids Research

New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads

Chat with Paper

AI Agents for this Paper

Citations

A phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families

Unveiling potential virulence determinants in Vibrio isolates from Anadara tuberculosa through whole genome analyses

A nearly complete and phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families

Efficient data reconstruction: The bottleneck of large-scale application of DNA storage.

References

Haplotype-resolved assembly of diploid genomes without parental data

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project

Efficient assembly of nanopore reads via highly accurate and intact error correction.

Highly accurate long-read HiFi sequencing data for five complex genomes

Related Papers (5)

Improvements in the sequencing and assembly of plant genomes

Improvements in the Sequencing and Assembly of Plant Genomes

Idba-ud

Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data