New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
Laura Natalia González-García,David Guevara-Barrientos,Daniela Lozano-Arce,Juan Martínez Gil,Jorge Díaz-Riaño,Erick Duarte,Germán Andrade,Juan Camilo Bojacá,María Camila Hoyos-Sánchez,Christian Chavarro,Natalia Guayazan,Luis Alberto Chica,Maria Camila Buitrago Acosta,Edwin Bautista,Miller Trujillo,Jorge Duitama +15 more
4
TL;DR: New algorithms for assembling long-DNA sequencing reads from haploid and diploid organisms showed competitive efficiency and contiguity of assemblies, as well as superior accuracy in some cases, as compared to other currently used software.
read more
Abstract: Innovative algorithmic approaches were used to perform assembly of complex genomes across the tree of life, from long DNA sequencing data. Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families
Maria Camila Hoyos Sanchez,Hader Sebastian Ospina Zapata,Brayhan Dario Suarez,Carlos Ospina,Hamilton J. Barbosa,Julio Cesar Carranza Martinez,Gustavo Adolfo Vallejo,Daniel Urrea Montes,Jorge Duitama +8 more
TL;DR: It is expected that this new assembly of a chromosome-level phased assembly of a TcI T. cruzi strain will be a valuable resource for further studies on evolution and functional genomics of Trypanosomatids.
3
Unveiling potential virulence determinants in Vibrio isolates from Anadara tuberculosa through whole genome analyses
Mariana Restrepo-Benavides,Daniela Lozano-Arce,Laura Natalia González-García,Felipe Báez-Aguirre,Gabriela Ariza‐Aranguren,Daniel Faccini,María Mercedes Zambrano,Pedro Jiménez,Ana Fernández-Bravo,Silvia Restrepo,Marcela Guevara-Suarez +10 more
TL;DR: The first comprehensive report on the whole genome analysis of Vibrio isolates obtained from Anadara tuberculosa, a bivalve species of great significance for social and economic matters on the Pacific coast of Colombia, represents the first genomic analysis of bacteria within A. tuberculosa.
1
A nearly complete and phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families
TL;DR: In this paper , a chromosome-level phased assembly of a T. cruzi strain (Dm25) is presented, isolated from a reservoir of the species Didelphis marsupialis located at the Tolima department in Colombia, and belonging to the TcI DTU.
1
Efficient data reconstruction: The bottleneck of large-scale application of DNA storage.
Ben Cao,Yan Zheng,Qi Shao,Zhenlu Liu,Yunzhu Zhao,Bin Wang,Qiang Zhang,Xiaopeng Wei +7 more
TL;DR: DNA storage has overcome capacity and persistence bottlenecks, but high data reconstruction costs and latency hinder practical implementation beyond laboratory settings, necessitating efficient data reconstruction methods for large-scale applications.
References
Haplotype-resolved assembly of diploid genomes without parental data
Haoyu Cheng,Erich D. Jarvis,Olivier Fedrigo,Klaus-Peter Koepfli,Lara Urban,Neil J. Gemmell,Heng Li +6 more
TL;DR: In this article , the authors describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.
371
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads
Sergey Nurk,Brian P. Walenz,Arang Rhie,Mitchell R. Vollger,Glennis A. Logsdon,Robert Grothe,Karen H. Miga,Evan E. Eichler,Evan E. Eichler,Adam M. Phillippy,Sergey Koren +10 more
TL;DR: This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes.
Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project
Roger Horton,Richard Gibson,Penny Coggill,Marcos Mateo Miretti,Richard J.N. Allcock,J P Almeida,Simon A. Forbes,James G. R. Gilbert,Karen Halls,Jennifer Harrow,E. Hart,Kevin L. Howe,David K. Jackson,Sophie Palmer,Anne N. Roberts,Sarah Sims,C. Andrew Stewart,James A. Traherne,Steve Trevanion,Laurens G. Wilming,Jane Rogers,Pieter J. de Jong,John F. Elliott,Stephen Sawcer,John A. Todd,John Trowsdale,Stephan Beck,Stephan Beck +27 more
TL;DR: The MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHChaplotype in the European population.
342
Efficient assembly of nanopore reads via highly accurate and intact error correction.
Ying Chen,Fan Nie,Shang-Qian Xie,Yingfeng Zheng,Qi Dai,Thomas Bray,Yao-Xin Wang,Jian-Feng Xing,Zhi-Jian Huang,Depeng Wang,Li-Juan He,Feng Luo,Jianxin Wang,Yizhi Liu,Yizhi Liu,Chuan-Le Xiao +15 more
TL;DR: NECAT as mentioned in this paper is an error correction and de novo assembly tool designed to overcome complex errors in nanopore reads, which uses an adaptive read selection and two-step progressive method to quickly correct the reads to high accuracy.
Highly accurate long-read HiFi sequencing data for five complex genomes
Ting Hon,Kristin Mars,Greg Young,Yu-Chih Tsai,Joseph W. Karalius,Jane M. Landolin,Nicholas Maurer,David Kudrna,Michael A. Hardigan,Cynthia C. Steiner,Steven J. Knapp,Doreen Ware,Beth Shapiro,Paul Peluso,David R. Rank +14 more
TL;DR: Deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa are presented.
Related Papers (5)
Priyanka Sharma,Othman Al-Dossary,Othman Al-Dossary,Bader Alsubaie,Bader Alsubaie,Ibrahim S. Al-Mssallem,Onkar Nath,Neena Mitter,Gabriel Rodrigues Alves Margarido,Gabriel Rodrigues Alves Margarido,Bruce Topp,Valentine Murigneux,Ardashir Kharabian Masouleh,Agnelo Furtado,Robert J Henry +14 more
- 10 Jun 2021