Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing
Jason O'Rawe,Jason O'Rawe,Tao Jiang,Guangqing Sun,Yiyang Wu,Yiyang Wu,Wei Min Wang,Jingchu Hu,Paul Bodily,Lifeng Tian,Hakon Hakonarson,W. Evan Johnson,Zhi Wei,Kai Wang,Kai Wang,Gholson J. Lyon,Gholson J. Lyon,Gholson J. Lyon +17 more
TL;DR: The results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels.
read more
Abstract: Background: To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be. Methods: We sequenced 15 exomes from four families using commercial kits (Illumina HiSeq 2000 platform and Agilent SureSelect version 2 capture kit), with approximately 120X mean coverage. We analyzed the raw data using near-default parameters with five different alignment and variant-calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools). We additionally sequenced a single whole genome using the sequencing and analysis pipeline from Complete Genomics (CG), with 95% of the exome region being covered by 20 or more reads per base. Finally, we validated 919 single-nucleotide variations (SNVs) and 841 insertions and deletions (indels), including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with approximately 5000X mean coverage. Results: SNV concordance between five Illumina pipelines across all 15 exomes was 57.4%, while 0.5 to 5.1% of variants were called as unique to each pipeline. Indel concordance was only 26.8% between three indel-calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. There were 11% of CG variants falling within targeted regions in exome sequencing that were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2%, and 99.1% of the GATK-only, SOAP-only and shared SNVs could be validated, but only 54.0%, 44.6%, and 78.1% of the GATKonly, SOAP-only and shared indels could be validated. Additionally, our analysis of two families (one with four individuals and the other with seven), demonstrated additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family. Conclusions: Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families to increase the overall accuracy of whole genomes.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications
Andrew J. Rimmer,H Phan,Iain Mathieson,Zamin Iqbal,Twigg Srf.,Wilkie Aom.,Gil McVean,Gil McVean,Gerton Lunter +8 more
TL;DR: The performance of Platypus is demonstrated by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.
Toward better understanding of artifacts in variant calling from high-coverage samples
TL;DR: By investigating false heterozygous calls in the haploid genome, the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample are identified as the two major sources of errors, which press for continued improvements in these two areas.
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls
Justin M. Zook,Brad Chapman,Jason Wang,David Mittelman,Oliver Hofmann,Winston Hide,Marc L. Salit +6 more
TL;DR: Methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium are presented.
A spectral approach integrating functional genomic annotations for coding and noncoding variants
TL;DR: Across varied scenarios, the Eigen score performs generally better than any single individual annotation, representing a powerful single functional score that can be incorporated in fine-mapping studies.
Detection of Clinically Relevant Genetic Variants in Autism Spectrum Disorder by Whole-Genome Sequencing
Yong-hui Jiang,Ryan K. C. Yuen,Xin Jin,Xin Jin,Mingbang Wang,Nong Chen,Xueli Wu,Jia Ju,Junpu Mei,Yujian Shi,Mingze He,Guangbiao Wang,Jieqin Liang,Zhe Wang,Dandan Cao,Melissa T. Carter,Christina Chrysler,Irene Drmic,Jennifer L. Howe,Lynette Lau,Christian R. Marshall,Christian R. Marshall,Daniele Merico,Thomas Nalpathamkalam,Bhooma Thiruvahindrapuram,Ann Thompson,Mohammed Uddin,Susan Walker,J. Luo,Evdokia Anagnostou,Lonnie Zwaigenbaum,Robert H. Ring,Jian Wang,Clara Lajonchere,Jun Wang,Andy Shih,Peter Szatmari,Huanming Yang,Geraldine Dawson,Geraldine Dawson,Yingrui Li,Stephen W. Scherer,Stephen W. Scherer +42 more
TL;DR: Results suggest that WGS and thorough bioinformatic analyses for de novo and rare inherited mutations will improve the detection of genetic variants likely to be associated with ASD or its accompanying clinical symptoms.
501
References
An integrated map of genetic variation from 1,092 human genomes
Gonçalo R. Abecasis,Adam Auton,Lisa D. Brooks,Mark A. DePristo,Richard Durbin,Robert E. Handsaker,Robert E. Handsaker,Hyun Min Kang,Gabor T. Marth,Gil McVean +9 more
TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
SOAP2: an improved ultrafast tool for short read alignment.
TL;DR: SOAP2 is a significantly improved version of the short oligonucleotide alignment program that both reduces computer memory usage and increases alignment speed at an unprecedented rate and is compatible with both single- and paired-end reads.
SOAP: short oligonucleotide alignment program
TL;DR: The program SOAP is designed to handle the huge amounts of short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology, which supports multi-threaded parallel computing and has a batch module for multiple query sets.
Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.
TL;DR: This work presents a new method and software for inference of haplotypes phase and missing data that can accurately phase data from whole-genome association studies, and presents the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals.
De novo assembly of human genomes with massively parallel short read sequencing
Ruiqiang Li,Hongmei Zhu,Jue Ruan,Wubin Qian,Xiaodong Fang,Zhongbin Shi,Yingrui Li,Shengting Li,Gao Shan,Karsten Kristiansen,Songgang Li,Huanming Yang,Jing Wang,Jun Wang +13 more
TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Related Papers (5)
Mark A. DePristo,Eric Banks,Ryan Poplin,Kiran V. Garimella,Jared Maguire,Christopher Hartl,Anthony A. Philippakis,Anthony A. Philippakis,Anthony A. Philippakis,Guillermo del Angel,Manuel A. Rivas,Manuel A. Rivas,Matt Hanna,Aaron McKenna,Timothy Fennell,Andrew Kernytsky,Andrey Sivachenko,Kristian Cibulskis,Stacey Gabriel,David Altshuler,David Altshuler,Mark J. Daly,Mark J. Daly +22 more