TL;DR: The ANNOVAR tool to annotate single nucleotide variants and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP is developed.
Abstract: High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.
TL;DR: This work describes how members of the biomedical research community can use GDV and the related NCBI Sequence Viewer (SV) to access, analyze, and disseminate NCBI and custom biomedical sequence data and reports how users can add SV to their own web pages to create a custom graphical sequence display without the need for infrastructure investments or back-end deployments.
Abstract: The National Center for Biotechnology Information (NCBI) is an archive providing free access to a wide range and large volume of biological sequence data and literature. Staff scientists at NCBI analyze user-submitted data in the archive, producing gene and SNP annotation and generating sequence alignment tools. NCBI's flagship genome browser, Genome Data Viewer (GDV), displays our in-house RefSeq annotation; is integrated with other NCBI resources such as Gene, dbGaP, and BLAST; and provides a platform for customized analysis and visualization. Here, we describe how members of the biomedical research community can use GDV and the related NCBI Sequence Viewer (SV) to access, analyze, and disseminate NCBI and custom biomedical sequence data. In addition, we report how users can add SV to their own web pages to create a custom graphical sequence display without the need for infrastructure investments or back-end deployments.
TL;DR: It is concluded that individual cultivars carry hundreds of deleterious SNPs on average, and that nonsense variants make up a minority of deleters in the protein-coding regions of the genomes of two crops.
Abstract: Populations continually incur new mutations with fitness effects ranging from lethal to adaptive. While the distribution of fitness effects of new mutations is not directly observable, many mutations likely either have no effect on organismal fitness or are deleterious. Historically, it has been hypothesized that a population may carry many mildly deleterious variants as segregating variation, which reduces the mean absolute fitness of the population. Recent advances in sequencing technology and sequence conservation-based metrics for inferring the functional effect of a variant permit examination of the persistence of deleterious variants in populations. The issue of segregating deleterious variation is particularly important for crop improvement, because the demographic history of domestication and breeding allows deleterious variants to persist and reach moderate frequency, potentially reducing crop productivity. In this study, we use exome resequencing of 15 barley accessions and genome resequencing of 8 soybean accessions to investigate the prevalence of deleterious single nucleotide polymorphisms (SNPs) in the protein-coding regions of the genomes of two crops. We conclude that individual cultivars carry hundreds of deleterious SNPs on average, and that nonsense variants make up a minority of deleterious SNPs. Our approach identifies known phenotype-altering variants as deleterious more frequently than the genome-wide average, suggesting that putatively deleterious variants are likely to affect phenotypic variation. We also report the implementation of a SNP annotation tool BAD_Mutations that makes use of a likelihood ratio test based on alignment of all currently publicly available Angiosperm genomes.
TL;DR: A new tool called AASsites is designed which searches for SNPs which modify splicing and identified 301 “likely” and 985 “probable” classified SNPs with such characteristics.
Abstract: Background
Some single nucleotide polymorphisms (SNPs) are known to modify the risk of developing certain diseases or the reaction to drugs. Due to next generation sequencing methods the number of known human SNPs has grown. Not all SNPs lead to a modified protein, which may be the origin of a disease. Therefore, the recognition of functional SNPs is needed. Because most SNP annotation tools look for SNPs which lead to an amino acid exchange or a premature stop, we designed a new tool called AASsites which searches for SNPs which modify splicing.
TL;DR: In this article, the authors compared SNP calling results using GATK suggested filters, on two chicken populations for which both RNA-seq and DNA-seq data were available for the same samples of the same tissue.
Abstract: In addition to their common usages to study gene expression, RNA-seq data accumulated over the last 10 years are a yet-unexploited resource of SNPs in numerous individuals from different populations. SNP detection by RNA-seq is particularly interesting for livestock species since whole genome sequencing is expensive and exome sequencing tools are unavailable. These SNPs detected in expressed regions can be used to characterize variants affecting protein functions, and to study cis-regulated genes by analyzing allele-specific expression (ASE) in the tissue of interest. However, gene expression can be highly variable, and filters for SNP detection using the popular GATK toolkit are not yet standardized, making SNP detection and genotype calling by RNA-seq a challenging endeavor. We compared SNP calling results using GATK suggested filters, on two chicken populations for which both RNA-seq and DNA-seq data were available for the same samples of the same tissue. We showed, in expressed regions, a RNA-seq precision of 91% (SNPs detected by RNA-seq and shared by DNA-seq) and we characterized the remaining 9% of SNPs. We then studied the genotype (GT) obtained by RNA-seq and the impact of two factors (GT call-rate and read number per GT) on the concordance of GT with DNA-seq; we proposed thresholds for them leading to a 95% concordance. Applying these thresholds to 767 multi-tissue RNA-seq of 382 birds of 11 chicken populations, we found 9.5 M SNPs in total, of which ∼550,000 SNPs per tissue and population with a reliable GT (call rate ≥ 50%) and among them, ∼340,000 with a MAF ≥ 10%. We showed that such RNA-seq data from one tissue can be used to (i) detect SNPs with a strong predicted impact on proteins, despite their scarcity in each population (16,307 SIFT deleterious missenses and 590 stop-gained), (ii) study, on a large scale, cis-regulations of gene expression, with ∼81% of protein-coding and 68% of long non-coding genes (TPM ≥ 1) that can be analyzed for ASE, and with ∼29% of them that were cis-regulated, and (iii) analyze population genetic using such SNPs located in expressed regions. This work shows that RNA-seq data can be used with good confidence to detect SNPs and associated GT within various populations and used them for different analyses as GTEx studies.