TL;DR: The immediate challenge now is to comprehensively and systematically mine this dataset to link genotypic variation to functional variation with the ultimate goal of creating new and sustainable rice varieties that can support a future world population that will approach 9.6 billion by 2050.
Abstract: Rice is the world’s most important staple grown by millions of small-holder farmers. Sustaining rice production relies on the intelligent use of rice diversity. The 3,000 Rice Genomes Project is a giga-dataset of publically available genome sequences (averaging 14× depth of coverage) derived from 3,000 accessions of rice with global representation of genetic and functional diversity. The seed of these accessions is available from the International Rice Genebank Collection. Together, they are an unprecedented resource for advancing rice science and breeding technology. Our immediate challenge now is to comprehensively and systematically mine this dataset to link genotypic variation to functional variation with the ultimate goal of creating new and sustainable rice varieties that can support a future world population that will approach 9.6 billion by 2050.
TL;DR: A read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr is presented to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding.
Abstract: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. The MinION™ measures the change in current resulting from DNA strands interacting with a charged protein nanopore. These measurements can then be used to deduce the underlying nucleotide sequence. We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION™ Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods.
TL;DR: This study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome.
Abstract: Structural variants (SVs) are less common than single nucleotide polymorphisms and indels in the population, but collectively account for a significant fraction of genetic polymorphism and diseases. Base pair differences arising from SVs are on a much higher order (>100 fold) than point mutations; however, none of the current detection methods are comprehensive, and currently available methodologies are incapable of providing sufficient resolution and unambiguous information across complex regions in the human genome. To address these challenges, we applied a high-throughput, cost-effective genome mapping technology to comprehensively discover genome-wide SVs and characterize complex regions of the YH genome using long single molecules (>150 kb) in a global fashion. Utilizing nanochannel-based genome mapping technology, we obtained 708 insertions/deletions and 17 inversions larger than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-base gaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-end data from whole-genome sequencing-based re-sequencing or de novo assembly sequence from fosmid data. Of the remaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 out of 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases. At the same time, genome mapping also provides valuable information for complex regions with haplotypes in a straightforward fashion. In addition, with long single-molecule labeling patterns, exogenous viral sequences were mapped on a whole-genome scale, and sample heterogeneity was analyzed at a new level. Our study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome.
TL;DR: This work has implemented the open-source Complete Genomics tool set, CGATools, in Galaxy, and implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations.
Abstract: Background: Complete Genomics provides an open-source suite of command-line tools for the analysis of their CG-formatted mapped sequencing files. Determination of; for example, the functional impact of detected variants, requires annotation with various databases that often require command-line and/or programming experience; thus, limiting their use to the average research scientist. We have therefore implemented this CG toolkit, together with a number of annotation, visualisation and file manipulation tools in Galaxy called CGtag (Complete Genomics Toolkit and Annotation in a Cloud-based Galaxy). Findings: In order to provide research scientists with web-based, simple and accurate analytical and visualisation applications for the selection of candidate mutations from Complete Genomics data, we have implemented the open-source Complete Genomics tool set, CGATools, in Galaxy. In addition we implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations (SNV, and indels). Furthermore, we have developed a cloud-based public Galaxy instance to host the CGtag toolkit and other associated modules. Conclusions: CGtag provides a user-friendly interface to all research scientists wishing to select candidate variants from CG or other next-generation sequencing platforms' data. By using a cloud-based infrastructure, we can also assure sufficient and on-demand computation and storage resources to handle the analysis tasks. The tools are freely available for use from an NBIC/CTMM-TraIT (The Netherlands Bioinformatics Center/Center for Translational Molecular Medicine) cloud-based Galaxy instance, or can be installed to a local (production) Galaxy via the NBIC Galaxy tool shed.
TL;DR: An open dataset is presented – the first of its kind – to the radiation oncology community, which will allow researchers to compare methods for optimizing radiation dose delivery.
Abstract: We provide common datasets (which we call the CORT dataset: common optimization for radiation therapy) that researchers can use when developing and contrasting radiation treatment planning optimization algorithms. The datasets allow researchers to make one-to-one comparisons of algorithms in order to solve various instances of the radiation therapy treatment planning problem in intensity modulated radiation therapy (IMRT), including beam angle optimization, volumetric modulated arc therapy and direct aperture optimization. We provide datasets for a prostate case, a liver case, a head and neck case, and a standard IMRT phantom. We provide the dose-influence matrix from a variety of beam/couch angle pairs for each dataset. The dose-influence matrix is the main entity needed to perform optimizations: it contains the dose to each patient voxel from each pencil beam. In addition, the original Digital Imaging and Communications in Medicine (DICOM) computed tomography (CT) scan, as well as the DICOM structure file, are provided for each case. Here we present an open dataset – the first of its kind – to the radiation oncology community, which will allow researchers to compare methods for optimizing radiation dose delivery.
TL;DR: Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble.
Abstract: Background: Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome sequencing and a robust assembly of a parrot genome. Findings: We present a genomic resource for the budgerigar, an Australian Parakeet (Melopsittacus undulatus) – the most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes over 300× raw read coverage from multiple sequencing technologies and chromosome optical maps from a single male animal. The reads and optical maps were used to create three hybrid assemblies representing some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome sequence assemblies. The sequence reads for this project were in part generated and used for both the Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio single-molecule sequencing. Conclusions: Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work provides valuable data and material for genome technology development and for investigating the genomics of complex behavioral traits.
TL;DR: Analysis of effective population sizes reveals that the two penguin species experienced population expansions from ~1 million years ago to ~100 thousand years ago, but responded differently to the climatic cooling of the last glacial period.
Abstract: Background: Penguins are flightless aquatic birds widely distributed in the Southern Hemisphere. The distinctive morphological and physiological features of penguins allow them to live an aquatic life, and some of them have successfully adapted to the hostile environments in Antarctica. To study the phylogenetic and population history of penguins and the molecular basis of their adaptations to Antarctica, we sequenced the genomes of the two Antarctic dwelling penguin species, the Adelie penguin [Pygoscelis adeliae] and emperor penguin [Aptenodytes forsteri]. Results: Phylogenetic dating suggests that early penguins arose ~60 million years ago, coinciding with a period of global warming. Analysis of effective population sizes reveals that the two penguin species experienced population expansions from ~1 million years ago to ~100 thousand years ago, but responded differently to the climatic cooling of the last glacial period. Comparative genomic analyses with other available avian genomes identified molecular changes in genes related to epidermal structure, phototransduction, lipid metabolism, and forelimb morphology. Conclusions: Our sequencing and initial analyses of the first two penguin genomes provide insights into the timing of penguin origin, fluctuations in effective population sizes of the two penguin species over the past 10 million years, and the potential associations between these biological patterns and global climate change. The molecular changes compared with other avian genomes reflect both shared and diverse adaptations of the two penguin species to the Antarctic environment.
TL;DR: Some of the ways in which GO can change should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets.
Abstract: The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.
TL;DR: This paper presents a curated repository of multielectrode array recordings of spontaneous activity in developing mouse and ferret retina, and describes the structure of the data, along with examples of reproducible research using these data files.
Abstract: During early development, neural circuits fire spontaneously, generating activity episodes with complex spatiotemporal patterns. Recordings of spontaneous activity have been made in many parts of the nervous system over the last 25 years, reporting developmental changes in activity patterns and the effects of various genetic perturbations. We present a curated repository of multielectrode array recordings of spontaneous activity in developing mouse and ferret retina. The data have been annotated with minimal metadata and converted into HDF5. This paper describes the structure of the data, along with examples of reproducible research using these data files. We also demonstrate how these data can be analysed in the CARMEN workflow system. This article is written as a literate programming document; all programs and data described here are freely available. 1. We hope this repository will lead to novel analysis of spontaneous activity recorded in different laboratories. 2. We encourage published data to be added to the repository. 3. This repository serves as an example of how multielectrode array recordings can be stored for long-term reuse.
TL;DR: The co-authors of this paper state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter, and to describe their shared vision for its future.
Abstract: The co-authors of this paper hereby state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter. We define a Genomic Observatory as an ecosystem and/or site subject to long-term scientific research, including (but not limited to) the sustained study of genomic biodiversity from single-celled microbes to multicellular organisms. An international group of 64 scientists first published the call for a global network of Genomic Observatories in January 2012. The vision for such a network was expanded in a subsequent paper and developed over a series of meetings in Bremen (Germany), Shenzhen (China), Moorea (French Polynesia), Oxford (UK), Pacific Grove (California, USA), Washington (DC, USA), and London (UK). While this community-building process continues, here we express our mutual intent to establish the GOs Network formally, and to describe our shared vision for its future. The views expressed here are ours alone as individual scientists, and do not necessarily represent those of the institutions with which we are affiliated.
TL;DR: The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing and offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.
Abstract: Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotation a highly informative resource that facilitates multifaceted research endeavors. Here we report a preliminary annotation of the whole genome sequence of Cinnamon, a domestic cat living in Columbia (MO, USA), bisulfite sequencing of Boris, a male cat from St. Petersburg (Russia), and light 30× sequencing of Sylvester, a European wildcat progenitor of cat domestication. The annotation includes 21,865 protein-coding genes identified by a comparative approach, 217 loci of endogenous retrovirus-like elements, repetitive elements which comprise about 55.7% of the whole genome, 99,494 new SNVs, 8,355 new indels, 743,326 evolutionary constrained elements, and 3,182 microRNA homologues. The methylation sites study shows that 10.5% of cat genome cytosines are methylated. An assisted assembly of a European wildcat, Felis silvestris silvestris, was performed; variants between F. silvestris and F. catus genomes were derived and compared to F. catus. The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing. The assembly and its annotation offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.
TL;DR: Recent crowdfunding efforts to sequence the Azolla genome, a little fern with massive green potential, are described, showing that Crowdfunding is a worthy platform not only for obtaining seed money for exploratory research, but also for engaging directly with the general public as a rewarding form of outreach.
Abstract: Much of science progresses within the tight boundaries of what is often seen as a “black box”. Though familiar to funding agencies, researchers and the academic journals they publish in, it is an entity that outsiders rarely get to peek into. Crowdfunding is a novel means that allows the public to participate in, as well as to support and witness advancements in science. Here we describe our recent crowdfunding efforts to sequence the Azolla genome, a little fern with massive green potential. Crowdfunding is a worthy platform not only for obtaining seed money for exploratory research, but also for engaging directly with the general public as a rewarding form of outreach.
TL;DR: The dataset presented here shows that earthworms constitute suitable candidates for μCT scanning in combination with soft tissue staining and is comparable to results derived from traditional dissection techniques, but due to their digital nature the data also permit computer-based interactive exploration of earthworm morphology and anatomy.
Abstract: Background: Although molecular tools are increasingly employed to decipher invertebrate systematics, earthworm (Annelida: Clitellata: ‘Oligochaeta’) taxonomy is still largely based on conventional dissection, resulting in data that are mostly unsuitable for dissemination through online databases. In order to evaluate if micro-computed tomography (μCT) in combination with soft tissue staining techniques could be used to expand the existing set of tools available for studying internal and external structures of earthworms, μCT scans of freshly fixed and museum specimens were gathered. Findings: Scout images revealed full penetration of tissues by the staining agent. The attained isotropic voxel resolutions permit identification of internal and external structures conventionally used in earthworm taxonomy. The μCT projection and reconstruction images have been deposited in the online data repository GigaDB and are publicly available for download. Conclusions: The dataset presented here shows that earthworms constitute suitable candidates for μCT scanning in combination with soft tissue staining. Not only are the data comparable to results derived from traditional dissection techniques, but due to their digital nature the data also permit computer-based interactive exploration of earthworm morphology and anatomy. The approach pursued here can be applied to freshly fixed as well as museum specimens, which is of particular importance when considering the use of rare or valuable material. Finally, a number of aspects related to the deposition of digital morphological data are briefly discussed.
TL;DR: Ten recommendations to ensure the usability, sustainability and practicality of research software are addressed, in particular for young researchers new to programming.
Abstract: Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ensure the usability, sustainability and practicality of research software.
TL;DR: The successful beginnings of an international interdisciplinary venture, the Avian Phylogenomics Project that lets us view, through a genomics lens, modern bird species and the evolutionary events that produced them are presented.
Abstract: Everyone loves the birds of the world. From their haunting songs and majesty of flight to dazzling plumage and mating rituals, bird watchers – both amateurs and professionals - have marveled for centuries at their considerable adaptations. Now, we are offered a special treat with the publication of a series of papers in dedicated issues of Science, Genome Biology and GigaScience (which also included pre-publication data release). These present the successful beginnings of an international interdisciplinary venture, the Avian Phylogenomics Project that lets us view, through a genomics lens, modern bird species and the evolutionary events that produced them.
TL;DR: The presented datasets together with their metadata provide researchers with an opportunity to study the P300 component from different perspectives and can be used for BCI research.
Abstract: The event-related potentials technique is widely used in cognitive neuroscience research. The P300 waveform has been explored in many research articles because of its wide applications, such as lie detection or brain-computer interfaces (BCI). However, very few datasets are publicly available. Therefore, most researchers use only their private datasets for their analysis. This leads to minimally comparable results, particularly in brain-computer research interfaces. Here we present electroencephalography/event-related potentials (EEG/ERP) data. The data were obtained from 20 healthy subjects and was acquired using an odd-ball hardware stimulator. The visual stimulation was based on a three-stimulus paradigm and included target, non-target and distracter stimuli. The data and collected metadata are shared in the EEG/ERP Portal. The paper also describes the process and validation results of the presented data. The data were validated using two different methods. The first method evaluated the data by measuring the percentage of artifacts. The second method tested if the expectation of the experimental results was fulfilled (i.e., if the target trials contained the P300 component). The validation proved that most datasets were suitable for subsequent analysis. The presented datasets together with their metadata provide researchers with an opportunity to study the P300 component from different perspectives. Furthermore, they can be used for BCI research.
TL;DR: The availability of elephant genome sequence data from all three elephant species will complement studies of behaviour, genetic diversity, evolution and disease resistance, and are an important addition to the available genetic and genomic information on Asian and African elephants.
Abstract: Background: There are three species of elephant that exist, the Asian elephant (Elephas maximus) and two species of African elephant (Loxodonta africana and Loxodonta cyclotis). The populations of all three species are dwindling, and are under threat due to factors, such as habitat destruction and ivory hunting. The species differ in many respects, including in their morphology and response to disease. The availability of elephant genome sequence data from all three elephant species will complement studies of behaviour, genetic diversity, evolution and disease resistance. Findings: We present low-coverage Illumina sequence data from two Asian elephants, representing approximately 5X and 2.5X coverage respectively. Both raw and aligned data are available, using the African elephant (L. africana) genome as a reference. Conclusions: The data presented here are an important addition to the available genetic and genomic information on Asian and African elephants.
TL;DR: This work proposes a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers.
Abstract: Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.
TL;DR: Evidence derived from comparative methylome and transcriptome analyses indicates that a non-exhaustive and partly reversible methylation process operates in truffles, which can be of interest for evolutionary genomics studies of multicellular (filamentous) fungi, in particular Ascomycetes belonging to the subphylum, Pezizomycotina.
Abstract: Tuber melanosporum, also known in the gastronomic community as “truffle”, features one of the largest fungal genomes (125 Mb) with an exceptionally high transposable element (TE) and repetitive DNA content (>58%). The main purpose of DNA methylation in fungi is TE silencing. As obligate outcrossing organisms, truffles are bound to a sexual mode of propagation, which together with TEs is thought to represent a major force driving the evolution of DNA methylation. Thus, it was of interest to examine if and how T. melanosporum exploits DNA methylation to maintain genome integrity. We performed whole-genome DNA bisulfite sequencing and mRNA sequencing on different developmental stages of T. melanosporum; namely, fruitbody (“truffle”), free-living mycelium and ectomycorrhiza. The data revealed a high rate of cytosine methylation (>44%), selectively targeting TEs rather than genes with a strong preference for CpG sites. Whole genome DNA sequencing uncovered multiple TE-enriched, copy number variant regions bearing a significant fraction of hypomethylated and expressed TEs, almost exclusively in free-living mycelium propagated in vitro. Treatment of mycelia with 5-azacytidine partially reduced DNA methylation and increased TE transcription. Our transcriptome assembly also resulted in the identification of a set of novel transcripts from 614 genes. The datasets presented here provide valuable and comprehensive (epi)genomic information that can be of interest for evolutionary genomics studies of multicellular (filamentous) fungi, in particular Ascomycetes belonging to the subphylum, Pezizomycotina. Evidence derived from comparative methylome and transcriptome analyses indicates that a non-exhaustive and partly reversible methylation process operates in truffles.
TL;DR: It is argued that publicly available digital anatomical and morphological data gathered during experiments involving non-invasive imaging techniques constitute one of the prerequisites for future large-scale genotype—phenotype correlations.
Abstract: Apart from its application in human diagnostics, magnetic resonance imaging (MRI) can also be used to study the internal anatomy of zoological specimens. As a non-invasive imaging technique, MRI has several advantages, such as rapid data acquisition, output of true three-dimensional imagery, and provision of digital data right from the onset of a study. Of particular importance for comparative zoological studies is the capacity of MRI to conduct high-throughput analyses of multiple specimens. In this study, MRI was applied to systematically document the internal anatomy of 98 representative species of sea urchins (Echinodermata: Echinoidea). The dataset includes raw and derived image data from 141 MRI scans. Most of the whole sea urchin specimens analyzed were obtained from museum collections. The attained scan resolutions permit differentiation of various internal organs, including the digestive tract, reproductive system, coelomic compartments, and lantern musculature. All data deposited in the Giga DB repository can be accessed using open source software. Potential uses of the dataset include interactive exploration of sea urchin anatomy, morphometric and volumetric analyses of internal organs observed in their natural context, as well as correlation of hard and soft tissue structures. The dataset covers a broad taxonomical and morphological spectrum of the Echinoidea, focusing on ‘regular’ sea urchin taxa. The deposited files significantly expand the amount of morphological data on echinoids that are electronically available. The approach chosen here can be extended to various other vertebrate and invertebrate taxa. We argue that publicly available digital anatomical and morphological data gathered during experiments involving non-invasive imaging techniques constitute one of the prerequisites for future large-scale genotype—phenotype correlations.
TL;DR: A generic and flexible reporting tool for Galaxy, iReport, that allows users to create interactive HTML reports directly from the Galaxy UI, with the ability to combine an arbitrary number of outputs from any number of different tools.
Abstract: Galaxy offers a number of visualisation options with components, such as Trackster, Circster and Galaxy Charts, but currently lacks the ability to easily combine outputs from different tools into a single view or report. A number of tools produce HTML reports as output in order to combine the various output files from a single tool; however, this requires programming and knowledge of HTML, and the reports must be custom-made for each new tool. We have developed a generic and flexible reporting tool for Galaxy, iReport, that allows users to create interactive HTML reports directly from the Galaxy UI, with the ability to combine an arbitrary number of outputs from any number of different tools. Content can be organised into different tabs, and interactivity can be added to components. To demonstrate the capability of iReport we provide two publically available examples, the first is an iReport explaining about iReports, created for, and using content from the recent Galaxy Community Conference 2014. The second is a genetic report based on a trio analysis to determine candidate pathogenic variants which uses our previously developed Galaxy toolset for whole-genome NGS analysis, CGtag. These reports may be adapted for outputs from any sequencing platform and any results, such as omics data, non-high throughput results and clinical variables. iReport provides a secure, collaborative, and flexible web-based reporting system that is compatible with Galaxy (and non-Galaxy) generated content. We demonstrate its value with a real-life example of reporting genetic trio-analysis.
TL;DR: The data repository alongside the evaluation test bed provides the option to reliably compare motion compensation algorithms for myocardial perfusion MRI and is encouraged that researchers add their own annotations to the data set to make other applications possible, for example, the validation of segmentation algorithms.
Abstract: Perfusion quantification by using first-pass gadolinium-enhanced myocardial perfusion magnetic resonance imaging (MRI) has proved to be a reliable tool for the diagnosis of coronary artery disease that leads to reduced blood flow to the myocardium. The image series resulting from such acquisition usually exhibits a breathing motion that needs to be compensated for if a further automatic analysis of the perfusion is to be executed. Various algorithms have been presented to facilitate such a motion compensation, but the lack of publicly available data sets hinders a proper, reproducible comparison of these algorithms. Free breathing perfusion MRI series of ten patients considered clinically to have a stress perfusion defect were acquired; for each patient a rest and a stress study was executed. Manual segmentations of the left ventricle myocardium and the right-left ventricle insertion point are provided for all images in order to make a unified validation of the motion compensation algorithms and the perfusion analysis possible. In addition, all the scripts and the software required to run the experiments are provided alongside the data, and to enable interested parties to directly run the experiments themselves, the test bed is also provided as a virtual hard disk. To illustrate the utility of the data set two motion compensation algorithms with publicly available implementations were applied to the data and earlier reported results about the performance of these algorithms could be confirmed. The data repository alongside the evaluation test bed provides the option to reliably compare motion compensation algorithms for myocardial perfusion MRI. In addition, we encourage that researchers add their own annotations to the data set, either to provide inter-observer comparisons of segmentations, or to make other applications possible, for example, the validation of segmentation algorithms.
TL;DR: The utility of GWATCH is illustrated with three large genome-wide association studies for HIV-AIDS resistance genes screened in large multicenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.
Abstract: Background: As genome-wide sequence analyses for complex human disease determinants are expanding, it is increasingly necessary to develop strategies to promote discovery and validation of potential disease-gene associations. Findings: Here we present a dynamic web-based platform – GWATCH – that automates and facilitates four steps in genetic epidemiological discovery: 1) Rapid gene association search and discovery analysis of large genome-wide datasets; 2) Expanded visual display of gene associations for genome-wide variants (SNPs, indels, CNVs), including Manhattan plots, 2D and 3D snapshots of any gene region, and a dynamic genome browser illustrating gene association chromosomal regions; 3) Real-time validation/replication of candidate or putative genes suggested from other sources, limiting Bonferroni genome-wide association study (GWAS) penalties; 4) Open data release and sharing by eliminating privacy constraints (The National Human Genome Research Institute (NHGRI) Institutional Review Board (IRB), informed consent, The Health Insurance Portability and Accountability Act (HIPAA) of 1996 etc.) on unabridged results, which allows for open access comparative and meta-analysis. Conclusions: GWATCH is suitable for both GWAS and whole genome sequence association datasets. We illustrate the utility of GWATCH with three large genome-wide association studies for HIV-AIDS resistance genes screened in large multicenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.
TL;DR: How to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize the gene and species trees of the 1KP project.
Abstract: The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees. Users can develop computational pipelines to analyse these data, in conjunction with data of their own that they can upload. Computationally estimated protein-protein interactions and biochemical pathways can be visualized at another site. Finally, we comment on our future plans and how they fit within this scalable system for the dissemination, visualization, and analysis of large multi-species data sets.
TL;DR: It is explained why the fern clade is pivotal for understanding genome evolution across land plants, and a rationale for how knowledge of fern genomes will enable progress in research beyond the f Ferns themselves is provided.
Abstract: Ferns are the only major lineage of vascular plants not represented by a sequenced nuclear genome. This lack of genome sequence information significantly impedes our ability to understand and reconstruct genome evolution not only in ferns, but across all land plants. Azolla and Ceratopteris are ideal and complementary candidates to be the first ferns to have their nuclear genomes sequenced. They differ dramatically in genome size, life history, and habit, and thus represent the immense diversity of extant ferns. Together, this pair of genomes will facilitate myriad large-scale comparative analyses across ferns and all land plants. Here we review the unique biological characteristics of ferns and describe a number of outstanding questions in plant biology that will benefit from the addition of ferns to the set of taxa with sequenced nuclear genomes. We explain why the fern clade is pivotal for understanding genome evolution across land plants, and we provide a rationale for how knowledge of fern genomes will enable progress in research beyond the ferns themselves.
TL;DR: This study provides the first evaluation of the clinical outcomes of NGS-based preimplantation genetic diagnosis/screening compared with single nucleotide polymorphism (SNP) array-based PGD/PGS and shows the reliability of this method in a clinical and array- based laboratory setting.
Abstract: Background: Next generation sequencing (NGS) is now being used for detecting chromosomal abnormalities in blastocyst trophectoderm (TE) cells from in vitro fertilized embryos. However, few data are available regarding the clinical outcome, which provides vital reference for further application of the methodology. Here, we present a clinical evaluation of NGS-based preimplantation genetic diagnosis/screening (PGD/PGS) compared with single nucleotide polymorphism (SNP) array-based PGD/PGS as a control. Results: A total of 395 couples participated. They were carriers of either translocation or inversion mutations, or were patients with recurrent miscarriage and/or advanced maternal age. A total of 1,512 blastocysts were biopsied on D5 after fertilization, with 1,058 blastocysts set aside for SNP array testing and 454 blastocysts for NGS testing. In the NGS cycles group, the implantation, clinical pregnancy and miscarriage rates were 52.6% (60/114), 61.3% (49/80) and 14.3% (7/49), respectively. In the SNP array cycles group, the implantation, clinical pregnancy and miscarriage rates were 47.6% (139/292), 56.7% (115/203) and 14.8% (17/115), respectively. The outcome measures of both the NGS and SNP array cycles were the same with insignificant differences. There were 150 blastocysts that underwent both NGS and SNP array analysis, of which seven blastocysts were found with inconsistent signals. All other signals obtained from NGS analysis were confirmed to be accurate by validation with qPCR. The relative copy number of mitochondrial DNA (mtDNA) for each blastocyst that underwent NGS testing was evaluated, and a significant difference was found between the copy number of mtDNA for the euploid and the chromosomally abnormal blastocysts. So far, out of 42 ongoing pregnancies, 24 babies were born in NGS cycles; all of these babies are healthy and free of any developmental problems. Conclusions: This study provides the first evaluation of the clinical outcomes of NGS-based pre-implantation genetic diagnosis/screening, and shows the reliability of this method in a clinical and array-based laboratory setting. NGS provides an accurate approach to detect embryonic imbalanced segmental rearrangements, to avoid the potential risks of false signals from SNP array in this study.
TL;DR: In this article, the authors have implemented a complete genomics toolkit and annotation in a cloud-based Galaxy, called CGtag (Complete Genomics Toolkit and Annotation in a Cloudbased Galaxy), for the selection of candidate mutations from Complete Genomics data.
Abstract: Complete Genomics provides an open-source suite of command-line tools for the analysis of their CG-formatted mapped sequencing files. Determination of; for example, the functional impact of detected variants, requires annotation with various databases that often require command-line and/or programming experience; thus, limiting their use to the average research scientist. We have therefore implemented this CG toolkit, together with a number of annotation, visualisation and file manipulation tools in Galaxy called CGtag (Complete Genomics Toolkit and Annotation in a Cloud-based Galaxy). In order to provide research scientists with web-based, simple and accurate analytical and visualisation applications for the selection of candidate mutations from Complete Genomics data, we have implemented the open-source Complete Genomics tool set, CGATools, in Galaxy. In addition we implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations (SNV, and indels). Furthermore, we have developed a cloud-based public Galaxy instance to host the CGtag toolkit and other associated modules. CGtag provides a user-friendly interface to all research scientists wishing to select candidate variants from CG or other next-generation sequencing platforms’ data. By using a cloud-based infrastructure, we can also assure sufficient and on-demand computation and storage resources to handle the analysis tasks. The tools are freely available for use from an NBIC/CTMM-TraIT (The Netherlands Bioinformatics Center/Center for Translational Molecular Medicine) cloud-based Galaxy instance, or can be installed to a local (production) Galaxy via the NBIC Galaxy tool shed.
TL;DR: The Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date and the genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparativegenomics, evolution, neurobiology, development biology, and other related areas.
Abstract: Background: The evolutionary relationships of modern birds are among the most challenging to understand in systematic biology and have been debated for centuries. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders, and used the genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomics analyses (Jarvis et al. in press; Zhang et al. in press). Here we release assemblies and datasets associated with the comparative genome analyses, which include 38 newly sequenced avian genomes plus previously released or simultaneously released genomes of Chicken, Zebra finch, Turkey, Pigeon, Peregrine falcon, Duck, Budgerigar, Adelie penguin, Emperor penguin and the Medium Ground Finch. We hope that this resource will serve future efforts in phylogenomics and comparative genomics. Findings: The 38 bird genomes were sequenced using the Illumina HiSeq 2000 platform and assembled using a whole genome shotgun strategy. The 48 genomes were categorized into two groups according to the N50 scaffold size of the assemblies: a high depth group comprising 23 species sequenced at high coverage (>50X) with multiple insert size libraries resulting in N50 scaffold sizes greater than 1 Mb (except the White-throated Tinamou and Bald Eagle); and a low depth group comprising 25 species sequenced at a low coverage (~30X) with two insert size libraries resulting in an average N50 scaffold size of about 50 kb. Repetitive elements comprised 4%-22% of the bird genomes. The assembled scaffolds allowed the homology-based annotation of 13,000 ~ 17000 protein coding genes in each avian genome relative to chicken, zebra finch and human, as well as comparative and sequence conservation analyses. Conclusions: Here we release full genome assemblies of 38 newly sequenced avian species, link genome assembly downloads for the 7 of the remaining 10 species, and provide a guideline of genomic data that has been generated and used in our Avian Phylogenomics Project. To the best of our knowledge, the Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date. The genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparative genomics, evolution, neurobiology, development biology, and other related areas.
TL;DR: An international resequencing effort of 3,000 rice genomes serves as a foundation for large-scale discovery of novel alleles for important rice phenotypes using various bioinformatics and/or genetic approaches and to understand the genomic diversity within O. sativa at a higher level of detail.
Abstract: Background
Rice, Oryza sativa L., is the staple food for half the world’s population. By 2030, the production of rice must increase by at least 25% in order to keep up with global population growth and demand. Accelerated genetic gains in rice improvement are needed to mitigate the effects of climate change and loss of arable land, as well as to ensure a stable global food supply.
TL;DR: Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region and this approach to the GWAS analysis of height is applied.
Abstract: The aim of a genome-wide association study (GWAS) is to isolate DNA markers for variants affecting phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated. Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability equal to one (h
2
= 1), there is a sharp phase transition from poor performance to complete selection as the sample size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find for h
2
∼ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers. Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this approach to the GWAS analysis of height, we show that 70-100% of the selected markers are strongly correlated with height-associated markers identified by the GIANT Consortium.