TL;DR: GDA acknowledges financial support from the ARISTEIA II (Aristeia II) Action, which is co-funded by the European Social and National Resources (code 4288 to GDA).
Abstract: GDA acknowledges financial support from the “ARISTEIA II” Action
of the ”OPERATIONAL PROGRAMME EDUCATION AND LIFELONG
LEARNING” that is co-funded by the European Social
Fund and National Resources (code 4288 to GDA). GDA acknowledges
additional support by research grants from the Postgraduate
Programme ‘Toxicology’ of the Dept. of Biochemistry and
Biotechnology, School of Health Sciences, University of Thessaly,
Greece. YVdP acknowledges the Multidisciplinary Research Partnership
“Bioinformatics: from nucleotides to networks” Project
(no. 01MR0310W) of Ghent University. SGO acknowledges the
University of Cambridge for granting him Sabbatical Leave to
permit him to work with GDA in the University of Thessaly,
Greece.
TL;DR: These MED/Q genomic resources lay a foundation for future ‘pan-genomic’ comparisons of invasive vs. noninvasive, invasive versus.
Abstract: National Natural Science Foundation of China [31420103919, 31672032]; Chinese Academy of Agricultural Sciences (CAAS-ASTIP-IVFCAAS) the China Agriculture Research System [CARS-26-10]; Beijing Training Project for the Leading Talents in S T [LJRC201412]; Beijing Key Laboratory for Pest Control and Sustainable Cultivation of Vegetables; Beijing Nova Program [Z171100001117039]
TL;DR: Deep learning–based phenotyping is shown to have very good detection and localization accuracy in validation and testing image sets and to derive meaningful biological traits, which in turn can be used in quantitative trait loci discovery pipelines.
Abstract: Deep learning is an emerging field that promises unparalleled results on many data analysis problems. We show the success offered by such techniques when applied to the challenging problem of image-based plant phenotyping, and demonstrate state-of-the-art results for root and shoot feature identification and localisation. We predict a paradigm shift in image-based phenotyping thanks to deep learning approaches.
TL;DR: The observations suggest that the BGISEQ-500 holds the potential to represent a valid and potentially valuable alternative platform for palaeogenomic data generation that is worthy of future exploration by those interested in the sequencing and analysis of degraded DNA.
Abstract: Ancient DNA research has been revolutionized following development of next-generation sequencing platforms. Although a number of such platforms have been applied to ancient DNA samples, the Illumina series are the dominant choice today, mainly because of high production capacities and short read production. Recently a potentially attractive alternative platform for palaeogenomic data generation has been developed, the BGISEQ-500, whose sequence output are comparable with the Illumina series. In this study, we modified the standard BGISEQ-500 library preparation specifically for use on degraded DNA, then directly compared the sequencing performance and data quality of the BGISEQ-500 to the Illumina HiSeq2500 platform on DNA extracted from 8 historic and ancient dog and wolf samples. The data generated were largely comparable between sequencing platforms, with no statistically significant difference observed for parameters including level (P = 0.371) and average sequence length (P = 0718) of endogenous nuclear DNA, sequence GC content (P = 0.311), double-stranded DNA damage rate (v. 0.309), and sequence clonality (P = 0.093). Small significant differences were found in single-strand DNA damage rate (δS; slightly lower for the BGISEQ-500, P = 0.011) and the background rate of difference from the reference genome (θ; slightly higher for BGISEQ-500, P = 0.012). This may result from the differences in amplification cycles used to polymerase chain reaction-amplify the libraries. A significant difference was also observed in the mitochondrial DNA percentages recovered (P = 0.018), although we believe this is likely a stochastic effect relating to the extremely low levels of mitochondria that were sequenced from 3 of the samples with overall very low levels of endogenous DNA. Although we acknowledge that our analyses were limited to animal material, our observations suggest that the BGISEQ-500 holds the potential to represent a valid and potentially valuable alternative platform for palaeogenomic data generation that is worthy of future exploration by those interested in the sequencing and analysis of degraded DNA.
TL;DR: The first human whole-genome sequencing dataset of BGISEQ-500, generated by sequencing the widely used cell line HG001, can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform.
Abstract: Background BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform.
TL;DR: The authors' EEG datasets for MI BCI may provide researchers with opportunities to investigate human factors related to MIBCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states.
Abstract: Background Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. Findings We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Conclusions Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states.
TL;DR: The results demonstrated that host plants enrich specific bacteria and functions in the rhizoplane in foxtail millet root bacterial community, and may serve as a valuable knowledge foundation for bio-fertilizer development in agriculture.
Abstract: The root microbes play pivotal roles in plant productivity, nutrient uptakes, and disease resistance. The root microbial community structure has been extensively investigated by 16S/18S/ITS amplicons and metagenomic sequencing in crops and model plants. However, the functional associations between root microbes and host plant growth are poorly understood. This work investigates the root bacterial community of foxtail millet (Setaria italica) and its potential effects on host plant productivity. We determined the bacterial composition of 2882 samples from foxtail millet rhizoplane, rhizosphere and corresponding bulk soils from 2 well-separated geographic locations by 16S rRNA gene amplicon sequencing. We identified 16 109 operational taxonomic units (OTUs), and defined 187 OTUs as shared rhizoplane core OTUs. The β-diversity analysis revealed that microhabitat was the major factor shaping foxtail millet root bacterial community, followed by geographic locations. Large-scale association analysis identified the potential beneficial bacteria correlated with plant high productivity. Besides, the functional prediction revealed specific pathways enriched in foxtail millet rhizoplane bacterial community. We systematically described the root bacterial community structure of foxtail millet and found its core rhizoplane bacterial members. Our results demonstrated that host plants enrich specific bacteria and functions in the rhizoplane. The potentially beneficial bacteria may serve as a valuable knowledge foundation for bio-fertilizer development in agriculture.
TL;DR: This work reports the first near-complete assembly of T. aestivum, using deep sequencing coverage from a combination of short Illumina reads and very long Pacific Biosciences reads, providing a strong foundation for future genetic studies of this important food crop.
Abstract: Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall haploid size of more than 15 billion bases. Multiple past attempts to assemble the genome have produced assemblies that were well short of the estimated genome size. Here we report the first near-complete assembly of T. aestivum, using deep sequencing coverage from a combination of short Illumina reads and very long Pacific Biosciences reads. The final assembly contains 15 344 693 583 bases and has a weighted average (N50) contig size of 232 659 bases. This represents by far the most complete and contiguous assembly of the wheat genome to date, providing a strong foundation for future genetic studies of this important food crop. We also report how we used the recently published genome of Aegilops tauschii, the diploid ancestor of the wheat D genome, to identify 4 179 762 575 bp of T. aestivum that correspond to its D genome components.
TL;DR: Novel relationships between the gut microbiome and GDM status are discovered and it is suggested that changes in microbial composition may potentially be used to identify individuals at risk for GDM.
Abstract: The human gut microbiome can modulate metabolic health and affect insulin resistance, and it may play an important role in the etiology of gestational diabetes mellitus (GDM). Here, we compared the gut microbial composition of 43 GDM patients and 81 healthy pregnant women via whole-metagenome shotgun sequencing of their fecal samples, collected at 21-29 weeks, to explore associations between GDM and the composition of microbial taxonomic units and functional genes. A metagenome-wide association study identified 154 837 genes, which clustered into 129 metagenome linkage groups (MLGs) for species description, with significant relative abundance differences between the 2 cohorts. Parabacteroides distasonis, Klebsiella variicola, etc., were enriched in GDM patients, whereas Methanobrevibacter smithii, Alistipes spp., Bifidobacterium spp., and Eubacterium spp. were enriched in controls. The ratios of the gross abundances of GDM-enriched MLGs to control-enriched MLGs were positively correlated with blood glucose levels. A random forest model shows that fecal MLGs have excellent discriminatory power to predict GDM status. Our study discovered novel relationships between the gut microbiome and GDM status and suggests that changes in microbial composition may potentially be used to identify individuals at risk for GDM.
TL;DR: It is found that record-wise CV often massively overestimates the prediction accuracy of the algorithms, and this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes.
Abstract: The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map those data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation (CV) is the standard approach where the accuracy of such algorithms is evaluated on part of the data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and the validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular CV methods: record-wise and subject-wise. While the subject-wise method mirrors the clinically relevant use-case scenario of diagnosis in newly recruited subjects, the record-wise strategy has no such interpretation. Using both a publicly available dataset and a simulation, we found that record-wise CV often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning-based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as inaccurate results can mislead both clinicians and data scientists.
TL;DR: An easily operated “how to” guide for new potential users and describes the various steps required for successful planning of research projects that involve micro-CT, a fast-growing method in scientific research applications that allows for non-destructive imaging of morphological structures.
Abstract: Laboratory x-ray micro-computed tomography (micro-CT) is a fast-growing method in scientific research applications that allows for non-destructive imaging of morphological structures. This paper provides an easily operated "how to" guide for new potential users and describes the various steps required for successful planning of research projects that involve micro-CT. Background information on micro-CT is provided, followed by relevant setup, scanning, reconstructing, and visualization methods and considerations. Throughout the guide, a Jackson's chameleon specimen, which was scanned at different settings, is used as an interactive example. The ultimate aim of this paper is make new users familiar with the concepts and applications of micro-CT in an attempt to promote its use in future scientific studies.
TL;DR: This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge- associated rare biosphere.
Abstract: Marine sponges (phylum Porifera) are a diverse, phylogenetically deep-branching clade known for forming intimate partnerships with complex communities of microorganisms. To date, 16S rRNA gene sequencing studies have largely utilised different extraction and amplification methodologies to target the microbial communities of a limited number of sponge species, severely limiting comparative analyses of sponge microbial diversity and structure. Here, we provide an extensive and standardised dataset that will facilitate sponge microbiome comparisons across large spatial, temporal, and environmental scales. Samples from marine sponges (n = 3569 specimens), seawater (n = 370), marine sediments (n = 65) and other environments (n = 29) were collected from different locations across the globe. This dataset incorporates at least 268 different sponge species, including several yet unidentified taxa. The V4 region of the 16S rRNA gene was amplified and sequenced from extracted DNA using standardised procedures. Raw sequences (total of 1.1 billion sequences) were processed and clustered with (i) a standard protocol using QIIME closed-reference picking resulting in 39 543 operational taxonomic units (OTU) at 97% sequence identity, (ii) a de novo clustering using Mothur resulting in 518 246 OTUs, and (iii) a new high-resolution Deblur protocol resulting in 83 908 unique bacterial sequences. Abundance tables, representative sequences, taxonomic classifications, and metadata are provided. This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge-associated rare biosphere.
TL;DR: It was found that elements of glycerophospholipid metabolism were significantly altered in the plasma of psoriatic patients and provides novel insight into the role of lipids in psoriasis.
Abstract: Psoriasis is a common and chronic inflammatory skin disease that is complicated by gene-environment interactions. Although genomic, transcriptomic, and proteomic analyses have been performed to investigate the pathogenesis of psoriasis, the role of metabolites in psoriasis, particularly of lipids, remains unclear. Lipids not only comprise the bulk of the cellular membrane bilayers but also regulate a variety of biological processes such as cell proliferation, apoptosis, immunity, angiogenesis, and inflammation. In this study, an untargeted lipidomics approach was used to study the lipid profiles in psoriasis and to identify lipid metabolite signatures for psoriasis through ultra-performance liquid chromatography-tandem quadrupole mass spectrometry. Plasma samples from 90 participants (45 healthy and 45 psoriasis patients) were collected and analyzed. Statistical analysis was applied to find different metabolites between the disease and healthy groups. In addition, enzyme-linked immunosorbent assay was performed to validate differentially expressed lipids in psoriatic patient plasma. Finally, we identified differential expression of several lipids including lysophosphatidic acid (LPA), lysophosphatidylcholine (LysoPC), phosphatidylinositol (PI), phosphatidylcholine (PC), and phosphatidic acid (PA); among these metabolites, LPA, LysoPC, and PA were significantly increased, while PC and PI were down-regulated in psoriasis patients. We found that elements of glycerophospholipid metabolism such as LPA, LysoPC, PA, PI, and PC were significantly altered in the plasma of psoriatic patients; this study characterizes the circulating lipids in psoriatic patients and provides novel insight into the role of lipids in psoriasis.
TL;DR: The impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution are demonstrated.
Abstract: Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.
TL;DR: The ginseng genome represents a valuable resource for understanding and improving the breeding, cultivation, and synthesis biology of this key herb.
Abstract: Ginseng, which contains ginsenosides as bioactive compounds, has been regarded as an important traditional medicine for several millennia. However, the genetic background of ginseng remains poorly understood, partly because of the plant's large and complex genome composition. We report the entire genome sequence of Panax ginseng using next-generation sequencing. The 3.5-Gb nucleotide sequence contains more than 60% repeats and encodes 42 006 predicted genes. Twenty-two transcriptome datasets and mass spectrometry images of ginseng roots were adopted to precisely quantify the functional genes. Thirty-one genes were identified to be involved in the mevalonic acid pathway. Eight of these genes were annotated as 3-hydroxy-3-methylglutaryl-CoA reductases, which displayed diverse structures and expression characteristics. A total of 225 UDP-glycosyltransferases (UGTs) were identified, and these UGTs accounted for one of the largest gene families of ginseng. Tandem repeats contributed to the duplication and divergence of UGTs. Molecular modeling of UGTs in the 71st, 74th, and 94th families revealed a regiospecific conserved motif located at the N-terminus. Molecular docking predicted that this motif captures ginsenoside precursors. The ginseng genome represents a valuable resource for understanding and improving the breeding, cultivation, and synthesis biology of this key herb.
TL;DR: This work will review the main achievements obtained from interdisciplinary research based on magnetic resonance imaging and establish de facto, the birth of multilayer network analysis and modeling of the human brain.
Abstract: Understanding how the human brain is structured, and how its architecture is related to function, is of paramount importance for a variety of applications, including but not limited to new ways to prevent, deal with, and cure brain diseases, such as Alzheimer's or Parkinson's, and psychiatric disorders, such as schizophrenia. The recent advances in structural and functional neuroimaging, together with the increasing attitude toward interdisciplinary approaches involving computer science, mathematics, and physics, are fostering interesting results from computational neuroscience that are quite often based on the analysis of complex network representation of the human brain. In recent years, this representation experienced a theoretical and computational revolution that is breaching neuroscience, allowing us to cope with the increasing complexity of the human brain across multiple scales and in multiple dimensions and to model structural and functional connectivity from new perspectives, often combined with each other. In this work, we will review the main achievements obtained from interdisciplinary research based on magnetic resonance imaging and establish de facto, the birth of multilayer network analysis and modeling of the human brain.
TL;DR: The highly polymorphic genome of the pearl oyster is sequenced and a large set of novel proteins participating in matrix-framework formation are identified, including components similar to that found in vertebrate bones such as collagen-related VWA-containing proteins, chondroitin sulfotransferases, and regulatory elements.
Abstract: Nacre, the iridescent material found in pearls and shells of molluscs, is formed through an extraordinary process of matrix-assisted biomineralization. Despite recent advances, many aspects of the biomineralization process and its evolutionary origin remain unknown. The pearl oyster Pinctada fucata martensii is a well-known master of biomineralization, but the molecular mechanisms that underlie its production of shells and pearls are not fully understood. We sequenced the highly polymorphic genome of the pearl oyster and conducted multi-omic and biochemical studies to probe nacre formation. We identified a large set of novel proteins participating in matrix-framework formation, many in expanded families, including components similar to that found in vertebrate bones such as collagen-related VWA-containing proteins, chondroitin sulfotransferases, and regulatory elements. Considering that there are only collagen-based matrices in vertebrate bones and chitin-based matrices in most invertebrate skeletons, the presence of both chitin and elements of collagen-based matrices in nacre suggests that elements of chitin- and collagen-based matrices have deep roots and might be part of an ancient biomineralizing matrix. Our results expand the current shell matrix-framework model and provide new insights into the evolution of diverse biomineralization systems.
TL;DR: This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at “The Cell Image Library” (CIL) repository, and includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged level.
Abstract: Background Large-scale image sets acquired by automated microscopy of perturbed samples enable a detailed comparison of cell states induced by each perturbation, such as a small molecule from a diverse library. Highly multiplexed measurements of cellular morphology can be extracted from each image and subsequently mined for a number of applications. Findings This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at "The Cell Image Library" (CIL) repository. It also includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged (i.e., per-well) level; the image analysis workflows that generated the morphological features are also provided. Quality-control metrics are provided as metadata, indicating fields of view that are out-of-focus or containing highly fluorescent material or debris. Lastly, chemical annotations are supplied for the compound treatments applied. Conclusions Because computational algorithms and methods for handling single-cell morphological measurements are not yet routine, the dataset serves as a useful resource for the wider scientific community applying morphological (image-based) profiling. The dataset can be mined for many purposes, including small-molecule library enrichment and chemical mechanism-of-action studies, such as target identification. Integration with genetically perturbed datasets could enable identification of small-molecule mimetics of particular disease- or gene-related phenotypes that could be useful as probes or potential starting points for development of future therapeutics.
TL;DR: A detailed look at the complexities of cross-validation, fostered by the peer review of Saeb et al.
Abstract: This three-part review takes a detailed look at the complexities of cross-validation, fostered by the peer review of Saeb et al.'s paper entitled "The need to approximate the use-case in clinical machine learning." It contains perspectives by reviewers and by the original authors that touch upon cross-validation: the suitability of different strategies and their interpretation.
TL;DR: This database is one of the largest and most comprehensive databases of its type because it includes both in situ measurements and ecological context data and can be used as the foundation for other studies of freshwaters at broad spatial and ecological scales.
Abstract: Understanding the factors that affect water quality and the ecological services provided by freshwater ecosystems is an urgent global environmental issue. Predicting how water quality will respond ...
TL;DR: Comparative transcriptome studies combined with genome-wide analysis revealed polyphenol-rich and pathogen resistance characteristics of longan fruit and suggested a genomic basis for resistance to insects, fungus, and bacteria in this fruit tree.
Abstract: Abstract Longan (Dimocarpus longan Lour.), an important subtropical fruit in the family Sapindaceae, is grown in more than 10 countries. Longan is an edible drupe fruit and a source of traditional medicine with polyphenol-rich traits. Tree size, alternate bearing, and witches' broom disease still pose serious problems. To gain insights into the genomic basis of longan traits, a draft genome sequence was assembled. The draft genome (about 471.88 Mb) of a Chinese longan cultivar, “Honghezi,” was estimated to contain 31 007 genes and 261.88 Mb of repetitive sequences. No recent whole-genome-wide duplication event was detected in the genome. Whole-genome resequencing and analysis of 13 cultivated D. longan accessions revealed the extent of genetic diversity. Comparative transcriptome studies combined with genome-wide analysis revealed polyphenol-rich and pathogen resistance characteristics. Genes involved in secondary metabolism, especially those from significantly expanded (DHS, SDH, F3΄H, ANR, and UFGT) and contracted (PAL, CHS, and F3΄5΄H) gene families with tissue-specific expression, may be important contributors to the high accumulation levels of polyphenolic compounds observed in longan fruit. The high number of genes encoding nucleotide-binding site leucine-rich repeat (NBS-LRR) and leucine-rich repeat receptor-like kinase proteins, as well as the recent expansion and contraction of the NBS-LRR family, suggested a genomic basis for resistance to insects, fungus, and bacteria in this fruit tree. These data provide insights into the evolution and diversity of the longan genome. The comparative genomic and transcriptome analyses provided information about longan-specific traits, particularly genes involved in its polyphenol-rich and pathogen resistance characteristics.
TL;DR: The ability of sequence data produced by MinION to correctly assign taxonomy in single bacterial species runs and in three types of low-complexity synthetic communities was tested, suggesting the platform has the potential to provide rapid and accurate metagenomic analysis where the consortium is comprised of a limited number of taxa.
Abstract: Environmental metagenomic analysis is typically accomplished by assigning taxonomy and/or function from whole genome sequencing or 16S amplicon sequences. Both of these approaches are limited, however, by read length, among other technical and biological factors. A nanopore-based sequencing platform, MinION™, produces reads that are ≥1 × 104 bp in length, potentially providing for more precise assignment, thereby alleviating some of the limitations inherent in determining metagenome composition from short reads. We tested the ability of sequence data produced by MinION (R7.3 flow cells) to correctly assign taxonomy in single bacterial species runs and in three types of low-complexity synthetic communities: a mixture of DNA using equal mass from four species, a community with one relatively rare (1%) and three abundant (33% each) components, and a mixture of genomic DNA from 20 bacterial strains of staggered representation. Taxonomic composition of the low-complexity communities was assessed by analyzing the MinION sequence data with three different bioinformatic approaches: Kraken, MG-RAST, and One Codex. Results: Long read sequences generated from libraries prepared from single strains using the version 5 kit and chemistry, run on the original MinION device, yielded as few as 224 to as many as 3497 bidirectional high-quality (2D) reads with an average overall study length of 6000 bp. For the single-strain analyses, assignment of reads to the correct genus by different methods ranged from 53.1% to 99.5%, assignment to the correct species ranged from 23.9% to 99.5%, and the majority of misassigned reads were to closely related organisms. A synthetic metagenome sequenced with the same setup yielded 714 high quality 2D reads of approximately 5500 bp that were up to 98% correctly assigned to the species level. Synthetic metagenome MinION libraries generated using version 6 kit and chemistry yielded from 899 to 3497 2D reads with lengths averaging 5700 bp with up to 98% assignment accuracy at the species level. The observed community proportions for “equal” and “rare” synthetic libraries were close to the known proportions, deviating from 0.1% to 10% across all tests. For a 20-species mock community with staggered contributions, a sequencing run detected all but 3 species (each included at 99% of reads were assigned to the correct family. Conclusions: At the current level of output and sequence quality (just under 4 × 103 2D reads for a synthetic metagenome), MinION sequencing followed by Kraken or One Codex analysis has the potential to provide rapid and accurate metagenomic analysis where the consortium is comprised of a limited number of taxa. Important considerations noted in this study included: high sensitivity of the MinION platform to the quality of input DNA, high variability of sequencing results across libraries and flow cells, and relatively small numbers of 2D reads per analysis limit. Together, these limited detection of very rare components of the microbial consortia, and would likely limit the utility of MinION for the sequencing of high-complexity metagenomic communities where thousands of taxa are expected. Furthermore, the limitations of the currently available data analysis tools suggest there is considerable room for improvement in the analytical approaches for the characterization of microbial communities using long reads. Nevertheless, the fact that the accurate taxonomic assignment of high-quality reads generated by MinION is approaching 99.5% and, in most cases, the inferred community structure mirrors the known proportions of a synthetic mixture warrants further exploration of practical application to environmental metagenomics as the platform continues to develop and improve. With further improvement in sequence throughput and error rate reduction, this platform shows great promise for precise real-time analysis of the composition and structure of more complex microbial communities.
TL;DR: The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detectingCNVs in large-scale sequencing data with complex genomic structures and promotes population genetic analyses of functional CNVs in more species.
Abstract: Background The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Results Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1-2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. Conclusions The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species.
TL;DR: Cocos nucifera is a member of genus Cocos and family Arecaceae (Palmaceae) as mentioned in this paper, which is an important tropical fruit and oil crop.
Abstract: Coconut palm (Cocos nucifera,2n = 32), a member of genus Cocos and family Arecaceae (Palmaceae), is an important tropical fruit and oil crop. Currently, coconut palm is cultivated in 93 countries, including Central and South America, East and West Africa, Southeast Asia and the Pacific Islands, with a total growth area of more than 12 million hectares [1]. Coconut palm is generally classified into 2 main categories: "Tall" (flowering 8-10 years after planting) and "Dwarf" (flowering 4-6 years after planting), based on morphological characteristics and breeding habits. This Palmae species has a long growth period before reproductive years, which hinders conventional breeding progress. In spite of initial successes, improvements made by conventional breeding have been very slow. In the present study, we obtained de novo sequences of the Cocos nucifera genome: a major genomic resource that could be used to facilitate molecular breeding in Cocos nucifera and accelerate the breeding process in this important crop. A total of 419.67 gigabases (Gb) of raw reads were generated by the Illumina HiSeq 2000 platform using a series of paired-end and mate-pair libraries, covering the predicted Cocos nucifera genome length (2.42 Gb, variety "Hainan Tall") to an estimated ×173.32 read depth. A total scaffold length of 2.20 Gb was generated (N50 = 418 Kb), representing 90.91% of the genome. The coconut genome was predicted to harbor 28 039 protein-coding genes, which is less than in Phoenix dactylifera (PDK30: 28 889), Phoenix dactylifera (DPV01: 41 660), and Elaeis guineensis (EG5: 34 802). BUSCO evaluation demonstrated that the obtained scaffold sequences covered 90.8% of the coconut genome and that the genome annotation was 74.1% complete. Genome annotation results revealed that 72.75% of the coconut genome consisted of transposable elements, of which long-terminal repeat retrotransposons elements (LTRs) accounted for the largest proportion (92.23%). Comparative analysis of the antiporter gene family and ion channel gene families between C. nucifera and Arabidopsis thaliana indicated that significant gene expansion may have occurred in the coconut involving Na+/H+ antiporter, carnitine/acylcarnitine translocase, potassium-dependent sodium-calcium exchanger, and potassium channel genes. Despite its agronomic importance, C. nucifera is still under-studied. In this report, we present a draft genome of C. nucifera and provide genomic information that will facilitate future functional genomics and molecular-assisted breeding in this crop species.
TL;DR: Metagenomic shotgun sequencing was employed to provide a detailed characterization of the compositional and functional features of the CD microbiota, comprising also unannotated bacteria, and investigated its modulation by exclusive enteral nutrition.
Abstract: The inflammatory intestinal disorder Crohn's disease (CD) has become a health challenge worldwide. The gut microbiota closely interacts with the host immune system, but its functional impact in CD is unclear. Except for studies on a small number of CD patients, analyses of the gut microbiota in CD have used 16S rDNA amplicon sequencing. Here we employed metagenomic shotgun sequencing to provide a detailed characterization of the compositional and functional features of the CD microbiota, comprising also unannotated bacteria, and investigated its modulation by exclusive enteral nutrition. Based on signature taxa, CD microbiotas clustered into 2 distinct metacommunities, indicating individual variability in CD microbiome structure. Metacommunity-specific functional shifts in CD showed enrichment in producers of the pro-inflammatory hexa-acylated lipopolysaccharide variant and a reduction in the potential to synthesize short-chain fatty acids. Disruption of ecological networks was evident in CD, coupled with reduction in growth rates of many bacterial species. Short-term exclusive enteral nutrition elicited limited impact on the overall composition of the CD microbiota, although functional changes occurred following treatment. The microbiotas in CD patients can be stratified into 2 distinct metacommunities, with the most severely perturbed metacommunity exhibiting functional potentials that deviate markedly from that of the healthy individuals, with possible implication in relation to CD pathogenesis.
TL;DR: The 22-gigabase genome of loblolly pine is sequenced using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences, which generated approximately 12-fold coverage in long reads using the MaSuRCA mega-reads assembly algorithm.
Abstract: The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.
TL;DR: Gene family expansion and transcriptomic analyses provided hints to the genomic basis of the differences in important traits such as host range, migratory habit, and plant virus transmission between L. striatellus and the other 2 planthoppers.
Abstract: Background Laodelphax striatellus Fallen (Hemiptera: Delphacidae) is one of the most destructive rice pests. L. striatellus is different from 2 other rice planthoppers with a released genome sequence, Sogatella furcifera and Nilaparvata lugens, in many biological characteristics, such as host range, dispersal capacity, and vectoring plant viruses. Deciphering the genome of L. striatellus will further the understanding of the genetic basis of the biological differences among the 3 rice planthoppers. Findings A total of 190 Gb of Illumina data and 32.4 Gb of Pacbio data were generated and used to assemble a high-quality L. striatellus genome sequence, which is 541 Mb in length and has a contig N50 of 118 Kb and a scaffold N50 of 1.08 Mb. Annotated repetitive elements account for 25.7% of the genome. A total of 17 736 protein-coding genes were annotated, capturing 97.6% and 98% of the BUSCO eukaryote and arthropoda genes, respectively. Compared with N. lugens and S. furcifera, L. striatellus has the smallest genome and the lowest gene number. Gene family expansion and transcriptomic analyses provided hints to the genomic basis of the differences in important traits such as host range, migratory habit, and plant virus transmission between L. striatellus and the other 2 planthoppers. Conclusions We report a high-quality genome assembly of L. striatellus, which is an important genomic resource not only for the study of the biology of L. striatellus and its interactions with plant hosts and plant viruses, but also for comparison with other planthoppers.
TL;DR: This resource provides a test-bed for quantifying the reliability of connectivity indices across subjects, conditions and time and can be used to compare and optimize different frameworks for measuring connectivity and data collection parameters such as scan length.
Abstract: Background Although typically measured during the resting state, a growing literature is illustrating the ability to map intrinsic connectivity with functional MRI during task and naturalistic viewing conditions. These paradigms are drawing excitement due to their greater tolerability in clinical and developing populations and because they enable a wider range of analyses (e.g., inter-subject correlations). To be clinically useful, the test-retest reliability of connectivity measured during these paradigms needs to be established. This resource provides data for evaluating test-retest reliability for full-brain connectivity patterns detected during each of four scan conditions that differ with respect to level of engagement (rest, abstract animations, movie clips, flanker task). Data are provided for 13 participants, each scanned in 12 sessions with 10 minutes for each scan of the four conditions. Diffusion kurtosis imaging data was also obtained at each session. Findings Technical validation and demonstrative reliability analyses were carried out at the connection-level using the Intraclass Correlation Coefficient and at network-level representations of the data using the Image Intraclass Correlation Coefficient. Variation in intrinsic functional connectivity across sessions was generally found to be greater than that attributable to scan condition. Between-condition reliability was generally high, particularly for the frontoparietal and default networks. Between-session reliabilities obtained separately for the different scan conditions were comparable, though notably lower than between-condition reliabilities. Conclusions This resource provides a test-bed for quantifying the reliability of connectivity indices across subjects, conditions and time. The resource can be used to compare and optimize different frameworks for measuring connectivity and data collection parameters such as scan length. Additionally, investigators can explore the unique perspectives of the brain's functional architecture offered by each of the scan conditions.
TL;DR: The data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities.
Abstract: The miniaturized and portable DNA sequencer MinION™ has demonstrated great potential in different analyses such as genome-wide sequencing, pathogen outbreak detection and surveillance, human genome variability, and microbial diversity. In this study, we tested the ability of the MinION™ platform to perform long amplicon sequencing in order to design new approaches to study microbial diversity using a multi-locus approach. After compiling a robust database by parsing and extracting the rrn bacterial region from more than 67000 complete or draft bacterial genomes, we demonstrated that the data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, we presented a novel approach consisting of multi-locus and long amplicon sequencing using the MinION™ MkIb DNA sequencer and R9 and R9.4 chemistries that help to overcome the main disadvantage of this portable sequencing platform. Furthermore, the nanopore sequencing library, constructed with the last releases of pore chemistry (R9.4) and sequencing kit (SQK-LSK108), permitted the retrieval of the higher level of 1D read accuracy sufficient to characterize the microbial species present in each mock community analysed. Improvements in nanopore chemistry, such as minimizing base-calling errors and new library protocols able to produce rapid 1D libraries, will provide more reliable information in the near future. Such data will be useful for more comprehensive and faster specific detection of microbial species and strains in complex ecosystems.
TL;DR: Comparative transcriptome studies combined with genome-wide analysis revealed polyphenol-rich and pathogen resistance characteristics of longan fruit and suggested a genomic basis for resistance to insects, fungus, and bacteria in this fruit tree.
Abstract: Longan (Dimocarpus longan Lour.), an important subtropical fruit in the family Sapindaceae, is grown in more than 10 countries. Longan is an edible drupe fruit and a source of traditional medicine with polyphenol-rich traits. Tree size, alternate bearing, and witches' broom disease still pose serious problems. To gain insights into the genomic basis of longan traits, a draft genome sequence was assembled. The draft genome (about 471.88 Mb) of a Chinese longan cultivar, “Honghezi,” was estimated to contain 31 007 genes and 261.88 Mb of repetitive sequences. No recent whole-genome-wide duplication event was detected in the genome. Whole-genome resequencing and analysis of 13 cultivated D. longan accessions revealed the extent of genetic diversity. Comparative transcriptome studies combined with genome-wide analysis revealed polyphenol-rich and pathogen resistance characteristics. Genes involved in secondary metabolism, especially those from significantly expanded (DHS, SDH, F3΄H, ANR, and UFGT) and contracted (PAL, CHS, and F3΄5΄H) gene families with tissue-specific expression, may be important contributors to the high accumulation levels of polyphenolic compounds observed in longan fruit. The high number of genes encoding nucleotide-binding site leucine-rich repeat (NBS-LRR) and leucine-rich repeat receptor-like kinase proteins, as well as the recent expansion and contraction of the NBS-LRR family, suggested a genomic basis for resistance to insects, fungus, and bacteria in this fruit tree. These data provide insights into the evolution and diversity of the longan genome. The comparative genomic and transcriptome analyses provided information about longan-specific traits, particularly genes involved in its polyphenol-rich and pathogen resistance characteristics.