TL;DR: In this paper, a computational model is constructed in which individuals are able to self-organize both their strategy and their social ties throughout evolution, based exclusively on their self-interest.
Abstract: Conventional evolutionary game theory predicts that natural selection favours the selfish and strong even though cooperative interactions thrive at all levels of organization in living systems. Recent investigations demonstrated that a limiting factor for the evolution of cooperative interactions is the way in which they are organized, cooperators becoming evolutionarily competitive whenever individuals are constrained to interact with few others along the edges of networks with low average connectivity. Despite this insight, the conundrum of cooperation remains since recent empirical data shows that real networks exhibit typically high average connectivity and associated single-to-broad–scale heterogeneity. Here, a computational model is constructed in which individuals are able to self-organize both their strategy and their social ties throughout evolution, based exclusively on their self-interest. We show that the entangled evolution of individual strategy and network structure constitutes a key mechanism for the sustainability of cooperation in social networks. For a given average connectivity of the population, there is a critical value for the ratio W between the time scales associated with the evolution of strategy and of structure above which cooperators wipe out defectors. Moreover, the emerging social networks exhibit an overall heterogeneity that accounts very well for the diversity of patterns recently found in acquired data on social networks. Finally, heterogeneity is found to become maximal when W reaches its critical value. These results show that simple topological dynamics reflecting the individual capacity for self-organization of social ties can produce realistic networks of high average connectivity with associated single-to-broad–scale heterogeneity. On the other hand, they show that cooperation cannot evolve as a result of “social viscosity” alone in heterogeneous networks with high average connectivity, requiring the additional mechanism of topological co-evolution to ensure the survival of cooperative behaviour.
TL;DR: The results suggest a fundamental link between physical embeddedness and information, highlighting the effects of embodied interactions on internal (neural) information processing, and illuminating the role of various system components on the generation of behavior.
Abstract: Biological organisms continuously select and sample information used by their neural structures for perception and action, and for creating coherent cognitive states guiding their autonomous behavior. Information processing, however, is not solely an internal function of the nervous system. Here we show, instead, how sensorimotor interaction and body morphology can induce statistical regularities and information structure in sensory inputs and within the neural control architecture, and how the flow of information between sensors, neural units, and effectors is actively shaped by the interaction with the environment. We analyze sensory and motor data collected from real and simulated robots and reveal the presence of information structure and directed information flow induced by dynamically coupled sensorimotor activity, including effects of motor outputs on sensory inputs. We find that information structure and information flow in sensorimotor networks (a) is spatially and temporally specific; (b) can be affected by learning, and (c) can be affected by changes in body morphology. Our results suggest a fundamental link between physical embeddedness and information, highlighting the effects of embodied interactions on internal (neural) information processing, and illuminating the role of various system components on the generation of behavior.
TL;DR: Structural clustering of the predicted models shows that GPCRs with similar structures tend to belong to a similar functional class even when their sequences are diverse, which demonstrates the usefulness and robustness of the in silico models for GPCR functional analysis.
Abstract: G protein–coupled receptors (GPCRs), encoded by about 5% of human genes, comprise the largest family of integral membrane proteins and act as cell surface receptors responsible for the transduction of endogenous signal into a cellular response. Although tertiary structural information is crucial for function annotation and drug design, there are few experimentally determined GPCR structures. To address this issue, we employ the recently developed threading assembly refinement (TASSER) method to generate structure predictions for all 907 putative GPCRs in the human genome. Unlike traditional homology modeling approaches, TASSER modeling does not require solved homologous template structures; moreover, it often refines the structures closer to native. These features are essential for the comprehensive modeling of all human GPCRs when close homologous templates are absent. Based on a benchmarked confidence score, approximately 820 predicted models should have the correct folds. The majority of GPCR models share the characteristic seven-transmembrane helix topology, but 45 ORFs are predicted to have different structures. This is due to GPCR fragments that are predominantly from extracellular or intracellular domains as well as database annotation errors. Our preliminary validation includes the automated modeling of bovine rhodopsin, the only solved GPCR in the Protein Data Bank. With homologous templates excluded, the final model built by TASSER has a global Cα root-mean-squared deviation from native of 4.6 A, with a root-mean-squared deviation in the transmembrane helix region of 2.1 A. Models of several representative GPCRs are compared with mutagenesis and affinity labeling data, and consistent agreement is demonstrated. Structure clustering of the predicted models shows that GPCRs with similar structures tend to belong to a similar functional class even when their sequences are diverse. These results demonstrate the usefulness and robustness of the in silico models for GPCR functional analysis. All predicted GPCR models are freely available for noncommercial users on our Web site (http://www.bioinformatics.buffalo.edu/GPCR).
TL;DR: This article analyzed the phylogenetic distribution of nearly 5,000 histidine protein kinases from 207 sequenced prokaryotic genomes and found that many genomes carry a large repertoire of recently evolved signaling genes, which may reflect selective pressure to adapt to new environmental conditions.
Abstract: Two-component systems including histidine protein kinases represent the primary signal transduction paradigm in prokaryotic organisms To understand how these systems adapt to allow organisms to detect niche-specific signals, we analyzed the phylogenetic distribution of nearly 5,000 histidine protein kinases from 207 sequenced prokaryotic genomes We found that many genomes carry a large repertoire of recently evolved signaling genes, which may reflect selective pressure to adapt to new environmental conditions Both lineage-specific gene family expansion and horizontal gene transfer play major roles in the introduction of new histidine kinases into genomes; however, there are differences in how these two evolutionary forces act Genes imported via horizontal transfer are more likely to retain their original functionality as inferred from a similar complement of signaling domains, while gene family expansion accompanied by domain shuffling appears to be a major source of novel genetic diversity Family expansion is the dominant source of new histidine kinase genes in the genomes most enriched in signaling proteins, and detailed analysis reveals that divergence in domain structure and changes in expression patterns are hallmarks of recent expansions Finally, while these two modes of gene acquisition are widespread across bacterial taxa, there are clear species-specific preferences for which mode is used
TL;DR: This tutorial reviews computational techniques, termed “motif discovery,” to learn representations of regulatory motifs from sequence data, and discusses the main challenges associated with motif discovery in detail.
Abstract: Many functionally important regions of the genome can be recognized by searching for sequence patterns, or “motifs.” Aside from the genes themselves, examples include CpG islands, often present in promoter regions, and splice sites that denote intron/exon boundaries. Other motifs of great interest correspond to sites bound by regulatory proteins. Differential expression of genes in response to environmental and developmental cues depends on the action of these proteins, which are also known as transcription factors. Identifying the regulatory motifs bound by transcription factors can provide crucial insight into the mechanisms of transcriptional regulation. However, the search for these sites is challenging because a single regulatory protein will often recognize a variety of similar sequences. In this tutorial, we review computational techniques, termed “motif discovery,” to learn representations of regulatory motifs from sequence data. In Figure 1, we present an overview of the basic workflow in a motif discovery analysis and some practical strategies for successfully mining sequence data for biologically important regulatory motifs. In the remainder of this tutorial, we discuss the main challenges associated with motif discovery in detail, and we review recent developments for addressing these challenges.
Figure 1
Motif Discovery Workflow
TL;DR: It is shown that native conformations of proteins have statistically fewer knots than random compact loops, and that the local geometrical properties, such as the crumpled character of the conformations at a certain range of scales, are consistent with the rarity of knots.
Abstract: Like shoelaces, the backbones of proteins may get entangled and form knots. However, only a few knots in native proteins have been identified so far. To more quantitatively assess the rarity of knots in proteins, we make an explicit comparison between the knotting probabilities in native proteins and in random compact loops. We identify knots in proteins statistically, applying the mathematics of knot invariants to the loops obtained by complementing the protein backbone with an ensemble of random closures, and assigning a certain knot type to a given protein if and only if this knot dominates the closure statistics (which tells us that the knot is determined by the protein and not by a particular method of closure). We also examine the local fractal or geometrical properties of proteins via computational measurements of the end-to-end distance and the degree of interpenetration of its subchains. Although we did identify some rather complex knots, we show that native conformations of proteins have statistically fewer knots than random compact loops, and that the local geometrical properties, such as the crumpled character of the conformations at a certain range of scales, are consistent with the rarity of knots. From these, we may conclude that the known “protein universe” (set of native conformations) avoids knots. However, the precise reason for this is unknown—for instance, if knots were removed by evolution due to their unfavorable effect on protein folding or function or due to some other unidentified property of protein evolution.
TL;DR: PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species, and will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes.
Abstract: Accurate predictions of orthology and paralogy relationships are necessary to infer human molecular function from experiments in model organisms. Previous genome-scale approaches to predicting these relationships have been limited by their use of protein similarity and their failure to take into account multiple splicing events and gene prediction errors. We have developed PhyOP, a new phylogenetic orthology prediction pipeline based on synonymous rate estimates, which accurately predicts orthology and paralogy relationships for transcripts, genes, exons, or genomic segments between closely related genomes. We were able to identify orthologue relationships to human genes for 93% of all dog genes from Ensembl. Among 1:1 orthologues, the alignments covered a median of 97.4% of protein sequences, and 92% of orthologues shared essentially identical gene structures. PhyOP accurately recapitulated genomic maps of conserved synteny. Benchmarking against predictions from Ensembl and Inparanoid showed that PhyOP is more accurate, especially in its predictions of paralogy. Nearly half (46%) of PhyOP paralogy predictions are unique. Using PhyOP to investigate orthologues and paralogues in the human and dog genomes, we found that the human assembly contains 3-fold more gene duplications than the dog. Species-specific duplicate genes, or “in-paralogues,” are generally shorter and have fewer exons than 1:1 orthologues, which is consistent with selective constraints and mutation biases based on the sizes of duplicated genes. In-paralogues have experienced elevated amino acid and synonymous nucleotide substitution rates. Duplicates possess similar biological functions for either the dog or human lineages. Having accounted for 2,954 likely pseudogenes and gene fragments, and after separating 346 erroneously merged genes, we estimated that the human genome encodes a minimum of 19,700 protein-coding genes, similar to the gene count of nematode worms. PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species. PhyOP will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes.
TL;DR: A software program that weights and integrates specific properties on the genes in a pathogen so that they may be ranked as drug targets is developed and it is shown that targets can be prioritized by using evolutionary programming to optimize the weights of each desired property.
Abstract: We have developed a software program that weights and integrates specific properties on the genes in a pathogen so that they may be ranked as drug targets. We applied this software to produce three prioritized drug target lists for Mycobacterium tuberculosis, the causative agent of tuberculosis, a disease for which a new drug is desperately needed. Each list is based on an individual criterion. The first list prioritizes metabolic drug targets by the uniqueness of their roles in the M. tuberculosis metabolome (“metabolic chokepoints”) and their similarity to known “druggable” protein classes (i.e., classes whose activity has previously been shown to be modulated by binding a small molecule). The second list prioritizes targets that would specifically impair M. tuberculosis, by weighting heavily those that are closely conserved within the Actinobacteria class but lack close homology to the host and gut flora. M. tuberculosis can survive asymptomatically in its host for many years by adapting to a dormant state referred to as “persistence.” The final list aims to prioritize potential targets involved in maintaining persistence in M. tuberculosis. The rankings of current, candidate, and proposed drug targets are highlighted with respect to these lists. Some features were found to be more accurate than others in prioritizing studied targets. It can also be shown that targets can be prioritized by using evolutionary programming to optimize the weights of each desired property. We demonstrate this approach in prioritizing persistence targets.
TL;DR: This dataset of chicken GPCRs is the largest curated dataset from a single gene family from a non-mammalian vertebrate, and has high proportions of orthologous pairs, although the percentage of amino acid identity varies.
Abstract: G protein-coupled receptors (GPCRs) are one of the largest families of proteins, and here we scan the recently sequenced chicken genome for GPCRs. We use a homology-based approach, utilizing comparisons with all human GPCRs, to detect and verify chicken GPCRs from translated genomic alignments and Genscan predictions. We present 557 manually curated sequences for GPCRs from the chicken genome, of which 455 were previously not annotated. More than 60% of the chicken Genscan gene predictions with a human ortholog needed curation, which drastically changed the average percentage identity between the human-chicken orthologous pairs (from 56.3% to 72.9%). Of the non-olfactory chicken GPCRs, 79% had a one-to-one orthologous relationship to a human GPCR. The Frizzled, Secretin, and subgroups of the Rhodopsin families have high proportions of orthologous pairs, although the percentage of amino acid identity varies. Other groups show large differences, such as the Adhesion family and GPCRs that bind exogenous ligands. The chicken has only three bitter Taste 2 receptors, and it also lacks an ortholog to human TAS1R2 (one of three GPCRs in the human genome in the Taste 1 receptor family [TAS1R]), implying that the chicken's ability and mode of detecting both bitter and sweet taste may differ from the human's. The chicken genome contains at least 229 olfactory receptors, and the majority of these (218) originate from a chicken-specific expansion. To our knowledge, this dataset of chicken GPCRs is the largest curated dataset from a single gene family from a non-mammalian vertebrate. Both the updated human GPCR dataset, as well the chicken GPCR dataset, are available for download.
TL;DR: The results indicate that even at the tissue and organism levels, proliferation and differentiation modules may correspond to two alternative states of the molecular network and may reflect a universal symbiotic relationship in a multicellular organism.
Abstract: The protein–protein interaction networks, or interactome networks, have been shown to have dynamic modular structures, yet the functional connections between and among the modules are less well understood. Here, using a new pipeline to integrate the interactome and the transcriptome, we identified a pair of transcriptionally anticorrelated modules, each consisting of hundreds of genes in multicellular interactome networks across different individuals and populations. The two modules are associated with cellular proliferation and differentiation, respectively. The proliferation module is conserved among eukaryotic organisms, whereas the differentiation module is specific to multicellular organisms. Upon differentiation of various tissues and cell lines from different organisms, the expression of the proliferation module is more uniformly suppressed, while the differentiation module is upregulated in a tissue- and species-specific manner. Our results indicate that even at the tissue and organism levels, proliferation and differentiation modules may correspond to two alternative states of the molecular network and may reflect a universal symbiotic relationship in a multicellular organism. Our analyses further predict that the proteins mediating the interactions between these modules may serve as modulators at the proliferation/differentiation switch.
TL;DR: This work simulated 133 peptide 8-mer fragments from six different proteins, sampled by replica-exchange molecular dynamics using Amber7 with a GB/SA (generalized-Born/solvent-accessible electrostatic approximation to water) implicit solvent, and found that 85 of the peptides have no preferred structure, while 48 of them converge to a preferred structure.
Abstract: Peptides often have conformational preferences. We simulated 133 peptide 8-mer fragments from six different proteins, sampled by replica-exchange molecular dynamics using Amber7 with a GB/SA (generalized-Born/solvent-accessible electrostatic approximation to water) implicit solvent. We found that 85 of the peptides have no preferred structure, while 48 of them converge to a preferred structure. In 85% of the converged cases (41 peptides), the structures found by the simulations bear some resemblance to their native structures, based on a coarse-grained backbone description. In particular, all seven of the β hairpins in the native structures contain a fragment in the turn that is highly structured. In the eight cases where the bioinformatics-based I-sites library picks out native-like structures, the present simulations are largely in agreement. Such physics-based modeling may be useful for identifying early nuclei in folding kinetics and for assisting in protein-structure prediction methods that utilize the assembly of peptide fragments.
TL;DR: Recent progress on addressing questions in analyzing the architecture and dynamics of cellular networks are surveyed and mammalian cell signaling is used as case studies to discuss how computational analyses of networks shed light on specific biological processes.
Abstract: Understanding how the phenotypes and behaviors of cells are controlled is one of the major challenges in biological research. Traditionally, focus has been given to the characterization of individual genes/proteins or individual interactions during cellular events. However, many phenotypes and behaviors cannot be attributed to isolated components. Rather, they arise from characteristics of cellular networks, which represent connections between molecules in cells. We review the recent progress on analyzing the architecture and dynamics of cellular networks. We also summarize how computational modeling yields insight about cell signaling pathways.
The responses of cells to genetic perturbations or environmental cues are controlled by complex networks, including interconnected signaling pathways and cascades of transcriptional programs. The advance of genome technologies has made it possible to analyze cellular events on a global scale. A number of high-throughput techniques, such as DNA microarrays, chromatin immunoprecipitations, and yeast two-hybrid and mass-spectrometry analyses have been applied to cellular systems [1–10]. These experiments have provided first-draft catalogs of essential components, transcriptional regulatory diagrams, and molecular interaction maps for a number of organisms.
In addition to providing a candidate list of biomolecules involved in biological processes, the high-throughput technologies offer unprecedented opportunities to derive underlying principles of how complex cellular networks are built and how network architectures contribute to phenotypes. A series of important questions in this area have been addressed recently (Figure 1). For example, what are the characteristics of cellular network structures that distinguish them from randomly generated networks? Are the network structures relevant for biological functions? If so, are they evolutionarily conserved and how do they evolve? Are some topological patterns preferred at certain times or conditions? These questions are analogous to those asked in the field of genome sequence analysis, such as identifying biologically relevant sequence motifs and domains, investigating the evolutionary conservation between sequences from different species, and understanding temporal or spatial specificities of regulatory sites. In this paper, we survey recent progress on addressing these questions and use mammalian cell signaling as case studies to discuss how computational analyses of networks shed light on specific biological processes.
Figure 1
An Overview of Biological Network Analyses Based on “Omic” Data
TL;DR: The sequence identities of the redesigned proteins using the flexible-backbone design simulation are presented as the function of the backbone-RMSD from the reference protein.
Abstract: In PLoS Computational Biology, volume 2, issue 7: DOI: 10.1371/journal.pcbi.0020085
The references to the figure parts in the legend of figure 3 were incorrect. The correct caption is as follows:
Figure 3. The Sequence Identity for the Constructed Homologous Structures
Three different protein folds are studied: HPR domain (A,D), ROSSMAN fold (B,E), and SH3 domain (C,F). (A,B,C) The sequence identities of the redesigned proteins using the flexible-backbone design simulation are presented as the function of the backbone-RMSD from the reference protein. (D,E,F) The sequence identity of the core is also plotted against the overall sequence identity. The “twilight zone” of sequence identity (20%–30%) corresponds to regions between horizontal (A,B,C) or vertical (D,E,F) lines.
TL;DR: The present analyses suggest that a modified form of this cis regulatory code applies to only a subset of founder cell genes, those whose gene expression responds to specific genetic perturbations in a similar manner to the gene on which the original model was based.
Abstract: While combinatorial models of transcriptional regulation can be inferred for metazoan systems from a priori biological knowledge, validation requires extensive and time-consuming experimental work. Thus, there is a need for computational methods that can evaluate hypothesized cis regulatory codes before the difficult task of experimental verification is undertaken. We have developed a novel computational framework (termed "CodeFinder") that integrates transcription factor binding site and gene expression information to evaluate whether a hypothesized transcriptional regulatory model (TRM; i.e., a set of co-regulating transcription factors) is likely to target a given set of co-expressed genes. Our basic approach is to simultaneously predict cis regulatory modules (CRMs) associated with a given gene set and quantify the enrichment for combinatorial subsets of transcription factor binding site motifs comprising the hypothesized TRM within these predicted CRMs. As a model system, we have examined a TRM experimentally demonstrated to drive the expression of two genes in a sub-population of cells in the developing Drosophila mesoderm, the somatic muscle founder cells. This TRM was previously hypothesized to be a general mode of regulation for genes expressed in this cell population. In contrast, the present analyses suggest that a modified form of this cis regulatory code applies to only a subset of founder cell genes, those whose gene expression responds to specific genetic perturbations in a similar manner to the gene on which the original model was based. We have confirmed this hypothesis by experimentally discovering six (out of 12 tested) new CRMs driving expression in the embryonic mesoderm, four of which drive expression in founder cells.
TL;DR: Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context by using annotated subfamily groupings to infer function.
Abstract: Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context. As originally outlined by Jonathan Eisen [1–3], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to [1]. One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome [4], based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences [5,6]. Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale [7]—and the challenges we have faced in this effort—motivate this paper.
Figure 1
Phylogenomic Analysis of Protein Function Using Subfamily Annotation
In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error.
The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [8–10]. The root causes of these errors are these:
Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query.
Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures.
Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected.
Evolutionary distance. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species.
Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [3,7]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [11,12]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there.
This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1). Some of these methods have based annotation transfer on the identification of orthologs [13–15] or of functional subfamilies [6,16–21]. Other groups have used whole-tree analyses [22–24]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families.
Table 1
Resources for Phylogenomic Analysis
It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole.
TL;DR: A biophysically and molecularly detailed computational model is constructed to study microenvironmental transport of two isoforms of VEGF in rat extensor digitorum longus skeletal muscle under in vivo conditions and results in a platform for the design and evaluation of therapeutic approaches.
Abstract: Members of the vascular endothelial growth factor (VEGF) family of proteins are critical regulators of angiogenesis. VEGF concentration gradients are important for activation and chemotactic guidance of capillary sprouting, but measurement of these gradients in vivo is not currently possible. We have constructed a biophysically and molecularly detailed computational model to study microenvironmental transport of two isoforms of VEGF in rat extensor digitorum longus skeletal muscle under in vivo conditions. Using parameters based on experimental measurements, the model includes: VEGF secretion from muscle fibers; binding to the extracellular matrix; binding to and activation of endothelial cell surface VEGF receptors; and internalization. For 2-D cross sections of tissue, we analyzed predicted VEGF distributions, gradients, and receptor binding. Significant VEGF gradients (up to 12% change in VEGF concentration over 10 lm) were predicted in resting skeletal muscle with uniform VEGF secretion, due to non-uniform capillary distribution. These relative VEGF gradients were not sensitive to extracellular matrix composition, or to the overall VEGF expression level, but were dependent on VEGF receptor density and affinity, and internalization rate parameters. VEGF upregulation in a subset of fibers increased VEGF gradients, simulating transplantation of proangiogenic myoblasts, a possible therapy for ischemic diseases. The number and relative position of overexpressing fibers determined the VEGF gradients and distribution of VEGF receptor activation. With total VEGF expression level in the tissue unchanged, concentrating overexpression into a small number of adjacent fibers can increase the number of capillaries activated. The VEGF concentration gradients predicted for resting muscle (average 3% VEGF/10 lm) is sufficient for cellular sensing; the tip cell of a vessel sprout is approximately 50 lm long. The VEGF gradients also result in heterogeneity in the activation of blood vessel VEGF receptors. This first model of VEGF tissue transport and heterogeneity provides a platform for the design and evaluation of therapeutic approaches.
TL;DR: A full probabilistic model for fossil data that can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection is described.
Abstract: Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.
TL;DR: Light is shed on the daily challenges faced by annotators at the RCSB and the reader is given a glimpse at the juggling act that defines the job of a biocurator.
Abstract: Like most scientists, annotators at the Research Collaboratory for Structural Bioinformatics (RCSB) (http://www.pdb.org) dread the immortal cocktail party question “So, what do you do?” Unlike for some jobs, however, their answer can leave other scientists at the party with no response. Even within the structural biology community, our job is not well-understood. Throughout this perspective, we will shed light on the daily challenges faced by annotators at the RCSB and give the reader a glimpse at the juggling act that defines the job of a biocurator.
TL;DR: This piece follows an earlier Editorial, ‘‘Ten Simple Rules for Getting Published’’, and believes the rules presented here are generic, transcending funding institutions and national boundaries.
Abstract: This piece follows an earlier Editorial, ‘‘Ten Simple Rules for Getting Published’’ [1], which has generated significant interest, is well read, and continues to generate a variety of positive comments. That Editorial was aimed at students in the early stages of a life of scientific paper writing. This interest has prompted us to try to help scientists in making the next academic career step—becoming a young principal investigator. Leo Chalupa has joined us in putting together ten simple rules for getting grants, based on our many collective years of writing both successful and unsuccessful grants. While our grant writing efforts have been aimed mainly at United States government funding agencies, we believe the rules presented here are generic, transcending funding institutions and national boundaries. At the present time, US funding is frequently below 10% for a given grant program. Today, more than ever, we need all the help we can get in writing successful grant proposals. We hope you find these rules useful in reaching your research career goals.
TL;DR: In a simulation model of an evolving population of asexually replicating RNA molecules, initially deleterious mutations accumulated at rates nearly equal to that of initially beneficial mutations, without impeding evolutionary progress.
Abstract: Deleterious mutations are considered a major impediment to adaptation, and there are straightforward expectations for the rate at which they accumulate as a function of population size and mutation rate. In a simulation model of an evolving population of asexually replicating RNA molecules, initially deleterious mutations accumulated at rates nearly equal to that of initially beneficial mutations, without impeding evolutionary progress. As the mutation rate was increased within a moderate range, deleterious mutation accumulation and mean fitness improvement both increased. The fixation rates were higher than predicted by many population-genetic models. This seemingly paradoxical result was resolved in part by the observation that, during the time to fixation, the selection coefficient (s) of initially deleterious mutations reversed to confer a selective advantage. Significantly, more than half of the fixations of initially deleterious mutations involved fitness reversals. These fitness reversals had a substantial effect on the total fitness of the genome and thus contributed to its success in the population. Despite the relative importance of fitness reversals, however, the probabilities of fixation for both initially beneficial and initially deleterious mutations were exceedingly small (on the order of 10−5 of all mutations).
TL;DR: It is demonstrated that genes containing intragenic S/MARs are prone to pronounced spatiotemporal expression regulation and this characteristic is found to be even more pronounced for transcription factor genes.
Abstract: Scaffold/matrix attachment regions (S/MARs) are essential for structural organization of the chromatin within the nucleus and serve as anchors of chromatin loop domains. A significant fraction of genes in Arabidopsis thaliana contains intragenic S/MAR elements and a significant correlation of S/MAR presence and overall expression strength has been demonstrated. In this study, we undertook a genome scale analysis of expression level and spatiotemporal expression differences in correlation with the presence or absence of genic S/MAR elements. We demonstrate that genes containing intragenic S/MARs are prone to pronounced spatiotemporal expression regulation. This characteristic is found to be even more pronounced for transcription factor genes. Our observations illustrate the importance of S/MARs in transcriptional regulation and the role of chromatin structural characteristics for gene regulation. Our findings open new perspectives for the understanding of tissue- and organ-specific regulation of gene expression.
TL;DR: A novel meta-analysis methodology applied to multiple gene expression datasets from three mouse embryonic stem cell lines obtained at specific time points during the course of their differentiation into various lineages identifies a small set of genes whose expression is useful for identifying changes in stem cell frequencies in cultures of mouse ESC.
Abstract: Stem cell differentiation involves critical changes in gene expression. Identification of these should provide endpoints useful for optimizing stem cell propagation as well as potential clues about mechanisms governing stem cell maintenance. Here we describe the results of a new meta-analysis methodology applied to multiple gene expression datasets from three mouse embryonic stem cell (ESC) lines obtained at specific time points during the course of their differentiation into various lineages. We developed methods to identify genes with expression changes that correlated with the altered frequency of functionally defined, undifferentiated ESC in culture. In each dataset, we computed a novel statistical confidence measure for every gene which captured the certainty that a particular gene exhibited an expression pattern of interest within that dataset. This permitted a joint analysis of the datasets, despite the different experimental designs. Using a ranking scheme that favored genes exhibiting patterns of interest, we focused on the top 88 genes whose expression was consistently changed when ESC were induced to differentiate. Seven of these (103728_at, 8430410A17Rik, Klf2, Nr0b1, Sox2, Tcl1, and Zfp42) showed a rapid decrease in expression concurrent with a decrease in frequency of undifferentiated cells and remained predictive when evaluated in additional maintenance and differentiating protocols. Through a novel meta-analysis, this study identifies a small set of genes whose expression is useful for identifying changes in stem cell frequencies in cultures of mouse ESC. The methods and findings have broader applicability to understanding the regulation of self-renewal of other stem cell types.
TL;DR: The present emphasis on expanding computational resources, capable of managing and analyzing complex biological data, presents an ever-growing demand for biocurators capable of interpreting the increasingly complex scientific literature and extracting relevant data in an efficient, yet consistent, manner.
Abstract: From Impressionism and Pop Art to phosphorylation sites and interacting atom pairs, the realm of curation has been expanded. The recent growth of bioinformatics, driven by exponentially growing data, advanced computing techniques, and increased funding from private and governmental organizations, has created the need for novel strategies to adequately capture, store, and analyze the multitude of data present in the scientific literature. To meet this challenge, the number and scope of scientific databases has soared in recent years, creating a new profession, the biocurator. Indeed, the present emphasis on expanding computational resources, capable of managing and analyzing complex biological data, presents an ever-growing demand for biocurators capable of interpreting the increasingly complex scientific literature and extracting relevant data in an efficient, yet consistent, manner.
TL;DR: This October issue pays homage to biocurators of the Immune Epitope Database and Analysis Resource (IEDB), a new resource detailing known epitopes and their immunological outcomes, through two Perspectives written by biOCurators working with different types of biological data.
Abstract: Computational biology is a discipline built upon data (mostly free access), found in biological databases, and knowledge (mostly not free access), found in the literature. So important are these online sources of data that the discipline, and indeed this Journal, simply would not exist without them. Whether we are using the data in “browse mode”—doing a PubMed search, looking up a reaction in an enzymatic pathway, or in “compute mode”—analysis of a large dataset, we usually visit Web sites and download information without a second thought. Since our discipline is so dependent on the availability, extent, and quality of biological data, it is worth taking some time to think about the processes of data accessibility, annotation, and validation. These processes depend very much on biocurators—trained staff who ensure the information you are receiving is as complete and accurate as possible.
Biocurators can be considered the museum catalogers of the Internet age: they turn inert and unidentifiable objects (now virtual) into a powerful exhibit from which we can all marvel and learn. That would be a decent enough contribution to the world of science, but the task of the biocurator is even more extensive. Computational biologists do not expect to merely walk through the door, cast a casual eye over the exhibit, and exit wiser (although we frequently do); we also want to add our own data to the exhibit, plus pick and choose pieces of it to take home and create new exhibits of our own. Oh, and we would like to do all these things with minimal effort, please. We can be a pretty exacting bunch of customers, and it takes skills over and above a knowledge of biology to juggle the different needs of data submitters, information seekers, and power players.
“We pay homage to these special individuals who are dedicated to making our research endeavors a success.”
In this October issue, we pay homage to these special individuals who are dedicated to making our research endeavors a success. We do so through two Perspectives written by biocurators working with different types of biological data. The first is by biocurators from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (PDB), a well-established biological resource of macromolecular structure data used by more than 10,000 individual scientists per day, and the second by biocurators of the Immune Epitope Database and Analysis Resource (IEDB), a new resource detailing known epitopes and their immunological outcomes. The PDB validates the quality and consistency of primary data submitted by structural biologists as a prerequisite to publication. The IEDB curates the published literature, extracting relevant facts about the epitopes discussed therein. As you read these two Perspectives, similarities and differences concerning the approaches will emerge. But more than anything, we hope you are struck by the level of professionalism and dedication that goes into helping to make the quality research articles that you read in this Journal and elsewhere.
These two articles are told from the perspective of the biocurators themselves. It is only two perspectives; we certainly encourage you to send eLetters with your own perspective on biocuration, either as a curator of a different type of information, or as a person whose information has been curated, or as a consumer of information that has been curated. If you are not moved to comment, at least give a thought to the person upon whose efforts your research may well depend.
TL;DR: A computational study of cell sorting caused by a combination of cell adhesion and chemotaxis, where all cells respond equally to the chemotactic signal is presented, and the occurrence of “absolute negative mobility” is demonstrated.
Abstract: Differential movement of individual cells within tissues is an important yet poorly understood process in biological development. Here we present a computational study of cell sorting caused by a combination of cell adhesion and chemotaxis, where we assume that all cells respond equally to the chemotactic signal. To capture in our model mesoscopic properties of biological cells, such as their size and deformability, we use the Cellular Potts Model, a multiscale, cell-based Monte Carlo model. We demonstrate a rich array of cell-sorting phenomena, which depend on a combination of mescoscopic cell properties and tissue level constraints. Under the conditions studied, cell sorting is a fast process, which scales linearly with tissue size. We demonstrate the occurrence of “absolute negative mobility”, which means that cells may move in the direction opposite to the applied force (here chemotaxis). Moreover, during the sorting, cells may even reverse the direction of motion. Another interesting phenomenon is “minority sorting”, where the direction of movement does not depend on cell type, but on the frequency of the cell type in the tissue. A special case is the cAMP-wave-driven chemotaxis of Dictyostelium cells, which generates pressure waves that guide the sorting. The mechanisms we describe can easily be overlooked in studies of differential cell movement, hence certain experimental observations may be misinterpreted.
TL;DR: The URL provided for the GPCR model database in the published article is no longer active and is now located at http://cssb.biology.gatech.edu/skolnick/files/gpcr/GPcr.html.
Abstract: Correction: Structure Modeling of All Identified G Protein–Coupled Receptors in the Human Genome Yang Zhang, Mark E. DeVries, Jeffrey Skolnick DOI: 10.1371/journal.pcbi.0020013 In PLoS Computational Biology, volume 2, issue 2: The URL provided for the GPCR model database in the published article is no longer active. The database is now located at http://cssb.biology.gatech.edu/skolnick/files/gpcr/gpcr.html.
TL;DR: The current state of channel modeling is reviewed and the developments needed for its conclusions to be integrated into whole-cell modeling are explored.
Abstract: Ion channels are the building blocks of the information processing capability of neurons: any realistic computational model of a neuron must include reliable and effective ion channel components. Sophisticated statistical and computational tools have been developed to study the ion channel structure–function relationship, but this work is rarely incorporated into the models used for single neurons or small networks. The disjunction is partly a matter of convention. Structure–function studies typically use a single Markov model for the whole channel whereas until recently whole-cell modeling software has focused on serial, independent, two-state subunits that can be represented by the Hodgkin–Huxley equations. More fundamentally, there is a difference in purpose that prevents models being easily reused. Biophysical models are typically developed to study one particular aspect of channel gating in detail, whereas neural modelers require broad coverage of the entire range of channel behavior that is often best achieved with approximate representations that omit structural features that cannot be adequately constrained. To bridge the gap so that more recent channel data can be used in neural models requires new computational infrastructure for bringing together diverse sources of data to arrive at best-fit models for whole-cell modeling. We review the current state of channel modeling and explore the developments needed for its conclusions to be integrated into whole-cell modeling.
TL;DR: This work proposes a strategy for the very first step in protein nanotube design: map the candidate building blocks onto a planar sheet and wrap the sheet around a cylinder with the target dimensions.
Abstract: Here our goal is to carry out nanotube design using naturally occurring protein building blocks. Inspection of the protein structural database reveals the richness of the conformations of proteins, their parts, and their chemistry. Given target functional protein nanotube geometry, our strategy involves scanning a library of candidate building blocks, combinatorially assembling them into the shape and testing its stability. Since self-assembly takes place on time scales not affordable for computations, here we propose a strategy for the very first step in protein nanotube design: we map the candidate building blocks onto a planar sheet and wrap the sheet around a cylinder with the target dimensions. We provide examples of three nanotubes, two peptide and one protein, in atomistic model detail for which there are experimental data. The nanotube models can be used to verify a nanostructure observed by low-resolution experiments, and to study the mechanism of tube formation.
TL;DR: Here are ten simple rules to help you make the best decisions on a research project and the laboratory in which to carry it out.
Abstract: You are a PhD candidate and your thesis defense is already in sight. You have decided you would like to continue with a postdoctoral position rather than moving into industry as the next step in your career (that decision should be the subject of another “Ten Simple Rules”). Further, you already have ideas for the type of research you wish to pursue and perhaps some ideas for specific projects. Here are ten simple rules to help you make the best decisions on a research project and the laboratory in which to carry it out.
TL;DR: Normalized abundances of basic regulatory patterns of individual THubs in the yeast Saccharomyces cerevisiae transcriptional regulation network under five different cellular states and environmental conditions suggest switching of regulatory pattern preferences suggests that a change in conditions does not only elicit achange in response by the regulatory network, but also in the mechanisms by which the response is mediated.
Abstract: Transcription factors with a large number of target genes—transcription hub(s), or THub(s)—are usually crucial components of the regulatory system of a cell, and the different patterns through which they transfer the transcriptional signal to downstream cascades are of great interest. By profiling normalized abundances (AN) of basic regulatory patterns of individual THubs in the yeast Saccharomyces cerevisiae transcriptional regulation network under five different cellular states and environmental conditions, we have investigated their preferences for different basic regulatory patterns. Subgraph-normalized abundances downstream of individual THubs often differ significantly from that of the network as a whole, and conversely, certain over-represented subgraphs are not preferred by any THub. The THub preferences changed substantially when the cellular or environmental conditions changed. This switching of regulatory pattern preferences suggests that a change in conditions does not only elicit a change in response by the regulatory network, but also a change in the mechanisms by which the response is mediated. The THub subgraph preference profile thus provides a novel tool for description of the structure and organization between the large-scale exponents and local regulatory patterns.