TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Abstract: Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
TL;DR: OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs.
Abstract: The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of "recent" paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.
TL;DR: The PANTHER/X ontology is used to give a high-level representation of gene function across the human and mouse genomes, and the family HMMs are used to rank missense single nucleotide polymorphisms (SNPs) according to their likelihood of affecting protein function.
Abstract: In the genomic era, one of the fundamental goals is to characterize the function of proteins on a large scale. We describe a method, PANTHER, for relating protein sequence relationships to function relationships in a robust and accurate way. PANTHER is composed of two main components: the PANTHER library (PANTHER/LIB) and the PANTHER index (PANTHER/X). PANTHER/LIB is a collection of "books," each representing a protein family as a multiple sequence alignment, a Hidden Markov Model (HMM), and a family tree. Functional divergence within the family is represented by dividing the tree into subtrees based on shared function, and by subtree HMMs. PANTHER/X is an abbreviated ontology for summarizing and navigating molecular functions and biological processes associated with the families and subfamilies. We apply PANTHER to three areas of active research. First, we report the size and sequence diversity of the families and subfamilies, characterizing the relationship between sequence divergence and functional divergence across a wide range of protein families. Second, we use the PANTHER/X ontology to give a high-level representation of gene function across the human and mouse genomes. Third, we use the family HMMs to rank missense single nucleotide polymorphisms (SNPs), on a database-wide scale, according to their likelihood of affecting protein function.
TL;DR: This work describes BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences, and its modifications, the hardware environment on which it is run, and several empirical studies to validate its results.
Abstract: The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.
TL;DR: Both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu.
Abstract: Comparing genomic sequences across related species is a fruitful source of biological insight, because functional elements such as exons tend to exhibit significant sequence similarity, whereas regions that are not functional tend to be less conserved. The first step in comparing genomic sequences is to align them—that is, to map the letters of one sequence to those of the others. There are several categories of alignments: local alignments that identify local similarities between regions of each sequence, and global alignments that find a monotonically increasing map between the letters of each sequence; pairwise alignments that compare two sequences, and multiple alignments that compare several sequences.
Local pairwise alignment methods such as Smith-Waterman (1981), BLAST (Altschul et al. 1990, 1997), BLASTZ (Schwartz et al. 2000), SSAHA (Ning et al. 2001), and BLAT (Kent 2002) are able to pinpoint locations of rearrangements between two sequences, and are suitable for aligning draft sequences or individual reads. Global alignments are important because they reveal the shared order of biological features in the compared species, and produce a more accurate alignment at the base-pair level when the features are in the same order. The best-known global alignment algorithm is Needleman-Wunsch (1970), which requires time proportional to the product of the lengths of the aligned sequences. Unfortunately this algorithm is too inefficient for comparing long genomic sequences. Faster methods have been developed recently: DIALIGN (Morgenstern et al. 1998, Brudno and Morgenstern 2002), MUMmer (Delcher et al. 1999, 2002), GLASS (Batzoglou et al. 2000), WABA (Kent and Zahler 2000), and AVID (Bray et al. 2003). Most of these methods have proven effective in aligning genomic sequences from two closely related organisms, such as human and mouse or Caenorhabditis elegans and C. briggsae, but have not been tested in alignments between distant relatives such as human and fugu.
Multiple alignments, a natural extension of two-sequence comparisons, are a powerful way to study biological sequences. Even weak similarity across several sequences usually reveals an important conserved biological feature (Dubchak et al. 2000; Gottgens et al. 2002). Moreover, multiple alignments enable the computation of local rates of evolution, giving a quantitative measure of the strength of evolutionary constraints and the functional importance of local regions (Simon et al. 2002). Multiple alignments are considerably more difficult to compute than are pairwise alignments: the running time scales as the product of the lengths of all the sequences. Formally, the problem is NP-complete (Wang and Jiang 1994; Bonizzoni and Vedova 2001). For this reason heuristic approaches are usually applied, of which the most widely used is progressive alignment, which constructs a multiple alignment by successive applications of a pairwise alignment algorithm. The best-known system based on progressive alignment is perhaps CLUSTALW (Thompson et al. 1994). Some other systems include MULTALIGN (Barton and Sternberg 1987), MULTAL (Taylor 1988), YAMA (Hardison et al. 1993, 1994), and PRRP (Gotoh 1996). DIALIGN (Morgenstern 1999) does not use progressive alignment; instead it uses another heuristic approach to chain local conserved blocks between several sequences into a multiple alignment. These systems can effectively align proteins and relatively short genomic regions, but are not efficient enough to align entire genomes. MGA (Hohl et al. 2002) is a rapid multiple aligner suitable for comparing very close homologs, such as different strains of a bacterium, but is not designed to align distant homologs.
Here we describe novel systems for pairwise and multiple alignment of genomic sequences: LAGAN (Limited Area Global Alignment of Nucleotides), an efficient and reliable pairwise aligner that is suitable for genomic comparison of distantly related organisms, and MLAGAN (Multi-LAGAN), a multiple aligner based on progressive alignment with LAGAN. We tested our systems on sequence from 12 species generated for the genomic segment harboring the cystic fibrosis transmembrane conductance regulator (CFTR) gene (J.W. Thomas, J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beckstrom-Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell, B. Maskeri, N.F. Hansen, M.S. Schwartz, R.J. Weber, W.J. Kent, D. Karolchik, T.C. Bruen, R. Bevan, D.J. Cutler, S. Schwartz, L. Elnitski, J.R. Idol, A.B. Prasad, S.-Q. Lee-Lin, V.V.B. Maduro, M.E. Portnoy, N.L. Dietrich, N. Akhter, K. Ayele, B. Benjamin, K. Cariaga, C.P. Brinkley, S.Y. Brooks, S. Granite, X. Guan, J. Gupta, P. Haghighi, S-L. Ho, M.C. Huang, E. Karlins, P.L. Laric, R. Legaspi, M.J. Lim, Q.L. Maduro, C.A. Masiello, S.D. Mastrian, J.C. McCloskey, R. Pearson, S. Stantripop, E.E. Tiongson, J.T. Tran, C. Tsurgeon, J.L. Vogt, M.A. Walker, K.D. Wetherby, L.S. Wiggins, A.C. Young, L-H. Zhang, K. Osoegawa, B. Zhu, B. Zhao, C.L. Shu, P.J. De Jong, C.E. Lawrence, A.F. Smit, A. Chakravarti, D. Haussler, P. Green, W. Miller, and E.D. Green, in prep.). Based on comparisons with other available alignment programs and benchmarking on standard desktop computer systems, we conclude that LAGAN and MLAGAN are practical and reliable methods for large-scale pairwise and multiple genomic alignment that should prove useful for obtaining alignments of the entire human, mouse, fugu, rat, and other genomes in the context of a whole-genome alignment pipeline.
TL;DR: The Human Protein Reference Database (HPRD) as mentioned in this paper is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease, including protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization.
Abstract: Human Protein Reference Database (HPRD) is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease. Data pertaining to thousands of protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization were extracted from the literature for a nonredundant set of 2750 human proteins. Almost all the information was obtained manually by biologists who read and interpreted >300,000 published articles during the annotation process. This database, which has an intuitive query interface allowing easy access to all the features of proteins, was built by using open source technologies and will be freely available at http://www.hprd.org to the academic community. This unified bioinformatics platform will be useful in cataloging and mining the large number of proteomic interactions and alterations that will be discovered in the postgenomic era.
TL;DR: This method is fast, efficient, and reliable and makes it possible to generate cko-targeting vectors in less than 2 wk and should also facilitate the generation of knock-in mutations and transgene constructs, as well as expedite the analysis of regulatory elements and functional domains in or near genes.
Abstract: Phage-based Escherichia coli homologous recombination systems have recently been developed that now make it possible to subclone or modify DNA cloned into plasmids, BACs, or PACs without the need for restriction enzymes or DNA ligases. This new form of chromosome engineering, termed recombineering, has many different uses for functional genomic studies. Here we describe a new recombineering-based method for generating conditional mouse knockout (cko) mutations. This method uses homologous recombination mediated by the lambda phage Red proteins, to subclone DNA from BACs into high-copy plasmids by gap repair, and together with Cre or Flpe recombinases, to introduce loxP or FRT sites into the subcloned DNA. Unlike other methods that use short 45-55-bp regions of homology for recombineering, our method uses much longer regions of homology. We also make use of several new E. coli strains, in which the proteins required for recombination are expressed from a defective temperature-sensitive lambda prophage, and the Cre or Flpe recombinases from an arabinose-inducible promoter. We also describe two new Neo selection cassettes that work well in both E. coli and mouse ES cells. Our method is fast, efficient, and reliable and makes it possible to generate cko-targeting vectors in less than 2 wk. This method should also facilitate the generation of knock-in mutations and transgene constructs, as well as expedite the analysis of regulatory elements and functional domains in or near genes.
TL;DR: Although these stochastic methods cannot guarantee global optimality with certainty, their robustness, plus the fact that in inverse problems they have a known lower bound for the cost function, make them the best available candidates.
Abstract: Here we address the problem of parameter estimation (inverse problem) of nonlinear dynamic biochemical pathways. This problem is stated as a nonlinear programming (NLP) problem subject to nonlinear differential-algebraic constraints. These problems are known to be frequently ill-conditioned and multimodal. Thus, traditional (gradient-based) local optimization methods fail to arrive at satisfactory solutions. To surmount this limitation, the use of several state-of-the-art deterministic and stochastic global optimization methods is explored. A case study considering the estimation of 36 parameters of a nonlinear biochemical dynamic model is taken as a benchmark. Only a certain type of stochastic algorithm, evolution strategies (ES), is able to solve this problem successfully. Although these stochastic methods cannot guarantee global optimality with certainty, their robustness, plus the fact that in inverse problems they have a known lower bound for the cost function, make them the best available candidates.
TL;DR: The phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to approximately 22000 species, experienced a large-scale gene or complete genome duplication event and that the puffer fish has lost many duplicates that the zebra fish has retained.
Abstract: Through phylogeny reconstruction we identified 49 genes with a single copy in man, mouse, and chicken, one or two copies in the tetraploid frog Xenopus laevis, and two copies in zebrafish (Danio rerio). For 22 of these genes, both zebrafish duplicates had orthologs in the pufferfish (Takifugu rubripes). For another 20 of these genes, we found only one pufferfish ortholog but in each case it was more closely related to one of the zebrafish duplicates than to the other. Forty-three pairs of duplicated genes map to 24 of the 25 zebrafish linkage groups but they are not randomly distributed; we identified 10 duplicated regions of the zebrafish genome that each contain between two and five sets of paralogous genes. These phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to approximately 22000 species, experienced a large-scale gene or complete genome duplication event and that the pufferfish has lost many duplicates that the zebrafish has retained.
TL;DR: This work develops a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist, and applies it to a selection of publicly available cancer expression data sets.
Abstract: Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find “marker genes” that are differentially expressed in particular sets of “conditions.” We have developed a method that simultaneously clusters genes and conditions, finding distinctive “checkerboard” patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).
TL;DR: It is concluded that the Arabidopsis lineage underwent at least two distinct episodes of duplication, one of which was a polyploidy that occurred much more recently than estimated previously and probably during the early emergence of the crucifer family.
Abstract: The Arabidopsis genome contains numerous large duplicated chromosomal segments, but the different approaches used in previous analyses led to different interpretations regarding the number and timing of ancestral large-scale duplication events. Here, using more appropriate methodology and a more recent version of the genome sequence annotation, we investigate the scale and timing of segmental duplications in Arabidopsis. We used protein sequence similarity searches to detect duplicated blocks in the genome, used the level of synonymous substitution between duplicated genes to estimate the relative ages of the blocks containing them, and analyzed the degree of overlap between adjacent duplicated blocks. We conclude that the Arabidopsis lineage underwent at least two distinct episodes of duplication. One was a polyploidy that occurred much more recently than estimated previously, before the Arabidopsis/Brassica rapa split and probably during the early emergence of the crucifer family (24-40 Mya). An older set of duplicated blocks was formed after the monocot/dicot divergence, and the relatively low level of overlap among these blocks indicates that at least some of them are remnants of a larger duplication such as a polyploidy or aneuploidy.
TL;DR: The goal is to rapidly deliver allelic series of ethylmethanesulfonate-induced mutations in target 1-kb loci requested by the international research community.
Abstract: TILLING (Targeting Induced Local Lesions in Genomes) is a general reverse-genetic strategy that provides an allelic series of induced point mutations in genes of interest High-throughput TILLING allows the rapid and low-cost discovery of induced point mutations in populations of chemically mutagenized individuals As chemical mutagenesis is widely applicable and mutation detection for TILLING is dependent only on sufficient yield of PCR products, TILLING can be applied to most organisms We have developed TILLING as a service to the Arabidopsis community known as the Arabidopsis TILLING Project (ATP) Our goal is to rapidly deliver allelic series of ethylmethanesulfonate-induced mutations in target 1-kb loci requested by the international research community In the first year of public operation, ATP has discovered, sequenced, and delivered >1000 mutations in >100 genes ordered by Arabidopsis researchers The tools and methodologies described here can be adapted to create similar facilities for other organisms
TL;DR: This work measures mRNA decay rates in two human cell lines with high-density oligonucleotide arrays and investigates the dependence of decay rates on sequence composition, that is, the presence or absence of short mRNA motifs in various regions of the mRNA transcript.
Abstract: Although mRNA decay rates are a key determinant of the steady-state concentration for any given mRNA species, relatively little is known, on a population level, about what factors influence turnover rates and how these rates are integrated into cellular decisions. We decided to measure mRNA decay rates in two human cell lines with high-density oligonucleotide arrays that enable the measurement of decay rates simultaneously for thousands of mRNA species. Using existing annotation and the Gene Ontology hierarchy of biological processes, we assign mRNAs to functional classes at various levels of resolution and compare the decay rate statistics between these classes. The results show statistically significant organizational principles in the variation of decay rates among functional classes. In particular, transcription factor mRNAs have increased average decay rates compared with other transcripts and are enriched in "fast-decaying" mRNAs with half-lives <2 h. In contrast, we find that mRNAs for biosynthetic proteins have decreased average decay rates and are deficient in fast-decaying mRNAs. Our analysis of data from a previously published study of Saccharomyces cerevisiae mRNA decay shows the same functional organization of decay rates, implying that it is a general organizational scheme for eukaryotes. Additionally, we investigated the dependence of decay rates on sequence composition, that is, the presence or absence of short mRNA motifs in various regions of the mRNA transcript. Our analysis recovers the positive correlation of mRNA decay with known AU-rich mRNA motifs, but we also uncover further short mRNA motifs that show statistically significant correlation with decay. However, we also note that none of these motifs are strong predictors of mRNA decay rate, indicating that the regulation of mRNA decay is more complex and may involve the cooperative binding of several RNA-binding proteins at different sites.
TL;DR: A new global alignment method called AVID is described, designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long, and a format is established for the representation of alignments and methods for their comparison.
Abstract: In this paper we describe a new global alignment method called AVID. The method is designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long. We present numerous applications of the method, ranging from the comparison of assemblies to alignment of large syntenic genomic regions and whole genome human/mouse alignments. We have also performed a quantitative comparison of AVID with other popular alignment tools. To this end, we have established a format for the representation of alignments and methods for their comparison. These formats and methods should be useful for future studies. The tools we have developed for the alignment comparisons, as well as the AVID program, are publicly available. See Web Site References section for AVID Web address and Web addresses for other programs discussed in this paper.
TL;DR: The method complements bifurcation studies of the system's parameter dependence by providing estimates of sizes, correlations, and time scales of stochastic fluctuations by suitable variable changes and elimination of fast variables.
Abstract: Biochemical networks in single cells can display large fluctuations in molecule numbers, making mesoscopic approaches necessary for correct system descriptions. We present a general method that allows rapid characterization of the stochastic properties of intracellular networks. The starting point is a macroscopic description that identifies the system's elementary reactions in terms of rate laws and stoichiometries. From this formulation follows directly the stationary solution of the linear noise approximation (LNA) of the Master equation for all the components in the network. The method complements bifurcation studies of the system's parameter dependence by providing estimates of sizes, correlations, and time scales of stochastic fluctuations. We describe how the LNA can give precise system descriptions also near macroscopic instabilities by suitable variable changes and elimination of fast variables.
TL;DR: It is demonstrated that variation of gene expression between alleles is common, and this variation may contribute to human variability, as shown by real-time quantitative PCR experiments.
Abstract: Variations in gene sequence and expression underlie much of human variability. Despite the known biological roles of differential allelic gene expression resulting from X-chromosome inactivation and genomic imprinting, a large-scale analysis of allelic gene expression in human is lacking. We examined allele-specific gene expression of 1063 transcribed single-nucleotide polymorphisms (SNPs) by using Affymetrix HuSNP oligo arrays. Among the 602 genes that were heterozygous and expressed in kidney or liver tissues from seven individuals, 326 (54%) showed preferential expression of one allele in at least one individual, and 170 of those showed greater than fourfold difference between the two alleles. The allelic variation has been confirmed by real-time quantitative PCR experiments. Some of these 170 genes are known to be imprinted, such as SNRPN, IPW, HTR2A, and PEG3. Most of the differentially expressed genes are not in known imprinting domains but instead are distributed throughout the genome. Our studies demonstrate that variation of gene expression between alleles is common, and this variation may contribute to human variability.
TL;DR: TILLING can be used to detect the full spectrum of ENU-induced mutations in a vertebrate genome with the presence of many naturally occurring polymorphisms and is shown to be a highly efficient and easy method to do target-selected mutagenesis in zebrafish.
Abstract: One of the most powerful methods available to assign function to a gene is to inactivate or knockout the gene. Recently,we described the first target-selected knockout in zebrafish. Here,we report on the further improvements of this procedure,resulting in a highly efficient and easy method to do target-selected mutagenesis in zebrafish. A library of 4608 ENU-mutagenized F1 animals was generated and kept as a living stock. The DNA of these animals was screened for mutations in 16 genes by use of CEL-I-mediated heteroduplex cleavage (TILLING) and subsequent resequencing. In total,255 mutations were identified,of which 14 resulted in a premature stop codon,7 in a splice donor/acceptor site mutation,and 119 in an amino acid change. By this method,we potentially knocked out 13 different genes in a few months time. Furthermore,we show that TILLING can be used to detect the full spectrum of ENU-induced mutations in a vertebrate genome with the presence of many naturally occurring polymorphisms.
TL;DR: Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides, however, it does vary with GC-content: Processed pseudogene occur mostly in intermediate GC- content regions.
Abstract: Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.
TL;DR: A high-throughput genotyping platform is developed by hybridizing genomic DNA from Arabidopsis thaliana accessions to an RNA expression GeneChip (AtGenome1), and it is demonstrated that array hybridization can be combined with bulk segregant analysis to quickly map mutations.
Abstract: We have developed a high-throughput genotyping platform by hybridizing genomic DNA from Arabidopsis thaliana accessions to an RNA expression GeneChip (AtGenome1). Using newly developed analytical tools, a large number of single-feature polymorphisms (SFPs) were identified. A comparison of two accessions, the reference strain Columbia (Col) and the strain Landsberg erecta (Ler), identified nearly 4000 SFPs, which could be reliably scored at a 5% error rate. Ler sequence was used to confirm 117 of 121 SFPs and to determine the sensitivity of array hybridization. Features containing sequence repeats, as well as those from high copy genes, showed greater polymorphism rates. A linear clustering algorithm was developed to identify clusters of SFPs representing potential deletions in 111 genes at a 5% false discovery rate (FDR). Among the potential deletions were transposons, disease resistance genes, and genes involved in secondary metabolism. The applicability of this technique was demonstrated by genotyping a recombinant inbred line. Recombination break points could be clearly defined, and in one case delimited to an interval of 29 kb. We further demonstrate that array hybridization can be combined with bulk segregant analysis to quickly map mutations. The extension of these tools to organisms with complex genomes, such as Arabidopsis, will greatly increase our ability to map and clone quantitative trait loci (QTL).
TL;DR: In this article, the authors derived a parsimonious scenario of gene losses for eukaryotic orthologous groups (KOGs) from seven complete eukarial genomes and introduced a numerical measure, the propensity for gene loss (PGL).
Abstract: Lineage-specific gene loss, to a large extent, accounts for the differences in gene repertoires between genomes, particularly among eukaryotes. We derived a parsimonious scenario of gene losses for eukaryotic orthologous groups (KOGs) from seven complete eukaryotic genomes. The scenario involves substantial gene loss in fungi, nematodes, and insects. Based on this evolutionary scenario and estimates of the divergence times between major eukaryotic phyla, we introduce a numerical measure, the propensity for gene loss (PGL). We explore the connection among the propensity of a gene to be lost in evolution (PGL value), protein sequence divergence, the effect of gene knockout on fitness, the number of protein-protein interactions, and expression level for the genes in KOGs. Significant correlations between PGL and each of these variables were detected. Genes that have a lower propensity to be lost in eukaryotic evolution accumulate fewer substitutions in their protein sequences and tend to be essential for the organism viability, tend to be highly expressed, and have many interaction partners. The dependence between PGL and gene dispensability and interactivity is much stronger than that for sequence evolution rate. Thus, propensity of a gene to be lost during evolution seems to be a direct reflection of its biological importance.
TL;DR: Algorithmic adaptations to the whole-genome assembly program Arachne are described, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes.
Abstract: We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rejoined using several criteria, yielding a 64-fold increase in length (N50), and apparent elimination of all global misjoins; (2) gaps between contigs in supercontigs were filled (partially or completely) by insertion of reads, as suggested by pairing within the supercontig, increasing the N50 contig length by 50%; (3) memory usage was reduced fourfold. The outcome of this mouse assembly and its analysis are described in (Mouse Genome Sequencing Consortium 2002).
TL;DR: Ridges are found to be very gene-dense domains with a high GC content, a high SINE repeat density, and a low LINE repeat density and are an integral part of a higher order structure in the genome related to transcriptional regulation.
Abstract: The chromosomal gene expression profiles established by the Human Transcriptome Map (HTM) revealed a clustering of highly expressed genes in about 30 domains, called ridges. To physically characterize ridges, we constructed a new HTM based on the draft human genome sequence (HTMseq). Expression of 25,003 genes can be analyzed online in a multitude of tissues (http://bioinfo.amc.uva.nl/HTMseq). Ridges are found to be very gene-dense domains with a high GC content, a high SINE repeat density, and a low LINE repeat density. Genes in ridges have significantly shorter introns than genes outside of ridges. The HTMseq also identifies a significant clustering of weakly expressed genes in domains with fully opposite characteristics (antiridges). Both types of domains are open to tissue-specific expression regulation, but the maximal expression levels in ridges are considerably higher than in antiridges. Ridges are therefore an integral part of a higher order structure in the genome related to transcriptional regulation.
TL;DR: Two strategies for MCS identification are reported, demonstrating their ability to detect virtually all known actively conserved sequences but very little neutrally evolving sequence (specifically, ancestral repeats).
Abstract: A key component of genomics research beyond the Human Genome Project will be the rigorous interpretation of the recently finished human genome sequence (Collins et al. 2003). Central to these efforts will be the identification of all functional elements in the human genome. Recent comparative analyses of the human and mouse genome sequences suggest that ∼5% of the mammalian genome is under active selection and thus likely serves a functional role (International Mouse Genome Sequencing Consortium 2002; Roskin et al. 2003). Within this functional subset is an estimated 1% to 2% of the genome that encodes protein (International Mouse Genome Sequencing Consortium 2002). The prospects for comprehensive identification of these coding sequences are quite good, especially in light of the availability of data sets that are complementary to the genomic sequence (e.g., ESTs [Boguski et al. 1994; also see http://www.ncbi.nlm.nih.gov/dbEST] and full-length cDNA sequences [Strausberg et al. 2002; also see http://mgc.nci.nih.gov]) and ever-improving computational methods for gene prediction (Kulp et al. 1996; Burge and Karlin 1997; Rogic et al. 2001; Solovyev 2001; Flicek et al. 2003). The complete identification and characterization of the remaining 3% to 4% of the mammalian genome that likely corresponds to functional non-coding sequence will be profoundly more challenging, due to the lack of complementary data sets, the absence of robust tools for computational predictions, and the incomplete insight about the nature of such sequence. In short, the generation of a comprehensive “parts list” of functional elements in the human genome remains an immense and important challenge.
The comparison of orthologous genomic sequences has emerged as a powerful approach for identifying functional elements in the genome (Dermitzakis et al. 2002; DeSilva et al. 2002). The premise of this approach is that sequences conserved across millions of years of evolution are likely to have a functional role (Pennacchio and Rubin 2001). Comparative sequence analyses have been shown to facilitate the identification of both coding (Batzoglou et al. 2000; Korf et al. 2001; Pennacchio et al. 2001; Alexandersson et al. 2003; Flicek et al. 2003) and functional non-coding (Stojanovic et al. 1999; Dubchak et al. 2000; Gottgens et al. 2000; Loots et al. 2000, 2002; Wasserman et al. 2000; Dehal et al. 2001; Elnitski et al. 2003; Kellis et al. 2003) sequences. Among the latter are elements that regulate the spatial and temporal patterns of gene expression (Hardison 2000). When the generation of alignments between related sequences is not possible, motif-finding techniques have also been used to identify functional sequences, in particular for detecting transcription factor–binding sites (Bailey and Elkan 1995; Roth et al. 1998; Hertz and Stormo 1999; McCue et al. 2001; Blanchette and Tompa 2002).
Recent efforts have produced whole-genome sequences for several vertebrates, including human (International Human Genome Sequencing Consortium 2001), mouse (International Mouse Genome Sequencing Consortium 2002), rat (http://genome.ucsc.edu/cgi-bin/hgGateway?org=rat), and pufferfish (Aparicio et al. 2002), with the sequencing of additional vertebrate genomes well underway. Increasingly, methods for visualizing (Kent et al. 2002; Clamp et al. 2003; Karolchik et al. 2003) and comparing (Stojanovic et al. 1999; Mayor et al. 2000; Blanchette and Tompa 2002; Loots et al. 2002; Giardine et al. 2003; Schwartz et al. 2003a) genomic sequences from multiple species are emerging. As a complement to these efforts, we are generating the sequence of targeted genomic regions in multiple, phylogenetically diverse vertebrates (Thomas et al. 2003) and developing computational approaches for identifying the subset of sequences that confers function. In particular, we have focused on developing algorithms for detecting sequences that are highly conserved across multiple species, which we call Multi-species Conserved Sequences (or MCSs); such sequences represent candidates for being functionally important. Here we report the development and testing of methods for MCS detection, including analyses of MCSs identified using a recently generated set of orthologous sequences from 11 non-human vertebrates (Thomas et al. 2003).
TL;DR: A small set of genes can be traced back to the universal ancestor and have coevolved since that time, suggesting innovations that may have been essential to the divergence of the three domains of life.
Abstract: Phylogenetic studies of ribosomal RNA (rRNA) revolutionized our understanding of biological diversity by revealing that modern organisms fall into three phylogenetic domains: Archaea, Bacteria, and Eucarya (Woese and Fox 1977; Woese et al. 1990). rRNA sequence information in principle is well suited for determining deep phylogenetic relationships for several reasons. The rRNA sequences occur in all organisms, they have evolved at a sufficiently slow rate to retain phylogenetic information between distantly related organisms, and the rRNA genes have undergone limited or no horizontal transfer (i.e., transfer between distantly related organisms; Asai et al. 1999). Since the original description of the three-domain phylogeny, correlations of biochemical properties between organisms and data from genomic sequences have lent support to this classification of life (Woese et al. 1990; Wettach et al. 1995; Brown et al. 2001b).
At the same time, it also has become evident that many genes do not exhibit the same phylogenetic pattern as rRNA genes. Data from complete genomic sequences and phylogenetic studies of particular genes have revealed that genomes contain many genes that have undergone horizontal as well as vertical evolutionary change (Brown and Doolittle 1997). Moreover, a large number of genes appear to have been lost from, or never acquired by, various lineages over evolutionary time (Snel et al. 2002). Although gene loss or gain and horizontal transfer are common themes in evolution, phylogenetic analyses nonetheless have identified a number of genes in the nucleic acid-based information-processing pathway that have phylogenetic histories congruent with that of rRNA. For instance, the phylogenetic relationships among the core subunits of the DNA-dependent RNA polymerases, or most ribosomal protein genes, are the same as those seen in phylogenetic analyses of rRNA sequences (Iwabe et al. 1991; Klenk et al. 1993; Liao and Dennis 1994). Additionally, recent studies of concatenated datasets recovered the three-domain topology even when component members analyzed separately clearly demonstrated lateral transfers between organisms (Brown et al. 2001a). Collectively, the results indicate that the phylogenetic pattern of rRNA is representative of the evolutionary history of some portion of cellular components, which we term the ‘genetic core.’
Although it is known that some cellular genes show the same phylogenetic patterns as rRNA, the purpose of the present study was to determine the entire set of universal genes with this property; this set constitutes a ‘genetic core’ of the known cellular lines of descent. Abundant new sequence information from a rapidly expanding database of genome sequences allows a more complete assessment of the genes that comprise such a genetic core that traces its ancestry back to the last common ancestor (LCA) of life. We used the Clusters of Orthologous Groups of proteins (COG) database (Tatusov et al. 2001) to search for constituents of the genetic core by identifying the universally conserved set of related genes that have the same phylogenetic history as rRNA. If a gene that is universally present in cells shares the same phylogenetic history as rRNA, two important properties of the gene can be inferred: (1) The gene occurred in the LCA and is not present in all organisms, as a result of subsequent horizontal transfer between lineages; and (2) the gene has resisted both nonorthologous displacement and extensive amino acid substitution since that time of the LCA. We note that this analysis will not yield a minimal genome for the LCA, because it should focus primarily on the mechanisms of the universal function of transfer of genetic information.
The analyses presented here were based exclusively on fully sequenced genomes and have two primary advantages over single-gene surveys. First, the complete set of genes from the organisms being examined is known, which allowed for a comprehensive analysis of gene coevolution. Second, the absence of a gene in an analysis of complete genome sequences is not a negative result; rather, it is a finding that the gene is truly not present in the organism. This contrasts with PCR- or homology-based analyses of particular genes, where a negative result is ambiguous.
TL;DR: The unsupervised neural network algorithm, a self-organizing map (SOM), used to analyze di-, tri-, and tetranucleotide frequencies in a wide variety of prokaryotic and eukaryotic genomes was shown to be an excellent tool for analyzing global characteristics of genome sequences and for revealing key combinations of oligonucleotides representing individual genomes.
Abstract: With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, a self-organizing map (SOM), to analyze di-, tri-, and tetranucleotide frequencies in a wide variety of prokaryotic and eukaryotic genomes. The SOM, which can cluster complex data efficiently, was shown to be an excellent tool for analyzing global characteristics of genome sequences and for revealing key combinations of oligonucleotides representing individual genomes. From analysis of 1- and 10-kb genomic sequences derived from 65 bacteria (a total of 170 Mb) and from 6 eukaryotes (460 Mb), clear species-specific separations of major portions of the sequences were obtained with the di-, tri-, and tetranucleotide SOMs. The unsupervised algorithm could recognize, in most 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature features of each genome. We were able to classify DNA sequences within one and between many species into subgroups that corresponded generally to biological categories. Because the classification power is very high, the SOM is an efficient and fundamental bioinformatic strategy for extracting a wide range of genomic information from a vast amount of sequences.
TL;DR: All intergenic regions in the human genome are screened with a combination of homology searches and a functionality test using the ratio of silent to replacement nucleotide substitutions (KA/KS), and nonprocessed pseudogenes appear to be enriched in regions with high gene density.
Abstract: We screened all intergenic regions in the human genome to identify pseudogenes with a combination of homology searches and a functionality test using the ratio of silent to replacement nucleotide substitutions (KA/KS). We identified 19,724 regions of which 95% +/- 3% are estimated to evolve neutrally and thus are likely to encode pseudogenes. Half of these have no detectable truncation in their pseudocoding regions and therefore are not identifiable by methods that require the presence of truncations to prove nonfunctionality. A comparative analysis with the mouse genome showed that 70% of these pseudogenes have a retrotranspositional origin (processed), and the rest arose by segmental duplication (nonprocessed). Although the spread of both types of pseudogenes correlates with chromosome size, nonprocessed pseudogenes appear to be enriched in regions with high gene density. It is likely that the human pseudogenes identified here represent only a small fraction of the total, which probably exceeds the number of genes.
TL;DR: The results demonstrate that the combination of microarray technology with the zebrafish model system can provide useful information on how genes are coordinated in a genetic network to control zebra fish embryogenesis and can help to identify novel genes that are important for organogenesis.
Abstract: A total of 15590 unique zebrafish EST clusters from two cDNA libraries have been identified. Most significantly, only 22% (3437) of the 15590 unique clusters matched 2805 (of 15200) clusters in the Danio rerio UniGene database, indicating that our EST set is complementary to the existing ESTs in the public database and will be invaluable in assisting the annotation of genes based on the upcoming zebrafish genome sequence. Blast search showed that 7824 of our unique clusters matched 6710 known or predicted proteins in the nonredundant database. A cDNA microarray representing approximately 3100 unique zebrafish cDNA clusters has been generated and used to profile the gene expression patterns across six different embryonic stages (cleavage, blastula, gastrula, segmentation, pharyngula, and hatching). Analysis of expression data using K-means clustering revealed that genes coding for muscle-specific proteins displayed similar expression patterns, confirming that the coordinate gene expression is important for myogenesis. Our results demonstrate that the combination of microarray technology with the zebrafish model system can provide useful information on how genes are coordinated in a genetic network to control zebrafish embryogenesis and can help to identify novel genes that are important for organogenesis.
TL;DR: Comparative genomics revealed trends in amino acid and tRNA composition, and structural features of proteins from cold-adapted Archaea, and indicated that GC content is the major factor influencing tRNA stability in hyperthermophiles, but not in the psychrophiles, mesophiles or moderate thermophiles
Abstract: We generated draft genome sequences for two cold-adapted Archaea, Methanogenium frigidum and Methanococcoides burtonii, to identify genotypic characteristics that distinguish them from Archaea with a higher optimal growth temperature (OGT). Comparative genomics revealed trends in amino acid and tRNA composition, and structural features of proteins. Proteins from the cold-adapted Archaea are characterized by a higher content of non-charged polar amino acids, particularly Gln and Thr and a lower content of hydrophobic amino acids, particularly Leu. Sequence data from nine methanogen genomes (OGT 15-98 C) was used to generate 1 111 modeled protein structures. Analysis of the models from the cold-adapted Archaea showed a strong tendency in the solvent accessible area for more Gln, Thr an hydrophobic residues and fewer charged residues. A cold shock domain (CSD) protein (CspA homolog) was identified in M. frigidum, two hypothetical proteins with CSD-folds in M. burtonii, and a unique winged helix DNA-binding domain protein in M. burtonii. This suggests that these types of nucleic acid binding proteins have a critical role in cold-adapted Archaea. Structural analysis of tRNA sequences from the Archaea indicated that GC content is the major factor influencing tRNA stability in hyperthermophiles, but not in the psychrophiles, mesophiles or moderate thermophiles. Below an OGT of 60 C, the GC content in tRNA was largely unchanged, indicating that any requirement for flexibility of tRNA in psychrophiles is mediated by other means. This is the first time that comparisons have been performed with genome data from Archaea spanning the growth temperature extremes from psychrophiles to hyperthermophiles.
TL;DR: Using the FANTOM2 mouse cDNA set, public mRNA data, and mouse genome sequence data, the analysis greatly expands the number of known examples of sense-antisense transcript and nonantisense bidirectional transcription pairs in mammals and implies that the regulation of gene expression by antisense transcripts is more common that previously recognized.
Abstract: We have used the FANTOM2 mouse cDNA set (60,770 clones), public mRNA data, and mouse genome sequence data to identify 2481 pairs of sense-antisense transcripts and 899 further pairs of nonantisense bidirectional transcription based upon genomic mapping. The analysis greatly expands the number of known examples of sense-antisense transcript and nonantisense bidirectional transcription pairs in mammals. The FANTOM2 cDNA set appears to contain substantially large numbers of noncoding transcripts suitable for antisense transcript analysis. The average proportion of loci encoding sense-antisense transcript and nonantisense bidirectional transcription pairs on autosomes was 15.1 and 5.4%, respectively. Those on the X chromosome were 6.3 and 4.2%, respectively. Sense-antisense transcript pairs, rather than nonantisense bidirectional transcription pairs, may be less prevalent on the X chromosome, possibly due to X chromosome inactivation. Sense and antisense transcripts tended to be isolated from the same libraries, where nonantisense bidirectional transcription pairs were not apparently coregulated. The existence of large numbers of natural antisense transcripts implies that the regulation of gene expression by antisense transcripts is more common that previously recognized. The viewer showing mapping patterns of sense-antisense transcript pairs and nonantisense bidirectional transcription pairs on the genome and other related statistical data is available on our Web site.
TL;DR: A novel bioinformatics method that bases classification of potential binding sites explicitly on the estimate of sequence-specific binding energy of a given transcription factor, resulting in a significant improvement in the number of expected false positives.
Abstract: Identification of transcription factor binding sites within regulatory segments of genomic DNA is an important step toward understanding of the regulatory circuits that control expression of genes. Here, we describe a novel bioinformatics method that bases classification of potential binding sites explicitly on the estimate of sequence-specific binding energy of a given transcription factor. The method also estimates the chemical potential of the factor that defines the threshold of binding. In contrast with the widely used information-theoretic weight matrix method, the new approach correctly describes saturation in the transcription factor/DNA binding probability. This results in a significant improvement in the number of expected false positives, particularly in the ubiquitous case of low-specificity factors. In the strong binding limit, the algorithm is related to the "support vector machine" approach to pattern recognition. The new method is used to identify likely genomic binding sites for the E. coli transcription factors collected in the DPInteract database. In addition, for CRP (a global regulatory factor), the likely regulatory modality (i.e., repressor or activator) of predicted binding sites is determined.