Top 21 Biodata Mining papers published in 2012

Showing papers in "Biodata Mining in 2012"

Visualising associations between paired 'omics' data sets.

[...]

Ignacio González¹, Kim-Anh Lê Cao², Melissa J. Davis², Sébastien Déjean¹•Institutions (2)

University of Toulouse¹, University of Queensland²

13 Nov 2012-Biodata Mining

TL;DR: This paper proposes to revisit few graphical outputs to better understand the relationships between two ‘omics’ data and to better visualise the correlation structure between the different biological entities and demonstrates the usefulness of such graphical outputs on several biological data sets.

...read moreread less

Abstract: Background Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities.

...read moreread less

317 citations

Journal Article•10.1186/1756-0381-5-16•

GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures

[...]

Ryan J. Urbanowicz¹, Jeff Kiralis¹, Nicholas A. Sinnott-Armstrong¹, Tamra Heberling¹, Jonathan M. Fisher¹, Jason H. Moore¹ - Show less +2 more•Institutions (1)

Dartmouth College¹

01 Oct 2012-Biodata Mining

TL;DR: GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures, and is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms.

...read moreread less

Abstract: Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.

...read moreread less

252 citations

Journal Article•10.1186/1756-0381-5-18•

Mining SOM expression portraits: feature selection and integrating concepts of molecular function

[...]

Henry Wirth¹, Henry Wirth², Martin von Bergen², Hans Binder¹•Institutions (2)

Leipzig University¹, Helmholtz Centre for Environmental Research - UFZ²

08 Oct 2012-Biodata Mining

TL;DR: Self organizing maps typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues, which allows the comprehensive downstream analysis of SOM-transformed expression data in terms of cluster-related gene lists and enriched gene sets for functional interpretation.

...read moreread less

Abstract: Background Self organizing maps (SOM) enable the straightforward portraying of high-dimensional data of large sample collections in terms of sample-specific images. The analysis of their texture provides so-called spot-clusters of co-expressed genes which require subsequent significance filtering and functional interpretation. We address feature selection in terms of the gene ranking problem and the interpretation of the obtained spot-related lists using concepts of molecular function.

...read moreread less

61 citations

Journal Article•10.1186/1756-0381-5-1•

Caipirini: using gene sets to rank literature

[...]

01 Feb 2012-Biodata Mining

TL;DR: Caipirini is the first service enabling literature search directly based on biological relevance to gene sets; thus, it gives the research community a new way to unlock hidden knowledge from gene sets derived via high-throughput experiments.

...read moreread less

Abstract: Background Keeping up-to-date with bioscience literature is becoming increasingly challenging. Several recent methods help meet this challenge by allowing literature search to be launched based on lists of abstracts that the user judges to be 'interesting'. Some methods go further by allowing the user to provide a second input set of 'uninteresting' abstracts; these two input sets are then used to search and rank literature by relevance. In this work we present the service 'Caipirini' (http://caipirini.org) that also allows two input sets, but takes the novel approach of allowing ranking of literature based on one or more sets of genes.

...read moreread less

60 citations

Journal Article•10.1186/1756-0381-5-5•

Visually integrating and exploring high throughput Phenome-Wide Association Study (PheWAS) results using PheWAS-View

[...]

Sarah A. Pendergrass¹, Scott M. Dudek¹, Dana C. Crawford², Marylyn D. Ritchie¹•Institutions (2)

Pennsylvania State University¹, Vanderbilt University Medical Center²

08 Jun 2012-Biodata Mining

TL;DR: Phenome-Wide Association Studies can be used to discover novel relationships between SNPs, phenotypes, and networks of interrelated phenotypes; identify pleiotropy; provide novel mechanistic insights; and foster hypothesis generation – and these results can be both explored and presented with PheWAS-View.

...read moreread less

Abstract: Background Phenome-Wide Association Studies (PheWAS) can be used to investigate the association between single nucleotide polymorphisms (SNPs) and a wide spectrum of phenotypes. This is a complementary approach to Genome Wide Association studies (GWAS) that calculate the association between hundreds of thousands of SNPs and one or a limited range of phenotypes. The extensive exploration of the association between phenotypic structure and genotypic variation through PheWAS produces a set of complex and comprehensive results. Integral to fully inspecting, analysing, and interpreting PheWAS results is visualization of the data.

...read moreread less

56 citations

Journal Article•10.1186/1756-0381-5-15•

Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection

[...]

Ryan J. Urbanowicz¹, Jeff Kiralis¹, Jonathan M. Fisher¹, Jason H. Moore¹•Institutions (1)

Dartmouth College¹

26 Sep 2012-Biodata Mining

TL;DR: This study formally identifies and evaluates metrics which quantify model detection difficulty and utilizes these metrics to intelligently select models from a population of potential architectures for improved simulation study design which accounts for differences in detection difficulty attributed to model architecture.

...read moreread less

Abstract: Background Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection.

...read moreread less

43 citations

Journal Article•10.1186/1756-0381-5-11•

Murine colon proteome and characterization of the protein pathways

[...]

Sameh Magdeldin¹, Sameh Magdeldin², Yutaka Yoshida¹, Huiping Li³, Huiping Li¹, Yoshitaka Maeda¹, Munesuke Yokoyama¹, Shymaa Enany², Shymaa Enany¹, Ying Zhang¹, Bo Xu¹, Hidehiko Fujinaka¹, Eishin Yaoita¹, Sei Sasaki⁴, Tadashi Yamamoto¹ - Show less +11 more•Institutions (4)

Niigata University¹, Suez Canal University², Sichuan University³, Tokyo Medical and Dental University⁴

28 Aug 2012-Biodata Mining

TL;DR: This highly confidence colon proteome catalogue will not only serve as a useful reference for further experiments characterizing differentially expressed proteins induced from diseased conditions, but also will aid in better understanding the ontology and functional absorptive mechanism of the colon as well.

...read moreread less

Abstract: Most of the current proteomic researches focus on proteome alteration due to pathological disorders (i.e.: colorectal cancer) rather than normal healthy state when mentioning colon. As a result, there are lacks of information regarding normal whole tissue- colon proteome. We report here a detailed murine (mouse) whole tissue- colon protein reference dataset composed of 1237 confident protein (FDR < 2) with comprehensive insight on its peptide properties, cellular and subcellular localization, functional network GO annotation analysis, and its relative abundances. The presented dataset includes wide spectra of pI and Mw ranged from 3–12 and 4–600 KDa, respectively. Gravy index scoring predicted 19.5% membranous and 80.5% globularly located proteins. GO hierarchies and functional network analysis illustrated proteins function together with their relevance and implication of several candidates in malignancy such as Mitogen- activated protein kinase (Mapk8, 9) in colorectal cancer, Fibroblast growth factor receptor (Fgfr 2), Glutathione S-transferase (Gstp1) in prostate cancer, and Cell division control protein (Cdc42), Ras-related protein (Rac1,2) in pancreatic cancer. Protein abundances calculated with 3 different algorithms (NSAF, PAF and emPAI) provide a relative quantification under normal condition as guidance. This highly confidence colon proteome catalogue will not only serve as a useful reference for further experiments characterizing differentially expressed proteins induced from diseased conditions, but also will aid in better understanding the ontology and functional absorptive mechanism of the colon as well.

...read moreread less

42 citations

Journal Article•10.1186/1756-0381-5-7•

‘MicroRNA Targets’, a new AthaMap web-tool for genome-wide identification of miRNA targets in Arabidopsis thaliana

[...]

Lorenz Bülow¹, Julio C. Bolívar¹, Jonas Ruhe¹, Yuri Brill¹, Reinhard Hehl¹ - Show less +1 more•Institutions (1)

Braunschweig University of Technology¹

16 Jul 2012-Biodata Mining

TL;DR: A novel web-tool, ‘MicroRNA Targets’, was integrated into AthaMap which permits the identification of genes predicted to be regulated by selected miRNAs and putative target sites of small RNAs from selected tissue datasets can be identified with the new ‘Small RNA Target’ web- tool.

...read moreread less

Abstract: The AthaMap database generates a genome-wide map for putative transcription factor binding sites for A. thaliana. When analyzing transcriptional regulation using AthaMap it may be important to learn which genes are also post-transcriptionally regulated by inhibitory RNAs. Therefore, a unified database for transcriptional and post-transcriptional regulation will be highly useful for the analysis of gene expression regulation.

...read moreread less

26 citations

Journal Article•10.1186/1756-0381-5-21•

Multivariate methods and software for association mapping in dose-response genome-wide association studies.

[...]

Chad Brown¹, Tammy M. Havener², Marisa W. Medina³, Ronald M. Krauss³, Howard L. McLeod², Alison A. Motsinger-Reif¹ - Show less +2 more•Institutions (3)

North Carolina State University¹, University of North Carolina at Chapel Hill², Children's Hospital Oakland Research Institute³

12 Dec 2012-Biodata Mining

TL;DR: Overall, MANOVA was found to be the most powerful method for detecting real signals, and was also the most robust method for detection using alternatives generated with the previous simulation study.

...read moreread less

Abstract: Background The large sample sizes, freedom of ethical restrictions and ease of repeated measurements make cytotoxicity assays of immortalized lymphoblastoid cell lines a powerful new in vitro method in pharmacogenomics research. However, previous studies may have over‐simplified the complex differences in dose‐response profiles between genotypes, resulting in a loss of power.

...read moreread less

21 citations

Journal Article•10.1186/1756-0381-5-13•

An automated framework for hypotheses generation using literature

[...]

Vida Abedi¹, Vida Abedi², Ramin Zand³, Mohammed Yeasin¹, Mohammed Yeasin², Fazle E. Faisal¹, Fazle E. Faisal² - Show less +3 more•Institutions (3)

University of Memphis¹, Florida State University College of Arts and Sciences², University of Tennessee Health Science Center³

29 Aug 2012-Biodata Mining

TL;DR: The proposed HGF is able to capture “crisp” direct and indirect associations, and provide knowledge discovery on demand, and is fast, efficient, and robust in generating new hypotheses to identify factors associated with a disease.

...read moreread less

Abstract: Background In bio-medicine, exploratory studies and hypothesis generation often begin with researching existing literature to identify a set of factors and their association with diseases, phenotypes, or biological processes. Many scientists are overwhelmed by the sheer volume of literature on a disease when they plan to generate a new hypothesis or study a biological phenomenon. The situation is even worse for junior investigators who often find it difficult to formulate new hypotheses or, more importantly, corroborate if their hypothesis is consistent with existing literature. It is a daunting task to be abreast with so much being published and also remember all combinations of direct and indirect associations. Fortunately there is a growing trend of using literature mining and knowledge discovery tools in biomedical research. However, there is still a large gap between the huge amount of effort and resources invested in disease research and the little effort in harvesting the published knowledge. The proposed hypothesis generation framework (HGF) finds “crisp semantic associations” among entities of interest - that is a step towards bridging such gaps.

...read moreread less

21 citations

Journal Article•10.1186/1756-0381-5-2•

A multilevel layout algorithm for visualizing physical and genetic interaction networks, with emphasis on their modular organization

[...]

Johannes Tuikkala¹, Heidi Vähämaa¹, Pekka Salmela¹, Olli S. Nevalainen¹, Tero Aittokallio², Tero Aittokallio³ - Show less +2 more•Institutions (3)

Information Technology University¹, University of Turku², University of Helsinki³

26 Mar 2012-Biodata Mining

TL;DR: A modified layout plug-in is implemented, named Multilevel Layout, which applies the conventional layout algorithms within a multilevel optimization framework to better capture the hierarchical modularity of many biological networks.

...read moreread less

Abstract: Background Graph drawing is an integral part of many systems biology studies, enabling visual exploration and mining of large-scale biological networks. While a number of layout algorithms are available in popular network analysis platforms, such as Cytoscape, it remains poorly understood how well their solutions reflect the underlying biological processes that give rise to the network connectivity structure. Moreover, visualizations obtained using conventional layout algorithms, such as those based on the force-directed drawing approach, may become uninformative when applied to larger networks with dense or clustered connectivity structure.

...read moreread less

Journal Article•10.1186/1756-0381-5-9•

Gene ontology analysis of pairwise genetic associations in two genome-wide studies of sporadic ALS.

[...]

Nora Chung Kim¹, Peter C. Andrews¹, Folkert W. Asselbergs², H. Robert Frost¹, Scott M. Williams¹, Brent T. Harris³, Cynthia Read⁴, Kathleen D. Askland⁴, Jason H. Moore⁴, Jason H. Moore¹ - Show less +6 more•Institutions (4)

Dartmouth College¹, Utrecht University², Georgetown University Medical Center³, Brown University⁴

28 Jul 2012-Biodata Mining

TL;DR: Pathway analysis of pairwise genetic associations in two GWAS of sporadic ALS revealed a set of genes involved in cellular component organization and actin cytoskeleton that were not reported by prior GWAS, suggesting that pathway-level analysis of GWAS data may discover important associations not revealed using conventional one-SNP-at-a-time approaches.

...read moreread less

Abstract: Background It is increasingly clear that common human diseases have a complex genetic architecture characterized by both additive and nonadditive genetic effects The goal of the present study was to determine whether patterns of both additive and nonadditive genetic associations aggregate in specific functional groups as defined by the Gene Ontology (GO)

...read moreread less

Journal Article•10.1186/1756-0381-5-12•

Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development.

[...]

Nicolas Turenne¹, E. S. Tiys, Vladimir A. Ivanisenko, N. S. Yudin, Elena V. Ignatieva, Damien Valour¹, Séverine A. Degrelle¹, Isabelle Hue¹ - Show less +4 more•Institutions (1)

Institut national de la recherche agronomique¹

29 Aug 2012-Biodata Mining

TL;DR: A workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia) is created.

...read moreread less

Abstract: Since processes in well-known model organisms have specific features different from those in Bos taurus, the organism under study, a good way to describe gene regulation in ruminant embryos would be a species-specific consideration of closely related species to cattle, sheep and pig. However, as highlighted by a recent report, gene dictionaries in pig are smaller than in cattle, bringing a risk to reduce the gene resources to be mined (and so for sheep dictionaries). Bioinformatics approaches that allow an integration of available information on gene function in model organisms, taking into account their specificity, are thus needed. Besides these closely related and biologically relevant species, there is indeed much more knowledge of (i) trophoblast proliferation and differentiation or (ii) embryogenesis in human and mouse species, which provides opportunities for reconstructing proliferation and/or differentiation processes in other mammalian embryos, including ruminants. The necessary knowledge can be obtained partly from (i) stem cell or cancer research to supply useful information on molecular agents or molecular interactions at work in cell proliferation and (ii) mouse embryogenesis to supply useful information on embryo differentiation. However, the total number of publications for all these topics and species is great and their manual processing would be tedious and time consuming. This is why we used text mining for automated text analysis and automated knowledge extraction. To evaluate the quality of this “mining”, we took advantage of studies that reported gene expression profiles during the elongation of bovine embryos and defined a list of transcription factors (or TF, n = 64) that we used as biological “gold standard”. When successful, the “mining” approach would identify them all, as well as novel ones. To gain knowledge on molecular-genetic regulations in a non model organism, we offer an approach based on literature-mining and score arrangement of data from model organisms. This approach was applied to identify novel transcription factors during bovine blastocyst elongation, a process that is not observed in rodents and primates. As a result, searching through human and mouse corpuses, we identified numerous bovine homologs, among which 11 to 14% of transcription factors including the gold standard TF as well as novel TF potentially important to gene regulation in ruminant embryo development. The scripts of the workflow are written in Perl and available on demand. They require data input coming from all various databases for any kind of biological issue once the data has been prepared according to keywords for the studied topic and species; we can provide data sample to illustrate the use and functionality of the workflow. To do so, we created a workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia). First, the human and mouse homologs of the bovine proteins were selected, filtered by text corpora and arranged by score functions. The score functions were based on the gene name frequencies in corpora. Then, transcription factors were identified using TcoF-DB and double-checked using TFe to characterise TF groups and families. Thus, among a search space of 18,670 bovine homologs, 489 were identified as transcription factors. Among them, 243 were absent from the high-throughput data available at the time of the study. They thus stand so far for putative TF acting during bovine embryo elongation, but might be retrieved from a recent RNA sequencing dataset (Mamo et al. , 2012). Beyond the 246 TF that appeared expressed in bovine elongating tissues, we restricted our interpretation to those occurring within a list of 50 top-ranked genes. Among the transcription factors identified therein, half belonged to the gold standard (ASCL2, c-FOS, ETS2, GATA3, HAND1) and half did not (ESR1, HES1, ID2, NANOG, PHB2, TP53, STAT3). A workflow providing search for transcription factors acting in bovine elongation was developed. The model assumed that proteins sharing the same protein domains in closely related species had the same protein functionalities, even if they were differently regulated among species or involved in somewhat different pathways. Under this assumption, we merged the information on different mammalian species from different databases (literature and biology) and proposed 489 TF as potential participants of embryo proliferation and differentiation, with (i) a recall of 95% with regard to a biological gold standard defined in 2011 and (ii) an extension of more than 3 times the gold standard of TF detected so far in elongating tissues. The working capacity of the workflow was supported by the manual expertise of the biologists on the results. The workflow can serve as a new kind of bioinformatics tool to work on fused data sources and can thus be useful in studies of a wide range of biological processes.

...read moreread less

Journal Article•10.1186/1756-0381-5-3•

Global tests of P-values for multifactor dimensionality reduction models in selection of optimal number of target genes

[...]

Hongying Dai¹, Madhusudan Bhandary², Mara L. Becker¹, J. Steven Leeder¹, Roger Gaedigk¹, Alison A. Motsinger-Reif³ - Show less +2 more•Institutions (3)

Children's Mercy Hospital¹, Columbus State University², North Carolina State University³

22 May 2012-Biodata Mining

TL;DR: The proposed global tests can serve as a screening approach prior to individual tests to prevent false discovery and strong power in small sample sizes and well controlled Type I error in absence of GxG interactions make global tests highly recommended in epistasis studies.

...read moreread less

Abstract: Multifactor Dimensionality Reduction (MDR) is a popular and successful data mining method developed to characterize and detect nonlinear complex gene-gene interactions (epistasis) that are associated with disease susceptibility. Because MDR uses a combinatorial search strategy to detect interaction, several filtration techniques have been developed to remove genes (SNPs) that have no interactive effects prior to analysis. However, the cutoff values implemented for these filtration methods are arbitrary, therefore different choices of cutoff values will lead to different selections of genes (SNPs). We suggest incorporating a global test of p-values to filtration procedures to identify the optimal number of genes/SNPs for further MDR analysis and demonstrate this approach using a ReliefF filter technique. We compare the performance of different global testing procedures in this context, including the Kolmogorov-Smirnov test, the inverse chi-square test, the inverse normal test, the logit test, the Wilcoxon test and Tippett’s test. Additionally we demonstrate the approach on a real data application with a candidate gene study of drug response in Juvenile Idiopathic Arthritis. Extensive simulation of correlated p-values show that the inverse chi-square test is the most appropriate approach to be incorporated with the screening approach to determine the optimal number of SNPs for the final MDR analysis. The Kolmogorov-Smirnov test has high inflation of Type I errors when p-values are highly correlated or when p-values peak near the center of histogram. Tippett’s test has very low power when the effect size of GxG interactions is small. The proposed global tests can serve as a screening approach prior to individual tests to prevent false discovery. Strong power in small sample sizes and well controlled Type I error in absence of GxG interactions make global tests highly recommended in epistasis studies.

...read moreread less

Journal Article•10.1186/1756-0381-5-10•

Logic Minimization and Rule Extraction for Identification of Functional Sites in Molecular Sequences

[...]

Raul Cruz-Cano¹, Mei-Ling Ting Lee¹, Ming-Ying Leung²•Institutions (2)

University of Maryland, College Park¹, University of Texas at El Paso²

16 Aug 2012-Biodata Mining

TL;DR: A method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins is proposed.

...read moreread less

Abstract: Background Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions.

...read moreread less

Journal Article•10.1186/1756-0381-5-14•

Peer2ref: a peer-reviewer finding web tool that uses author disambiguation

[...]

Miguel A. Andrade-Navarro¹, Gareth A. Palidwor², Carolina Perez-Iratxeta²•Institutions (2)

Max Delbrück Center for Molecular Medicine¹, Ottawa Hospital Research Institute²

07 Sep 2012-Biodata Mining

TL;DR: A method called peer2ref, implemented in a web server that automatically suggests experts for peer-review among scientists that have authored manuscripts published during the last decade in more than 3,800 journals indexed in MEDLINE, is developed.

...read moreread less

Abstract: Background Reviewer and editor selection for peer review is getting harder for authors and publishers due to the specialization onto narrower areas of research carried by the progressive growth of the body of knowledge. Examination of the literature facilitates finding appropriate reviewers but is time consuming and complicated by author name ambiguities.

...read moreread less

Journal Article•

A robustness study to investigate the performance of parametric and non-parametric tests used in Model-Based Multifactor Dimensionality Reduction Epistasis Detection

[...]

Jestinah M. Mahachie John¹, Elena S. Gusareva¹, François Van Lishout¹, Kristel Van Steen¹•Institutions (1)

University of Liège¹

01 Jun 2012-Biodata Mining

Journal Article•10.1186/1756-0381-5-8•

A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data

[...]

Li Li¹, Yang Guo¹, Wenwu Wu¹, Youyi Shi¹, Jian Cheng¹, Shiheng Tao¹ - Show less +2 more•Institutions (1)

Northwest A&F University¹

23 Jul 2012-Biodata Mining

TL;DR: Two scoring methods, namely weighted enrichment (THE AUTHORS) scoring and PPI scoring, have been proved effective to validate biological significance of the biclusters and a significantly positive correlation between the two sets of scores has been tested to demonstrate the consistence of these two methods.

...read moreread less

Abstract: Several biclustering algorithms have been proposed to identify biclusters, in which genes share similar expression patterns across a number of conditions. However, different algorithms would yield different biclusters and further lead to distinct conclusions. Therefore, some testing and comparisons between these algorithms are strongly required. In this study, five biclustering algorithms (i.e. BIMAX, FABIA, ISA, QUBIC and SAMBA) were compared with each other in the cases where they were used to handle two expression datasets (GDS1620 and pathway) with different dimensions in Arabidopsis thaliana (A. thaliana) GO (gene ontology) annotation and PPI (protein-protein interaction) network were used to verify the corresponding biological significance of biclusters from the five algorithms. To compare the algorithms’ performance and evaluate quality of identified biclusters, two scoring methods, namely weighted enrichment (WE) scoring and PPI scoring, were proposed in our study. For each dataset, after combining the scores of all biclusters into one unified ranking, we could evaluate the performance and behavior of the five biclustering algorithms in a better way. Both WE and PPI scoring methods has been proved effective to validate biological significance of the biclusters, and a significantly positive correlation between the two sets of scores has been tested to demonstrate the consistence of these two methods. A comparative study of the above five algorithms has revealed that: (1) ISA is the most effective one among the five algorithms on the dataset of GDS1620 and BIMAX outperforms the other algorithms on the dataset of pathway. (2) Both ISA and BIMAX are data-dependent. The former one does not work well on the datasets with few genes, while the latter one holds well for the datasets with more conditions. (3) FABIA and QUBIC perform poorly in this study and they may be suitable to large datasets with more genes and more conditions. (4) SAMBA is also data-independent as it performs well on two given datasets. The comparison results provide useful information for researchers to choose a suitable algorithm for each given dataset.

...read moreread less

Journal Article•10.1186/1756-0381-5-20•

Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease

[...]

Matthew E. Stokes¹, Shyam Visweswaran¹•Institutions (1)

University of Pittsburgh¹

03 Dec 2012-Biodata Mining

TL;DR: A new Relief algorithm called SWRF* was developed that had greater ability to identify interacting genetic variants in synthetic data compared to existing Relief algorithms.

...read moreread less

Abstract: Background Identification of genetic variants that are associated with disease is an important goal in elucidating the genetic causes of diseases. The genetic patterns that are associated with common diseases are complex and may involve multiple interacting genetic variants. The Relief family of algorithms is a powerful tool for efficiently identifying genetic variants that are associated with disease, even if the variants have nonlinear interactions without significant main effects. Many variations of Relief have been developed over the past two decades and several of them have been applied to single nucleotide polymorphism (SNP) data.

...read moreread less

Journal Article•10.1186/1756-0381-5-4•

Weighted multiple testing procedures for genomic studies

[...]

Jiang Gui¹, Jiang Gui², Tor D. Tosteson², Mark E. Borsuk²•Institutions (2)

Dartmouth–Hitchcock Medical Center¹, Dartmouth College²

07 Jun 2012-Biodata Mining

TL;DR: Recent developments on three distinct but closely related methods involving p-value weighting to improve statistical power while also controlling for the false discovery rate or the family wise error rate are reviewed.

...read moreread less

Abstract: With the rapid development of biological technology, measurement of thousands of genes or SNPs can be carried out simultaneously. Improved procedures for multiple hypothesis testing when the number of tests is very large are critical for interpreting genomic data. In this paper, we review recent developments on three distinct but closely related methods involving p-value weighting to improve statistical power while also controlling for the false discovery rate or the family wise error rate.

...read moreread less

Journal Article•10.1186/1756-0381-5-6•

How do alignment programs perform on sequencing data with varying qualities and from repetitive regions

[...]

Xiaoqing Yu¹, Kishore Guda¹, Joseph Willis¹, Marty L Veigl¹, Zhenghe John Wang², Sanford D. Markowitz¹, Mark Raymond Adams², Shuying Sun¹ - Show less +4 more•Institutions (2)

Case Western Reserve University¹, J. Craig Venter Institute²

18 Jun 2012-Biodata Mining

TL;DR: This study provides a systematic comparison of commonly used alignment algorithms in the context of sequencing data with varying qualities and from repetitive regions and shows that Novoalign is more sensitive to the improvement of data quality than other alignment programs.

...read moreread less

Abstract: Background Next-generation sequencing technologies generate a significant number of short reads that are utilized to address a variety of biological questions. However, quite often, sequencing reads tend to have low quality at the 3’ end and are generated from the repetitive regions of a genome. It is unclear how different alignment programs perform under these different cases. In order to investigate this question, we use both real data and simulated data with the above issues to evaluate the performance of four commonly used algorithms: SOAP2, Bowtie, BWA, and Novoalign.

...read moreread less