1. What is the impact of data preprocessing steps on differential expression analysis (DEA) performance in proteomics data?
The impact of data preprocessing steps on DEA performance in proteomics data is poorly understood. However, studies have shown that the choice of normalization methods and DEA statistical methods exert greater influence on performance over other steps. High-performing workflows prefer no normalization and incline MinProb for imputation while eschewing simple statistical tools for DEA. By studying the impact of choices at each step of a workflow on DEA performance, a unique resource to guide workflow selection on new datasets can be provided. Additionally, an ensemble inference strategy that integrates DEA results from individual top-performing workflows can further improve DEA performance by 1~5% for FragPipe-based or 1~4% for maxquant-based DDA data, and 2~4% for DIA-NN-based DIA data. The combination of multiple proteomic data layers provides complementary information that enhances DEA outcomes.
read more
2. What are the performance metrics used to evaluate candidate workflows for proteomic datasets?
The performance metrics used to evaluate candidate workflows for proteomic datasets include partial area under receiver operator characteristic curves (pAUC) with false-positive rate (FPR) thresholds of 0.01, 0.05, 0.1, normalized Matthew's correlation coefficient (nMCC), and geometric mean of specificity and recall (G-mean). These metrics are used to assess the performance of workflows in identifying differentially expressed proteins. Additionally, the mean or median performance of a workflow across multiple datasets is used to establish its overall performance ranking. Leave-one-project-out cross-validation (LOPOCV) is also employed to test the consistency between workflow ranks obtained by benchmarking and the true performance of new datasets.
read more
3. What are the common frequent selection patterns identified for 'H' workflows across three quantification platforms?
The common frequent selection patterns identified for 'H' workflows across three quantification platforms are normalization methods 'center.mean','center.median' and 'None' with support ratio bigger than 10%, and MVI algorithms 'MinProb' and 'MinDet' for imputing missing values. Additionally, protein MaxLFQ intensity is frequently coupled with limma in FragPipe and DIA-NN 'H' workflows, and protein peak intensity is coupled with limma and ROTS in DIA-NN 'H' workflows. These patterns are platform-specific and highly recommended for improving workflow performance.
read more
4. What are the preferred expression matrix types and normalization methods for high-performing workflows in DEA?
For high-performing workflows in DEA, the preferred expression matrix types are MaxLFQ intensity at the protein level and peptide MaxLFQ intensity for DIA data. At the peptide level, peak intensities work better in maxquant-related workflows. For normalization methods, no normalization (F:None), center.mean, and center.median are frequently found in high-performing workflows. These methods are recommended based on benchmark results, but data normalization should be carefully customized to the data and research question at hand. It is possible that there are situations where normalization techniques are warranted, but they should be tailored to the specific data and research question.
read more