1. What are the key features extracted from protein sequences for predicting protein-protein interactions?
The key features extracted from protein sequences for predicting protein-protein interactions include Conjoint Triad (CT), Auto Covariance (AC), Amino Acid Composition (AAC), Sequence-Order (SO), and Dipeptide Composition (DPC). These features are used to classify protein-protein interactions based on primary-level information from protein sequences. Additionally, embedding techniques such as word2vec and doc2vec are utilized to obtain distributed representations of protein sequences, which are then converted into machine-readable form for machine learning models. Linear dimensionality reduction methods like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are also employed to handle high-dimensional features and map them into a lower-dimensional space for accurate prediction models.
read more
2. How are features extracted for interacting and non-interacting proteins?
Features for interacting and non-interacting proteins are extracted using iLearnPlus, a comprehensive automated feature extraction tool based on machine learning. The extracted features are categorized into various groups with different descriptors of varied dimensions. Due to the high dimensionality of these features, linear dimensionality reduction techniques like PCA and LDA are applied to map them into lower dimensions. However, this process may eliminate certain relevant features, affecting the efficiency of classifying interacting and non-interacting proteins. To address this, three datasets were created: Dataset 1 with PCA, Dataset 2 with LDA, and Dataset 3 without dimensionality reduction. These datasets were then divided into training and testing data at a ratio of 8:2. The datasets were classified using a random forest classifier, and the results were analyzed to determine the classification accuracy. The detailed architecture of this process is provided in Figure 1.
read more
3. How is the dataset balanced for the study?
To balance the dataset, all 22,383 positive samples are considered interacting proteins, and only 22,383 out of 2,23,821 negative samples are selected as non-interacting proteins. This approach helps mitigate the imbalance and potential bias in classifying protein-protein interactions. The dataset consists of human and virus protein sequences, which are handled separately during feature extraction to distinguish between them effectively. This balanced dataset ensures a more accurate classification model for the study.
read more
4. What is the purpose of grouping amino acids in protein sequences?
Grouping amino acids in protein sequences helps in uniquely determining proteins involved in protein-protein interactions. By categorizing amino acids based on their chemical characteristics, researchers can identify the specific amino acids and interactions that contribute to these interactions. This grouping allows for a more detailed analysis of protein sequences and their functional roles in biological processes. In the context of the provided section, grouping amino acids is essential for feature extraction, as it enables the extraction of various feature groups such as Amino Acid Composition (AAC), Grouped Amino Acid Composition (GAAC), Autocorrelation (AC), Quasi-Sequence-Order (QSC), and Pseudo-Amino Acid Composition (PAAC). These feature groups provide valuable insights into the structural and functional properties of proteins, aiding in the understanding of protein-protein interactions and their implications in biological systems.
read more