Data Resources for Structural Bioinformatics

Question

1. What is the Protein DataBank (PDB) and its purpose?

2. What are the limitations of experimental structure determination techniques?

3. What information does an ATOM line in a PDB file contain?

4. What additional descriptions and data can be accessed through the PDB's browser-based user interface?

Accepted Answer

The Protein DataBank (PDB), established in 1971, is a freely accessible, single global archive of experimentally determined structure data for biological macromolecules. Its aim is to provide researchers with a comprehensive resource for studying the three-dimensional structures of proteins, nucleic acids, and complex assemblies. The PDB offers valuable insights into the molecular mechanisms of biological processes and facilitates the development of new drugs and therapies. It serves as a crucial platform for scientists to share and analyze structural data, enabling collaborative research and advancements in the field of structural biology.

Accepted Answer

Experimental structure determination techniques, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR), and Cryo-electron microscopy (cryo-EM), have their own limitations. X-ray crystallography provides high-resolution structures but requires crystallization of the protein, which may not always be possible. NMR is suitable for smaller proteins but has lower resolution compared to X-ray crystallography. Cryo-EM is more suitable for complexes but may have lower resolution for smaller proteins. It is important to be aware of these limitations when using experimental data to build computational models. Understanding the limitations helps researchers choose the most appropriate technique for their specific protein structure determination needs.

Accepted Answer

An ATOM line in a PDB file contains extensive information about a resolved atom in the crystal structure. The meaning of each column is explained below. The record name indicates the type of line, with ATOM being the most common. The atom serial number represents the number of the atom in the total structural complex, which is not renewed for the next molecule in the complex. The atom name is the abbreviation of the atom name, such as CA for alpha carbon. The alternate location indicator shows different locations for an atom, with a character (A, B, C, etc.) indicating the location. Residue refers to the amino-acid residue to which the atom belongs, in 3-letter notation. Chain indicates the molecule to which the atom belongs. Residue sequence number shows the position of the amino acid in the chain, with occasional jumps due to failed elucidation. Code for insertion of residues is rarely used and helps match important amino acids in different versions of a structure. X, Y, and Z coordinates represent the spatial coordinates of the atom in Angstroms.

Accepted Answer

The PDB's browser-based user interface provides additional descriptions, derived data, and relevant cross references. Users can inspect a structure's Ramachandran plot, structure validation reports, and detailed descriptions of experimental methods used to determine a structure. The feature viewer displays data derived by the PDB, such as secondary structure, disorder calculations, and hydrophobicity, alongside information from other databases like UniProt, Pfam, and Phosphosite. It also offers structural information on PDB entries from the PDBsum web server and access to homology models from the Structural Biology Knowledgebase (SBKB) and Protein Model Portal.

Accepted Answer

The FAIR data principles are Findable, Accessible, Interoperable, and Reusable. They aim to enhance data reusability by ensuring data is easily located, understood, and utilized. These principles were introduced by Wilkinson et al. in 2016. They are crucial for sharing data effectively, allowing consumers to access and reuse it efficiently. In the field of structural biology, the principles have been adopted early on, with the PDB providing a consolidated source of experimental structure data in standardized formats, accompanied by extensive provenance and metadata. The PDB's adoption of FAIR principles has contributed to its reputation as a reliable source of structure data, aligning with the state of the art in data sharing and management.

Accepted Answer

In structure analysis and annotation, topics such as structure validation, secondary structure calling, structure classification, and domain definition should be discussed. These topics involve further analysing and annotating protein structures obtained from experiments or computational models. Resources often integrate precomputed analyses and annotations, making secondary structure, domains, and validation reports readily available. These analyses are crucial for a comprehensive understanding of protein structures and their functional implications.

Accepted Answer

The three aspects of validating atomic models are confirming the validity of actual measurements, confirming model consistency with measurements, and confirming model adherence to physical and chemical constraints. Users of experimentally determined models often lack access to raw measurements, focusing only on the third aspect. Structural features should follow distributions observed in known structures, with outliers requiring strong experimental evidence. Chemical checks include chirality, hydrogen bonding, and sidechain packing. Tools like WhatCheck, ProCheck, and MolProbity aid in automated quality checking. Secondary structure assignment can be done manually or with programs like DSSP and HBplus.

Accepted Answer

Structural classification schemes, such as SCOP and CATH, provide a gold standard for curating homologous relations. By comparing protein structures, these schemes can validate sequence-based homology search methods like (PSI-)BLAST, HMMer, and HHBlitz. The classification considers various features like sequence similarity, shared functions, conserved secondary structure elements, topology, and structural alignment scores. This validation ensures that the sequence-based methods accurately identify homologous relationships, as structure is generally more conserved than sequence. The hierarchical grouping and ordering of protein structures in these schemes correspond to biologically relevant features, enhancing the reliability of homology searches.

Accepted Answer

The SCOP database implements four levels of hierarchy to classify protein structures: family, superfamily, fold, and class. Families are assigned based on significant sequence similarity and similar function and structure. Superfamilies share a fold with similar secondary structure elements and packing arrangements. Classes are divided based on secondary structure elements: all-a, all-b, a/b, and a+b. These levels aim to group homologous proteins with shared evolutionary ancestry.

Accepted Answer

CATH, like SCOP, is a hierarchical structural classification database based on structure, function, and sequence similarity. However, CATH aims to automate its classification process while maintaining biological relevance. The hierarchy in CATH consists of Class, Architecture, Topology, and Homology levels. Architecture level is unique to CATH and represents the shape defined by secondary structures without considering connectivity. Challenges in fully automatic assignment include recognizing domain boundaries, distinguishing between homology and topology levels, and grouping families into homologous groups. Therefore, manual curation is still necessary in CATH.

Accepted Answer

Domain shuffling refers to the evolutionary phenomenon where entire domains get duplicated, deleted, or inserted next to other domains, similar to building blocks. This process contributes to the diversity and complexity of protein structures. Domain shuffling is commonly observed in protein evolution and can lead to the formation of new protein functions. It occurs when different domains, which are conserved protein regions, are rearranged within a protein sequence. This rearrangement can result in the creation of novel proteins with unique functions. Domain shuffling plays a significant role in the evolution of proteins and can be observed in various organisms. It is an important concept in understanding protein structure and function, as well as the evolutionary history of proteins. Several databases, such as CATH, SCOP, and PDB, provide domain definitions that help in identifying and studying domain shuffling events in proteins.

Accepted Answer

A protein's sequence contains a lot of information about its structure and function. Despite predicted structures becoming more accurate, sequence data and analysis are integral to bioinformatics research. Sequence databases, although not as consolidated as structure databases, offer protein, RNA, and DNA sequences with annotations and cross-references. UniProt is a crucial database in structural bioinformatics.

Accepted Answer

The core data in UniProt database includes the protein's amino acid sequence, name or description, taxonomic data, and citation information. This data is then enriched with annotations spanning common biological ontologies, classifications, and cross references, accompanied by an indication of annotation quality through experimental and computational evidence attribution. The database consists of two main parts: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, with Swiss-Prot providing human-curated annotations and TrEMBL containing protein sequences annotated by computational techniques.

Accepted Answer

Databases containing modeled structures include the Swiss-Model Repository (SMR) and ModBase. The Swiss-Model Repository offers access to structures produced by the Swiss-Model homology modeling pipeline, while ModBase provides access to structures generated by the ModPipe pipeline. Additionally, the Protein Modeling Portal (PMP) provides a single interface to query SMR, ModBase, and models generated by several partners of the Protein Structure Initiative (PSI). With the emergence of AlphaFold v2.0 by DeepMind, the protein sequences in UniProt have been expanded with accompanying structural models, known as Al-phaFold DB. However, it's important to note that not all predicted structures are equally good, and the reliability of the models should be assessed using per-residue and pairwise model-confidence estimates, as well as the 'provided predicted aligned errors'.

Accepted Answer

UniRef and UniParc are sequence resources maintained by the European Bioinformatics Institute (EMBL-EBI). UniRef is a collection of clusters of similar protein sequences, while UniParc is a database of protein sequences. Both resources are closely integrated with UniProt, a comprehensive database of protein sequences. These resources are valuable for researchers studying protein sequences, as they provide a centralized platform for accessing and analyzing protein data. UniRef and UniParc offer unique features and tools for sequence comparison, annotation, and visualization, making them essential resources for bioinformatics research.

Accepted Answer

STRING utilizes five sources of information to detect or predict protein interactions: curated databases, experimental data, textmining, co-expression, and homology. Curated databases provide pre-existing knowledge about protein interactions. Experimental data, such as protein-protein interaction assays, contribute to the accuracy of predictions. Textmining involves extracting information from scientific literature to identify potential interactions. Co-expression analysis examines the correlation between gene expression patterns to infer interactions. Homology-based methods compare protein sequences to identify similarities that may indicate interactions. STRING integrates these sources to generate a comprehensive list of protein interactions, assigning confidence scores to each interaction based on the available data.

Accepted Answer

When working with protein structures, it is crucial to understand the data sources and experimental protein structures available in the PDB database. Structure validation is essential to ensure coherence with experimental data and physical chemical constraints. Databases for structural classification and sequence databases aid in finding homologous proteins and their corresponding information. Additionally, protein features can be retrieved from databases like STRING and GO. These key points contribute to a comprehensive understanding of protein structures and their applications in research.

Data Resources for Structural Bioinformatics

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is the Protein DataBank (PDB) and its purpose?

2. What are the limitations of experimental structure determination techniques?

3. What information does an ATOM line in a PDB file contain?

4. What additional descriptions and data can be accessed through the PDB's browser-based user interface?

5. What are the FAIR data principles?

6. What topics should be discussed in structure analysis and annotation?

7. What are the three aspects of validating atomic models?

8. How can structural classification schemes validate sequence-based homology search methods?

9. What are the four levels of hierarchy in SCOP database?

10. What are the significant differences between CATH and SCOP?

11. What is domain shuffling?

12. What information does a protein's sequence contain?

13. What is the core data in UniProt database?

14. What databases contain modeled structures?

15. What are UniRef and UniParc?

16. What are the five sources of information used by STRING to detect or predict protein interactions?

17. What are the key points to consider when working with protein structures?

Related Papers (5)

Preface to Introduction to Structural Bioinformatics

Molecular modeling on the Web.

Protein Structure Prediction and Structural Genomics

Protein Tertiary Structure Prediction using Data mining Techniques

State-of-the-art bioinformatics protein structure prediction tools (Review).