- To study about proteins and its structure.
- To introduce PDB and its importance.
- To learn how to retrieve structural data of a protein using PDB database.
- To describe the PDB file format.
Proteins are the fundamental units of all living cells. It performs a large variety of biological tasks. The structures of proteins are much conserved which determines its function. The primary structure of a protein is made up of linear sequence of amino acid. It is synthesized during the translation process of DNA to mRNA. DNA (Deoxyribonucleic acid) is the genetic material that contains all the genetic information for the development and maintaining all functions in all living organisms. The information is stored as genetic codes using four types of bases. They are adenine (A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair with thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape structure of these strands which is called a double helix. The intermolecular and intramolecular hydrogen bonding between the amide groups in primary structure of protein form secondary structure. The attraction of hydrogen molecule towards electro negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and formed between two different molecules is called intermolecular H bonding. Alpha helices and beta sheets are the two important secondary structures in protein.
A protein database contains the information about 3D structure of proteins. The PDB files contain experimentally decided 3D structures of biological macromolecules. The structural information of a protein can be determined by X–ray crystallography or Nuclear Magnetic Resonance (NMR) spectroscopy methods. Here X-rays are diffracted by electrons of a comparable sized atom resulting in patterns obtained as small spots in an X-ray film. These patterns are used to calculate the coordinates of atoms in a protein. NMR spectroscopy (Nuclear Magnetic Resonance) is also used for determining the structure of molecules. The nucleus of an atom that is located in a high magnetic field can absorb the electromagnetic radiation of a particular frequency. Electromagnetic radiation is a form of energy that contains both electric and magnetic fields. This type of radiation includes X-rays, gamma rays, radio waves, visible light etc. The PDB files also contains information of data collected, molecule name, primary and secondary structure, ligand, atomic coordinates, crystallographic structure factors, NMR experimental data etc.. The data are submitted by scientists from all over the world. PDB is maintained by Worldwide Protein Data Bank. Each entry in the PDB is provided with a unique identification number called the PDB ID. It is a 4 letter identification number which consist of both alphanumeric characters.
All data in PDB are accessible to public. There are databases which contain data derived from PDB. For example Structural Classification of Proteins (SCOP) that groups different protein structures, HSSP (Homology-Derived Secondary Structure of Proteins) for 3D- structure and 1D- sequence of the protein, CATH for protein structure classification according to their evolution etc.. PDB allows searching for information regarding the structure, sequence, function, visualize , download and to assess molecules.
PDB File Format
The PDB file format is the standard file format for protein structure files. It describes how molecules are held together in 3-D structure of a protein. The file contains hundreds or thousands of lines called record, which describes about protein. Figure 1 shows certain parts of a PDB formatted file for deoxyhemoglobin.
Figure 1. Certain parts of a PDB formatted file for deoxyhemoglobin
Each record provides a different set of information like:
- The HEADER record contains the file name and date of submission and the molecule PDB ID. Header contains the classification (classify the molecule), deposition date of the data at PDB and id code (unique PDB identifier) respectively.
- The TITLE record contains title of the PDB entry.
- The COMPND record includes the protein name. The specification list describes the molecular component.
- The SOURCE record contains the name of the organism in which the particular protein is obtained.
- The KEYWDS record contains keywords that describes about the protein. It includes functional classification, metabolic role, biological chemical activity and structural classification.
- The EXPDTA record contains the method used for the protein structure experiment. E.g. X-ray diffraction, electron crystallography etc.
- The AUTHOR record contains the name of contributors who put the data into PDB database.
- The REVDAT record contains revision date of the data related to the protein. It includes the date of modification and the type of modification.
- The JRNL record contains journal details of the literature that has been reported about the protein.
- The REMARK record contains the reference to journal about the protein and other remarks about the protein structure.
- The DBREF record contains the reference to the protein in the sequence databases. It contains ID code of the entry, Chain identifier, Initial sequence number of the PDB sequence segment, Initial insertion code of the PDB sequence segment, the ending sequence number of the PDB sequence segment, ending insertion code of the PDB sequence segment, sequence database name, sequence database accession code, sequence database identification code, initial sequence number of the database seqment, initial residue of the segment for PDB reference, ending sequence number of the database segment, insertion code of the ending residue of the segment for PDB reference.
- The SEQADV record contains the difference between named sequence database and the PDB. It includes ID code of the entry name of the PDB residue in conflict, PDB chain identifier, PDB sequence number, PDB insertion code, sequence database accession number, sequence database residue name, sequence database sequence number and Conflict comment.
- The SEQRES record contains information about the amino acid sequence of protein. It includes serial number of the SEQRES record for the current chain, chain identifier and number of residues in the chain,
- The HET record contains details about the non protein substances in protein. It contains HET identifier, chain identifier, sequence number, insertion code, the number of HETATM records present in the entry and the text describing Het group.
- The HETNAM record contains the compound name of non standard residues. It contains HET identifier and the chemical name.
- The HETSYN record contains the identical compound names for non standard residues.
- The FORMUL record contains the chemical formula of non standard residues.
- The HELIX record holds the recognition of helical substructures. It includes Serial number of the helix, Helix identifier, Name of the initial residue, Chain identifier for the chain containing the helix, Sequence number of the initial residue, Insertion code of the initial residue, Name of the terminal residue of the helix, Chain identifier for the chain containing the helix, sequence number of the terminal residue, Insertion code of the terminal residue, comment about this helix and Length of this helix.
- The LINK record holds the recognition of inter-residue bonds. It contains atom name, alternate location indicator, residue name, chain identifier, residue sequence number, insertion code, atom name, alternate location indicator, residue name, chain identifier, residue sequence number, insertion code, symmetry operator atom 1, symmetry operator atom 2 and link distance.
- The SITE record contains groups that contain important entity sites. It shows the sequence number, site name, number of residues that compose the site, residue name for first residue that creates the site, chain identifier for first residue of site, residue sequence number for first residue of the site, insertion code for first residue of the site, residue name for second residue that creates the site, chain identifier for second residue of the site, residue sequence number for second residue of the site, insertion code for second residue of the site. residue name for third residue that creates the site, chain identifier for third residue of the site, residue sequence number for third residue of the site, insertion code for third residue of the site, residue name for fourth residue that creates the site, chain identifier for fourth residue of the site, residue sequence number for fourth residue of the site, insertion code for fourth residue of the site.
- The ORIGXn record shows the transformation from orthogonal coordinates to the submitted coordinates.
- The SCALE record transformation from orthogonal coordinates to fractional crystallographic coordinates.
- The ATOM record contains the atomic coordinates for the structure. It contains the atom name, alternate location indicator, residue name, chain identifier, residue sequence number, code for insertion of residues, othogonal coordinates for X, Y and Z respectively in angstroms, occupancy, temperature factor, element symbol and charge on the atom.
- The TER record indicates the termination of a series.
- The HETATM record contains the atomic coordinate records for non standard residues. It includes the atom serial number, atom name, alternate location indicator, residue name, chain identifier, residue sequence number, code for insertion of residues, orthogonal coordinates for X, Y and Z respectively, occupancy, temperature factor, element symbol, charge on the atom.
- The CONECT record contains the details about the bonds involves in non-protein atoms.
- The MASTER contains number of REMARK records, number of HET records, number of HELIX records, number of SHEET records, deprecated, number of SITE records, number of coordinate transformation records, number of atomic coordinate records, number of TER records, number of CONECT records and number of SEQRES records.
- END records represent the end of a file.