To predict the structure of a protein from its amino acid sequence with an experimentally resolved structures of related proteins .
It is the process of predicting a structure from sequence which should be comparable with the experimental results. The structure of a protein is determined by its amino acid sequence. These structures of protein can be obtained from X–ray crystallography, NMR spectroscopy or from theoretical methods using real experiments or by homology modeling. But real experiments failed to provide high resolution information for the majority of proteins and NMR and other analysis too failed due to the high protein dimension. During the evolutionary process, the structure stays more conserved rather than a sequence. These protein which share similar sequence form identical structure and distantly related protein fold into similar structure.
In this process, a 3 D structure is constructing by aligning a target protein sequence with known template structures. The protein sequence can be obtained from NCBI or UniProt. The quality of a structure depends upon the extent of similarity between the target sequence and the database sharing highest similarity is aligned. It can be ranged as safe zone >30% twilight zone < 10% midnight zone. Homology modeling cannot be used to predict structures which have less than 30 % similarity. In rare cases, less than 20 % is also selected. Homology modelling is multi step process which includes sequence alignment, structural modification, database searches, energy minimization and structure evaluation to generate a structure. The 7 steps of Homology modelling are as follows.
1. Template recognition and initial alignment
The sequence of similarity can be searched using BLAST or Psi blast or fold recognition methods and align with the known structures in PDB. PDB which is the largest database contains only experimentally resolved structure. BLAST allows comparing a query sequence with a database such as PDB and identifying the best sequence which shares a high degree of similarity. The sequence of similarity of each line is summarised with its E-value (Expected value) which is closer to zero, have high degree of similarity. The E-value describes the number of hits one can “expect” when searching through a database of a particular size. The sequences which fall under safe zone are expected to be getting good structure than twilight zone and midnight zone. After identifying one or more possible template, alignment correction is performed. Sometimes it is difficult to align two sequences that have percentage identity which is low. Such cases, one can use other sequences from homologous proteins to solve this problem. Multiple Sequence Alignment programs such as CLUSTALW align sequences by insertions and deletions. Alignment correction is the critical step in homology modeling, otherwise which in turn creates a defective model.
2. Backbone generation
The backbone generation from the aligned regions can be done using modelling tools such as Modeller or CASP. The actual experimentally determined structures contain manual errors due to poor electron density in the map. Therefore a good model has to be chosen with less number of errors.
3. Loop Modeling
In most cases, alignment between model and template sequence contain gaps. By means of insertions and deletions with some conformational changes to the backbone it can be modelled, although it rarely happens to secondary structures. So it is safe to shift the insertion and deletions of the alignment, out of helices or strands and placing them in loops or coils. But this loop conformational change is difficult to predict due to many reasons like
1. Surface loops tend to be involved in crystal contacts, leading to a signiﬁcant conformational change between template and target.
2. The interchange of the side chains can lead to change in the orientation and spatial arrangement especially when it is an interchange between small and a bulky group.
3. Proline and glycine are an exception when a Ramachandran plot is considered. Proline has a restriction in the plot due to its 5 membered ring whereas glycine has a hydrogen atom as its side chain which is very difficult to predict from the plot. This makes it difficult for detect mutations that have happened to loop residue from/to either glycine or proline.
There are two main ways to overcome this and model the loop region:
1. Knowledge based:
User can search PDB for known loops with endpoints that match the residues between loops that have to be inserted and simply copy the loop conformation.
2. Energy based:
The quality of a loop is determined with energy function and minimizes the function using Monte Carlo or molecular dynamics to find the best loop conformation.
4. Side Chain Modeling
Proteins that are structurally similar, have similar torsion angle about Ca-Cb bond (psi angle) when comparing with side chain conformations. In such cases, copying conserved residues entirely from the template to the model will result in higher accuracy than copying the backbone or re-predicting side chains. Side chain conformations are partially knowledge based which uses libraries of rotamers extracted from high resolution X ray structures. To build a position-specific rotamer library, one can take high-resolution protein structures and collects all stretches of three to seven residues (method dependant) with a given amino acid at the center. Prediction accuracy is usually quite high for residues in the hydrophobic core, where more than 90% of all psi angles fall with 20o of experimental values, it is much lower for surface residues, where the percentage is often lower than 50%.
There are two reasons for this:
1. Flexible side chains on the surface tend to adopt multiple conformations, which are additionally influenced by crystal contacts.
2. Energy functions used to score rotamers can easily handle hydrophobic packing in the core (Van der Waals interactions), but are not accurate enough to get complicated electrostatic interactions on the surface.
5. Model Optimization
Sometimes the rotamers are predicted based on incorrect backbone or incorrect prediction. Such cases modeling programs either restrain the atom positions and/or apply only a few hundred steps of energy minimization to get an accurate value. This accuracy can be achieved by 2 ways.
1. Quantum force field: To handle large molecules efficiently force field can be used, energies are therefore normally expressed as a function of the positions of the atomic nuclei only. Van der Waals forces are, for example, so difficult to treat, that they must often be completely omitted. While providing more accurate electrostatics, the overall precision achieved is still about the same as in the classical force fields.
2. Self-parametrizing force fields: The precision of a force field depends to a large extent on its parameters (e.g., Van der Waals radii, atomic charges). These parameters are usually obtained from quantum chemical calculations on small molecules and fitting to experimental data, following elaborate rules (Wang, Cieplak, and Kollman, 2000). By applying the force field to proteins, one implicitly assumes that a peptide chain is just the sum of its individual small molecule building blocks—the amino acids. To increase the precision of the force field, the following steps can be used. Take initial parameters (for example, from an existing force field), change a parameter randomly, energy minimize models, see if the result improved, keep the new force field if yes, otherwise go back to the previous force field.
6. Model Validation
The models we obtain may contain errors. These errors mainly depend upon two values.
1. The percentage identity between the template and the target.
If the value is > 90% then accuracy can be compared to crystallography, except for a few individual side chains. If its value ranges between 50-90 % r.m.s.d. error can be as large as 1.5 Å, with considerably more errors. If the value is <25% the alignment turns out to be difficult for homology modeling, often leading to quite larger errors.
2. The number of errors in the template.
Errors in a model become less of a problem if they can be localized. Therefore, an essential step in the homology modeling process is the verification of the model. The errors can be estimated by calculating the model’s energy based on a force field. This method checks to see if the bond lengths and angles are in a normal range. However, this method cannot judge if the model is correctly folded. The 3D distribution functions can also easily identify misfolded proteins and are good indicators of local model building problems.
Modeller (Sali and Blundell 1993)
Modeller is a program for comparative protein structure modelling by satisfaction of spatial restraints. It can be described as “Modeling by satisfaction of restraints” uses a set of restraints derived from an alignment and the model is obtained by minimization of these restraints. These restraints can be from related protein structures or NMR experiments. User gives an alignment of sequences to be modelled with known structures. Modeller calculates a model with all non hydrogen atoms. It also performs comparison of protein structures or sequences, clustering of proteins, searching of sequence databases.
This experiment is in the process of re-edit