Objective
- To align three or more sequences to find out structural and functional relationship between these sequences.
Key terms
Conserved regions: In biology, during the evolutionary time there may be some regions called group of bases or a sequence of nucleotides preserved as such in DNA, those sequences or a region, if seen in next generations called as Conserved regions.
Consensus Sequence: In a Nucleotide or an amino acid sequence, each base pair (an amino acid or a nucleotide) may occur more frequently at a particular region in different sequences of nature.
Theory
Sequence is a collection of nucleotides or amino acid residues which are connected with each other. Speaking biologically, a typical DNA/RNA sequence consist of nucleotides while a protein sequence consist of amino acids.
Sequencing is the process to determine the nucleotide or amino acid sequence of a DNA fragment or a protein. There are different experimental methods for sequencing, and the obtained sequence is submitted to different databases like NCBI, Genbank etc.
Methods of Sequencing:
Sequences stored in the database were obtained from different experimental methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used for protein sequencing.
Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes, DNA has to be added in denatured form (single strands). Next a primer is to be added which anneals to one of the strand in DNA template. The 3' end of the primer accomodates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminates due to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.
Maxam-Gilbert (Chemical degradation method): This method requires denatured DNA fragment whose 5' end is radioactively labeled. This fragment is then subjected to purification before proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis technique helps in arranging the fragments based on their molecular weight. To view the fragments, gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.
Edman Degradation reaction: The reaction finds the order of amino acids in a protein by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid.
Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for finding the chemical structures of molecules like peptides and other chemical compounds. Based on the mass to charge ratio, one can identify the amino acids in a protein.
Sequence Alignment:
When a new sequence is found, the structure and function can be easily predicted by doing sequence alignment. Since it is believed that, a sequence sharing a common ancestor would exhibit similar structure or function. Greater the sequence similarity, greater is the chance that they share similar structure or function.
Sequence alignment can be of two types i.e., comparing two (pair-wise) or more sequences (multiple) for a series of characters or patterns. Alignment of three or more biological nucleotides or protein sequences, simply defines multiple sequence alignment. The genes which are similar may be conserved among different species.
Take these identical or similar set of genes to perform multiple sequence alignment. Through this, we can easily identify the most evolutionarily conserved regions that play critical role in functionality of a specified gene.
During the evolutionary time, the genes may have got altered at sequence level, which results in alteration of function. Multiple sequence alignment can identify alterations in the function and the causes for that alteration at sequence level.
At protein level, information regarding structure and function of proteins can be obtained by multiple sequence alignment. It would be helpful in getting new domains or motifs with biological significance.
We can find many tools for multiple sequence alignment like MSA DIALIGN, CLUSTAL series, MAFT, MUSCLE, T-Coffee, BlastAlign, etc.
Table 1: Summary of multiple sequence alignment programs
*Adapted from Current Opinion in Structural Biology 2006, 16:368–373.

CLUSTALW / CLUSTAL Omega
Pair wise sequence alignment has been approached with dynamic programming between nucleotide or amino acid sequences. The same approach can be used for alignment of ‘n’ number of sequences. But this program is limited to pair wise, since there will be exponential increase in memory, number of steps with respect to number of sequences. Because of such limitations with dynamic programming, researchers came up with an approach called 'progressive method' to align three or more sequences.
Progressive method was first suggested by Feng and Doolittle in 1987. It compares only a pair of sequences together at a time using the following steps:
- Using the standard dynamic programming algorithm on each pair, we can calculate the (N*(N-1))/2 (N is total number of sequences) distances between the sequence pairs.
- From the distance matrix obtained using the clustering algorithm, construct a guide tree.
- From the tree obtained, align the first node to the second node. After fixing the alignment, add another sequence or the third node. Iterate the step until all the sequences are aligned. When a sequence is aligned to a group or when there is alignment in between the two groups of sequences, the alignment is performed that had the highest alignment score. The gap symbols in the alignment replaced with a neutral character. Where it helps to guide the alignment of sequence- alignment and alignment –alignment.
Working of Algorithm
Multiple sequence alignment can be done through different tools. CLUSTALW is one among the mostly accepted tool. Recently a newer version of CLUSTALW came out with the name CLUSTAL Omega which is available from the European Bioinformatics institute www.ebi.ac.uk/Tools/clustalw2/
Higgins D has written the first program of CLUSTAL, considering memory and time various CLUSTAL series of programs have came up and presently used version is CLUSTALW, which came up with dynamic programming and progressive alignment methods.
CLUSTALW uses the progressive algorithm, by adding the sequence one by one until all the sequences are completely aligned.
Steps for CLUSTAL algorithm
- Calculate all possible pairwise alignments, record the score for each pair.
- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
- Find the two most closely related sequences
- Align the sequences by progressive method
i. Calculate a consensus of this alignment
ii. Replace the two sequences with the consensus
iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
iv. Iterate until all sequences have been aligned
5. Expand the consensus sequences with the (gapped) original sequences
6. Report the multiple sequence alignment