. .
.
Smith-Waterman Algorithm - Local Alignment of Sequences
.
.

 

 

Objective

 

  • To perform local sequence alignment between two nucleotide or amino acid sequences and find out structural or functional similarity.

 

Theory

 

The most commonly asked question in molecular biology is whether two given sequences are related or not, in order to identify their structure or function. The most simpler way to answer this question is to compare their sequences.

 

Sequence is a collection of nucleotides or amino acid residues which are connected with each other. Speaking biologically, a typical DNA/RNA sequence consist of nucleotides while a protein sequence consist of amino acids.

 

Sequencing is the process to determine the nucleotide or amino acid sequence of a DNA fragment or a protein. There are different experimental methods for sequencing, and the obtained sequence  is submitted to  different databases like NCBI, Genbank etc. 

 

Methods of Sequencing:

 

Sequences stored in the database were obtained from different experimental methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used for protein sequencing.

 

Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes, DNA has to be added in denatured form (single strands). Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer accomodates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminates due to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight.  The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.

 

Maxam-Gilbert (Chemical degradation method): This method requires denatured DNA fragment whose 5' end is radioactively labeled. This fragment is then subjected to purification before proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis technique helps in arranging the fragments based on their molecular weight. To view the fragments, gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.

 

Edman Degradation reaction: The reaction finds the order of amino acids in a protein by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid.

 

 Mass Spectrometry:  It is used to determine the mass of particle, composition of molecule and for finding the chemical structures of molecules like peptides and other chemical compounds. Based on the mass to charge ratio, one can identify the amino acids in a protein.

 

Sequence Alignment and importance:

 

Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes the way of arrangement of DNA/RNA or protein sequences, in order to identify the regions of similarity among them. It is used to infer structural, functional and evolutionary relationship between the sequences. Alignment finds similarity level between query sequence and different database sequences. The algorithm works by dynamic programming approach which divides the problem into smaller independent sub problems. It finds the alignment more quantitatively by assigning scores.

 

When a new sequence is found, the structure and function can be easily predicted by doing sequence alignment. Since it is believed that, a sequence sharing common ancestor would exhibit similar structure or function. Greater the sequence similarity, greater is the chance that they share similar structure or function.

 

Methods of Sequence Alignment:

 

There are mainly two methods of Sequence Alignment:

 

Global Alignment: Closely related sequences which are of same length are very much appropriate for global alignment. Here, the alignment is carried out from beginning till end of the sequence to find out the best possible alignment.

 

Local Alignment: Sequences which are suspected to have similarity or even dissimilar sequences can be compared with local alignment method. It finds the local regions with high level of similarity.

 

These two methods of alignments are defined by different algorithms, which use scoring matrices to align the two different series of characters or patterns (sequences). The two different alignment methods are mostly defined by Dynamic programming approach for aligning two different sequences.

 

Dynamic programming

 

Dynamic programming is used for optimal alignment of two sequences. It finds the alignment in a more quantitative way by giving some scores for matches and mismatches (Scoring matrices), rather than only applying dots. By searching the highest scores in the matrix, alignment can be accurately obtained. The Dynamic Programming solves the original problem by dividing the problem into smaller independent sub problems. These techniques are used in many different aspects of computer science. Needleman-Wunsch and Smith-Waterman algorithms for sequence alignment are defined by dynamic programming approach.

 

Smith Waterman algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. The algorithm explains the local sequence alignment, it gives conserved regions between the two sequences, and one can align two partially overlapping sequences, also it’s possible to align the subsequence of the sequence to itself. These are the main advantages of Local Sequence Alignment.

 

This algorithm mainly differs within two aspects from the Needleman-Wunsch algorithm. First aspect being, local alignment differs by having only a negative score for the mismatch and when the matrix value becomes negative it has to be set to zero (need to take maximum value of the scorings compared with zero). They are also predefined scoring matrices for nucleotide or protein sequences.

 

Scoring matrices: 

 

 

In optimal alignment procedures, mostly Needleman-Wunsch and Smith-Waterman algorithms use scoring system. For nucleotide sequence alignment, the scoring matrices used are relatively simpler since the frequency of mutation for all the bases are equal. Positive or higher value is assigned for a match and a negative or a lower value is assigned for mismatch. These assumption based scores can be used for scoring the matrices. There are other scoring matrices which are predefined mostly, used in the case of amino acid substitutions.

 

Mainly used predefined matrices are PAM and BLOSUM.

 

PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands for Point Accepted Mutations. PAM matrices are calculated by observing the differences in closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues, i.e. 1% change and 99% remains as such.

 

BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992, used conserved regions. These matrices are actual percentage identity values. Simply to say, they depend on similarity. Blosum 62 means there is 62 % similarity.

 

Gap score or gap penalty: Dynamic programming algorithms use gap penalties to maximize the biological meaning. Gap penalty is subtracted for each gap that has been introduced. There are different gap penalties such as gap open and gap extension. The gap score defines a penalty given to alignment when we have insertion or deletion. During the evolution, there may be a case where we can see continuous gaps all along the sequence, so the linear gap penalty would not be appropriate for the alignment. Thus gap open and gap extension has been introduced when there are continuous gaps (five or more). The open penalty is always applied at the start of the gap, and then the other gaps following it is given with a gap extension penalty which will be less compared to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.

 

Assumed scoring schemas: If the residues (nucleotide or amino acids) are same in both the sequences the match score is assumed (Si,j) as +5 which is added to the diagonally positioned cell of the current cell (i, j position). If the residues are not same, the mismatch score is assumed as -3. This score should be added to the diagonally positioned cell of the current cell. The gap penalty score is assumed as -4 which is added to left and above positioned cells of the current cell. These scores are not unique, they can be user defined also, but the mismatch and gap penalty should be the negative values.

 

Working of Smith-Waterman Algorithm :

 

Intialization of Matrix 

 

The basic steps for the algorithm are similar to that of Needleman-Wunsch algorithm. The steps are:

1. Initialization of a matrix.
2. Matrix Filling with the appropriate scores.
3. Trace back the sequences for a suitable alignment.

To study the Local sequence alignment consider the given below sequences.

CGTGAATTCAT (sequence#1 or A)
GACTTAC (sequence #2 or B)

 

The two sequences are arranged in a matrix form with A+1columns and B+1rows. The values in the first row and first column are set to zero as shown in Figure 1.

 

 Figure 1: Initialization of Matrix

 

Variables used:
          i,j describes row and columns.
          M is the matrix value of the required cell (stated as Mi,j)
          S is the score of the required cell (Si, j)
          W is the gap alignment
 

 

Matrix Filling

 

The second and crucial step of the algorithm is filling the entire matrix, so it is more important to know the neighbor values (diagonal, upper and left) of the current cell to fill each and every cell.

  

 «math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«msub»«mi»M«/mi»«mrow»«mi»i«/mi»«mo»,«/mo»«mi»j«/mi»«/mrow»«/msub»«mo»§nbsp;«/mo»«mo»=«/mo»«mo»§nbsp;«/mo»«mi»M«/mi»«mi»a«/mi»«mi»x«/mi»«mi»i«/mi»«mi»m«/mi»«mi»u«/mi»«mi»m«/mi»«mo»§nbsp;«/mo»«mfenced close=¨]¨ open=¨[¨»«mrow»«msub»«mi»M«/mi»«mrow»«mi»i«/mi»«mo»-«/mo»«mn»1«/mn»«mo»,«/mo»«mi»j«/mi»«mo»-«/mo»«mn»1«/mn»«/mrow»«/msub»«mo»+«/mo»«mo»§nbsp;«/mo»«msub»«mi»S«/mi»«mrow»«mi»i«/mi»«mo»,«/mo»«mi»j«/mi»«mo»,«/mo»«/mrow»«/msub»«mo»§nbsp;«/mo»«msub»«mi»M«/mi»«mrow»«mi»i«/mi»«mo»,«/mo»«mi»j«/mi»«mo»-«/mo»«mn»1«/mn»«/mrow»«/msub»«mo»+«/mo»«mi»W«/mi»«mo»,«/mo»«mo»§nbsp;«/mo»«msub»«mi»M«/mi»«mrow»«mi»i«/mi»«mo»-«/mo»«mn»1«/mn»«mo»,«/mo»«mi»j«/mi»«/mrow»«/msub»«mo»+«/mo»«mi»W«/mi»«mo»,«/mo»«mn»0«/mn»«/mrow»«/mfenced»«/math»

As per the assumptions stated earlier, fill the entire matrix using the assumed scoring schema and initial values.  One can fill the 1st row and 1st column with the scoring matrix as follows.

 

The first residue (nucleotides or amino acids) in both sequences is ‘C’ and ‘G’, the matching score or the mismatching score is going to be added the neighboring value which is diagonally located i.e. 0. The upper and left values are added to the gap penalty score from the matrix. So the scoring schema equation can be shown as follows.

 

 

 «math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«mtable columnalign=¨left¨ rowspacing=¨0¨»«mtr»«mtd»«msub»«mi»M«/mi»«mrow»«mn»1«/mn»«mo»,«/mo»«mn»1«/mn»«/mrow»«/msub»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»=«/mo»«mo»§nbsp;«/mo»«mi»M«/mi»«mi»a«/mi»«mi»x«/mi»«mi»i«/mi»«mi»m«/mi»«mi»u«/mi»«mi»m«/mi»«mo»§nbsp;«/mo»«mo»[«/mo»«msub»«mi»M«/mi»«mrow»«mn»0«/mn»«mo»,«/mo»«mn»0«/mn»«/mrow»«/msub»«mo»§nbsp;«/mo»«mo»+«/mo»«msub»«mi»S«/mi»«mrow»«mn»1«/mn»«mo»,«/mo»«mn»1«/mn»«/mrow»«/msub»«mo»,«/mo»«mo»§nbsp;«/mo»«msub»«mi»M«/mi»«mrow»«mn»1«/mn»«mo»,«/mo»«mn»0«/mn»«/mrow»«/msub»«mo»+«/mo»«mo»§nbsp;«/mo»«mi»W«/mi»«mo»,«/mo»«mo»§nbsp;«/mo»«msub»«mi»M«/mi»«mrow»«mn»0«/mn»«mo»,«/mo»«mn»1«/mn»«/mrow»«/msub»«mo»+«/mo»«mi»W«/mi»«mo»,«/mo»«mo»§nbsp;«/mo»«mn»0«/mn»«mo»]«/mo»«/mtd»«/mtr»«mtr»«mtd»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»=«/mo»«mo»§nbsp;«/mo»«mi»M«/mi»«mi»a«/mi»«mi»x«/mi»«mi»i«/mi»«mi»m«/mi»«mi»u«/mi»«mi»m«/mi»«mo»§nbsp;«/mo»«mfenced close=¨]¨ open=¨[¨»«mrow»«mn»0«/mn»«mfenced»«mrow»«mo»-«/mo»«mn»3«/mn»«/mrow»«/mfenced»«mo»,«/mo»«mn»0«/mn»«mo»§nbsp;«/mo»«mo»+«/mo»«mo»§nbsp;«/mo»«mfenced»«mrow»«mo»-«/mo»«mn»4«/mn»«/mrow»«/mfenced»«mo»,«/mo»«mn»0«/mn»«mo»+«/mo»«mo»§nbsp;«/mo»«mfenced»«mrow»«mo»-«/mo»«mn»4«/mn»«/mrow»«/mfenced»«mo»,«/mo»«mn»0«/mn»«/mrow»«/mfenced»«/mtd»«/mtr»«mtr»«mtd»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»=«/mo»«mo»§nbsp;«/mo»«mi»M«/mi»«mi»a«/mi»«mi»x«/mi»«mi»i«/mi»«mi»m«/mi»«mi»u«/mi»«mi»m«/mi»«mo»§nbsp;«/mo»«mfenced close=¨]¨ open=¨[¨»«mrow»«mo»-«/mo»«mn»3«/mn»«mo»,«/mo»«mo»-«/mo»«mn»4«/mn»«mo»,«/mo»«mo»-«/mo»«mn»4«/mn»«mo»,«/mo»«mn»0«/mn»«/mrow»«/mfenced»«/mtd»«/mtr»«mtr»«mtd»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»§nbsp;«/mo»«mo»=«/mo»«mo»§nbsp;«/mo»«mn»0«/mn»«/mtd»«/mtr»«/mtable»«/math»

 

From the above calculations the maximum value obtained is 0. Finding the maximum value for Mi,j position, one can notice that there is no chance to see any negative values in the matrix, since we are taking 0 as lowest value.

 

After filling the matrix, keep the pointer back to the cell from where the maximum score has been determined. In the similar fashion fill all the values of the matrix of the cell.

 

For the example the matrix can be filled is shown in Figure 2.

 

Figure 2: Matrix filling with back pointers

 

Each cell is back pointed by one or more pointers from where the maximum score has been obtained.

 

Trace backing the sequences for an optimal alignment:

 

The final step for the appropriate alignment is trace backing, prior to that one needs to find out the maximum score obtained in the entire matrix for the local alignment of the sequences.It is possible that the maximum scores can be present in more than one cell, in that case there may be possibility of two or more alignments, and the best alignment by scoring it.

 

In this example we can see the maximum score in the matrix as 18, which is found in two positions that lead to multiple alignments, so the best alignment has to be found.

 

So the trace back begins from the position which has the highest value, pointing back with the pointers,  thus find out the possible predecessor, then move to next predecessor and continue until we reach the score 0 (Figure 3).

 

 

Figure 3: Trace back of first possible alignment

 

It is possible to find two pointers pointing out from one cell, where both ways(alignments) can be considered,  best one is found by scoring and finding maximum score among them.

 

Figure 4: Trace back of second possible alignment

 

Thus a local alignment is obtained and one can see the possible alignments as in Figure 5.

 

 

 Figure 5: Scoring for best alignment

 

The two alignments can be given with a score, for matching as +5 , mismatch as -3 and gap penalty as -4, sum up all the individual scores and the alignment which has maximum score after this can be taken as the best alignment.

 

By summing up the scores both of the alignments are giving the same as 18, so one can predict both alignments are the best.

 

 

 

 

Cite this Simulator:

.....
..... .....

Copyright @ 2017 Under the NME ICT initiative of MHRD

 Powered by AmritaVirtual Lab Collaborative Platform [ Ver 00.12. ]