- To study the pairwise sequence similarity search using BLAST algorithm.
- To study the functional and evolutionary relationships between different sequences.
BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipmann at National Institutes of Health (NIH) and was published in Journal of Molecular Biology in 1990. BLAST (Basic local alignment search tool) is a heuristic search algorithm, it finds the solutions from the all possibilities ,which takes input as nucleotide or protein sequence and compare it with existing databases like NCBI, GenBank etc. It finds the local similarity between different sequences and calculates the statistical significance of matches. It can also be used to find functional and evolutionary relationship between different sequences. Search is done by taking the sequence of a certain word size, comparing it with the database sequence and scores are assigned for each comparison. Based on the threshold, a suitable match of that query word is taken and the alignment is extended to both sides. After the alignment is complete, the total score is calculated and alignment is displayed on the blast result page only if the total scores exceed the threshold value.
Sequence, Sequence Alignment and importance:
A biological sequence refers to a sequence of characters which belong to DNA/RNA/protein. Two types of biological sequences are most commonly known namely, Nucleotide Sequence and Protein Sequence. Nucleotide sequence is mainly formed of four different nucleotides namely, adenine (A), guanine (G), cytosine (C) and tyrosine (T). While protein sequence is formed of 20 different amino acids which are commonly found. The nucleotides arrange themselves in the form of triplet code (triplet code refers to a group of three nucleotides) to code for an amino acid. These sequences are properly indexed in the already existing databases and it is possible to retrieve these sequences from their corresponding databases. The sequences are obtained by the following methods explained below.
DNA sequencing methods:
Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands). Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.
Maxam-Gilbert (Chemical degradation method): This method requires denature DNA fragment whose 5' end is radioactively labeled. This fragment is then subjected to purification before proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis technique helps in arranging the fragments based on their molecular weight. To view the fragments, gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.
Protein sequencing methods:
Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid
Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for finding the chemical structures of molecules like peptides and other chemical compounds. Based on the mass to charge ratio, one can identify the amino acids in a protein
Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes the way of arrangement of DNA/RNA/Protein sequences and identify the regions of similarity among them. It is used to infer structural, functional and evolutionary relationship between the sequences. Alignment finds similarity level between the query sequence and different available database sequences. The algorithm works by dynamic programming approach which divides the problem into smaller independent sub problems. It finds the alignment more quantitatively by assigning scores.
Methods of Sequence Alignment:
There are mainly two methods of Sequence Alignment:
Global Alignment :Sequences having same length and quite similar are very much appropriate for global alignment. Here the alignment is carried out from beginning of the sequence to end of the sequences to find out the best possible alignment.
Local Alignment:Sequences which are suspected to have similarity or even dissimilar sequences can be compared with local alignment method. It finds the local regions with high level of similarity
BLAST is one of the pairwise sequence alignment tool used to compare different sequences. There are different BLAST programs for different comparisons as shown in Table 1.
Nucleotide BLAST Programs:
BLASTN : The initial search is done for a word of length ‘w’ and threshold score ‘T’. Whole sequence is divided into words with a length of 11 for nucleic acids (maximum number of words can be calculated by L-w+1= max.word no (L=sequence length, w=words)). The BLASTn algorithm parses nucleotide sequences into 11 letter “words” the same is done for every sequence in the query database, word matches are being identified from the database sequence. This searches for somewhat similar sequences.
Mega BLAST: Searches for highly similar sequences.
Discontiguous Mega BLAST: Searches for more dissimilar sequences.
Protein BLAST Programs :
BLASTp: Finds the similarity between the query protein sequences to a protein sequences available in the protein database. BLASTp also reports for global alignment, which is the preferred result for protein identification. The BLASTp algorithm parses protein sequences into 3 letter “words” the same is done for every sequence in the query database, word matches are being identified from the database.
PSI-BLAST: Position-Specific Iterated -BLAST is the most sensitive BLAST program. It is used to find very distantly related proteins or new members of protein family. Algorithm builds a position-specific scoring matrix (PSSM or profile) from an iterative alignment of sequences, returns with E-values and threshold (default=0.005). E-value It decreases exponentially with the score that is assigned to a match between two sequences.
PHI-BLAST: Pattern-Hit Initiated BLAST is used to find protein sequences which contains a pattern, specified by the user and are similar to the query sequence. This requirement was proposed to reduce the number of hits which contains only the pattern, but is likely to have no true homology to the query. To run PHI-BLAST, enter the query into the Search box, and enter the pattern into the PHI pattern box. Only one pattern can be used in one search.
Scoring matrices :
Scoring matrices are used to assign score for comparision of pairs of characters. There are different types of scoring matrices like:
Identity Matrices: In this type of matrix, the score would either 1's or 0's. 1's will lie along the diagonal. Basically the scoring scheme is based on matches and mismatches.
Unitary scoring matrices: This matrix also have either 0's or 1's as their scores . The difference is that it takes into the idea of transitions(change among purines or pyramidines) and transversions(change between purine and a pyramidine).
PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands for Point Accepted Mutations. PAM matrices are calculated by observing the differences in closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues, i.e. 1% change and 99% remains as such.
BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992, using conserved regions, these matrices are actual percentage identity values. Simply to say they depend on similarity. Blosum 62 means there is a 62 % similarity.
Parameters used in BLAST algorithm :
Threshold: It is a boundary of minimum or maximum value which can be used to filter out words during comparison.
True Homology: In BLAST true homology refers how much the sequence is similar to the query sequence.
E-value: It decreases exponentially with the score that is assigned to an alignment between two sequences.
Word size: Whole Search is done by taking the sequence of a certain word size and compares it with the database sequence and scores are assigned for each comparison. Word size is given as 11 for nucleic acids and 3 for proteins.
Putative conserved domains: These are the domains that have different functionalities.
Gap score or gap penalty: Dynamic programming algorithms uses gap penalties to maximize the biological meaning. Gap penalty is subtracted for each gap that has been introduced. There are different gap penalties such as gap open and gap extension. The gap score defines, a penalty given to alignment when we have insertion or deletion. During the evolution, there may be a case where we can see continuous gaps all along the sequence, so the linear gap penalty would not be appropriate for the alignment. Thus gap open and gap extension has been introduced when there are continuous gaps (five or more). The open penalty is always applied at the start of the gap, and then the other gaps following it is given with a gap extension penalty which will be lesser compared to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.
Working of BLAST Algorithm:
- Query sequence is taken and analyzed for low complex regions. Low complexity regions are regions which contain less information or variations like AAAAAAAA or ATATATAT etc.
- These low complex regions are masked with alphabet s like X or N
- List of words of certain word size is made. Usually the word size is 3 for proteins and 11 for DNA
- Scores are calculated for each pair of words(query sequence word and database word) using substitution scoring matrixes (like PAM or BLOSUM),and only the high scoring words i.e. above a threshold value or a cutoff score is taken for further alignment. A cutoff score is selected to reduce number the number of matches so as to decrease the computation time.
- This scoring and checking is repeated for all the words in the query sequence.
- The remaining high-scoring words are organised into efficient search tree and rapidly compared to the database sequence. This is done to find out the exact matches.
- If an exact or good match is found then an alignment is extended in both directions from the position where the exact match occurred
Image source : petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/BLAST_algorithm.html
- High scoring pairs (HSP) which have score greater than a threshold are taken for consideration.
- Significance of the HSP score are calculated. The probability p of observing a score S equal to or greater than x is given by the equation:
- Statistical assessments are made in the case if two or more HSP regions are found and certain matching pairs are put in descending order in the output file as far as their similarity/ score is concerned.