Pairwise sequence alignment using FASTA (Theory) : Bioinformatics Virtual Lab II : Biotechnology and Biomedical Engineering : Amrita Vishwa Vidyapeetham Virtual Lab

	Theory Procedure Self Evaluation Simulator Assignment Reference Feedback

Objective

FASTA can carry out a dynamic sequence similarity search between the Protein and Nucleotide sequences against the databases.

Theory

FASTA is a pairwise sequence alignment tool which takes input as nucleotide or protein sequences and compares it with existing databases It is a text-based format and can be read and written with the help of text editor or word processor. Fasta file description starts with ÃƒÂ¢Ã¢â€šÂ¬Ã‹Å“>ÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢ symbol and followed by the gi and accession number and then the description, all in a single line. Next line starts with the sequence and in each row there would be 60 nucleotides/amino acids only. For DNA and proteins it is represented in one letter IUPAC nucleotide codes and amino acid codes. It finds the local similarity between the sequences and calculates the statistical significance of matches. It can be also used to find the functional and evolutionary relationship between the sequences.

FASTA program uses the word hits to identify potential matches before attempting the more time consuming optimised search. The speed and sensitivity is controlled by the parameter called ktup, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Initially it checks for segment's containing several nearby hits. This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces local alignment scores for the comparison of the query sequence to every sequence in the database. This approach avoids the artificiality of a random sequence model by real sequences, with their natural correlations. The sequences are obtained by the following methods.

DNA sequencing methods:

Sanger Method (dideoxy chain termination method) : Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands). Next a primer is is added which anneals to one of the strand in template. The 3' end of the primer accomadates the dideoxy nucleotides[ddNTPs] (specific to each tube) as well as the deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.

Maxam-Gilbert (Chemical degradation method): This method also requires denatured DNA and 5' end of the strand is made radioactive and purification of the DNA fragment. A series of labelled fragments are generated by chemical treatment. Fragments are arranged in a gel after electrophoresis. To view the fragments, the gel is exposed to X-ray film for autoradiography, a series of dark bands appears, each corresponds to a radiolabelled DNA fragment, from which the sequence may be inferred.

Protein sequencing methods:

Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid.

Mass Spectrometry: It is used for determine the mass of particles, for determining the composition of a molecule, and for finding the chemical structures of molecules, like peptides and other chemical compounds. Based on the mass to charge ratio one can identify the amino acids in a protein.

Sequence Alignment and importance:

Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes the way of arrangement of DNA/RNA, or Protein sequences to identify the regions of similarity among them. It is used to infer structural, functional and evolutionary relationship between the sequences. Alignment finds similarity level between the the query sequence and the different database sequences. The algorithm works by dynamic programming approach which divides the problem into smaller independent sub problems and finds the alignment more quantitatively by assigning scores.

Methods of Sequence Alignment:

They are mainly two methods of Sequence Alignment

Global Alignment :Sequences having same length and quite similar are very much appropriate for global alignment. Here the alignment is carried out from beginning of the sequence to end of the sequences to find out the best possible alignment.

Local Alignment: Sequences which are suspected to have similarity or even dissimilar sequences can be compared with local alignment method. It finds the local regions with high level of similarity.

FASTA file looks like

>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN(OVALBUMINRELATED)

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE

KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS

Source: FASTA Sequence related to Albumin protein of chicken species retrieved from NCBI.

FASTA Programs

FASTA: Compares the protein sequence to another protein sequence in a database or compares nucleotide sequence to another nucleotide sequence in a database.

FASTX, FASTY: It performs a search for comparing the nucleotide sequence to a protein sequence database.

SSEARCH: It performs a Smith-Watermann alignment, between a protein sequence and another protein sequence/nucleotide sequence and another nucleotide sequence. It is local alignment.

GGSEARCH: Compares a protein or DNA sequence to a sequence database using Global alignment. It compares the query sequences that are between 80% of the length of the query.

GLSEARCH: Compares a protein or DNA sequence to a sequence in a database. The alignments are global in query and local in database.

Parameters used in FASTA algorithm :

Threshold: It is a boundary of minimum or maximum value which can be used to filter out words during comparison.

True Homology: In FASTA true homology refers how much the sequence is similar to the query sequence.

E-value: It decreases exponentially with the score that is assigned to an alignment between two sequences.

Putative conserved domains: These are the domains that have different functionalities.

Working of FASTA algorithm :

Nucleotide or protein sequence is taken as input.

The speed and sensitivity is controlled by the parameter called ktup, which specifies the size of the word. This program uses the word hits to identify potential matches between the query sequence and database sequence. Lesser the ktup value,more sensitive the search. By default ktup is 2 for proteins and ktup is 4 or 6 for nucleotides, initially it checks for segment's containing several nearby hits.

image source : upload.wikimedia.org/wikipedia/en/thumb/c/cd/Document_html_47f1ed1b.gif/432px-Document_html_47f1ed1b.gif

Then it finds the similar local regions based on the matches and mismatches (scoring) and isolate thehighest matches from the background hits. Scoring matrices used are BLOSUM50 for protein sequence and identity matrix for nucleotide sequence. Local regions are represented as diagonal line in dotplot between two sequences.

It finds the best local regions and saves it.

Rescan and score the local regions with a suitable scoring matrix.

image source : upload.wikimedia.org/wikipedia/en/thumb/c/cd/Document_html_47f1ed1b.gif/432px-Document_html_47f1ed1b.gif

Take the subregions with maximum score from the local regions. From that, highest score of the subregion will be referred as init1.

Subsequences (subregions) are searched through the library sequences to determine the similarity . From these sequences which are having less than the cutoff value will be eliminated.

image source : upload.wikimedia.org/wikipedia/en/thumb/c/cd/Document_html_47f1ed1b.gif/432px-Document_html_47f1ed1b.gif

Checks whether gaps are required to fill the sequence similarity search. Initial similarity score is used to rank the library sequence (initn).

image source : upload.wikimedia.org/wikipedia/en/thumb/c/cd/Document_html_47f1ed1b.gif/432px-Document_html_47f1ed1b.gif

It uses the Smith-Waterman algorithm to calculate an optimal score for whole alignment.

Cite this Simulator: