Theory Procedure Self Evaluation Simulator Assignment Reference Feedback

Objective

To study about proteins, its sturcture and motifs .
To introduce the importance of using prosite database.
To learn how to find out the motif patterns of a protein using Prosite database.

Theory

Proteins:

Proteins are the fundamental units of all living cells. Each protein has specific function in body. All proteins are made up of long chain of amino acids that fold into a 3-D shape. The primary structure of a protein is made up of a linear sequence of amino acids. The conserved regions in a protein are called motifs. Each protein has specific function in the living organism. Proteins can be characterized by more than one motif and it can be classified using certain specific motifs. Some of examples for motifs are Helix-turn-helix, Helix-loop-helix, Omega loop etc. The DNA binding protein lac repressor is an example for helix-turn-motif. Since motifs regulates and performs different functions in protein, motif detection in proteins is very significant.The description of the motif using regular expression can be reffered to as a pattern. The patterns can functionally annotate and classify proteins. The Prosite database contains almost all known proteins. Prosite database uses patterns and profiles that help to identify the possible functions of a new sequence from the existing sequence.

Proteins are the fundamental units of all living cells and play a vital role in various cellular functions. Each protein has specific function in our body. For example hemoglobin is a protein found in Red Blood Cells that carries oxygen from lungs to cells and collects the carbon dioxide back to the lungs. The structure of the protein determines its function. The binding of a protein with other molecules is very specific to carry out its function properly. For this reason every protein has a particular structure. Protein structures are classified into primary, secondary, tertiary, and quaternary. The proteins are synthesized as primary sequences and then folds to form secondary, tertiary and quaternary structures.

All proteins are made up of long chain of amino acids that fold into a 3-D shape. Amino acids are organic compounds that contain a hydrogen atom, alpha (α) carbon, two functional groups and a side chain R group. The ‘α’ carbon is the first carbon atom that is attached to a functional group. The two functional groups in amino acid are amino group and a carboxyl group. The functional groups and R group are also bonded to α carbon atom. The side chain refers to a particular amino acid. There are almost 20 amino acids found in human body that varies in their R groups. R group can be hydrophobic or hydrophilic. The hydrophobic side chains will tend to get away from water environment while hydrophilic side chains are attracted towards it. The atoms attached to some of the hydrophilic side chains will make them acidic and some of them make it basic. So the basic ends will get attracted towards the acidic ends. This makes the protein to be in its native conformation. The native conformation is the condition of a protein which is correctly folded and functional.

Amino acids are linked to each other by peptide bond. A peptide bond is formed when the carboxyl group of one amino acid is linked to the amino group of another molecule through a covalent bond. During this reaction a molecule of water is released. Short sequence of amino acids held together by peptide bonds is called peptides. Each amino acid in a peptide is called as a residue. N-terminus is the starting of a protein which contains an amino acid with a free amine group (-NH₂) and the C-terminus is the end of a protein which contains an amino acid (-COOH) with a free carboxyl group.

Structure of protein:

The primary structure of a protein is made up of a linear sequence of amino acids. It is synthesized during the translation process of mRNA(messenger RNA). mRNA is formed from DNA during the process of transcription. DNA (Deoxyribonucleic acid) is the genetic material that contains the genetic information for the development and maintaining all functions in living organisms. The information is stored as genetic codes using four types of nucleotides. They are adenine (A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair with thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape structure of these strands which is called a double helix. RNA differs from DNA only in one base i.e. in RNA it is uracil (U) instead of thymine(T). During the transcription process, DNA is transcribed to mRNA and it would contain uracil which pairs with adenine.

The intermolecular and intramolecular hydrogen bonding between the amide groups in primary structure of protein form secondary structure. The attraction of hydrogen molecule towards electro negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and formed between two different molecules is called intermolecular H bonding. Alpha helices and beta sheets are the two important secondary structures in protein.

Figure 1: Structure of proteins

(image src: http://en.wikipedia.org/wiki/File:Protein_structure.png)

Database:

It is a structured collection of data which is updated regularly. It is prepared in such a way that the computer can search for a user defined query and give the information related to it. A query can be defined as a request that one uses to get the information from a database. A protein database contains a large amount of computational nucleic acid sequence and protein sequence (amino acid sequence) of the living organisms. Prosite is a protein database. It contains biological information which describes protein families, protein domains and functional sites. A domain can be defined as the independently folded parts of sequence or structure in a protein. All protein has a specific domain with a specific function. For example, most of the proteins involved in intracellular signaling contains PH domain (Pleckstrin homology domain). This domain can bind to certain biological molecules inside the membrane and recruit them to the membrane.

This database helps to identify the possible functions of a new sequence from the existing sequence. If the new sequence is not closely related to those existing proteins, though a complete alignment cannot be found, one can identify the existence of pattern or motif from the database.

Motifs:

Motifs are the conserved regions of the protein. Proteins can be characterized by more than one motif and it can be classified using certain specific motifs. Some of examples for motifs are:

Helix-turn-helix:

It is composed of two α – helices connected with few amino acids. One is on the N-terminus end and the other at the C-terminus. Helix-turn-helix has important role in DNA recognizing and DNA binding. The first helix is a DNA recognizing motif. It binds to the DNA using hydrogen bonds and van der Waals interaction. The other helix helps in stabilizing the protein – DNA interaction. This particular motif has an important role in the process of gene expression regulation. Gene expression is the process of producing an efficient protein or RNA from a gene.

Helix-loop-helix:

It is composed of two α- helices that is joined by a loop. A loop is the area between two secondary structure elements. It describes the qualities of transcription factors. Transcription factor is a DNA binding protein which controls the transcription process of DNA. The two helices help in DNA binding. So in transcription factors these motifs can help them to bind to specific sequences.

Omega loop:

It is a loop shaped polypeptide chain. The motif contains a large number of hydrogen bond inside. Due to this it has an important role in protein stability and folding.

The DNA binding protein lac repressor is an example for helix-turn-motif. It controls the gene expression for lactose metabolism in bacteria. It stops the transcription of mRNA that translates to form Lac proteins. Lactose is an enzyme that facilitates the breakdown of lactose into glucose and galactose. So in the absence of lactose or in the presence of sufficient amount of glucose, the lac repressor controls the breakdown of lactose. Since motifs regulates and performs different functions in protein, motif detection in proteins is very significant.

Prosite:

The Prosite database uses patterns and profiles that help to identify the possible functions of a new sequence from the existing sequence. This determines the family of protein to which it belongs to. The description of the motif using regular expression syntax can be called as a pattern. Amino acids described by the patterns make a motif in a protein. The patterns can functionally annotate and classify proteins. The functional annotation describes important features of the protein whereas classification discriminate members and non members in a particular protein family.

The Prosite database contains almost all known proteins. The Prosite profiles give the information about the entire length of the sequence. Prosite is maintained by Amos Bairoch at the University of Geneva, Switzerland. A pattern file obtained from Prosite database would be like the EMBL format( Figure 2).

Figure 2: The pattern for PS00107 (Protein kinase)

The representation of each line is listed below:

‘ID’ line represents identification for each entry.
‘AC’ line indicates Accession number.
‘DT’ line shows the date.
‘DE ’line shows the short description.
‘PA’ line shows the pattern.
‘MA’ line represents the Matrix or profile.
‘RU’ line represents the Rule.
‘NR’ line is for the numerical results.
‘CC’ line is for Comments.
‘DR’ line shows the cross-reference to Swiss-Prot.
‘3D’ shows the cross-reference to PDB.
‘DO’ line represents the documentation file.
‘//’ line is the termination line.

In the file 'PA' represent the pattern for a motif. There are some pattern syntax rules for Prosite database. They are:

The amino acids are represented by using standard IUPAC one-letter code.
The symbol ‘x’ indicates any amino acid can occur at that position.
The symbol ‘[ ]’ indicates the possible amino acids at a particular position.
The symbol ‘{ }’ indicates the prohibited amino acids at a particular position.
In a pattern, x (3) represents x-x-x.
In pattern x (2, 4) represents any sequence that matches to x-x or x-x-x or x-x-x-x.
The symbol ‘-‘ indicates separation of pattern.
The symbol ‘<’ indicates N-terminal restriction of the pattern.
The symbol ‘>’ indicates C-terminal restriction of the pattern.
The symbol ‘.’ Indicates end of the pattern.

E.g. the pattern [AV]-D-{CE} - x (5) can be analyzed as [Ala or Val]-Asp-{any AA except Cys or Glu}-any AA-any AA - any AA-any AA -any AA.

Cite this Simulator: