Objective
- To introduce Entrez as a biological data retrieval system
- To learn how to use Entrez search engine to retrieve nucleotide/protein sequence data.
Theory
Entrez is an integrated search engine which allows users to search and retrieve different data from the National Center for Biotechnology Information (NCBI). It can be accessed from the site www.ncbi.nlm.nih.gov/Entrez/. Entrez is NCBI’s major text search and retrieval system which integrates PubMed database and 39 other scientific literatures, nucleotide and protein databases, protein domain data, population study datasets, expression data, pathways and systems of interacting molecules, complete genome details and taxonomic information into a tightly inter linked system. These component databases can be accessed using one single query.
The major functions of NCBI are:
- Create public databases for storing, retrieving, and analyzing knowledge about molecular biology, biochemistry, and genetics.
- Conduct research in computational biology, for analyzing the structure and function of biological molecules.
- Develop software tools for analyzing genomic data.
- Disseminate biomedical information.
- Gather biotechnology information worldwide.
Entrez thereby act as the search engine for NCBI databases.Searching can be made more precisely by using Boolean operators like AND, OR or NOT with the search statement. Limits allow user to filter his search according to their choice. An Advanced Search interface allows performing more detailed queries.
The different queries can be searched on the following basis. The syntax for searching queries as shown below.
Search term [tag] Boolean operators [AND, OR, NOT] Search term [tag].
Table1: Entrez Boolean Search Statements
User can perform Global search by selecting the default option “All Databases “, which displays result from the different databases and their number of records available for each database will also be showed. The databases are arranged in three main sections, of which the top section contains information about literature databases, the middle section includes molecular databases and the bottom section includes accessory literature database journals, NLM Catalog and MeSH.
The associated databases included in the Entrez are as follows.

- Books: Bookshelf provide free access to search, retrieve and read books and journals from life science area. It can be accessed from the site http://www.ncbi.nlm.nih.gov/books
- CDD: Conserved Domain Database is a collection of annotation of functional units in protein. It contains manually annotated domain models, which uses 3D structure information to define sequence /structure/function relationships. It can be accessed from the site www.ncbi.nlm.nih.gov/sites/entrez
- Gene: Gene database comprises of information about various species including their nomenclature, associated pathways, RefSeq's, phenotypes, links to genome. It can be accessed from the site http://www.ncbi.nlm.nih.gov/gene/
- EST: Expression Sequence Tag database is a collection of data from GenBank. These are sequence tagged site derived from cDNA, which act as a resource to evaluate gene expression, find potential variation, annotated genes. It can be accessed from the site http://www.ncbi.nlm.nih.gov/nucest
- Genome: Genome database is a collection of genomes information which include their sequences, maps, chromosomes and annotations. It can be accessed from the site http://www.ncbi.nlm.nih.gov/genome
- dbGaP: The database of Genotypes and Phenotypes is a library of results, from the studies of interaction of genotypes and phenotypes. It can be accessed from the site http://www.ncbi.nlm.nih.gov/gap
- GEO Datasets: The Gene Expression Omnibus (GEO) offers information on gene expression datasets, their original series and Platform records. It also provides additional information such as experimental details, cluster tools and differential expression queries. It can be accessed from the site www.ncbi.nlm.nih.gov/gds
- GEO Profiles: It offers to browse for profiles which are important on gene annotation or pre-computed profile characteristics. It can be accessed from the site http://www.ncbi.nlm.nih.gov/geoprofiles
- GSS: The GSS nucleotide database provides information from GenBank of Genome Survey Sequence records. It can be accessed from the site www.ncbi.nlm.nih.gov/nucgss
- HomoloGene: It is a collection of homologs from the annotated genes of completely sequenced eukaryotic organisms. It can be accessed from the site www.ncbi.nlm.nih.gov/homologene
- MeSH: MeSH (Medical Subject Headings) is the NLM (Nations Library of Medicine) controlled vocabulary used for browsing articles, also act as a thesaurus in biomedical sciences for Pubmed and MEDLINE. It can be accessed from the site www.ncbi.nlm.nih.gov/mesh
- NLM Catalog: NLM (United States National Library of Medicine) is the largest medical library which offers access to books, journals, technical information, audiovisuals, software’s and other resources. It can be accessed from the site http://www.ncbi.nlm.nih.gov/nlmcatalog
- OMIM: It is a comprehensive resource database for human genes and genetic disorders. It contains information about human genes and genetic phenotypes, which is updated daily. It can be accessed from the site www.ncbi.nlm.nih.gov/omim
- OMIA: Online Mendelian Inheritance in Animals is acting as a resource for genes, inherited disorders and traits in more than 135 animal species, authored by Professor Frank Nicholas. It provides access to animal species excluding those in human and mouse, for which species specific data are offered. It can be accessed from the site http://www.ncbi.nlm.nih.gov/omia
- PopSet: Population study dataset is a collection of set of DNA sequences, collected to study evolutionary relatedness of a population. It can be accessed from the site http://www.ncbi.nlm.nih.gov/popset
- Probe: It is a collection of nucleic acids reagents. It also contains information on reagent distributors, probe effectiveness and computed sequence similarities. It can be accessed from the site http://www.ncbi.nlm.nih.gov/probe
- Protein Sequence Database: It is a collection of sequences from GenBank, RefSeq, TAP, SwissProt, PIR, PRF, PDB. It can be accessed from the site www.ncbi.nlm.nih.gov/protein
- Pubchem BioAssay: It contains information of bioactivity screens of chemical substances from PubChem. It can be accessed from the site www.ncbi.nlm.nih.gov/pcassay
- PubChem Compound: It contains compounds with their unique structures and biological information from PubChem substances. It can be accessed from the site www.ncbi.nlm.nih.gov/pccompound
- PubChem Substance: It is a collection of records of substances from depositors into the system, descriptions of samples, and links to biological screening results which are available in PubChem BioAssay. It can be accessed from the site www.ncbi.nlm.nih.gov/pcsubstance
- PubMed: PubMed is a freely accessible database search system for health information which is developed and maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). It contains articles from MEDLINE and other biomedical articles. It can be accessed from the site www.ncbi.nlm.nih.gov/pubmed
- Pubmed Central: PubMed central is a freely accessible digital resource of full text articles for biomedical life science journals, which is linked to PubMed database. It can be accessed from the site www.ncbi.nlm.nih.gov/pmc/
- SNP: The SNP database contains information of single nucleotide polymorphisms, short insertion and deletion polymorphisms. It can be accessed from the site www.ncbi.nlm.nih.gov/snp
- Structure: The Structure database contains information of 3 dimensional structures of proteins and other polynucleotide. It can be accessed from the site www.ncbi.nlm.nih.gov/structure
- UniGene: It identifies transcripts from the same locus, analyses expression by tissue, age, health status and report related proteins (protest) and clone resources. It can be accessed from the site www.ncbi.nlm.nih.gov/unigene
- UniSTS: It contains information about Sequenced Tagged Sites (STS) which are from the PCR primer pairs with their genomic positions, genes and sequence information from STS based maps and other experiments. It can be accessed from the site www.ncbi.nlm.nih.gov/unists
- BioSample: It is a collection of information of different biological source materials used in experimental assays. It can be accessed from the site www.ncbi.nlm.nih.gov/biosample
The results of the query search are represented in different data formats like GenBank, FASTA.
GenBank : GenBank is a collection of annotated DNA sequences, which is the NIH genetic sequence database. The different parameter components included are explained below.
- Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the fourth and fifth characters gives other group designations, such as gene product and the last character is a series of sequential integers.
- Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence record.
- Molecule Type shows the type of sequenced molecule .
- Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter abbreviation.
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTG sequences (high-throughput genomic seq)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
- Modification Date shows the last date of modification.
- Definition is a brief description of sequence that includes information such as source organism, gene name/protein name, or some description of the sequence's function.
- Accession number indicates the unique identifier for a sequence record.
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
- Version shows a nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.
- GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
- Keywords describes word or phrase of the sequence.
- Source indicates free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type.
- Organism describes the formal scientific name for the source organism and its lineage.
- Reference includes publications by the authors of the sequence that discuss the data reported in the record.
- Authors contains List of authors in the order in which they appear in the cited article.
Entrez Search Field: Author [AUTH]
- Title represents the title of the published work or tentative title of an unpublished word.
Entrez Search Field: Text Word [WORD]
- Journal: MEDLINE abbreviation of the journal name.
Entrez Search Field: Journal Name [JOUR]
- Pubmed: PubMed Identifier (PMID)
- Features shows information about genes and gene products, as well as regions of biological significance reported in the sequence.
- Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number. Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter.
- Taxon is a stable unique identification number for the taxon of the source organism.
- CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino acids in a protein.
Figure 1 : GenBank file obtained from NCBI database for the entry Homo sapiens Neurexin1
FASTA: It is a file format used for representing nucleotide or protein sequences as a string with some basic tag or identifier in which nucleotides or amino acids are represented as single letter codes. A FASTA sequence starts with a (>) greater than symbol which implies the beginning of a new sequence records called as definition line (“def lineâ€ÂÂ). An accession number or version number is followed by description of that entry. DNA sequence in either uppercase or lower case letters starts from the next line. The sequences contain 60 characters per line.
Figure 2: FASTA file format obtained from NCBI database for the entry Homo sapiens Neurexin1
These sequences which are stored in the database were obtained from different experimental methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used for protein sequencing.
Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands). Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.
Maxam-Gilbert (Chemical degradation method): This method requires denature DNA fragment whose 5' end is radioactively labeled. This fragment is then subjected to purification before proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis technique helps in arranging the fragments based on their molecular weight. To view the fragments, gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.
Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid
Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for finding the chemical structures of molecules like peptides and other chemical compounds. Based on the mass to charge ratio, one can identify the amino acids in a protein