- Basic understanding of the importance of NCBI in biological sciences.
- To understand how to retrieve sequence data from NCBI using R programming.
There has been an increasing demand in biological sciences for understanding biological data in depth that were created by sequencing projects mainly, human genome project. Bioinformatics, an interdisciplinary field provides tools for analyzing biological data computationally followed by interpreting of biological sequence data specifically in modern medicine and biology. Using complex computer software programs, DNA and protein sequence data of specific organisms were retrieved and sorted for analysis and predictions by molecular biologists. Pharmaceutical companies were also relying on constant input from bioinformatics experts for acquisition and analysis of biological data for large scale analysis. Biological data from genome projects put forward the need of algorithm software programs for storing the biological data in a usable format. Many databases exist through internet for retrieving information such as DNA sequence of species, genetic mutations and polymorphisms and so on. Understanding biological databases using search engines provide repositories of gene data for identifying and analyzing genetic data for biomedical research.
National Center for Biotechnology Information (NCBI)
NCBI (https://www.ncbi.nlm.nih.gov/) was developed by National Library of Medicine (NLM) at the National Institutes of Health in 1988 for the purpose of developing information systems for molecular biology. It is a comprehensive website that has databases relevant for biotechnology and biomedicine. It provides retrieval sources for the data in GenBank, a comprehensive database with publicly available nucleotide sequences of various species and other biological data through the NCBI website. GenBank has participation with the European Molecular Biology Lab (EMBL): Nucleotide Sequence Database (EMBL-Bank), and the DNA Data Bank of Japan (DDBJ). Examples of NCBI data resources include Entrez, Pubmed, BLAST, COBALT, the Map Viewer, Gene Expression Omnibus, Molecular Modeling Database and so on, which can be accessed through NCBI homepage.
The stipulated roles of NCBI multidisciplinary research includes
- Largest repository for biological research data by creating public databases for storage, retrieval, analysis of data related to molecular biology, genetics and biochemistry.
- Integrating scientific and medical data with computer-based information processing systems which emphasizes research in computational biology related to structure and function of biological molecules.
- Designing algorithms for biological data analysis and cross-referencing mechanisms for genome analysis.
- Developing software tools for genome sequence analysis and for disseminating biomedical information.
Major goal in databases development was to provide a user friendly access to stored biological data. A wide number of biological retrieval systems exist for genome analysis. Among these, the most popular one is Entrez, an integrated search engine, developed and maintained by NCBI for retrieval of biological data. It allow users for text based searches for nucleotide and protein sequence data, structural information, abstracts, citations, and taxonomic data related to specific query. It integrates information by cross-referencing with NCBI databases on the basis of preexisting and logical relationships between the individual query sequences.
NCBI database in biological sciences
In NCBI Sequence Database, each sequence was provided with a separate record and an identifier, which was unique to particular organisms that can be used to refer to that sequence record. Accession, denotes unique identifier, consisting of numbers and letters. For example, accessions for the DNA sequences of the Dengue Virus types are DEN-1 (NC_001477), DEN-2 (NC_001474), DEN-3(NC_001475) and DEN-4 (NC_002640) respectively. For example, to retrieve DNA sequence of DEN-2 virus, potential virus to cause dengue hemorrhagic fever, in NCBI website (https://www.ncbi.nlm.nih.gov/) in search box type “NC_001474 and “Search” for analysis output (Fig.1)..
Fig.1. Querying sequence data for Dengue virus, type 2
The result page retrives “PubMed” data containing scientific paper abstracts, DNA and RNA sequence analysis Nucleotide database, the “Protein” data for retrieving protein sequence data, and so on. User can look into different features of the given query. Instead of providing accession number, users can type query word such as insulin, keratin etc and can search in depth about the query sequence.
To retrieve FASTA format sequence of DEN-2 Dengue virus genome, have to click on “Send” button at the top right of the NC_001477 sequence record webpage This was followed by choosing “File” in FASTA and save the file in “ My documents folder of user’s personal computer as den2.fasta (Fig.2).
Fig.2. Retrieving FASTA file of query sequence
For detailed information go through http://vlab.amrita.edu/index.php?sub=3&brch=273&sim=1437&cnt=1
Querying genome sequence data using SeqinR
Apart from retrieving genome sequence from NCBI website, another approach for sequence retrieval is by NCBI sequence data directly with the help of R. With standard R installation add-ons need to be loaded for running R programming function. SeqinR package helps to retrieve sequence meant for simple DNA sequence analyses from a DNA sequence database.
Bioconductor comprises of R packages (http://www.bioconductor.org) for programming especially for analyses of biological data sets such as microarray data.
SeqinR package (http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng) works for retrieving DNA sequences and protein sequence databases, DNA and protein sequences analyses.