Objective
- To introduce GEO database and its features.
- To learn how to retrieve the gene expression data from GEO.
Theory
The Gene Expression Omnibus (GEO) is an online database for microarray, next generation sequencing and other forms of high–throughput functional genomic data. It can be accessed from the site www.ncbi.nlm.nih.gov/geo/. GEO also contains different options to retrieve and study the various gene expression data submitted by different scientist. The data needs to be submitted to MIAME (Minimum Information about a Microarray Experiment), complaint public repository like GEO, where they provide assistance so as to present the data for journal publication. Minimum Information About a Microarray Experiment depicts the minimum information for the microarray data which can be easily interpreted and analyzed. It is formed by FGED Society for reporting microarray experiments. Microarray is a collection of microscopic DNA sequence spots attached to a solid surface of glass or silicon. Each spot contains millions of replicates of specific DNA sequences called probes, which are short section of a gene or other DNA element. The microarray technology is a combination of wet-lab techniques, statistical analysis and application of informatics to the data. The data obtained is submitted to GEO. If the user has any questions or complaints about submitting data to GEO, they can email to geo@ncbi.nlm.nih.gov. The 3 main objectives of GEO are Submission guide, Data organization and Query & Analysis.
Submission Guide:
GEO accepts different kinds of data including high–throughput functional genomic data, which include all array based applications and some high-throughput sequencing data. The curators will provide as much assistance for submitting data to GEO. One can submit data as spreadsheets like Excel, plain text or as XML document. Among these, Excel is the most recommended method for submission. Data which can be submitted include common commercial arrays like Agilent, Illumina, Nimblegen or Affymetrix which has unique properties and file types. The microarray data which are being submitted have to be in raw data format, this helps to understand and verify the data as described in the MIAME guidelines. If the submitter wants to delete a record from the database, they need to give a request by providing the accession number of the respective record. Only the GEO members have the permission to delete a record.
Data Organization:
The data from expression studies can be organised into platform records, sample records, series records, dataset and profiles.
Platform records are submitted by the scientist for curation, which contains many samples from multiple submitters. It contains information about the array/sequencer. For array based platforms, a data table with its array template are provided. Each of that record is assigned with a unique and stable GEO accession number (GPLxxx).
Figure 1: Example for Platform record: www.ncbi.nlm.nih.gov/geo/query/acc.cgi
Sample record is assigned with a unique and stable GEO accession number (GSMxxx), which contains the sample, the changes and manipulation it has undergone and the abundance measurement of each element derived from it.
Figure 2: Example for Sample record: www.ncbi.nlm.nih.gov/geo/query/acc.cgi
Series record is assigned with a unique and stable GEO accession number (GSExxx), which is an original record from the submitter summarizing an experiment. It contains a group of related samples with its description, tables, extracted data, summary, conclusion and analysis. The data submitted by the scientist are examined by various curators in GEO before going for a paper publication or any analysis.
Figure 3: Example for Series record: www.ncbi.nlm.nih.gov/geo/query/acc.cgi
GEO Dataset: The data which are selected after curation are refined again before transferred into GEO Dataset and Gene profile records. GEO Dataset is a study based database which contains Platform, Samples, Series and the corrected data from the curators. These curated dataset form the root of GEO’s advanced data display and analysis features, with gene expression level identifying tools and cluster heat maps. These dataset are assigned with a unique and stable GEO accession number GDSxxx. It can be searched by simply typing any keywords or Organism name, authors or Dataset type in their respective boxes. A large amount of data can be filtered down to make our search more specific by using GEO Limits Query builder and Advanced Search. GEO Dataset contains GEO samples which refer to the same platform with same set of array elements. The values are calculated by background processing and normalization which are consistent across the dataset subsets. Both GEO dataset and GEO series can be searched using the same dataset interface. It is not possible to represent all the submitted data in a dataset because every day the curators are experiencing a backlog in dataset creation.
Figure 4: Example for dataset www.ncbi.nlm.nih.gov/sites/GDSbrowser
GEO Profiles are derived from curated GEO datasets. GEO Profile is a gene based database where the user can search for gene expression profiles. Every single profile is represented as a chart which is displaying the expression level of one gene across all samples within a Dataset. Experimental details are represented as bars along the bottom of the charts. It is also possible to view the similarly expressed genes within close proximity on the chromosome. All the profiles are connected with internal links to represent similarly behavioring genes and external links to significant records in other NCBI databases. It can be searched with any keywords, gene names, gene symbols, GenBank accession number or profiles flagged as being differently expressed.
Figure 5: Example for profile www.ncbi.nlm.nih.gov/geoprofiles/32391646