- Understand basics of RNA sequencing.
- Determining gene expression at different levels between conditions.
- To detect and quantify non-protein-coding transcripts, slice isoforms, novel transcripts and sites of protein-RNA interactions.
RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered the view of the extent and complexity of eukaryotic transcriptomes. Differential expression analysis denotes the calculation of the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. The detection of gene expression changes (i.e., mRNA levels) between different cell populations and/or experimental conditions remains the most common application of RNA-seq; yet even for that highly popular application, widely accepted and adopted standards are lacking and the RNA-seq field is only slowly coming to terms about best practices for differential gene expression analysis amid a myriad of available software. In Bioinformatics a differential gene expression analysis includes the following steps.
- Processing of sequencing reads (including alignment)
- Estimation of individual gene expression levels
- Identification of differentially expressed (DE) genes
Recently, the development of novel high-throughput DNA sequencing methods has provided a new method for both mapping and quantifying transcriptomes. This method, termed RNA-Seq (RNA sequencing), has clear advantages over existing approaches and is expected to revolutionize the manner in which eukaryotic transcriptomes are analysed. RNA-Seq uses recently developed deep-sequencing technologies. In general, a population of RNA (total or fractionated, such as poly(A)+) is converted to a library of cDNA fragments with adaptors attached to one or both ends. Each molecule, with or without amplification, is then sequenced in a high-throughput manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-end sequencing).The reads are typically 30–400 bp, depending on the DNA-sequencing technology used. unlike hybridization-based approaches, RNA-Seq is not limited to detecting transcripts that correspond to existing genomic sequence. RNA-Seq can reveal the precise location of transcription boundaries, to a single-base resolution. Furthermore, 30-bp short reads from RNA-Seq give information about how two exons are connected, whereas longer reads or pair-end short reads should reveal connectivity between multiple exons. These factors make RNA-Seq useful for studying complex transcriptomes. In addition, RNA-Seq can also reveal sequence variations (for example, SNPs) in the transcribed regions.
Challenges in Bioinformatics for Differential expression analyses of RNA-seq
Like other high-throughput sequencing technologies, RNA-Seq faces several informatics challenges, including the development of efficient methods to store, retrieve and process large amounts of data, which must be overcome to reduce errors in image analysis and base-calling and remove low-quality reads. The first task of data analysis is to map the short reads from RNA-Seq to the reference genome, or to assemble them into contigs before aligning them to the genomic sequence to reveal transcription structure. Short transcriptomic reads also contain reads that span exon junctions or that contain poly(A) ends — these cannot be analysed in the same way. For genomes in which splicing is rare special attention only needs to be given to poly(A) tails and to a small number of exon–exon junctions. Poly(A) tails can be identified simply by the presence of multiple As or Ts at the end of some reads. Exon–exon junctions can be identified by the presence of a specific sequence context (the GT–AG dinucleotides that flank splice sites) and confirmed by the low expression of intronic sequences, which are removed during splicing. For large transcriptomes, alignment is also complicated by the fact that a significant portion of sequence reads match multiple locations in the genome. One solution is to assign these multi-matched reads by proportionally assigning them based on the number of reads mapped to their neighbouring unique sequences.. This method has been successful for low-copy repetitive sequences20. Short reads that have high copy numbers (>100) and long stretches of repetitive regions present a greater challenge.
Differential expression analyses of RNA-seq using R
Bioconductor has many packages which support analysis of high-throughput sequence data, including RNA sequencing (RNA-seq). As input, the count-based statistical methods, such as DESeq2, edgeR , limma with the voom method, DSS, EBSeq and baySeq, expect input data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of un-normalized counts. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) can be assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-Seq), or peptide sequences (with quantitative mass spectrometry). The values in the matrix should be counts or estimated counts of sequencing reads/fragments. Transcript quantification methods such as Salmon, kallisto or RSEM perform mapping or alignment of reads to reference transcripts, outputting estimated counts per transcript as well as effective transcript lengths which summarize bias effects. After running one of these tools, the tximport or tximeta packages can be used to assemble estimated count and offset matrices for use with Bioconductor differential gene expression packages. PCA plots. Heat map plots, PCA and MDS plot using the Poisson Distance helps in visualizing the differential expression of genes.