Theory Procedure PreTest Simulator PostTest Assignment Reference Feedback

Learning Objectives

To teach R as a general programming language, rather than to focus on issues specific to computational biology.
To teach users how to read and write data sequence using R programming.
Users will be able to read and write any kind of sequence files in CSV format.
To briefly introduce the use of R in data manipulation, statistics and graphical representations.
To explore application of R software environment by statisticians and data miners for developing statistical software and data analysis.

Theory

Introduction to R

With the technological innovations that revolutionized biology, bioinformatics has become a scientific disciple in the last few decades. Popular programming languages such as Java and Python were the choice of programmers for working in bioinformatics and computational biology. R is a simple programming language, and a free software environment meant for statistical analyses such as linear and nonlinear modelling, time-series analysis, classification, clustering and for computing and other graphical representations. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, similar in nature to the “S” statistical environment developed at Bell Laboratories (http://www.r-project.org/about.html) and is currently developed and maintained by R-core developers in computational disciplines. It has been widely used by bioinformaticians, software programmers, statisticians and data miners for exploring the aspects of computational biology, functional genomics, dynamical systems, statistical genetics, and network biology. It is a free software, with GNU General Public License, and is now used in many areas of scientific computation. The main feature of R is, it allows rapid development of ideas with object-oriented features for software development. The inbuilt functions are ideal for statistical simulations an the code provided by others are shared easily. High quality graphical output for modelling and analysis is also a core feature of using R in biological scenario. R has also the ability to interoperate with many other languages and maximal code usage has been ensured compared with other languages. Flexibility, data handling and modelling capabilities has made R as a widely used software tools for bioinformatics. It also supports the creation and use of self-describing data structures.

There are a few principal functions for reading data into R.

read.table,read.csv, for reading tabular data
readLines, for reading lines of a text file
source, for reading in R code files (inverse ofdump)
dget, for reading in R code files (inverse of dput)
load, for reading in saved workspaces
unserialize, for reading single R objects in binary form

There are analogous functions for writing data to files

write.table
writeLines
dump
dput
save
serialize

Basic concepts of R programming

R Data Structures

Vectors: A basic data structure of R including the same type of data, such as numeric, integers, character and so on.
Matrices: Include array of numbers or mathematical objects. Basic mathematical operations such as addition, substration and multiplication can be performned in R matrix.
Lists: Collection of objects with vectors of same type and same length in a matrix.
Array: Multi-dimensional Data structures forms an array. To create an array, provide vectors as input.

Application of R programming in biology

It is a free and open-source tool applicable to all operation system(Cross-platform support).
Easily adaptable for large community of users.
Robust visualization library and ability to process complex statistical operations.
Standard tool for machine learning aspects, statistics and biological data analysis.
In academics, R programming helps both students, teachers and research community to develop statistical models for analysing larger pool of datasets.

Reading biological sequence data in R

Genomes represent complete set of DNA of an organism.Exploratory data analysis and data visualization for biological sequence data is gaining insight for biological sequence analysis. The emerging fields of computational biology and bioinformatics have led to significant advances primarily in automated data analysis. In R programming, sequinR package helps to provide access to biological sequence databases.

Cite this Simulator: