Introduction the R Programming Language
Introduction to R
R is a programming language and environment for doing computations that involve statistics. The language offers a wide variety of tools that making statistical computations easy to do and takes out much of the boiler-plate code needed to implement various statistical models. It has better support for object orientated programming techniques. There are also strong tools for visualizing data built into R as well as a vast amount of community made packages that extend R’s functionality. I’m going to begin by going over some of the basic features of R and the syntax for some common statistical tasks.
To begin you will need to get an installation of R running on your computer. Most installations come bundled with a GUI editor that highlights R syntax, but most things I will go over can be done using the interactive prompt.
One great feature of R is the ability to quickly install packages created by the community from the command line. R’s Comprehensive R Archive Network (CRAN) hosts a wide array of packages that extend the functionality of R. You can find a list of available packages here. Installing a package is a simple one line command. Later on in this series, we will be plotting graphs of data using an extension of R’s graphing features called ggplot2. It simplifies some of the steps required for graphing data and allows you to make complex layered graphs. To install the package run
# select a mirror when prompted > install.packages(“ggplot2”)
R has a good deal of documentation available on the internet as well as a few free books on learning the basics. You can find them here An other helpful feature that R provides the ability to look up documentation on the command line. To look at the documentation on the install.packages() function prepend a ? to the function name and the documentation will be shown. You can also search through the help files for certain terms by using help.search().
> ?install.packages > help.search("install.packages")
Basics in R
R provides a nice selection of data types made for use in statistical computations.
In R you can assign numeric values that consist of any real number to a variable. An arrow pointing to the left (<-) is used to assign a value to variable. There is no need to declare the datatype of a variable before assigning it a value.
# assign numbers to two variables > t <- 45 > u <- 45.543543 # from the interpreter you can print out a variable by typing in the variable name # print out t and u > t  45 > u  45.54354
Define a string using single or double quotes.
> y <- "hello world" > y  "hello world"
Factors contain values of categorical data that can’t be interpreted as a number. Storing data as a factor makes sure R interprets certain data correctly when creating statistical models. They are an efficient way to store strings, because their values can be stored once in memory and then referenced by a numeric value. More about factors will come up when you start analyzing data.
There are three types of vectors in R
They are defined using c().You can refer to an element of a vector using it’s index value, a vector of index values, or a range of index values.
> x <- c(54,43,43,76,21,32) > y <- c("one", "two", "three", "four") > z <- c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE) > x(c(2,5)) # 2nd and 5th index  43 21 > z[2:5] #index 2 through 5  FALSE TRUE TRUE FALSE
Matrices are a large part of computing statistical data in R especially when dealing with regressions and other computations on tabular data. Create a matrix using the matrix() function. Each column in the matrix must be of the same mode type of mode(numeric, character, logical).
#create an 8 x8 matrix of numbers of numbers from 1 to 64 > x <-matrix(1:64, nrow=8, ncol=4) # create a 4x4 matrix with the columns and rows labels # byrow=TRUE makes sure the matrix is filled by columns instead of rows # dimnames is set to a list of of two vectors that will be set as the row and column labels > y <-matrix(1:16, nrow=2, ncol=2,byrow=TRUE, dimnames=list(c("a", "b"), c("1", "2")))
Create a list using the list() function. In R lists are like associative arrays with key value pairs. The default key values are integer index values, but they can also be set to strings. Lists can contain any data type and can have mixed types as well.
> a <- c("one", "two", "three", "four") > b <- "Hello World" > c <- matrix(1:64, nrow=8, ncol=4) # create list with variables of mixed data types > thelist <- list(avector=a, astring=b, amatrix=c)
Data frames are very similar to matrices except the columns of a data frame can have different modes. You can create a data frame by combining multiple vectors of the same length. They can later on be turned into plots and other visualizations. Create one using the data.frame() function. YOu can also uses the names() function to set column labels.
>x <- c(1,2,3,4) >y <- c("one", "two", "three", "four") >z <- c(TRUE,FALSE,FALSE,FALSE) >thedata <- data.frame(x,y,z) # set the column names of the data frame by passing names() a vector of labels >names(thedata) <- c("integers","strings","logical") # variable names integers strings logical 1 1 one TRUE 2 2 two FALSE 3 3 three FALSE 4 4 four FALSE
The next section of this tutorial will go over importing and graphing data as well as some linear regressions using R.