Basic Importing and Graphing Data In R

Importing and Analyzing Data

One of the most common ways of importing data into R is using a text file that is delimitated with some sort of character. There are also packages that allow you to input data from a database, or other formats. Once the data is imported as a data frame you can start creating models and graphs.

To go through the examples of this post you can download the data file about baby births.

babies.txt

The file is broken down into columns with a header given at the beginning of the file.

Reading CSV Files

To read in a CSV file as a data frame use the read.csv() function. This will return a data frame with the data from the file. If there is a header for the file then you can import that as well for the column labels. R will try and convert the columns of the file into relevant datatypes. It will work most of the time for numbers, booleans, and strings. To convert dates you will have to the date functions that R provides. You can read about the date functions here


# read in data.csv
# head=TRUE imports header values from the csv file and sets them to column names
# set the seperator using the sep argument
> baby.data <- read.csv(file="data/babies.csv", head=TRUE, sep=" ")

# print out the new data's columns
> colnames(baby.data)
[1] "birth_weight" "gestation"    "parity"       "age"          "height"      
[6] "weight"       "smoke"

# if you want to change the column labels you can pass the colnames() function a vector of labels
> colnames(baby.data) <- c("BirthWeight", "Gestation", "Parity", "Age", "Height", "Weight", "Smoke")

The baby birth data set is pretty large so it’s not the best to print out the entire data frame. There are a few functions that R provides that enable you to get a good idea of what the imported data looks like. Use can use head() and tail() to view the first and last 6 rows of the data frame.


# you can use head() to print out the first 6 entries in the data tail
> head(baby.data)
  birth_weight gestation parity age height weight smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0
# tail returns the last 6
> tail(baby.data)

# new column names
> head(baby.data)
  Birth Weight Gestation Parity Age Height Weight Smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0

# you can also view one single column of the data frame by using the $ symbol. It will return a vector containing that column's data
# this is good use to when formatting data
> head(baby.data$Weight)

The summary() function gives you some details about the information in a data frame. The minimum, 1st quartile, median, mean 3rd quartile, and maximum. It also gives you information about how many null values (NA’s) there are. You can then use the na.omit() function to omit the null values from the data frame, so there isn’t any bad data left that can cause bugs later on.


# print out a summary of the imported data
> summary(baby.data)
  Birth Weight     Gestation         Parity            Age       
 Min.   : 55.0   Min.   :148.0   Min.   :0.0000   Min.   :15.00  
 1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000   1st Qu.:23.00  
 Median :120.0   Median :280.0   Median :0.0000   Median :26.00  
 Mean   :119.6   Mean   :279.3   Mean   :0.2549   Mean   :27.26  
 3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000   3rd Qu.:31.00  
 Max.   :176.0   Max.   :353.0   Max.   :1.0000   Max.   :45.00  
                 NA's   :13                       NA's   :2      
     Height          Weight          Smoke       
 Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
 1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
 Median :64.00   Median :125.0   Median :0.0000  
 Mean   :64.05   Mean   :128.6   Mean   :0.3948  
 3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
 Max.   :72.00   Max.   :250.0   Max.   :1.0000  
 NA's   :22      NA's   :36      NA's   :10

# str will print out the data type for each column in the data frame
# It looks like there are some null so we'll use na.omit() to remove them
> str(baby.data)
'data.frame':  1174 obs. of  7 variables:
 $ BirthWeight: int  120 113 128 108 136 138 132 120 143 140 ...
 $ Gestation  : int  284 282 279 282 286 244 245 289 299 351 ...
 $ Parity     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age        : int  27 33 28 23 25 33 23 25 30 27 ...
 $ Height     : int  62 64 64 67 62 62 65 62 66 68 ...
 $ Weight     : int  100 135 115 125 93 178 140 125 136 120 ...
 $ Smoke      : int  0 0 1 1 0 0 0 0 1 0 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:62] 4 40 43 86 90 94 99 103 111 114 ...
  .. ..- attr(*, "names")= chr [1:62] "4" "40" "43" "86" ...
> baby.data <- na.omit(baby.data) 

Because this data set is reletively large taking out a few rows because of null values is fine and shouldn’t affect results we might try to get from the data.

Graphing

Now that you have some data points loaded into R you can start to to create graphs of the data and see if there are any visible trends that you can find. In the last tutorial you should have installed the ggplot2 package for plotting data. ggplot2 extends R’s graphing functions and allows for graphs to be made layer by layer. It also has better control over the visual aspects of the graphs. ggplot2 creates graphs by layers represented by R functions.

The example below starts off with a ggplot() object with the aesthetics x and y being set to columns in the baby.data data frame . Next, geom_point() creates a scatter plot of the data with the color being represented by the Smoke column in the baby data frame. That means the color of each point will be colored depending on on the Smoke column. The theme is set to black and white with theme_bw(), then scale_color_manual() defines the darkblue color and removes the legend from the data. xlab() and ylab() set the labels for the plot and opts() sets the title. ggsave() saves the file as a pdf to a directory of your choice.


# start off with a ggplot() objhect then add on the other layers to build the graph
> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies.pdf",width=8,height=9)

To create separate graphs for the baby data based on the parity of their birth you can use the
facet_wrap() function to break down the data into multiple graphs based on the value of a third column in the data frame.

> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> facet_wrap(~Parity, nrow = 6, ncol = 1) + 
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies-by-parity.pdf",width=8,height=9)

Graphs made with ggplot2 can be customized in almost any way to fit the type of graph. The docs can be found at the bottom of the page. In the next section of the tutorial I will go over graphing a bit more and take on some simple regressions and clustering.

Part 3: Linear Models In R

Helpful Links

Importing Data in R
PostgreSQL interface for R scripts
ggplot2 Documentation
R Cook Book: Graphing

Introduction the R Programming Language

Introduction to R

R is a programming language and environment for doing computations that involve statistics. The language offers a wide variety of tools that making statistical computations easy to do and takes out much of the boiler-plate code needed to implement various statistical models. It has better support for object orientated programming techniques. There are also strong tools for visualizing data built into R as well as a vast amount of community made packages that extend R’s functionality. I’m going to begin by going over some of the basic features of R and the syntax for some common statistical tasks.

To begin you will need to get an installation of R running on your computer. Most installations come bundled with a GUI editor that highlights R syntax, but most things I will go over can be done using the interactive prompt.

Packages

One great feature of R is the ability to quickly install packages created by the community from the command line. R’s Comprehensive R Archive Network (CRAN) hosts a wide array of packages that extend the functionality of R. You can find a list of available packages here. Installing a package is a simple one line command. Later on in this series, we will be plotting graphs of data using an extension of R’s graphing features called ggplot2. It simplifies some of the steps required for graphing data and allows you to make complex layered graphs. To install the package run


# select a mirror when prompted
> install.packages(“ggplot2”)

R has a good deal of documentation available on the internet as well as a few free books on learning the basics. You can find them here An other helpful feature that R provides the ability to look up documentation on the command line. To look at the documentation on the install.packages() function prepend a ? to the function name and the documentation will be shown. You can also search through the help files for certain terms by using help.search().


> ?install.packages
> help.search("install.packages")

Basics in R

Data Types

R provides a nice selection of data types made for use in statistical computations.

Numbers

In R you can assign numeric values that consist of any real number to a variable. An arrow pointing to the left (<-) is used to assign a value to variable. There is no need to declare the datatype of a variable before assigning it a value.


# assign numbers to two variables
> t <- 45
> u <- 45.543543
# from the interpreter you can print out a variable by typing in the variable name
# print out t and u
> t
[1] 45
> u
[1] 45.54354

Strings

Define a string using single or double quotes.


> y <- "hello world"
> y
[1] "hello world"

Factors

Factors contain values of categorical data that can’t be interpreted as a number. Storing data as a factor makes sure R interprets certain data correctly when creating statistical models. They are an efficient way to store strings, because their values can be stored once in memory and then referenced by a numeric value. More about factors will come up when you start analyzing data.

Vectors

There are three types of vectors in R
-numerical
-character
-logical
They are defined using c().You can refer to an element of a vector using it’s index value, a vector of index values, or a range of index values.


> x <- c(54,43,43,76,21,32)
> y <- c("one", "two", "three", "four")
> z <- c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE)
> x(c(2,5))
# 2nd and 5th index
[1] 43 21
> z[2:5]
#index 2 through 5
[1] FALSE  TRUE  TRUE FALSE

Matrices

Matrices are a large part of computing statistical data in R especially when dealing with regressions and other computations on tabular data. Create a matrix using the matrix() function. Each column in the matrix must be of the same mode type of mode(numeric, character, logical).


#create an 8 x8 matrix of numbers of numbers from 1 to 64
> x <-matrix(1:64, nrow=8, ncol=4)
# create a 4x4 matrix with the columns and rows labels
# byrow=TRUE makes sure the matrix is filled by columns instead of rows
# dimnames is set to a list of of two vectors that will be set as the row and column labels
> y <-matrix(1:16, nrow=2, ncol=2,byrow=TRUE, dimnames=list(c("a", "b"), c("1", "2")))

Lists

Create a list using the list() function. In R lists are like associative arrays with key value pairs. The default key values are integer index values, but they can also be set to strings. Lists can contain any data type and can have mixed types as well.


> a <- c("one", "two", "three", "four")
> b <- "Hello World"
> c <- matrix(1:64, nrow=8, ncol=4)
# create list with variables of mixed data types
> thelist <- list(avector=a, astring=b, amatrix=c)

Data Frames

Data frames are very similar to matrices except the columns of a data frame can have different modes. You can create a data frame by combining multiple vectors of the same length. They can later on be turned into plots and other visualizations. Create one using the data.frame() function. YOu can also uses the names() function to set column labels.


>x <- c(1,2,3,4)
>y <- c("one", "two", "three", "four")
>z <- c(TRUE,FALSE,FALSE,FALSE)
>thedata <- data.frame(x,y,z)
# set the column names of the data frame by passing names() a vector of labels
>names(thedata) <- c("integers","strings","logical") # variable names

  integers strings logical
1        1     one    TRUE
2        2     two   FALSE
3        3   three   FALSE
4        4    four   FALSE

The next section of this tutorial will go over importing and graphing data as well as some linear regressions using R.

Part 2 Importing and Graphing Data In R