Basic Importing and Graphing Data In R

Importing and Analyzing Data

One of the most common ways of importing data into R is using a text file that is delimitated with some sort of character. There are also packages that allow you to input data from a database, or other formats. Once the data is imported as a data frame you can start creating models and graphs.

To go through the examples of this post you can download the data file about baby births.

babies.txt

The file is broken down into columns with a header given at the beginning of the file.

Reading CSV Files

To read in a CSV file as a data frame use the read.csv() function. This will return a data frame with the data from the file. If there is a header for the file then you can import that as well for the column labels. R will try and convert the columns of the file into relevant datatypes. It will work most of the time for numbers, booleans, and strings. To convert dates you will have to the date functions that R provides. You can read about the date functions here


# read in data.csv
# head=TRUE imports header values from the csv file and sets them to column names
# set the seperator using the sep argument
> baby.data <- read.csv(file="data/babies.csv", head=TRUE, sep=" ")

# print out the new data's columns
> colnames(baby.data)
[1] "birth_weight" "gestation"    "parity"       "age"          "height"      
[6] "weight"       "smoke"

# if you want to change the column labels you can pass the colnames() function a vector of labels
> colnames(baby.data) <- c("BirthWeight", "Gestation", "Parity", "Age", "Height", "Weight", "Smoke")

The baby birth data set is pretty large so it’s not the best to print out the entire data frame. There are a few functions that R provides that enable you to get a good idea of what the imported data looks like. Use can use head() and tail() to view the first and last 6 rows of the data frame.


# you can use head() to print out the first 6 entries in the data tail
> head(baby.data)
  birth_weight gestation parity age height weight smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0
# tail returns the last 6
> tail(baby.data)

# new column names
> head(baby.data)
  Birth Weight Gestation Parity Age Height Weight Smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0

# you can also view one single column of the data frame by using the $ symbol. It will return a vector containing that column's data
# this is good use to when formatting data
> head(baby.data$Weight)

The summary() function gives you some details about the information in a data frame. The minimum, 1st quartile, median, mean 3rd quartile, and maximum. It also gives you information about how many null values (NA’s) there are. You can then use the na.omit() function to omit the null values from the data frame, so there isn’t any bad data left that can cause bugs later on.


# print out a summary of the imported data
> summary(baby.data)
  Birth Weight     Gestation         Parity            Age       
 Min.   : 55.0   Min.   :148.0   Min.   :0.0000   Min.   :15.00  
 1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000   1st Qu.:23.00  
 Median :120.0   Median :280.0   Median :0.0000   Median :26.00  
 Mean   :119.6   Mean   :279.3   Mean   :0.2549   Mean   :27.26  
 3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000   3rd Qu.:31.00  
 Max.   :176.0   Max.   :353.0   Max.   :1.0000   Max.   :45.00  
                 NA's   :13                       NA's   :2      
     Height          Weight          Smoke       
 Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
 1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
 Median :64.00   Median :125.0   Median :0.0000  
 Mean   :64.05   Mean   :128.6   Mean   :0.3948  
 3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
 Max.   :72.00   Max.   :250.0   Max.   :1.0000  
 NA's   :22      NA's   :36      NA's   :10

# str will print out the data type for each column in the data frame
# It looks like there are some null so we'll use na.omit() to remove them
> str(baby.data)
'data.frame':  1174 obs. of  7 variables:
 $ BirthWeight: int  120 113 128 108 136 138 132 120 143 140 ...
 $ Gestation  : int  284 282 279 282 286 244 245 289 299 351 ...
 $ Parity     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age        : int  27 33 28 23 25 33 23 25 30 27 ...
 $ Height     : int  62 64 64 67 62 62 65 62 66 68 ...
 $ Weight     : int  100 135 115 125 93 178 140 125 136 120 ...
 $ Smoke      : int  0 0 1 1 0 0 0 0 1 0 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:62] 4 40 43 86 90 94 99 103 111 114 ...
  .. ..- attr(*, "names")= chr [1:62] "4" "40" "43" "86" ...
> baby.data <- na.omit(baby.data) 

Because this data set is reletively large taking out a few rows because of null values is fine and shouldn’t affect results we might try to get from the data.

Graphing

Now that you have some data points loaded into R you can start to to create graphs of the data and see if there are any visible trends that you can find. In the last tutorial you should have installed the ggplot2 package for plotting data. ggplot2 extends R’s graphing functions and allows for graphs to be made layer by layer. It also has better control over the visual aspects of the graphs. ggplot2 creates graphs by layers represented by R functions.

The example below starts off with a ggplot() object with the aesthetics x and y being set to columns in the baby.data data frame . Next, geom_point() creates a scatter plot of the data with the color being represented by the Smoke column in the baby data frame. That means the color of each point will be colored depending on on the Smoke column. The theme is set to black and white with theme_bw(), then scale_color_manual() defines the darkblue color and removes the legend from the data. xlab() and ylab() set the labels for the plot and opts() sets the title. ggsave() saves the file as a pdf to a directory of your choice.


# start off with a ggplot() objhect then add on the other layers to build the graph
> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies.pdf",width=8,height=9)

To create separate graphs for the baby data based on the parity of their birth you can use the
facet_wrap() function to break down the data into multiple graphs based on the value of a third column in the data frame.

> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> facet_wrap(~Parity, nrow = 6, ncol = 1) + 
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies-by-parity.pdf",width=8,height=9)

Graphs made with ggplot2 can be customized in almost any way to fit the type of graph. The docs can be found at the bottom of the page. In the next section of the tutorial I will go over graphing a bit more and take on some simple regressions and clustering.

Part 3: Linear Models In R

Helpful Links

Importing Data in R
PostgreSQL interface for R scripts
ggplot2 Documentation
R Cook Book: Graphing

Leave a Reply

Your email address will not be published. Required fields are marked *