Linear Models in R (Part 2) – Data Analysis and Multiple Regressions

Multiple linear regression follows the same ideas as univariate regression, but instead of using one variable to predict and outcome, we will be using several.So there will be one dependent variable and multiple independent ones. By adding more independent variables you can make more precise predictions and model more complex systems.

b0 is the intercept and b1, b2 and so on are just the slopes, which is similar to the way the univariate regression was. They are also known as the regression coefficients.

The general formula looks like this.
y = Θ0 + Θ1(x1) + Θ2(x2) + Θ3(x3) +Θ3(x4)

Investigating Data Before Creating Models

Initially looking at a summary of the data frame you want to do a multiple regression on is a good start point. You’ll get a little bit of a better idea of what the data is like and the range of data for each of the input variables. You can also look into the correlations of the data by using R’s cor() function and see if there are any strong correlation’s between any of the variables. This will establish some idea of bivariate relationships in the data set, but unfortunately they don’t take in account all of the other variables we will be using. The great thing about a multiple regression is that it will give you a more pure representation of the correlation of the variables by taking them all into account.

I’ll just start off with some starter code that will get all of the data into a data frame and then we can start looking into the data itself some more using the techniques I’ve described.


# load ggplot to use later on to make some simple plots
> library(ggplot2)

> baby.datacolnames(baby.data)

Now we’ve got a data frame with the data about baby births and there mothers, and we can begin to start poking around at the data to see if we can find any interesting patterns. R’s summary() function is a good place to start.


> summary(baby.data)
birth_weight gestation parity age
Min. : 55.0 Min. :148.0 Min. :0.0000 Min. :15.00
1st Qu.:108.0 1st Qu.:272.0 1st Qu.:0.0000 1st Qu.:23.00
Median :120.0 Median :280.0 Median :0.0000 Median :26.00
Mean :119.5 Mean :279.1 Mean :0.2624 Mean :27.23
3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000 3rd Qu.:31.00
Max. :176.0 Max. :353.0 Max. :1.0000 Max. :45.00
height weight smoke
Min. :53.00 Min. : 87.0 Min. :0.000
1st Qu.:62.00 1st Qu.:114.2 1st Qu.:0.000
Median :64.00 Median :125.0 Median :0.000
Mean :64.05 Mean :128.5 Mean :0.391
3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.000
Max. :72.00 Max. :250.0 Max. :1.000

From this we see a few interesting aspects of the data. The smoking and parity values are binary variables, so it may be interesting to split up the data set into two subsets at some point. There isn’t much variance on the height, but both birth_wieght of the baby, weight of the mother, and gestation period have some what large ranges.

Next, we can take a look at the correlations between the variables and see if there are any interesting relationships between the variables. You can use R’s cor() function for this.


> cor(baby.data)
birth_weight gestation parity age height
birth_weight 1.00000000 0.40754279 -0.043908173 0.026982911 0.203704177
gestation 0.40754279 1.00000000 0.080916029 -0.053424774 0.070469902
parity -0.04390817 0.08091603 1.000000000 -0.351040648 0.043543487
age 0.02698291 -0.05342477 -0.351040648 1.000000000 -0.006452846
height 0.20370418 0.07046990 0.043543487 -0.006452846 1.000000000
weight 0.15592327 0.02365494 -0.096362092 0.147322111 0.435287428
smoke -0.24679951 -0.06026684 -0.009598971 -0.067771942 0.017506595
weight smoke
birth_weight 0.15592327 -0.246799515
gestation 0.02365494 -0.060266842
parity -0.09636209 -0.009598971
age 0.14732211 -0.067771942
height 0.43528743 0.017506595
weight 1.00000000 -0.060281396
smoke -0.06028140 1.000000000

There aren’t too many strong correlations from this data set, a few that stick out are baby’s birth weight and smoking, the baby’s height and weight of the mother, and the gestation period and birth weight. There doesn’t really seem to be any thing out of the ordinary for this and most of these correlations make sense. There aren’t any negative ones that really stick out at all. THe only negative correlation that sticks out is between smoking and birth weight.

The next step to get an even more in depth understanding of the data is to start plotting some of the varaibles and looking for linear relationships. VIsualizing the data definitely gives a better perspective of what you are working with and can expose trends thats numbers alone can’t show. Since smoking and birth weight had a decent correlation we’ll start by plotting them.


> ggplot(baby.data, aes(x = birth_weight, y = smoke)) + geom_point()

The only thing that sticks out from this graph is that if the mother isn’t smoking the birth weight will be a little bit higher. Plotting gestation and birth weight might be more interesting because there is a stronger correlation.


> ggplot(baby.data, aes(x = gestation, y = birth)) + geom_point()


This plot is much more revealing and you can clearly seen a linear relationship between the two variables. You can use ggplots geom_smooth() function to plot a fitted line to the data and see the relationships a little bit better. The method argument is set to lm, so it will use a linear model to make the fitted line. THere are many other options you can use as well. The se argument decides if you want to see the distribution of the standard error of the fitted line.


> ggplot(baby.data, aes(x = gestation, y = birth_weight)) +
> geom_point()+
> geom_smooth(method = 'lm', se = FALSE)


The fitted line doesn’t seem to be too revealing, but that is mostly because the correlation between the two variables wasn’t incredibly strong. There is still clearly a linear relationship between them though. By incorporating more input variables into the regression better predictions will be able to be created.

An other tehcnique to get more informoatiomn about the data is to make some density plots of the data you are working with. Here is a simple one for birth weight.


ggplot(baby.data, aes(x = BirthWeight)) + geom_density()


This is a nice way to visualize the data. From this graph you can see that birth weight follows a normal distribution. If this wasn’t the case it might be helpful to take the log of the variable and see how that turns out. Just use R’s log() function. That type of thing is an other topic that I will write about later, when I deal with scaling and normalizing data.

These are just a few of the techniques you can use to look into the data you are working with before trying to build models with it. Getting to know the data well and any relationships can make it much easier to create models that will help you accurately predict values later on.

Multiple Regressions

Now that you have analyzed the data and understand some of the relationships within the data set, its time to create a model with the data and begin to make predictions. You will be using the same lm() function that is used for univariate regressions except the formulation being used will be modified to include the rest of the input variables. Let’s start off by trying to predict the birth weight of a baby given all of the other input variables.


> baby.modelsummary(baby.model)

Call:
lm(formula = BirthWeight ~ Gestation + Parity + Age + Height +
Weight + Smoke, data = baby.data)

Residuals:
Min 1Q Median 3Q Max
-57.613 -10.189 -0.135 9.683 51.713

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.41085 14.34657 -5.605 2.60e-08 ***
Gestation 0.44398 0.02910 15.258 < 2e-16 ***
Parity -3.32720 1.12895 -2.947 0.00327 **
Age -0.00895 0.08582 -0.104 0.91696
Height 1.15402 0.20502 5.629 2.27e-08 ***
Weight 0.05017 0.02524 1.987 0.04711 *
Smoke -8.40073 0.95382 -8.807 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.83 on 1167 degrees of freedom
Multiple R-squared: 0.258, Adjusted R-squared: 0.2541
F-statistic: 67.61 on 6 and 1167 DF, p-value: < 2.2e-16

Now you should have a linear model that incorporates multiple input variables. In later post I will go over what all of this data means, but I’ll also put some links at the bottom to explain some of the topics. The next post will go over seeing how the model preforms and the amount of error. Also some ways to minimize the error as well.

Helpful Reading

Linear Models In R

Now that you have some data imported into R and understand some of the basic features for summarizes and viewing that data, you can start to dig into some of the features that make R truly standout as a programming language for statistics. The nature of the babies data makes it an excellent choice for utilizing some of the regression features that R offers.

Basically we will want to be predicting numbers using a model we have created with a training set of data. You have one varaibles that you would like to predict and then there are multiple explanatory variables you are using to predict that find value. This post will being going over simple univariate regression, but R is perfectly capable of multivariate linear regression as well as

Basics with Regression

  • The main goals of regression analysis
  • Predicting
  • Predicting housing prices using characteristics of houses
  • Assessing the effect of a variable on other variables
  • Asses impact of on an industry due to raw material price increases
  •  Give a description to a set of data

When using a linear regression the number one assumption is that the data will some how follow a linear relationship between the input variables/predictors (x) and the response (y). If there are more than one predictor then it is a multivariate regression, otherwise is is merely a simple regression. The response variable must be continuous, but all of the other predictors can be discrete, continuous, or categorical. You may have had to plot a fitted line on a scatter plot at some point in your elementary education, as a way to predict the value of something given that the line goes through the majority of the points plotted.Simple drawing a line by eye is good enough for most day to day things, but if you want accuracy in your predictions and the ability to do it with more complex models some math is going to be used. The general form of a regression line is

y = Θ0 + Θ1(x1)

Θ0 represents the y-intercept of the line and Θ1 the slope.

For simple data it isn’t that hard to just draw a line where it appears to be the best, but for accurate predictions you’ll want to turn towards some math. Now if there are more variables you wanted to keep track of, the equation would get much larger and complex.

To fit a line to a collection of points, most linear models will try and minimize the sum and the least squares of the models residuals. Residuals are the points that are either above or below the line and they can be seen as a way to measure the accuracy of a linear regression being made. If there sums are too big that the line is no being fit very well. Or if the sum is enormous, it may mean that the data isn’t right for linear regression.

For now I will go over how to make a simple one variable regression in R, using the babies.txt data and then afterwards an other model using all of the provided data.

If you have moved away from the initial R prompt here is a script with all of the code we went through in the past post.

You can load it into R by running the while R is running in the same directory as the source file. All of the variables created in the script will be available for your use.


>source("Babies.R")

All of the variables should be loaded into the interpreter.

Creating Linear Models

R makes it easy to create linear models for data sets.

First lets see what the data we have looks like


>head(baby.data)
BirthWeight Gestation Parity Age Height Weight Smoke
1 120 284 0 27 62 100 0
2 113 282 0 33 64 135 0
3 128 279 0 28 64 115 1
5 108 282 0 23 67 125 1
6 136 286 0 25 62 93 0
7 138 244 0 33 62 178 0

First we will have it split up the baby.data data frame from the previous post into a training and test set of data.

# input data frame and seed for random value to split up data frame
# this is the basic structure for creating a function in R
#function to split data frame into two parts for training and testing
> splitdf if (!is.null(seed)) set.seed(seed)
index trainindex trainset testset list(trainset=trainset,testset=testset)
}
> two.sets > summary(two.sets[1])
> summary(two.sets[2])
> train.data > test.data 

Making a linear model for birth weight and gestation period would be interesting. We’ll need to the lm() fucntion that R provides to do that.


# create a linear model to predict birthweight given the gestation period
# arguments are first variable to predict values for then a ~ to denote the variables being used to predict,
# then set data= to the baby.data from earlier
> baby.model > coefficients(baby.model)
# summary info about the new model
> summary(baby.model)

Residuals:

Min 1Q Median 3Q Max
-49.348 -11.065 0.218 10.101 57.704

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.75414 8.53693 -1.26 0.208
Gestation 0.46656 0.03054 15.28 —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared: 0.1661, Adjusted R-squared: 0.1654
F-statistic: 233.4 on 1 and 1172 DF, p-value: < 2.2e-16 # to display the coefficients used in the model create > coefficients(baby.model)
(Intercept) Gestation
-10.7541389 0.4665569

Next we will want to create a fit line on the linear model we’ve created. This is done using R’s predict() function. It will generate the predicted values for the linear model we are passing it.


# create a linear fit line for the model using the test data
> lm.fit 

A helpful set of bnumbers when dealing with linear models are the residuals.


# Difference between original labels and lm.fit = residuals
> summary((train.data-lm.fit) - baby.model$residuals)

To see how fit the data is to the linear model you can find the root mean squared error. You do this by taking the sum of all the residual values( data points above or bellow the regression line), then taking it’s square root.


# find the root mean squared error to see the sample standard deviation
> rmsError > rmsError print("rmsError = ")
[1] "rmsError = "
> rmsError
[1] 17.15356
# the RMS error is around 17 so this means on average the prediction will be off by around 17 ounces of the birth weight. So it will be within a pound or so.

As you can see R makes it very easy to create basic linear models for data and then use that data to see how predictions can be made using it. In the next part of this series I will go over incoroperating more variables into the linear model, so predictions can be even better.

Here the the entire script created in this post. Make sure the Babies.R source file is in the same directory as this one if you want to see it run correctly. Or otherwise change the path of the argument for source().


source("Babies.R")
head(baby.data)
# input data frame and seed for random value to split up data frame
splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
  #take index from 1 to the length of the data frame
    index <- 1:nrow(dataframe)
  #find training set index
    trainindex <- sample(index, trunc(length(index)/2))
  # assign training and test set
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]
    list(trainset=trainset,testset=testset)
}

two.sets <- splitdf(baby.data, seed=NULL)

train.data <- two.sets$trainset
test.data <- two.sets$testset

#create a linear model of the baby data trying to predict birthweight giv en other factors
baby.model <- lm(formula = BirthWeight ~ Gestation, data = train.data)

#show coefficients for the model
coefficients(baby.model)

# create a linear fit line for the model using the test data
lm.fit <- predict(baby.model, newdata = test.data)

# Difference between original labels and lm.fit = residuals 
summary((train.data-lm.fit) - baby.model$residuals)

# find the root mean squared error to see the sample standard deviation
rmsError <- sqrt(sum(baby.model$residuals * baby.model$residuals)/length(baby.model$residuals))

print("rmsError = ")
rmsError

# the RMS error is around 16 so this means on average the prediction will be off by around 16 ounces of the birth weight

Helpful Links

Basic Importing and Graphing Data In R

Importing and Analyzing Data

One of the most common ways of importing data into R is using a text file that is delimitated with some sort of character. There are also packages that allow you to input data from a database, or other formats. Once the data is imported as a data frame you can start creating models and graphs.

To go through the examples of this post you can download the data file about baby births.

babies.txt

The file is broken down into columns with a header given at the beginning of the file.

Reading CSV Files

To read in a CSV file as a data frame use the read.csv() function. This will return a data frame with the data from the file. If there is a header for the file then you can import that as well for the column labels. R will try and convert the columns of the file into relevant datatypes. It will work most of the time for numbers, booleans, and strings. To convert dates you will have to the date functions that R provides. You can read about the date functions here


# read in data.csv
# head=TRUE imports header values from the csv file and sets them to column names
# set the seperator using the sep argument
> baby.data <- read.csv(file="data/babies.csv", head=TRUE, sep=" ")

# print out the new data's columns
> colnames(baby.data)
[1] "birth_weight" "gestation"    "parity"       "age"          "height"      
[6] "weight"       "smoke"

# if you want to change the column labels you can pass the colnames() function a vector of labels
> colnames(baby.data) <- c("BirthWeight", "Gestation", "Parity", "Age", "Height", "Weight", "Smoke")

The baby birth data set is pretty large so it’s not the best to print out the entire data frame. There are a few functions that R provides that enable you to get a good idea of what the imported data looks like. Use can use head() and tail() to view the first and last 6 rows of the data frame.


# you can use head() to print out the first 6 entries in the data tail
> head(baby.data)
  birth_weight gestation parity age height weight smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0
# tail returns the last 6
> tail(baby.data)

# new column names
> head(baby.data)
  Birth Weight Gestation Parity Age Height Weight Smoke
1          120       284      0  27     62    100     0
2          113       282      0  33     64    135     0
3          128       279      0  28     64    115     1
4          123        NA      0  36     69    190     0
5          108       282      0  23     67    125     1
6          136       286      0  25     62     93     0

# you can also view one single column of the data frame by using the $ symbol. It will return a vector containing that column's data
# this is good use to when formatting data
> head(baby.data$Weight)

The summary() function gives you some details about the information in a data frame. The minimum, 1st quartile, median, mean 3rd quartile, and maximum. It also gives you information about how many null values (NA’s) there are. You can then use the na.omit() function to omit the null values from the data frame, so there isn’t any bad data left that can cause bugs later on.


# print out a summary of the imported data
> summary(baby.data)
  Birth Weight     Gestation         Parity            Age       
 Min.   : 55.0   Min.   :148.0   Min.   :0.0000   Min.   :15.00  
 1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000   1st Qu.:23.00  
 Median :120.0   Median :280.0   Median :0.0000   Median :26.00  
 Mean   :119.6   Mean   :279.3   Mean   :0.2549   Mean   :27.26  
 3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000   3rd Qu.:31.00  
 Max.   :176.0   Max.   :353.0   Max.   :1.0000   Max.   :45.00  
                 NA's   :13                       NA's   :2      
     Height          Weight          Smoke       
 Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
 1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
 Median :64.00   Median :125.0   Median :0.0000  
 Mean   :64.05   Mean   :128.6   Mean   :0.3948  
 3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
 Max.   :72.00   Max.   :250.0   Max.   :1.0000  
 NA's   :22      NA's   :36      NA's   :10

# str will print out the data type for each column in the data frame
# It looks like there are some null so we'll use na.omit() to remove them
> str(baby.data)
'data.frame':  1174 obs. of  7 variables:
 $ BirthWeight: int  120 113 128 108 136 138 132 120 143 140 ...
 $ Gestation  : int  284 282 279 282 286 244 245 289 299 351 ...
 $ Parity     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age        : int  27 33 28 23 25 33 23 25 30 27 ...
 $ Height     : int  62 64 64 67 62 62 65 62 66 68 ...
 $ Weight     : int  100 135 115 125 93 178 140 125 136 120 ...
 $ Smoke      : int  0 0 1 1 0 0 0 0 1 0 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:62] 4 40 43 86 90 94 99 103 111 114 ...
  .. ..- attr(*, "names")= chr [1:62] "4" "40" "43" "86" ...
> baby.data <- na.omit(baby.data) 

Because this data set is reletively large taking out a few rows because of null values is fine and shouldn’t affect results we might try to get from the data.

Graphing

Now that you have some data points loaded into R you can start to to create graphs of the data and see if there are any visible trends that you can find. In the last tutorial you should have installed the ggplot2 package for plotting data. ggplot2 extends R’s graphing functions and allows for graphs to be made layer by layer. It also has better control over the visual aspects of the graphs. ggplot2 creates graphs by layers represented by R functions.

The example below starts off with a ggplot() object with the aesthetics x and y being set to columns in the baby.data data frame . Next, geom_point() creates a scatter plot of the data with the color being represented by the Smoke column in the baby data frame. That means the color of each point will be colored depending on on the Smoke column. The theme is set to black and white with theme_bw(), then scale_color_manual() defines the darkblue color and removes the legend from the data. xlab() and ylab() set the labels for the plot and opts() sets the title. ggsave() saves the file as a pdf to a directory of your choice.


# start off with a ggplot() objhect then add on the other layers to build the graph
> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies.pdf",width=8,height=9)

To create separate graphs for the baby data based on the parity of their birth you can use the
facet_wrap() function to break down the data into multiple graphs based on the value of a third column in the data frame.

> baby.plot<-ggplot(baby.data, aes(x=BirthWeight,y=Gestation,))+
> geom_point(aes(colour=Smoke))+ 
> theme_bw()+
> facet_wrap(~Parity, nrow = 6, ncol = 1) + 
> ylab("Gestation Period In Days")+
> xlab("Birth Weight in Ounces")+
> opts(title="Baby Weight Compared to Gestation Period")+
# fit a line using a linear model (lm)
> stat_smooth(method="lm")
> ggsave(plot=baby.plot, filename="./graphs/babies-by-parity.pdf",width=8,height=9)

Graphs made with ggplot2 can be customized in almost any way to fit the type of graph. The docs can be found at the bottom of the page. In the next section of the tutorial I will go over graphing a bit more and take on some simple regressions and clustering.

Part 3: Linear Models In R

Helpful Links

Importing Data in R
PostgreSQL interface for R scripts
ggplot2 Documentation
R Cook Book: Graphing