Linear Models in R (Part 2) – Data Analysis and Multiple Regressions

Multiple linear regression follows the same ideas as univariate regression, but instead of using one variable to predict and outcome, we will be using several.So there will be one dependent variable and multiple independent ones. By adding more independent variables you can make more precise predictions and model more complex systems.

b0 is the intercept and b1, b2 and so on are just the slopes, which is similar to the way the univariate regression was. They are also known as the regression coefficients.

The general formula looks like this.
y = Θ0 + Θ1(x1) + Θ2(x2) + Θ3(x3) +Θ3(x4)

Investigating Data Before Creating Models

Initially looking at a summary of the data frame you want to do a multiple regression on is a good start point. You’ll get a little bit of a better idea of what the data is like and the range of data for each of the input variables. You can also look into the correlations of the data by using R’s cor() function and see if there are any strong correlation’s between any of the variables. This will establish some idea of bivariate relationships in the data set, but unfortunately they don’t take in account all of the other variables we will be using. The great thing about a multiple regression is that it will give you a more pure representation of the correlation of the variables by taking them all into account.

I’ll just start off with some starter code that will get all of the data into a data frame and then we can start looking into the data itself some more using the techniques I’ve described.


# load ggplot to use later on to make some simple plots
> library(ggplot2)

> baby.datacolnames(baby.data)

Now we’ve got a data frame with the data about baby births and there mothers, and we can begin to start poking around at the data to see if we can find any interesting patterns. R’s summary() function is a good place to start.


> summary(baby.data)
birth_weight gestation parity age
Min. : 55.0 Min. :148.0 Min. :0.0000 Min. :15.00
1st Qu.:108.0 1st Qu.:272.0 1st Qu.:0.0000 1st Qu.:23.00
Median :120.0 Median :280.0 Median :0.0000 Median :26.00
Mean :119.5 Mean :279.1 Mean :0.2624 Mean :27.23
3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000 3rd Qu.:31.00
Max. :176.0 Max. :353.0 Max. :1.0000 Max. :45.00
height weight smoke
Min. :53.00 Min. : 87.0 Min. :0.000
1st Qu.:62.00 1st Qu.:114.2 1st Qu.:0.000
Median :64.00 Median :125.0 Median :0.000
Mean :64.05 Mean :128.5 Mean :0.391
3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.000
Max. :72.00 Max. :250.0 Max. :1.000

From this we see a few interesting aspects of the data. The smoking and parity values are binary variables, so it may be interesting to split up the data set into two subsets at some point. There isn’t much variance on the height, but both birth_wieght of the baby, weight of the mother, and gestation period have some what large ranges.

Next, we can take a look at the correlations between the variables and see if there are any interesting relationships between the variables. You can use R’s cor() function for this.


> cor(baby.data)
birth_weight gestation parity age height
birth_weight 1.00000000 0.40754279 -0.043908173 0.026982911 0.203704177
gestation 0.40754279 1.00000000 0.080916029 -0.053424774 0.070469902
parity -0.04390817 0.08091603 1.000000000 -0.351040648 0.043543487
age 0.02698291 -0.05342477 -0.351040648 1.000000000 -0.006452846
height 0.20370418 0.07046990 0.043543487 -0.006452846 1.000000000
weight 0.15592327 0.02365494 -0.096362092 0.147322111 0.435287428
smoke -0.24679951 -0.06026684 -0.009598971 -0.067771942 0.017506595
weight smoke
birth_weight 0.15592327 -0.246799515
gestation 0.02365494 -0.060266842
parity -0.09636209 -0.009598971
age 0.14732211 -0.067771942
height 0.43528743 0.017506595
weight 1.00000000 -0.060281396
smoke -0.06028140 1.000000000

There aren’t too many strong correlations from this data set, a few that stick out are baby’s birth weight and smoking, the baby’s height and weight of the mother, and the gestation period and birth weight. There doesn’t really seem to be any thing out of the ordinary for this and most of these correlations make sense. There aren’t any negative ones that really stick out at all. THe only negative correlation that sticks out is between smoking and birth weight.

The next step to get an even more in depth understanding of the data is to start plotting some of the varaibles and looking for linear relationships. VIsualizing the data definitely gives a better perspective of what you are working with and can expose trends thats numbers alone can’t show. Since smoking and birth weight had a decent correlation we’ll start by plotting them.


> ggplot(baby.data, aes(x = birth_weight, y = smoke)) + geom_point()

The only thing that sticks out from this graph is that if the mother isn’t smoking the birth weight will be a little bit higher. Plotting gestation and birth weight might be more interesting because there is a stronger correlation.


> ggplot(baby.data, aes(x = gestation, y = birth)) + geom_point()


This plot is much more revealing and you can clearly seen a linear relationship between the two variables. You can use ggplots geom_smooth() function to plot a fitted line to the data and see the relationships a little bit better. The method argument is set to lm, so it will use a linear model to make the fitted line. THere are many other options you can use as well. The se argument decides if you want to see the distribution of the standard error of the fitted line.


> ggplot(baby.data, aes(x = gestation, y = birth_weight)) +
> geom_point()+
> geom_smooth(method = 'lm', se = FALSE)


The fitted line doesn’t seem to be too revealing, but that is mostly because the correlation between the two variables wasn’t incredibly strong. There is still clearly a linear relationship between them though. By incorporating more input variables into the regression better predictions will be able to be created.

An other tehcnique to get more informoatiomn about the data is to make some density plots of the data you are working with. Here is a simple one for birth weight.


ggplot(baby.data, aes(x = BirthWeight)) + geom_density()


This is a nice way to visualize the data. From this graph you can see that birth weight follows a normal distribution. If this wasn’t the case it might be helpful to take the log of the variable and see how that turns out. Just use R’s log() function. That type of thing is an other topic that I will write about later, when I deal with scaling and normalizing data.

These are just a few of the techniques you can use to look into the data you are working with before trying to build models with it. Getting to know the data well and any relationships can make it much easier to create models that will help you accurately predict values later on.

Multiple Regressions

Now that you have analyzed the data and understand some of the relationships within the data set, its time to create a model with the data and begin to make predictions. You will be using the same lm() function that is used for univariate regressions except the formulation being used will be modified to include the rest of the input variables. Let’s start off by trying to predict the birth weight of a baby given all of the other input variables.


> baby.modelsummary(baby.model)

Call:
lm(formula = BirthWeight ~ Gestation + Parity + Age + Height +
Weight + Smoke, data = baby.data)

Residuals:
Min 1Q Median 3Q Max
-57.613 -10.189 -0.135 9.683 51.713

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.41085 14.34657 -5.605 2.60e-08 ***
Gestation 0.44398 0.02910 15.258 < 2e-16 ***
Parity -3.32720 1.12895 -2.947 0.00327 **
Age -0.00895 0.08582 -0.104 0.91696
Height 1.15402 0.20502 5.629 2.27e-08 ***
Weight 0.05017 0.02524 1.987 0.04711 *
Smoke -8.40073 0.95382 -8.807 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.83 on 1167 degrees of freedom
Multiple R-squared: 0.258, Adjusted R-squared: 0.2541
F-statistic: 67.61 on 6 and 1167 DF, p-value: < 2.2e-16

Now you should have a linear model that incorporates multiple input variables. In later post I will go over what all of this data means, but I’ll also put some links at the bottom to explain some of the topics. The next post will go over seeing how the model preforms and the amount of error. Also some ways to minimize the error as well.

Helpful Reading

Resources for Statistics/Machine Learning That Are Good For Programmers

Statistics are coming up more and in programming situations across all fields of study. Knowing a bit of stats can really help you solve problems with big data sets and enable you to create some interesting features for products. Especially with all the large data sets being created by social media and other online services knowing some statistics can help immensely when trying to create innovative products.

The field of machien learning has been taking off as well, and it being used across all academic disciplines as a way to learn more about complex systems. The past semester I have been doing some work with machine learning and here are some of the resources I’ve found helpful in studies.

Books

Free Courses

Blogs/ Websites

 

There are loads of more resources out there, but these are the ones that have helped me out the most the past few months.

 

Linear Models In R

Now that you have some data imported into R and understand some of the basic features for summarizes and viewing that data, you can start to dig into some of the features that make R truly standout as a programming language for statistics. The nature of the babies data makes it an excellent choice for utilizing some of the regression features that R offers.

Basically we will want to be predicting numbers using a model we have created with a training set of data. You have one varaibles that you would like to predict and then there are multiple explanatory variables you are using to predict that find value. This post will being going over simple univariate regression, but R is perfectly capable of multivariate linear regression as well as

Basics with Regression

  • The main goals of regression analysis
  • Predicting
  • Predicting housing prices using characteristics of houses
  • Assessing the effect of a variable on other variables
  • Asses impact of on an industry due to raw material price increases
  •  Give a description to a set of data

When using a linear regression the number one assumption is that the data will some how follow a linear relationship between the input variables/predictors (x) and the response (y). If there are more than one predictor then it is a multivariate regression, otherwise is is merely a simple regression. The response variable must be continuous, but all of the other predictors can be discrete, continuous, or categorical. You may have had to plot a fitted line on a scatter plot at some point in your elementary education, as a way to predict the value of something given that the line goes through the majority of the points plotted.Simple drawing a line by eye is good enough for most day to day things, but if you want accuracy in your predictions and the ability to do it with more complex models some math is going to be used. The general form of a regression line is

y = Θ0 + Θ1(x1)

Θ0 represents the y-intercept of the line and Θ1 the slope.

For simple data it isn’t that hard to just draw a line where it appears to be the best, but for accurate predictions you’ll want to turn towards some math. Now if there are more variables you wanted to keep track of, the equation would get much larger and complex.

To fit a line to a collection of points, most linear models will try and minimize the sum and the least squares of the models residuals. Residuals are the points that are either above or below the line and they can be seen as a way to measure the accuracy of a linear regression being made. If there sums are too big that the line is no being fit very well. Or if the sum is enormous, it may mean that the data isn’t right for linear regression.

For now I will go over how to make a simple one variable regression in R, using the babies.txt data and then afterwards an other model using all of the provided data.

If you have moved away from the initial R prompt here is a script with all of the code we went through in the past post.

You can load it into R by running the while R is running in the same directory as the source file. All of the variables created in the script will be available for your use.


>source("Babies.R")

All of the variables should be loaded into the interpreter.

Creating Linear Models

R makes it easy to create linear models for data sets.

First lets see what the data we have looks like


>head(baby.data)
BirthWeight Gestation Parity Age Height Weight Smoke
1 120 284 0 27 62 100 0
2 113 282 0 33 64 135 0
3 128 279 0 28 64 115 1
5 108 282 0 23 67 125 1
6 136 286 0 25 62 93 0
7 138 244 0 33 62 178 0

First we will have it split up the baby.data data frame from the previous post into a training and test set of data.

# input data frame and seed for random value to split up data frame
# this is the basic structure for creating a function in R
#function to split data frame into two parts for training and testing
> splitdf if (!is.null(seed)) set.seed(seed)
index trainindex trainset testset list(trainset=trainset,testset=testset)
}
> two.sets > summary(two.sets[1])
> summary(two.sets[2])
> train.data > test.data 

Making a linear model for birth weight and gestation period would be interesting. We’ll need to the lm() fucntion that R provides to do that.


# create a linear model to predict birthweight given the gestation period
# arguments are first variable to predict values for then a ~ to denote the variables being used to predict,
# then set data= to the baby.data from earlier
> baby.model > coefficients(baby.model)
# summary info about the new model
> summary(baby.model)

Residuals:

Min 1Q Median 3Q Max
-49.348 -11.065 0.218 10.101 57.704

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.75414 8.53693 -1.26 0.208
Gestation 0.46656 0.03054 15.28 —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared: 0.1661, Adjusted R-squared: 0.1654
F-statistic: 233.4 on 1 and 1172 DF, p-value: < 2.2e-16 # to display the coefficients used in the model create > coefficients(baby.model)
(Intercept) Gestation
-10.7541389 0.4665569

Next we will want to create a fit line on the linear model we’ve created. This is done using R’s predict() function. It will generate the predicted values for the linear model we are passing it.


# create a linear fit line for the model using the test data
> lm.fit 

A helpful set of bnumbers when dealing with linear models are the residuals.


# Difference between original labels and lm.fit = residuals
> summary((train.data-lm.fit) - baby.model$residuals)

To see how fit the data is to the linear model you can find the root mean squared error. You do this by taking the sum of all the residual values( data points above or bellow the regression line), then taking it’s square root.


# find the root mean squared error to see the sample standard deviation
> rmsError > rmsError print("rmsError = ")
[1] "rmsError = "
> rmsError
[1] 17.15356
# the RMS error is around 17 so this means on average the prediction will be off by around 17 ounces of the birth weight. So it will be within a pound or so.

As you can see R makes it very easy to create basic linear models for data and then use that data to see how predictions can be made using it. In the next part of this series I will go over incoroperating more variables into the linear model, so predictions can be even better.

Here the the entire script created in this post. Make sure the Babies.R source file is in the same directory as this one if you want to see it run correctly. Or otherwise change the path of the argument for source().


source("Babies.R")
head(baby.data)
# input data frame and seed for random value to split up data frame
splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
  #take index from 1 to the length of the data frame
    index <- 1:nrow(dataframe)
  #find training set index
    trainindex <- sample(index, trunc(length(index)/2))
  # assign training and test set
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]
    list(trainset=trainset,testset=testset)
}

two.sets <- splitdf(baby.data, seed=NULL)

train.data <- two.sets$trainset
test.data <- two.sets$testset

#create a linear model of the baby data trying to predict birthweight giv en other factors
baby.model <- lm(formula = BirthWeight ~ Gestation, data = train.data)

#show coefficients for the model
coefficients(baby.model)

# create a linear fit line for the model using the test data
lm.fit <- predict(baby.model, newdata = test.data)

# Difference between original labels and lm.fit = residuals 
summary((train.data-lm.fit) - baby.model$residuals)

# find the root mean squared error to see the sample standard deviation
rmsError <- sqrt(sum(baby.model$residuals * baby.model$residuals)/length(baby.model$residuals))

print("rmsError = ")
rmsError

# the RMS error is around 16 so this means on average the prediction will be off by around 16 ounces of the birth weight

Helpful Links