Building an Environment Monitor for my Apartment –

We moved to a new apartment this fall and I wanted to investigate ways to monitor air quality/humidity. Our new place has radiator heat that we don’t have much control over. As a result the air is incredibly dry and goes through intense temperature swings daily. The plants and my skin weren’t doing the best in the desert like environment. In short, the air temperate/humidity levels are unpredictable.

Because I’m crazy, I decided to assemble some hardware to detect various air quality metrics. Then I built a website to track this data:

I just bought a bunch of I2C sensors and I thought logging a bunch of different air quality metrics would be fun. One of the goals was to try and figure out if the radiators are set on some sort of timer/schedule or if they are responding to temperature changes in some sort of PID system. Additionally, I wanted to see what effect a humidifier might have on air quality and if my plants decrease CO2 levels throughout the day.

Here is is during construction

Monitor Setup

The main monitor was built using a Raspberry Pi. It had a bunch of sensors connected via I2C and an analog to digital converter. I setup a cron job to run a Python script that would read the sensor values for 10-15 seconds and calculate an average value from each sensor. The script would then send this data to an API I wrote using Laravel. The Laravel site shows the current sensor readings and a chart showing CO2 levels over time.

There is a tiny OLED display on the main board that gives you a readout of all the levels.

Basic diagram showing all system parts

Air Quality Metrics

I got a little carried away on Adafruit/eBay and ended up with a ton of different sensors. Right now the system can sense

  • CO2 Levels
  • Volatile Organic Compounds (plastics, combustable stuff, certain household cleaners, other nasty stuff)
  • Dust levels
  • Temperature
  • Humidity (this is a tricky one to get accurately for various reasons)
  • Relative light levels


I built simple Laravel site that has an API endpoint for logging sensor values. There is a single page that loads a basic listing of current values and then a chart that shows CO2 levels and other data gathered.

Future Plans

Some future plans:

  • Move project off bread board to some perf-board
  • Add more charts to the website
  • Run some sensors outside for humidity/temp
  • Setup alerts when levels reach a certain threshold

Building an 8-bit Computer From Scratch: Part 1 of ?

It was really cold this winter in Chicago. For a few days, the wind chill was around -50F. Which is pretty crazy. To pass the time, I decided to start a new electronics hardware project. The past year or two I’ve mostly been doing hardware projects in my spare time because I never have to chance to any at work. This project is going to take a long time.

Inspiration for this project came from an amazing Youtube Channel by Ben Eater. One of the big series he has on his channel is a step by step guide to building a computer using logic integrated circuits. He goes through all the steps needed to build the CPU clock, registers, arithmetic logic unit, system bus, RAM, ROM logic for displaying numbers and loads of other neat stuff. Ben’s personal website has a full list of parts you can get online and circuit schematics for each of the modules.

Computer Design

The basic structure of the planned computer is pretty simple. As a result it isn’t able to do much. In the end, I want it to add/subtract 8 bit numbers. Use 2’s complement to work with negative integers. It will have about 5 or 6 assembly instructions and be able to run programs that are around 14 instructions long.

Saying it is an 8-bit computer is somewhat misleading because the memory address space will only be 4-6 bits depending on how things are built. The clock speed maxes out at a few hertz so you’re not going to be calculating too fast.

At this point, I have a pretty basic binary adder/subtractor built. It has 5 modules. It can added 2 8-bit numbers and that is about it.

  • System Clock
  • A Register
  • B Register
  • Instruction Register
  • Arithmetic Logic Unit

System Clock

The system clock is made up of a few 555 timer chips setup in their various operating states (astable, monostable, and bi-stable). They are set up to blink an LED at a rate that is adjustable with a variable pot. It runs in two modes. A simple latch system is built with some logic chips.

  • Auto-mode (clock signal runs over and over again at a set rate)
  • Manual mode (clock signal sent every time you click a button)

It runs at a few hertz. You can overclock it by turning the pot to adjust the timer circuit……

Auto/manual system clock built using 555 timer chips

Here’s the schematic from Ben’s website:

System clock circuit diagram

The Registers…

The registers are built using 2 4-bit flip-flops to hold values loaded onto them. There is a lock and clock signal fed in to store the values. You can read them by toggling a read/write pin. Theses are using SN74LS173 chips from Texas Instruments. One of many logic chips from TI that were used.

Each register holds an 8 bit value you can load onto it.

I built A and B registers to load up values to add/subtract. And then an instruction register to store instructions/memory addresses. The instruction register hasn’t been used yet because I don’t have anything to control CPU logic yet.

Register circuit diagram
This is one of the registers hooked up to the system clock. It can read/write values to the yellow LED’s

Arithmetic Logic Unit

The ALU can add/subtract two 8 bit numbers. It makes use of twos-complement to handle negative values. Normally a computers ALU would handle some bitwise operations but this one is only going to add/subtract.

ALU circuit diagram
ALU connected to the A and B registers


There is still a ton of work left to do to get this working decently. I need to work on the system bus, hooking up RAM/ROM and something to keep track of CPU instructions. Also memory management with will be kind of difficult. Planning out the next steps now. I wanted to try and do some stuff outside of Ben’s plans but am not sure what yet.

Here is everything hooked up in all its’ messy glory:

Clock/ALU/A and B register all hooked up writing to the bus. Not pictured is the instruction register

Finding Plaid Shirts with the Amazon Rekognition API

Is that flannel you are wearing?

I’m pretty sure I’ve been bitten by the machine learning bug! The past few weeks, I’ve had the opportunity to work with Amazon Rekognition. It’s a new fangled deep-learning image recognition API that is part of AWS.It’s been fun to play around with. You feed it images and it will send back attempts to detect objects, faces, text and other things you’d want to detect. No need to train your own model and run all sorts of specialized software. Just sign up for AWS, set up a client on your machine and start sending the API images to analyze. It’ll take you around 30 minutes to setup a simple proof of concept and get an idea of the API’s features.

What is Rekognition?

First, lets go over a little bit about what Rekognition is for those who aren’t familiar. Rekognition is an API for deep-learning based image and video analysis. You send it photos or video and it can identify objects, people, faces, scenes, text and other stuff.  Rekognition’s deep-learning algo will attempt to label objects in the image.

There are four types of labelling currently supported

I was blown away by how many objects it could label and the granularity of its’ classifications. My expectations of the API’s accuracy were low initially but I was quickly proven wrong. For instance, the API is able to distinguish different breeds of dogs. It knows there’s a difference between a dung beetle and a cockroach. It is also great at finding faces and labeling the parts of a face. Nose, eye, eyebrow and pupil location are just a few. There was a bit of uncertainty when trying to label emotions. For some reason, it always set my emotion as ‘confused’? As time goes by it will only get better at identification. One thing it never fails to label is flannel/plaid. If there is plaid is an image Rekognition will label it like there is no tomorrow.

It can also analyze streaming video for faces in real time. I haven’t tried video yet but at work we have an AWS DeepLens preordered. It has specialized hardware for deep learning and will be able to use custom detection models.

Let’s Start Tinkering

It is easy to start tinkering with Rekognition. We’ll use the AWS CLI and an S3 bucket to get started. We will upload images to the S3 bucket and pass them to the API via CLI. When the API is done processing the image it will return a string of JSON.

To begin, we will setup a simple environment to send images to the API.

  1. Setup the AWS CLI
  2. Create a S3 bucket for the images to be labeled
  3. Upload those images
  4. Use CLI to run Rekognition on bucket images

If you don’t have the CLI setup here are some 3rd party guides. The AWS docs aren’t known for their quality.

Next, you’ll need to create an S3 bucket with public read permissions. I used the GUI on the AWS console to make one in an availability zone close to me. In this case us-east-1. Take note of your bucket’s AZ because Rekognition needs it to find the right image. Once the bucket was ready, I uploaded a few images to bucket and made sure they were publicly accessible.

Look at that tasty pic

woman eating chicken wings with face covered in hot sauce

We will use cutting edge technology to analyze this image

# The base CLI command for Rekognition is: aws rekognition 
# To detect labels in the image use aws rekognition detect-labels 
# We need to specify an S3 bucket and the proper AZ 
# The bucket is described with escaped JSON 
# The AZ uses the abbreviations used across AWS 
# This page has all the AZ shortnames if you forgot 
# this will return JSON describing the images aws rekognition 

detect-labels --image "{\"S3Object\":{\"Bucket\":\"tinker-bucket\",\"Name\":\"wing-woman.jpg\"}}" --region us-east-1

Here’s a sample of the JSON that gets returned. Some of the results are pretty funny
"Labels": [
 "Name": "Human",
 "Confidence": 99.27762603759766
 "Name": "Corn",
 "Confidence": 94.26850891113281
 "Name": "Flora",
 "Confidence": 94.26850891113281
 "Name": "Grain",
 "Confidence": 94.26850891113281

Here’s a command to use face detection instead of object detection. Face detection mode returns an array with entries for each face in the image. It can detect up to 100 faces per image. For each face it returns an array of potential emotions too. It seems like the emotion detection is hit or miss at times but it is still really good.
# Note the --attributes ALL argument at the end
# Without this the array of emotions wouldn't be returned
aws rekognition detect-faces --image "{\"S3Object\":{\"Bucket\":\"secrete-bucket\",\"Name\":\"wing-woman.jpg\"}}" --region us-east-1 --attributes ALL

// the result has been shortened
 "FaceDetails": [
 "BoundingBox": {
 "Width": 0.3016826808452606,
 "Height": 0.46822741627693176,
 "Left": 0.359375,
 "Top": 0.15793386101722717
 "AgeRange": {
 "Low": 23,
 "High": 38
 "Smile": {
 "Value": false,
 "Confidence": 74.72581481933594
 "Eyeglasses": {
 "Value": false,
 "Confidence": 50.551666259765625
 "Emotions": [
 "Type": "HAPPY",
 "Confidence": 38.40011215209961
 "Type": "SAD",
 "Confidence": 3.1377792358398438
 "Type": "DISGUSTED",
 "Confidence": 1.5140950679779053
 "Landmarks": [
 "Type": "eyeLeft",
 "X": 0.4536619782447815,
 "Y": 0.3465670645236969
 "Type": "eyeRight",
 "X": 0.5664145946502686,
 "Y": 0.3220127522945404

It returned a bounding box for the person’s face, a potential age range, if they have glasses or not,  an array of potential emotions and coordinates for facial landmarks like the person’s eyes.

One thing to note is the coordinates for landmarks are formatted as decimals between 0.0 and 1.0. To get values in pixels multiply ‘X’ coordinates by the source images width or ‘Y’ coordinates by the height.

Pretty Neat

Rekognition is pretty impressive considering how simple it is to setup and start using. The CLI is easy to setup and start using although there are some parts of the API you can’t use. For instance, you are stuck uploading images to S3 whenever you want them processed. Using an SDK gives you more control over the API and will let you integrate it with applications you are writing seamlessly. I have been using the Python SDK Boto3 and have been very pleased. It has methods for pretty much any AWS product. At some point I’ll post about using it to alter S3 buckets.

The Apollo Guidance Computer Part 1

Recently, I have been getting interested in old computers, especially one with low amounts of computing power. The Apollo Guidance Computer has been the one I’ve been reading the most about. It is pretty amazing the way it is designed and how it was able to guide a rocket all the way to the moon and back. It was one of the first computers to heavily use integrated circuits. Some of the hardware used was really unique and embodies a great methods in computer design.


Linear Models in R (Part 2) – Data Analysis and Multiple Regressions

Multiple linear regression follows the same ideas as univariate regression, but instead of using one variable to predict and outcome, we will be using several.So there will be one dependent variable and multiple independent ones. By adding more independent variables you can make more precise predictions and model more complex systems.

b0 is the intercept and b1, b2 and so on are just the slopes, which is similar to the way the univariate regression was. They are also known as the regression coefficients.

The general formula looks like this.
y = Θ0 + Θ1(x1) + Θ2(x2) + Θ3(x3) +Θ3(x4)

Investigating Data Before Creating Models

Initially looking at a summary of the data frame you want to do a multiple regression on is a good start point. You’ll get a little bit of a better idea of what the data is like and the range of data for each of the input variables. You can also look into the correlations of the data by using R’s cor() function and see if there are any strong correlation’s between any of the variables. This will establish some idea of bivariate relationships in the data set, but unfortunately they don’t take in account all of the other variables we will be using. The great thing about a multiple regression is that it will give you a more pure representation of the correlation of the variables by taking them all into account.

I’ll just start off with some starter code that will get all of the data into a data frame and then we can start looking into the data itself some more using the techniques I’ve described.

# load ggplot to use later on to make some simple plots
> library(ggplot2)

> baby.datacolnames(

Now we’ve got a data frame with the data about baby births and there mothers, and we can begin to start poking around at the data to see if we can find any interesting patterns. R’s summary() function is a good place to start.

> summary(
birth_weight gestation parity age
Min. : 55.0 Min. :148.0 Min. :0.0000 Min. :15.00
1st Qu.:108.0 1st Qu.:272.0 1st Qu.:0.0000 1st Qu.:23.00
Median :120.0 Median :280.0 Median :0.0000 Median :26.00
Mean :119.5 Mean :279.1 Mean :0.2624 Mean :27.23
3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000 3rd Qu.:31.00
Max. :176.0 Max. :353.0 Max. :1.0000 Max. :45.00
height weight smoke
Min. :53.00 Min. : 87.0 Min. :0.000
1st Qu.:62.00 1st Qu.:114.2 1st Qu.:0.000
Median :64.00 Median :125.0 Median :0.000
Mean :64.05 Mean :128.5 Mean :0.391
3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.000
Max. :72.00 Max. :250.0 Max. :1.000

From this we see a few interesting aspects of the data. The smoking and parity values are binary variables, so it may be interesting to split up the data set into two subsets at some point. There isn’t much variance on the height, but both birth_wieght of the baby, weight of the mother, and gestation period have some what large ranges.

Next, we can take a look at the correlations between the variables and see if there are any interesting relationships between the variables. You can use R’s cor() function for this.

> cor(
birth_weight gestation parity age height
birth_weight 1.00000000 0.40754279 -0.043908173 0.026982911 0.203704177
gestation 0.40754279 1.00000000 0.080916029 -0.053424774 0.070469902
parity -0.04390817 0.08091603 1.000000000 -0.351040648 0.043543487
age 0.02698291 -0.05342477 -0.351040648 1.000000000 -0.006452846
height 0.20370418 0.07046990 0.043543487 -0.006452846 1.000000000
weight 0.15592327 0.02365494 -0.096362092 0.147322111 0.435287428
smoke -0.24679951 -0.06026684 -0.009598971 -0.067771942 0.017506595
weight smoke
birth_weight 0.15592327 -0.246799515
gestation 0.02365494 -0.060266842
parity -0.09636209 -0.009598971
age 0.14732211 -0.067771942
height 0.43528743 0.017506595
weight 1.00000000 -0.060281396
smoke -0.06028140 1.000000000

There aren’t too many strong correlations from this data set, a few that stick out are baby’s birth weight and smoking, the baby’s height and weight of the mother, and the gestation period and birth weight. There doesn’t really seem to be any thing out of the ordinary for this and most of these correlations make sense. There aren’t any negative ones that really stick out at all. THe only negative correlation that sticks out is between smoking and birth weight.

The next step to get an even more in depth understanding of the data is to start plotting some of the varaibles and looking for linear relationships. VIsualizing the data definitely gives a better perspective of what you are working with and can expose trends thats numbers alone can’t show. Since smoking and birth weight had a decent correlation we’ll start by plotting them.

> ggplot(, aes(x = birth_weight, y = smoke)) + geom_point()

The only thing that sticks out from this graph is that if the mother isn’t smoking the birth weight will be a little bit higher. Plotting gestation and birth weight might be more interesting because there is a stronger correlation.

> ggplot(, aes(x = gestation, y = birth)) + geom_point()

This plot is much more revealing and you can clearly seen a linear relationship between the two variables. You can use ggplots geom_smooth() function to plot a fitted line to the data and see the relationships a little bit better. The method argument is set to lm, so it will use a linear model to make the fitted line. THere are many other options you can use as well. The se argument decides if you want to see the distribution of the standard error of the fitted line.

> ggplot(, aes(x = gestation, y = birth_weight)) +
> geom_point()+
> geom_smooth(method = 'lm', se = FALSE)

The fitted line doesn’t seem to be too revealing, but that is mostly because the correlation between the two variables wasn’t incredibly strong. There is still clearly a linear relationship between them though. By incorporating more input variables into the regression better predictions will be able to be created.

An other tehcnique to get more informoatiomn about the data is to make some density plots of the data you are working with. Here is a simple one for birth weight.

ggplot(, aes(x = BirthWeight)) + geom_density()

This is a nice way to visualize the data. From this graph you can see that birth weight follows a normal distribution. If this wasn’t the case it might be helpful to take the log of the variable and see how that turns out. Just use R’s log() function. That type of thing is an other topic that I will write about later, when I deal with scaling and normalizing data.

These are just a few of the techniques you can use to look into the data you are working with before trying to build models with it. Getting to know the data well and any relationships can make it much easier to create models that will help you accurately predict values later on.

Multiple Regressions

Now that you have analyzed the data and understand some of the relationships within the data set, its time to create a model with the data and begin to make predictions. You will be using the same lm() function that is used for univariate regressions except the formulation being used will be modified to include the rest of the input variables. Let’s start off by trying to predict the birth weight of a baby given all of the other input variables.

> baby.modelsummary(baby.model)

lm(formula = BirthWeight ~ Gestation + Parity + Age + Height +
Weight + Smoke, data =

Min 1Q Median 3Q Max
-57.613 -10.189 -0.135 9.683 51.713

Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.41085 14.34657 -5.605 2.60e-08 ***
Gestation 0.44398 0.02910 15.258 < 2e-16 ***
Parity -3.32720 1.12895 -2.947 0.00327 **
Age -0.00895 0.08582 -0.104 0.91696
Height 1.15402 0.20502 5.629 2.27e-08 ***
Weight 0.05017 0.02524 1.987 0.04711 *
Smoke -8.40073 0.95382 -8.807 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.83 on 1167 degrees of freedom
Multiple R-squared: 0.258, Adjusted R-squared: 0.2541
F-statistic: 67.61 on 6 and 1167 DF, p-value: < 2.2e-16

Now you should have a linear model that incorporates multiple input variables. In later post I will go over what all of this data means, but I’ll also put some links at the bottom to explain some of the topics. The next post will go over seeing how the model preforms and the amount of error. Also some ways to minimize the error as well.

Helpful Reading

Resources for Statistics/Machine Learning That Are Good For Programmers

Statistics are coming up more and in programming situations across all fields of study. Knowing a bit of stats can really help you solve problems with big data sets and enable you to create some interesting features for products. Especially with all the large data sets being created by social media and other online services knowing some statistics can help immensely when trying to create innovative products.

The field of machien learning has been taking off as well, and it being used across all academic disciplines as a way to learn more about complex systems. The past semester I have been doing some work with machine learning and here are some of the resources I’ve found helpful in studies.


Free Courses

Blogs/ Websites


There are loads of more resources out there, but these are the ones that have helped me out the most the past few months.


Finding new Ideas from Data Visualization

I came a across a nice article that goes over some things to keep in mind when dealing with scientific data and the importance of visualization when programming with large data sets. It got me thinking about how visualizing data can really change your perspective of a certain problem that you are trying to solve. With large data sets it can daunting trying to find some sort of form within them, while at the same time trying to find some sort of conclusion about the data.

Some of the things that Vince Buffalo stressed seemed kind of obvious at first, but the ideas he was presenting got me thinking about how important it is for humans to have some sort of visualization to better understand a concept. By visualizing the data you get a much better perspective on the shape of the date, and it is much easier to interpret than pure numbers or tables. YOu can also view multi variables of data in one visualization and see how variables interact with each other. When working with data the key thing is to understand the data, and visualizations can help you learn how the variables in your data are interacting and give insight into the relationship between them.

Right now I am collecting data on the Chicago Public Transit systems L trains using there API. And found some interesting things by visualizing the data. Using R I imported the CTA’s predicted arrival times for each of the stations on one of there lines. I then found the average value for the wait time at each stop. THe results I got surprised and gave me some insight into what type of factors affect train arrival times in the city. For instance, at the downtown stops where the stations are very close there was some weird patterns for the wait times. All of those stops are very close together compared to most other stops, but for some reason certain stops would have much higher average wait times, even though the stops were right next to each other.

Later on as I am trying to make predictinos on arrival times and the wait times for the CTA, I will definitely take into the population density and factors that may affect the scheduling due to the trains being downtown.

The Monroe stop stood out in the data because the two stops adjacent had much shorter wait times, and the distance between the stops is maybe 1 or 2 minutes while riding. There was also a large difference in wait times depending on which direction the trains were going. This got me thinking about how population density would affect how efficient trains would run, and also the ways the CTA may plan out the frequency of certain train lines depending on population density. For most of the other stops the wait times reflected the distance between the stops, but this trend seemed to change when the stops were in the downtown area. I will write more on this later, but it just goes to show how visualizing data can help you understand patterns in data, and lead you onto new ideas about why the data is taking a certain type of form.

Linear Models In R

Now that you have some data imported into R and understand some of the basic features for summarizes and viewing that data, you can start to dig into some of the features that make R truly standout as a programming language for statistics. The nature of the babies data makes it an excellent choice for utilizing some of the regression features that R offers.

Basically we will want to be predicting numbers using a model we have created with a training set of data. You have one varaibles that you would like to predict and then there are multiple explanatory variables you are using to predict that find value. This post will being going over simple univariate regression, but R is perfectly capable of multivariate linear regression as well as

Basics with Regression

  • The main goals of regression analysis
  • Predicting
  • Predicting housing prices using characteristics of houses
  • Assessing the effect of a variable on other variables
  • Asses impact of on an industry due to raw material price increases
  •  Give a description to a set of data

When using a linear regression the number one assumption is that the data will some how follow a linear relationship between the input variables/predictors (x) and the response (y). If there are more than one predictor then it is a multivariate regression, otherwise is is merely a simple regression. The response variable must be continuous, but all of the other predictors can be discrete, continuous, or categorical. You may have had to plot a fitted line on a scatter plot at some point in your elementary education, as a way to predict the value of something given that the line goes through the majority of the points plotted.Simple drawing a line by eye is good enough for most day to day things, but if you want accuracy in your predictions and the ability to do it with more complex models some math is going to be used. The general form of a regression line is

y = Θ0 + Θ1(x1)

Θ0 represents the y-intercept of the line and Θ1 the slope.

For simple data it isn’t that hard to just draw a line where it appears to be the best, but for accurate predictions you’ll want to turn towards some math. Now if there are more variables you wanted to keep track of, the equation would get much larger and complex.

To fit a line to a collection of points, most linear models will try and minimize the sum and the least squares of the models residuals. Residuals are the points that are either above or below the line and they can be seen as a way to measure the accuracy of a linear regression being made. If there sums are too big that the line is no being fit very well. Or if the sum is enormous, it may mean that the data isn’t right for linear regression.

For now I will go over how to make a simple one variable regression in R, using the babies.txt data and then afterwards an other model using all of the provided data.

If you have moved away from the initial R prompt here is a script with all of the code we went through in the past post.

You can load it into R by running the while R is running in the same directory as the source file. All of the variables created in the script will be available for your use.


All of the variables should be loaded into the interpreter.

Creating Linear Models

R makes it easy to create linear models for data sets.

First lets see what the data we have looks like

BirthWeight Gestation Parity Age Height Weight Smoke
1 120 284 0 27 62 100 0
2 113 282 0 33 64 135 0
3 128 279 0 28 64 115 1
5 108 282 0 23 67 125 1
6 136 286 0 25 62 93 0
7 138 244 0 33 62 178 0

First we will have it split up the data frame from the previous post into a training and test set of data.

# input data frame and seed for random value to split up data frame
# this is the basic structure for creating a function in R
#function to split data frame into two parts for training and testing
> splitdf if (!is.null(seed)) set.seed(seed)
index trainindex trainset testset list(trainset=trainset,testset=testset)
> two.sets > summary(two.sets[1])
> summary(two.sets[2])
> > 

Making a linear model for birth weight and gestation period would be interesting. We’ll need to the lm() fucntion that R provides to do that.

# create a linear model to predict birthweight given the gestation period
# arguments are first variable to predict values for then a ~ to denote the variables being used to predict,
# then set data= to the from earlier
> baby.model > coefficients(baby.model)
# summary info about the new model
> summary(baby.model)


Min 1Q Median 3Q Max
-49.348 -11.065 0.218 10.101 57.704

Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.75414 8.53693 -1.26 0.208
Gestation 0.46656 0.03054 15.28 —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared: 0.1661, Adjusted R-squared: 0.1654
F-statistic: 233.4 on 1 and 1172 DF, p-value: < 2.2e-16 # to display the coefficients used in the model create > coefficients(baby.model)
(Intercept) Gestation
-10.7541389 0.4665569

Next we will want to create a fit line on the linear model we’ve created. This is done using R’s predict() function. It will generate the predicted values for the linear model we are passing it.

# create a linear fit line for the model using the test data

A helpful set of bnumbers when dealing with linear models are the residuals.

# Difference between original labels and = residuals
> summary(( - baby.model$residuals)

To see how fit the data is to the linear model you can find the root mean squared error. You do this by taking the sum of all the residual values( data points above or bellow the regression line), then taking it’s square root.

# find the root mean squared error to see the sample standard deviation
> rmsError > rmsError print("rmsError = ")
[1] "rmsError = "
> rmsError
[1] 17.15356
# the RMS error is around 17 so this means on average the prediction will be off by around 17 ounces of the birth weight. So it will be within a pound or so.

As you can see R makes it very easy to create basic linear models for data and then use that data to see how predictions can be made using it. In the next part of this series I will go over incoroperating more variables into the linear model, so predictions can be even better.

Here the the entire script created in this post. Make sure the Babies.R source file is in the same directory as this one if you want to see it run correctly. Or otherwise change the path of the argument for source().

# input data frame and seed for random value to split up data frame
splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
  #take index from 1 to the length of the data frame
    index <- 1:nrow(dataframe)
  #find training set index
    trainindex <- sample(index, trunc(length(index)/2))
  # assign training and test set
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]

two.sets <- splitdf(, seed=NULL) <- two.sets$trainset <- two.sets$testset

#create a linear model of the baby data trying to predict birthweight giv en other factors
baby.model <- lm(formula = BirthWeight ~ Gestation, data =

#show coefficients for the model

# create a linear fit line for the model using the test data <- predict(baby.model, newdata =

# Difference between original labels and = residuals 
summary(( - baby.model$residuals)

# find the root mean squared error to see the sample standard deviation
rmsError <- sqrt(sum(baby.model$residuals * baby.model$residuals)/length(baby.model$residuals))

print("rmsError = ")

# the RMS error is around 16 so this means on average the prediction will be off by around 16 ounces of the birth weight

Helpful Links