I came a across a nice article that goes over some things to keep in mind when dealing with scientific data and the importance of visualization when programming with large data sets. It got me thinking about how visualizing data can really change your perspective of a certain problem that you are trying to solve. With large data sets it can daunting trying to find some sort of form within them, while at the same time trying to find some sort of conclusion about the data.
Some of the things that Vince Buffalo stressed seemed kind of obvious at first, but the ideas he was presenting got me thinking about how important it is for humans to have some sort of visualization to better understand a concept. By visualizing the data you get a much better perspective on the shape of the date, and it is much easier to interpret than pure numbers or tables. YOu can also view multi variables of data in one visualization and see how variables interact with each other. When working with data the key thing is to understand the data, and visualizations can help you learn how the variables in your data are interacting and give insight into the relationship between them.
Right now I am collecting data on the Chicago Public Transit systems L trains using there API. And found some interesting things by visualizing the data. Using R I imported the CTA’s predicted arrival times for each of the stations on one of there lines. I then found the average value for the wait time at each stop. THe results I got surprised and gave me some insight into what type of factors affect train arrival times in the city. For instance, at the downtown stops where the stations are very close there was some weird patterns for the wait times. All of those stops are very close together compared to most other stops, but for some reason certain stops would have much higher average wait times, even though the stops were right next to each other.
Later on as I am trying to make predictinos on arrival times and the wait times for the CTA, I will definitely take into the population density and factors that may affect the scheduling due to the trains being downtown.
The Monroe stop stood out in the data because the two stops adjacent had much shorter wait times, and the distance between the stops is maybe 1 or 2 minutes while riding. There was also a large difference in wait times depending on which direction the trains were going. This got me thinking about how population density would affect how efficient trains would run, and also the ways the CTA may plan out the frequency of certain train lines depending on population density. For most of the other stops the wait times reflected the distance between the stops, but this trend seemed to change when the stops were in the downtown area. I will write more on this later, but it just goes to show how visualizing data can help you understand patterns in data, and lead you onto new ideas about why the data is taking a certain type of form.