For the Module 3 final project our assignment was to create a classification model from a data set of our choice. Being a trader, I have always been in the business of trying to predict “random” events and I wanted to test my skills at a new type of prediction: Major League Baseball games. To be honest I though that gathering the data would be the easiest part, given that baseball is such a quantifiable sport and there is so much data publicly available. I was very wrong. I admit that the data set that I envisioned wasn’t exactly like the ones readily available on ESPN.com but all of the data I was looking for was very normal baseball statistics.
Northwind Traders is a fictional company database created by Microsoft to demonstrate their software related to business intelligence. For this project I was tasked with exploring the database and extracting useful business intelligence from it. The questions that I chose to answer are as follows:
When I began the King County housing project I began like most aspiring data scientists would. I explored and prepared the data before running it through a simple OLS model to get a feel for what I was working with. The resulting model had an r² of about .6 and I began thinking of ways to improve it. I thought of all of the types of things that influence home prices that were not included in the data set given; things like school quality, crime rates, proximity to attractions such as parks, grocery stores, shopping centers, business centers and transportation routes. After longer than I care to admit, it dawned on me that much of that data was captured by a seemingly useless column of data: the home’s zipcode. You can’t include zipcode in a model as a continuous variable because the changing value doesn’t have any underlying meaning, but by turning the zipcode into dummy variables that group homes together by small geographical regions, I had acquired much of the data I was searching for. For the most part, homes in the same zipcode share the same schools, criminals and attractions and thus the coefficients associated with them can be extremely valuable in accurately predicting a home’s value. Making this change alone improved my model’s r² value to .81. That was before any transforms, normalizations or converting other continuous variables into dummy ones. The main lesson I have taken away from this project is that being a little bit creative with how to use seemingly useless data can yield great results.