When I began the King County housing project I began like most aspiring data scientists would. I explored and prepared the data before running it through a simple OLS model to get a feel for what I was working with. The resulting model had an r² of about .6 and I began thinking of ways to improve it. I thought of all of the types of things that influence home prices that were not included in the data set given; things like school quality, crime rates, proximity to attractions such as parks, grocery stores, shopping centers, business centers and transportation routes. After longer than I care to admit, it dawned on me that much of that data was captured by a seemingly useless column of data: the home’s zipcode. You can’t include zipcode in a model as a continuous variable because the changing value doesn’t have any underlying meaning, but by turning the zipcode into dummy variables that group homes together by small geographical regions, I had acquired much of the data I was searching for. For the most part, homes in the same zipcode share the same schools, criminals and attractions and thus the coefficients associated with them can be extremely valuable in accurately predicting a home’s value. Making this change alone improved my model’s r² value to .81. That was before any transforms, normalizations or converting other continuous variables into dummy ones. The main lesson I have taken away from this project is that being a little bit creative with how to use seemingly useless data can yield great results.
In this data set, the column “id” can have similar usefulness. It cannot be used as a continuous variable for obvious reasons but it can be used to identify homes that have been sold more than once. This data set is only for about a year, which means that likely all of the homes sold twice were flipped. This would give us great insight into determining the value of home improvements to help decide if a project will be likely to turn a profit. Unfortunately, the homes that were sold twice do not have any changed characteristics, even though the sell date and price are different. This tells me that these are not duplicate entries of the same event but rather two separate sales that seemingly are missing the updated characteristics of the second sale. If the second entry had changed characteristics such as added square footage or an extra bathroom or updated condition, it would be incredibly useful in determining which projects that a home flipper should do. All variables other than the improvements would be held constant by definition and would give us a very accurate picture of what a bathroom or condition upgrade is worth.
Seeing that I am not yet an accomplished data scientist I can only say that I assume this is a common problem in the industry. Missing or incomplete data presents a major hurdle to overcome when modeling real world phenomena. The old sayings “you are what you eat” and “garbage in, garbage out” are extremely true in the field of data science. Our models and predictions are only as good as the data that was used to create it. In summary, our data can hold a lot more usefulness than we initially imagine but also can be the reason that we are not able to gain insight into the problem we are trying to solve, and it is our job to know the difference and know how to deal with it appropriately.