A Multiple Linear Regression Study
For my second project on the Flatiron Data Science program, I was tasked to perform inferential modelling on a King County, WA housing data set. The aim of this was to establish what home improvements, if any, could potentially boost the value of your home.
Firstly I looked more generally at the house prices in this County, it is expensive, the most expensive County in the State and home to some of the most expensive homes in the United States.
See those dark red shapes…yeah Bill Gates and Jeff Bezos live there (Medina to be specific). Medina, Bellevue and Mercer Island districts are the most expensive in the County with average house price in 2019 coming in at over $1.5million. So, no real surprise that location is an important factor when building a model to understand house prices.
Other Factors impacting House Price
So, sure we know that location is important, but what else? Some exploratory data analysis confirmed initial perceptions we all have..
No surprises here, everyone wants a bit more space, perhaps I expected the gradient to be a little steeper but the relationship between house price and space almost always a trade off against location (i.e. small properties in the middle of London or Manhattan can often cost a small fortune!).
Nice view? Add a zero..
You can’t really control your view, but having a nice one will cost. The difference in price between homes at utilized a view and those that did not was stark.
Maybe something less obvious was the bedroom to bathroom ratio. When plotting the number of bedrooms minus number of bathrooms it is clear that having a number of zero or less is desirable (having more bathrooms than bedrooms). This feature is decimal as bathrooms can be measured in 0.5 (WC and Sink), 0.75 (WC, sink and shower) or 1 (bath, WC, sink).
Beginning with a very simple model with the dependent variable of Sale Price and the independent variable of Square Feet Total Living space, I then added more and more features all whilst trying to honour the assumptions of linear regression modelling (no multicollinearity, linear relationship between dependent and independent variables and homoscedasticity, multivariate normality)
It can be seen that as I increase the number of features, the R-squared number increases i.e. the model is able to explain more and more of the variance in sale price as features are added, this is to be expected. It does become difficult to honour all the assumptions a rigidly however when more features are introduced as can be seen by the Jarque-Bera (JB) number which we want to be as low as possible. This assumption can also be checked visually using a Q-Q plot.
The data should ideally overlay the straight line, indicating multivariate normality. A Q-Q plot like above indicates the model suffers from kurtosis and resembles a fat-tailed normal distribution (more data at the tails and less in the middle).
Also with regards to homoscedasticity the model wasn’t perfect
The difference between the modelled house prices and actual appeared to be greater at higher predicted house prices. This requires thorough investigation but indicates there’s something that the model hasn’t quite captured.
When I mapped the 200 entries where Actual minus Predicted house price was the highest, a pattern emerged.
Most of the homes that where sale price was poorly predicted were extremely close to water. Whilst waterfront location and view utilization data was accounted for, I suspect there needs to be a more granular feature engineered that accounts for distance to waterbody as opposed to a binary feature that has either yes or no for water front location.
Recommended Home Improvements
Whilst the model is not perfect, the coefficients can still be interpreted to provide some recommendations on home improvements.
Add bathroom — provided all other variables remain the same, reducing the difference between number of bedrooms and bedrooms by one would increase property price by $30,000 USD.
Improve Condition — a slightly vague feature, but by improving the condition of your home from ‘Average’ to ‘Good’ can improve the value of your home by $34,000 USD.
Fix that nagging issue — is it a surprise that fixing that nagging water problem or other similar issue will help? A home with such an issue will be worth $19,000 USD less assuming all other variables are kept constant.
Renovate — renovations can be expensive, but worth it. Assuming all other variables are kept constant, a renovated home will sell for $58,000 USD more.