Machine Learning Algorithm for Tucson Housing Prices
Introduction
The goal of this project has three main components: 1) to scrape a bunch of web data of house information in Tucson (prices, beds, baths, some other stuff), 2) to build a test a series of machine learning models that do a good job of accurately predicting the price a house will sell at and 3) taking this model and building a web interface that folks could use to plug in information on a house in Tucson and get an output.
I’m not going to cover the first part of the project because of the murkiness of web-scraping, but so we’ll start with part 2 here and hopefully get to part 3 in the upcoming weeks!
Loading and cleaning the data
First, we’ll need to load up some packages (there will be quite a few for this).
#Packages
library(skimr) #I like this better than glimpse
library(tidyverse) #The usual suspects
library(caret) #Machine Learning
library(recipes) #PreProcessing Recipes
library(lubridate) #We'll do some date stuff
library(rsample) #Sampling for ML
library(caretEnsemble) #Ensemble models
library(doParallel) #Parallel Processing to speed things up a bit
We’ll set up the parallel processing and set the seed.
#parallel
detectCores()
## [1] 4
registerDoParallel(cores = 4)
#seed
set.seed(42)
Now we’ll get into the nitty gritty of some data cleaning before we use the recipes
package to pre-process the data.
#Reading in the data
housing_df = read_csv("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/data/housing_df.csv") #I'm pulling this from my local directory - email me if you want to play with the data.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## address = col_character(),
## price = col_double(),
## dens = col_integer(),
## beds = col_integer(),
## baths = col_double(),
## zip = col_integer(),
## sqft = col_integer(),
## date_sold = col_date(format = ""),
## lat = col_double(),
## lon = col_double()
## )
housing_df
## # A tibble: 30,170 x 11
## X1 address price dens beds baths zip sqft date_sold lat
## <int> <chr> <dbl> <int> <int> <dbl> <int> <int> <date> <dbl>
## 1 1 878 Ha… 907660 811 NA NA 85624 1118 2018-01-17 31.5
## 2 2 Casita… 415000 272 NA NA 85624 1524 2018-01-17 31.5
## 3 3 10 Aco… 430000 229 NA 2 85624 1872 2017-01-31 31.5
## 4 4 33 Kim… 240000 93 NA 3 85624 2560 2016-12-14 31.5
## 5 5 6 Acor… 249000 186 3 1 85624 1337 2016-05-24 31.5
## 6 6 5920 E… 340000 136 4 3 85739 2490 2018-09-19 32.5
## 7 7 39779 … 418989 107 4 5 85739 3896 2018-06-14 32.5
## 8 8 6080 E… 310000 132 5 3 85739 2343 2018-03-05 32.5
## 9 9 12640 … 275000 147 2 2 85619 1870 2016-05-04 32.4
## 10 10 6234 N… 649000 305 2 2 85750 2121 2018-12-20 32.3
## # ... with 30,160 more rows, and 1 more variable: lon <dbl>
#Doing a bit of cleaning after looking at the data
housing_df = housing_df %>%
select(-X1, -address) %>%
mutate(zip = as.factor(zip),
month = month(date_sold),
day = day(date_sold)) %>%
select(-date_sold)
#Take a look at the data using skim
skim(housing_df)
## Skim summary statistics
## n obs: 30170
## n variables: 10
##
## ── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique
## zip 0 30170 30170 45
## top_counts ordered
## 857: 3195, 857: 2713, 857: 2250, 857: 1541 FALSE
##
## ── Variable type:integer ────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## beds 6288 23882 30170 3.16 0.91 2 3 3 4 22
## day 244 29926 30170 16.66 8.93 1 9 16 25 31
## dens 4431 25739 30170 121.84 51.77 1 92 119 147 986
## sqft 4263 25907 30170 1934.16 4443.84 1 1296 1636 2140 275000
## hist
## ▇▁▁▁▁▁▁▁
## ▃▆▆▇▅▅▅▇
## ▇▆▁▁▁▁▁▁
## ▇▁▁▁▁▁▁▁
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## baths 4655 25515 30170 2.2 0.91 0.5 2
## lat 31 30139 30170 32.16 0.31 31.33 32.1
## lon 31 30139 30170 -111.04 0.51 -119.42 -111.13
## month 244 29926 30170 8.68 3.12 1 6
## price 0 30170 30170 2e+05 150552.48 1 98000
## p50 p75 p100 hist
## 2 2 37 ▇▁▁▁▁▁▁▁
## 32.2 32.32 41.6 ▇▁▁▁▁▁▁▁
## -111.01 -110.92 -73.09 ▁▇▁▁▁▁▁▁
## 10 11 12 ▁▁▂▁▂▃▂▇
## 170000 245000 992000 ▆▇▂▁▁▁▁▁
#Looking at the distribution of prices to see if anything is wonky
ggplot(housing_df, aes(x = price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Removing houses that cost very little (probably errors in price generation)
housing_df = housing_df %>%
filter(price > 50000)
#removing dens at it has price built in
housing_df = housing_df %>%
select(-dens)
#Removing entries with really wrong lats and lons
housing_df = housing_df %>%
filter(lat < 33)
Great! Now we have a semi-clean data set that contains a lot of useful data to predict price with.
Specifically:
1. Price
2. Bedrooms
3. Bathrooms
4. Zip Code
5. Square feet
6. Date Sold
7. Latitude and Longitude
We need to split into training and test sets and also pre-process (transform, impute) before we start building algorithms.
#Split into train and test using some tools from the rsample package
train_test_split = initial_split(housing_df)
housing_train = training(train_test_split)
housing_test = testing(train_test_split)
#Check for missing
sum(is.na(housing_df))
## [1] 6989
#A LOT of missing, but we can try and impute
tucson_rec = recipe(price ~ ., data = housing_train) %>%
#step_log(price, base = 10) %>%
step_bagimpute(all_numeric()) %>% #imputation
step_YeoJohnson(beds, baths, sqft, -lat, -lon) %>% #YeoJohnson Transformation
step_center(beds, baths, sqft, -lat, -lon) %>% #Centering
step_scale(beds, baths, sqft, -lat, -lon) %>% #Scaling
step_dummy(all_nominal()) %>% #One Hot Encoding
step_bs(lat, lon) #Splines
#Rational for step_bs - Makes sense to include splines above given the complex relationship between lat, lon and price.
housing_train %>%
filter(lat < 33) %>%
ggplot(aes(y = price, x = lat)) +
geom_point(alpha = 0.2) +
geom_smooth() +
theme_classic()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
housing_train %>%
filter(lat < 33) %>%
ggplot(aes(y = price, x = lon)) +
geom_point(alpha = 0.2) +
geom_smooth() +
theme_classic()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#Prepping the data
prepped_tucson = prep(tucson_rec, training = housing_train)
#Generating preprocessed training and test data
train_data = bake(prepped_tucson, new_data = housing_train)
test_data = bake(prepped_tucson, new_data = housing_test)
## Warning in bs(x = c(32.5148575, 32.4391429, 32.3137296, 32.312586,
## 32.308355, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x = c(-110.8679814, -110.7563179, -110.8415315,
## -110.846994, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
Single-Algorithm Testing and Tuning
Now the fun begins! We can start to build and test different algorithms to try and predict price. We’ll assess models using repeated cross-validation, and do a bit of automatic tuning along the way. Some of these models take a long time - the Random Forest and XGBoost in particular. I’ll be loading saved models (RDS objects) to save computational time and not re-build models I’ve already done, but I’ll also provide code on exactly how to work through each model.
Linear Regression
#Setting up the train control object telling caret we want to use repeated cross validation (3 folds with 3 repeats).
train_control = trainControl(method = "repeatedcv", number = 3, repeats = 3)
#Tune Length is longer here because linear models are fast - not wise to do this on a random forest.
lm_mod = train(price ~ ., data = train_data,
method = "lm", trControl = train_control, tuneLength = 10)
#loading the RDS object to save time - this won't work for you.
lm_mod = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/lm_model.rds")
summary(lm_mod)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1423216 -47399 -5435 32239 901181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 275192.41 30820.42 8.929 < 2e-16 ***
## beds 1844.70 861.20 2.142 0.032206 *
## baths 20874.43 884.24 23.607 < 2e-16 ***
## sqft 69615.57 952.37 73.097 < 2e-16 ***
## month 1043.49 231.88 4.500 6.83e-06 ***
## day 430.90 69.61 6.190 6.13e-10 ***
## zip_X85614 -4622.17 15530.31 -0.298 0.765995
## zip_X85619 -15950.36 87216.22 -0.183 0.854892
## zip_X85621 -91163.42 19481.17 -4.680 2.89e-06 ***
## zip_X85622 -6739.64 15269.37 -0.441 0.658940
## zip_X85624 -33174.41 26654.68 -1.245 0.213294
## zip_X85629 -65768.04 16826.67 -3.909 9.32e-05 ***
## zip_X85633 -268274.94 49995.77 -5.366 8.14e-08 ***
## zip_X85640 53236.99 16161.42 3.294 0.000989 ***
## zip_X85641 -65240.66 18315.56 -3.562 0.000369 ***
## zip_X85645 -58722.04 17041.09 -3.446 0.000570 ***
## zip_X85646 27011.78 14439.96 1.871 0.061412 .
## zip_X85648 -86821.34 15013.71 -5.783 7.46e-09 ***
## zip_X85653 57858.23 21869.09 2.646 0.008160 **
## zip_X85658 106463.62 21392.85 4.977 6.53e-07 ***
## zip_X85701 156153.08 25615.38 6.096 1.11e-09 ***
## zip_X85704 45478.75 21463.61 2.119 0.034114 *
## zip_X85705 4988.97 21446.12 0.233 0.816052
## zip_X85706 -47141.13 20093.17 -2.346 0.018980 *
## zip_X85710 -55254.99 20786.11 -2.658 0.007861 **
## zip_X85711 -30346.24 20812.62 -1.458 0.144838
## zip_X85712 -14266.39 21623.63 -0.660 0.509416
## zip_X85713 9274.52 20355.29 0.456 0.648660
## zip_X85714 -30350.01 20938.32 -1.449 0.147215
## zip_X85715 -26257.42 21546.45 -1.219 0.222995
## zip_X85716 47039.45 21509.89 2.187 0.028764 *
## zip_X85718 130025.95 21387.41 6.080 1.23e-09 ***
## zip_X85719 45907.71 21411.45 2.144 0.032039 *
## zip_X85730 -48219.96 20598.84 -2.341 0.019247 *
## zip_X85735 32866.35 19569.10 1.680 0.093070 .
## zip_X85736 64079.38 18796.04 3.409 0.000653 ***
## zip_X85737 29387.69 21975.65 1.337 0.181146
## zip_X85739 -92526.22 21690.61 -4.266 2.00e-05 ***
## zip_X85741 25025.45 21539.80 1.162 0.245321
## zip_X85742 32274.35 21413.90 1.507 0.131784
## zip_X85743 102482.56 21046.66 4.869 1.13e-06 ***
## zip_X85745 103941.51 20847.57 4.986 6.22e-07 ***
## zip_X85746 -10226.14 19399.21 -0.527 0.598101
## zip_X85747 -52230.52 19830.82 -2.634 0.008450 **
## zip_X85748 -10056.59 24971.36 -0.403 0.687155
## zip_X85749 15518.49 24600.31 0.631 0.528162
## zip_X85750 105045.52 21408.38 4.907 9.33e-07 ***
## zip_X85755 61343.44 21604.20 2.839 0.004524 **
## zip_X85756 -76636.94 18799.55 -4.077 4.59e-05 ***
## zip_X85757 41664.35 19252.77 2.164 0.030471 *
## lat_bs_1 236931.29 39642.08 5.977 2.32e-09 ***
## lat_bs_2 -150042.37 34513.19 -4.347 1.38e-05 ***
## lat_bs_3 106493.36 27308.84 3.900 9.67e-05 ***
## lon_bs_1 -651865.80 53363.70 -12.216 < 2e-16 ***
## lon_bs_2 286116.55 22027.58 12.989 < 2e-16 ***
## lon_bs_3 -195462.93 42451.28 -4.604 4.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84410 on 19443 degrees of freedom
## Multiple R-squared: 0.6552, Adjusted R-squared: 0.6542
## F-statistic: 671.8 on 55 and 19443 DF, p-value: < 2.2e-16
Not too bad - a tuned linear regression is giving us an Rsquared of 0.65, meaning the model is predicting 65% of the variation in price. Pretty good for a first pass! We’ll test all of our models on out-of-sample test data at the end. For now, let’s build the rest of the models.
#Random Forest
rf_mod = train(price ~ ., data = train_data,
method = "ranger", trControl = train_control,
tuneLength = 3, verbose = TRUE)
#XGBoost
xgboost_mod = train(price ~ ., data = train_data,
method = "xgbTree", trControl = train_control, tuneLength = 3)
#Ridge and Lasso - again, longer tune-length here because Ridge and Lasso is relatively fast.
ridge_lasso_mod = train(price ~ ., data = train_data,
method = "glmnet", trControl = train_control, tuneLength = 10)
Now let’s compare how good our models are on
#Loading previously saved models
rf_mod = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/rf_model.rds")
ridge_lasso_mod = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/ridge_lasso_mod.rds")
xgboost_mod = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/xgboost_model.rds")
lm_mod = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/lm_model.rds")
#Let's compare models on within sample data
results <- resamples(list(RandomForest=rf_mod, linearreg=lm_mod, xgboost = xgboost_mod, ridge_lass = ridge_lasso_mod))
# summarize the distributions
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RandomForest, linearreg, xgboost, ridge_lass
## Number of resamples: 9
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest 18631.00 18820.02 19317.43 19203.06 19418.66 19801.08 0
## linearreg 56063.50 56727.51 56989.97 56934.75 57169.85 57706.18 0
## xgboost 38244.10 39308.75 39860.49 39704.32 40212.84 40497.75 0
## ridge_lass 55940.15 56556.56 56801.23 56897.43 57355.76 57994.55 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest 42213.37 43354.33 44146.76 44599.34 46605.39 47614.94 0
## linearreg 82909.83 83760.47 84945.09 84698.78 85288.79 86537.02 0
## xgboost 59863.35 61700.21 62754.91 62351.03 63246.07 64074.02 0
## ridge_lass 83074.97 83897.61 84264.01 84890.64 85703.49 87897.91 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## RandomForest 0.8920291 0.8973384 0.9067016 0.9037629 0.9071052 0.9131115
## linearreg 0.6464394 0.6480837 0.6514192 0.6522050 0.6555898 0.6593292
## xgboost 0.7976722 0.8060560 0.8133984 0.8115616 0.8179571 0.8214243
## ridge_lass 0.6181736 0.6384160 0.6588827 0.6512886 0.6633868 0.6751351
## NA's
## RandomForest 0
## linearreg 0
## xgboost 0
## ridge_lass 0
# boxplot of results
bwplot(results, metric="RMSE", xlim = c(20000, 150000))
Great! We can see that the random forest model is doing significantly better than the rest, with a Root-Mean-Squared-Error of about $44,000. Let’s also check out this model’s performance on our test data.
rf_mod_fit = predict(rf_mod, test_data)
postResample(pred = rf_mod_fit, obs = test_data$price)
## RMSE Rsquared MAE
## 6.927496e+04 8.108359e-01 4.872507e+04
The RMSE here is a lot higher than in our test data, and the Rsquared is certainly better than the linear regression (0.81 versus 0.6), it still leaves quite a bit to be desired. Let’s see if we can use some ensemble methods to improve performance.
Ensemble Models
First, we need to see if our models are good candidates for building an ensemble.
#Let's see how much correlation there is between model predictions - ideally, we want models that are fairly uncorrelated.
modelCor(results)
## RandomForest linearreg xgboost ridge_lass
## RandomForest 1.00000000 0.15752534 -0.09086445 0.53485131
## linearreg 0.15752534 1.00000000 0.61396724 -0.08224185
## xgboost -0.09086445 0.61396724 1.00000000 -0.34638214
## ridge_lass 0.53485131 -0.08224185 -0.34638214 1.00000000
This looks promising! There is some moderate correlation between the outputs of the Ridge and Lasso model and the Random Forest model, but it’s not extremely high. This seems like a good candidate to build and test some ensemble models. Let’s use the caretList and caretEnsemble tools to buiild an ensemble model. We’re going to build a caretList object first, and then build three different ensemble types: a greedy ensemble, a glm ensemble and a random forest ensemble. A warning - these take significant time to build on a normal laptop. I’ll be loading previously built models for assessment below
#SEtting up the train control object
train_control = trainControl(method = "repeatedcv", number = 3, repeats = 3)
#Building the caret list
model_list <- caretList(
price ~ ., data=train_data,
trControl=train_control,
metric="RMSE",
methodList=c("lm", "ranger", "xgbTree", "glmnet"),
tuneLength = 3
)
#building a greedy ensemble
greedy_ensemble <- caretEnsemble(
model_list,
metric="RMSE"
)
#glm ensemble
glm_ensemble <- caretStack(
model_list,
method="glm",
metric="RMSE",
trControl=trainControl(
method="repeatedcv",
number=3,
repeats = 3
)
)
#Random Forest meta-model
rf_ensemble <- caretStack(
model_list,
method="rf",
metric="RMSE",
trControl=trainControl(
method="repeatedcv",
number=3,
repeats = 3
)
)
model_list = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/model_list.rds")
greedy_ensemble = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/greedy_ensemble.rds")
glm_ensemble = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/glm_ensemble.rds")
rf_ensemble = readRDS("/Users/KeatonWilson/Documents/Projects/Tucson_Housing_Machine_Learning/output/rf_ensemble.rds")
greedy_fit = predict(greedy_ensemble, test_data)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
glm_fit = predict(glm_ensemble, test_data)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
rf_ens_fit = predict(rf_ensemble, test_data)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
postResample(pred = greedy_fit, obs = test_data$price)
## RMSE Rsquared MAE
## 4.290985e+04 9.172482e-01 1.918544e+04
postResample(pred = glm_fit, obs = test_data$price)
## RMSE Rsquared MAE
## 4.290985e+04 9.172482e-01 1.918544e+04
postResample(pred = rf_ens_fit, obs = test_data$price)
## RMSE Rsquared MAE
## 4.521194e+04 9.078247e-01 2.047251e+04
And there we have it - the best model on out-of-sample test data appears to be either the greedy model or the glm model, which have identical scores on test data, and not the more complex meta-models. We end up with an algorithm that predicts 91.7% of the variation in price, with an RMSE of $42,909 and an MAE of $19185. Not bad! We could go through with more complex tuning - but the computational time is getting a bit hairy. For now, this seems like a good predictive tool. Next steps will be implementing this model into a web-based tool that folks can use to assess the price of a house in Tucson!