Preprocessing Bean Data (on the road to Machine Learning)
The gist
Let’s dive a bit deeper into the bean project - this post is the first in a series that will hopefully get at the meat of the project. One of the main questions of this endeavor is: Can we build a model that does a good job of predicting future market prices?
More generally: If I know something about the price of garbanzo beans today, and some of the market characteristics, can I predict with a good degree of accuracy what the price will be 6 months from now?
Today we’ll dive into a few processes we need to go through before the data are ready for analysis:
- Adding information about future prices
- Preprocessing and cleaning
- Splitting the data into training and test sets
- A few of the algorithms we can try out
Data things - loading in and initial exminations
#Packages we'll need
library(caret)
library(tidyverse)
library(lubridate)
#Snagging the sort-of-cleaned data (remember last week's post) from github
bean_master_import = read_csv(file = "https://raw.githubusercontent.com/keatonwilson/beans/master/data/bean_master_joined.csv?token=AefUVKlj3UobUpNDMJMDlZC9l72c2Gssks5bq6nCwA%3D%3D")
#Lots of warning, but it works. Just ignore them - let's look at the data
glimpse(bean_master_import)
## Observations: 8,407
## Variables: 19
## $ date <date> 1987-01-13, 1987-01-21, 1987-01-27, 1...
## $ class <chr> "Pinto", "Pinto", "Pinto", "Pinto", "P...
## $ price <dbl> 18.62, 18.25, 18.12, 18.12, 18.49, 18....
## $ weekly_avg_price <dbl> 18.62, 18.25, 18.12, 18.12, 18.49, 18....
## $ future_date <date> 1987-07-13, 1987-07-21, 1987-07-27, 1...
## $ future_weekly_avg_price <dbl> 20.25, 20.25, 20.00, 19.25, 18.75, 18....
## $ whole_market_avg <dbl> 29.18700, 28.85000, 28.89900, 28.81200...
## $ whole_market_sum <dbl> 291.87, 288.50, 288.99, 288.12, 239.99...
## $ class_market_share <dbl> 0.06379553, 0.06325823, 0.06270113, 0....
## $ planted <dbl> 637.6, 637.6, 637.6, 637.6, 637.6, 637...
## $ harvested <dbl> 617.1, 617.1, 617.1, 617.1, 617.1, 617...
## $ yield <int> 1549, 1549, 1549, 1549, 1549, 1549, 15...
## $ production <int> 9560, 9560, 9560, 9560, 9560, 9560, 95...
## $ season_avg_price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ production_value <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ year <dbl> 1987, 1987, 1987, 1987, 1987, 1987, 19...
## $ month <chr> "January", "January", "January", "Febr...
## $ imports <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ season_total <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
Back to the future
So, now that we have our data into our working space as a nice and tidy tibble, we want to do a few things before we start doing some prediction.
The first is that we want to generate future data - so that for a given date, we know what the average monthly price is 6 months in the future.
Why 6 months in the future, you might ask? Well, the time interval can be changed to anything we want, but this represents a typical turn-around time for brokers in the market. You’ll see in the code below that it’s fairly easy to build models that would predict for different time periods into the future.
Why average monthly price? Some of our data down the line (and on some pieces of this project that I’m working on now) only give us data on monthly-scales, not the nice weekly scales present in this data. Yes, it’s a bit a bigger temporal grain-size, but, as I’ll show you, the models still do an excellent job.
If you look at the data summary above, you can see that future_date and future_weekly_avg_price are already in this data frame… here is how to generate the future date using the absolutely fantastic lubridate package.
bean_master_import$future_date = bean_master_import$date + months(6)
bean_master_import$week = week(bean_master_import$date)
Next, we add the future month and year, and calculate monthly average prices. In the first step, I’m adding the month and year to the original dataframe, and in the second, I’m making a new dataframe that we’re going to join back to the first to get our future average monthly prices associated with dates.
#Adding future month and year
bean_master_import = bean_master_import %>%
mutate(future_month = month(future_date, label = TRUE, abbr = FALSE),
future_year = year(future_date))
#Making the dataframe we're going to join
bean_master_future_to_join = bean_master_import %>%
group_by(class, year, month) %>%
summarize(monthly_average_price = mean(price))
#joining it back on, but with the key column set to future date instead of date
bean_master_import = bean_master_import %>%
inner_join(bean_master_future_to_join, by = c("future_year" = "year", "future_month" = "month", "class" = "class")) %>%
rename(future_monthly_average_price = monthly_average_price)
## Warning: Column `future_month`/`month` joining factor and character vector,
## coercing into character vector
#Looks reasonable
head(bean_master_import$future_monthly_average_price, n = 20)
## [1] 20.23250 20.23250 20.23250 18.59500 18.59500 18.59500 18.59500
## [8] 18.03000 18.03000 18.03000 18.03000 17.62667 17.62667 17.62667
## [15] 17.62667 16.71750 16.71750 16.71750 16.71750 16.34250
#This looks reasonable too
select(bean_master_import, future_monthly_average_price, price)
## # A tibble: 7,625 x 2
## future_monthly_average_price price
## <dbl> <dbl>
## 1 20.2 18.6
## 2 20.2 18.2
## 3 20.2 18.1
## 4 18.6 18.1
## 5 18.6 18.5
## 6 18.6 18.2
## 7 18.6 18.2
## 8 18.0 18
## 9 18.0 17.4
## 10 18.0 17.7
## # ... with 7,615 more rows
Just a bit more cleaning - getting rid of some columns we won’t need and changing month and day to factors.
#Cleaning up the dataframe a bit
bean_master_import = bean_master_import %>%
select(-future_date, -future_weekly_avg_price, -year) %>%
mutate(day = day(date))
bean_master_import$future_month = factor(bean_master_import$future_month)
bean_master_import$day = factor(bean_master_import$day)
bean_master_import$class = factor(bean_master_import$class)
bean_master_import$imports = as.numeric(bean_master_import$imports)
Preprocessing
Ok, now we can start preprocessing the data before we feed it into our ML algorithms. Preprocessing is essential because:
- It get’s rid of NAs, which are the scurge of ML algorithms
- It scales and centers your data, which means that all the variables have more equal weight on models
It’s also super easy with the caret package.
#Setting seed for reproducibility
set.seed(42)
#First, let's cut it down to the explanatory variables we want
bean_master_import_slim = bean_master_import %>%
select(date, class, price, whole_market_avg, whole_market_sum,
class_market_share, planted, harvested, yield, production,
month, imports)
#Preprocessing
preProc2 = preProcess(bean_master_import_slim, method = c("center", "scale", "knnImpute", "zv"))
preProc2
## Created from 2467 samples and 12 variables
##
## Pre-processing:
## - centered (9)
## - ignored (3)
## - 5 nearest neighbor imputation (9)
## - scaled (9)
It’s worth discussing what’s going on here.
The preProcess()
function from caret takes a dataframe, and then in the methods argument you tell it what kind of preprocessing you want it to do. Here, I’ve centered and scaled the variabels (“center”, “scale”), used k-nearest neighbors (“knnImpute”) to impute missing values, and gotten rid of any columns that have zero variance (“zv”) (aren’t helpful).
You might expect that when we call preProc2, we get the data, but actually… we don’t. caret has just built a framework for preprocessing the data, so we’ll need to apply it in the next step. However, calling the object does give us nice information about the number of variables that were ignored, centered etc.
#Doing the preprocessing
bean_ML_import_pp = predict(preProc2, bean_master_import_slim)
#glimpsing results
glimpse(bean_ML_import_pp)
## Observations: 7,625
## Variables: 12
## $ date <date> 1987-01-13, 1987-01-21, 1987-01-27, 1987-0...
## $ class <fct> Pinto, Pinto, Pinto, Pinto, Pinto, Pinto, P...
## $ price <dbl> -1.454833, -1.501566, -1.517986, -1.517986,...
## $ whole_market_avg <dbl> -0.4745942, -0.5321818, -0.5238085, -0.5386...
## $ whole_market_sum <dbl> -0.7998774, -0.8469529, -0.8401081, -0.8522...
## $ class_market_share <dbl> -0.59721573, -0.60937574, -0.62198404, -0.6...
## $ planted <dbl> 1.748686, 1.748686, 1.748686, 1.748686, 1.7...
## $ harvested <dbl> 1.860342, 1.860342, 1.860342, 1.860342, 1.8...
## $ yield <dbl> -0.5854438, -0.5854438, -0.5854438, -0.5854...
## $ production <dbl> 1.827526, 1.827526, 1.827526, 1.827526, 1.8...
## $ month <chr> "January", "January", "January", "February"...
## $ imports <dbl> 1.923680, 1.923680, 1.923680, 1.923680, 1.9...
bean_ML_import_pp$month = factor(bean_ML_import_pp$month)
#binding the response and dates variable back on
bean_ML_import_pp$future_monthly_avg_price = bean_master_import$future_monthly_average_price
And finally, we get to split our data into training and test sets (check out the last post for info on this as a tenant of machine learning.
Caret to the rescue again - makes this super-easy.
#Training and test sets
index1 = createDataPartition(bean_ML_import_pp$future_monthly_avg_price, p = 0.80, list = FALSE)
bean_ML_import_train = bean_ML_import_pp[index1,]
bean_ML_import_test = bean_ML_import_pp[-index1,]
The createDataPartition()
function is doing the work here. We’re telling it what the response variable is (here future monthly average price), what we want our data split to be (here 80% training and 20% test), and that we don’t want the output in the form of a list. This gives us a random sample index, which we can then apply to our data frame to break it into training and test sets.
So, now that we have fully preprocessed training and test sets, we’re ready for the next step - jumping into some algorithms. I’ll save this for the next post (this is already super-long), but as a quick preview, we’ll run a variety of models to figure out what does the best job of predicting future prices:
Cheers!