Hi there!

It’s been a while! Sorry for the delay on posting, but things have been a bit crazy with the transition between the end of the semester at UA and moving into Summer research projects! I’ve also become a little obsessed with the Tidy Tuesday challenge on Twitter - it’s a cool project for the #r4ds (R for Data Science) community that sends out weekly datasets that Twitterfolks can munge and visualize with the fantastic tidyverse packages. If you’re interested, you can check out my submissions on Twitter or the Tidy Tuesday repo on GitHub.

Enough of a preamble. Where were we? Oh right. Beans.

What happens if we had our bean model to make investments in real life?

This post comes out of discussion with friends and family after building the predictive bean model - what sort of real world application does this have? Sure, it seems like a reasonable financially beneficial thing to be able to predict markets with a high degree of accuracy, but what does that actually mean to someone interested in throwing some money at buying and selling dry beans?

Well, I thought it might be useful to try and simulate an investment scenario. The rules are fairly simple:

  • You start with $10k to invest
  • You can’t spend more than 33% of your capital on a given week
  • Of that 33%, you spend as much as you can if the model predicts that the price will be higher in 6 months than the current price, otherwise you don’t buy anything
  • You sell all of those beans 6 months down the line at whatever the market price is
  • You can’t spend more than 100k on one purchase (optimistic that you’ll have this much bank eventually)

Let’s build the simulation (it’s also worth noting here that we’re preforming this simulation on the out-of-sample test data).

The simulation


Reading in the data and prepping it (code from two posts ago)

#loading appropriate packages
library(caret)
library(tidyverse)
library(lubridate)
#install.packages("scales")
library(scales)
#reading in the bean_master file

bean_master_import = read.csv(file = "https://raw.githubusercontent.com/keatonwilson/beans/master/data/bean_master_joined.csv?token=AefUVEvB38PXHx8ExMoim-IlgrhsKthtks5bq7KuwA%3D%3D")

#Let's cut down the data frame to only the important bits

bean_ML_imports = bean_master_import %>%
  mutate(month = month(date),
         day = day(date)) %>%
 select(future_weekly_avg_price, price, class, whole_market_avg, class_market_share,
        planted, harvested, yield, production, imports, date, month, day) %>%
  filter(!is.na(future_weekly_avg_price))
bean_ML_imports$class = factor(bean_ML_imports$class)
bean_ML_imports$month = factor(bean_ML_imports$month)
bean_ML_imports$day = factor(bean_ML_imports$day)

#Setting seed for reproducibility
set.seed(42)

#Preprocessing
preProc2 = preProcess(bean_ML_imports[,-c(1,11)], method = c("center", "scale", "knnImpute", "zv"))
bean_ML_import_pp = predict(preProc2, bean_ML_imports[,-c(1,11)])

#binding the response and dates variable back on
bean_ML_import_pp$future_weekly_avg_price = bean_ML_imports[,1]
bean_ML_import_pp$date = bean_ML_imports[,11]

#Training and test sets
index1 = createDataPartition(bean_ML_import_pp$future_weekly_avg_price, p = 0.80, list = FALSE)
bean_ML_import_train = bean_ML_import_pp[index1,]
bean_ML_import_test = bean_ML_import_pp[-index1,]

The output of this chunk of code above is what we’re going to work with - specifically, the test set - which means that the simulation we’ll run will be completely on out-of-sample data, which should provide a more realistic set of predictions.

We need to do a few things before we get into the nitty-gritty of building the model though: first, we need to get rid of duplicate rows - which simplifies our simulation a lot. Instead of making decisions among all of the varietals for a given week that might give the best return on investment, we’re going to set up a simple scenario where we only look at weeks where there is one option - and then decide whether to buy or not. This next code chunk also a few other details - we’ve loaded our previously-built model as an RDS object, and then use it to bind predictions back onto the data frame.

#Renaming
bean_portfolio = bean_ML_import_test
#binding non-transformed prices back on.
bean_portfolio$nontran_price = bean_ML_imports[-index1,]$price
#RF Model is on final_model.rd - this will be locally in your machine, it's the model object we built a few posts back
#
super_model = readRDS(file = "/Users/KeatonWilson/Documents/Projects/beans2/final_model.rds")

#binding predictions onto the test data frame
bean_portfolio$pred = predict(super_model, bean_portfolio)

#Generating a column to say whether or not the date is a duplicate
bean_portfolio$dup = duplicated(bean_portfolio$date)

glimpse(bean_portfolio)
## Observations: 1,339
## Variables: 16
## $ price                   <dbl> -1.59998660, -1.24329275, -1.43785303,...
## $ class                   <fct> Pinto, Pinto, Pinto, Pinto, Pinto, Pin...
## $ whole_market_avg        <dbl> -1.4308377, -1.2062523, -1.2553069, -1...
## $ class_market_share      <dbl> -0.16479201, 0.27002425, 0.12468812, 0...
## $ planted                 <dbl> 1.690830, 1.690830, 1.690830, 1.690830...
## $ harvested               <dbl> 1.802003, 1.802003, 1.802003, 1.802003...
## $ yield                   <dbl> -0.5926519, -0.5926519, -0.5926519, -0...
## $ production              <dbl> 1.771990, 1.771990, 1.771990, 1.771990...
## $ imports                 <dbl> 1.8751095, 1.8751095, 1.8751095, 1.875...
## $ month                   <fct> 4, 6, 8, 9, 11, 1, 1, 2, 3, 6, 7, 8, 1...
## $ day                     <fct> 28, 16, 11, 15, 3, 20, 26, 2, 15, 14, ...
## $ future_weekly_avg_price <dbl> 17.13, 16.37, 16.00, 17.88, 19.00, 32....
## $ date                    <fct> 1987-04-28, 1987-06-16, 1987-08-11, 19...
## $ nontran_price           <dbl> 17.50, 20.25, 18.75, 18.38, 17.00, 16....
## $ pred                    <dbl> 17.50913, 16.58924, 16.44968, 17.55849...
## $ dup                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
#Paring it down

small_bean_portfolio = bean_portfolio %>%
  filter(dup == FALSE) %>%
  arrange(date)

Building the model

bank = 10000 # initializing our starting value
df_port = list() #initializing the list that we'll output to
unique_dates = unique(small_bean_portfolio$date)
'%w/o%' <- function(x,y)!('%in%'(x,y)) #making a 'without' funcion

for(i in 1:nrow(small_bean_portfolio)) {
  subset = small_bean_portfolio[i,] # pulling out a row, indexed by row number
  df_port$Date[[i]] = as.character(subset$date) # feeding in the iterated date to the output data frame
  max_spend = 0.33*bank         #figuring out how much we can spend on beans
  buy = ifelse(subset$nontran_price < subset$pred, TRUE, FALSE) #Deciding whether to buy or not buy, based on whether the predicted price is more than the current price
  if (buy == TRUE) { 
    num_100wt =  ifelse(floor(max_spend/subset$nontran_price) > 10^5, 10^5, floor(max_spend/subset$nontran_price)) #Calculating the number of hundred-weights we can afford (capping at 10^5 hundredweights based on domain knowledge)
    df_port$num_100wt[[i]] = num_100wt            #binding this to the output dataframe
    amount_spent = num_100wt*subset$nontran_price #calculating the amount of money we spent on beans
    df_port$amount_spent[[i]] = amount_spent          #binding this to the output dataframe
    amount_made = subset$future_weekly_avg_price * num_100wt #Calculating amount made
    df_port$amount_made[[i]] = amount_made                      #binding to data frame
    bank = bank - amount_spent + amount_made                 #subtracting our costs and adding our profits to update bank
    df_port$bank[i] = bank                     #binding to the output dataframe
  }
 else {
  df_port$num_100wt[[i]] = NA
  df_port$amount_spent[[i]] = NA
  df_port$bank[[i]] = bank
  df_port$amount_made[[i]] = NA
 }
}

#binding the output list into a dataframe
df_port_df = dplyr::bind_cols(df_port)
dplyr::glimpse(df_port_df)
## Observations: 763
## Variables: 5
## $ Date         <chr> "1987-01-13", "1987-01-21", "1987-01-27", "1987-0...
## $ num_100wt    <dbl> NA, NA, NA, NA, NA, NA, 160, NA, 198, NA, NA, NA,...
## $ amount_spent <dbl> NA, NA, NA, NA, NA, NA, 3280.00, NA, 3465.00, NA,...
## $ bank         <dbl> 10000.00, 10000.00, 10000.00, 10000.00, 10000.00,...
## $ amount_made  <dbl> NA, NA, NA, NA, NA, NA, 3800.00, NA, 3391.74, NA,...

Great! Now let’s visualize how our bank size grows over time!

ggplot(df_port_df, aes(x = as.Date(Date), y = bank)) +
  geom_path() +
  ylab("Bank Size ($)") +
  xlab("Date") +
  theme_classic() +
  scale_y_continuous(labels = comma)

Whoa! So if we had conservatively purchased based on the rules of this simulation and had our predictive model in hand since the 80s, we could have turned our $10k into 150,000,000. Not too bad. We can also look and see how many times our model steered us wrong and we lost money.

df_port_df %>%
  filter(amount_spent > amount_made) %>%
  mutate(amount_lost = amount_spent - amount_made) %>%
  arrange(desc(amount_lost))
## # A tibble: 29 x 6
##    Date       num_100wt amount_spent       bank amount_made amount_lost
##    <chr>          <dbl>        <dbl>      <dbl>       <dbl>       <dbl>
##  1 2010-03-02    100000      4075000 153231547.    3750000      325000 
##  2 2004-09-08    100000      2900000 113561547.    2750000      150000 
##  3 2000-10-24    100000      2350000  64749547.    2225000      125000 
##  4 1991-08-20     94490      1889800   5620416.    1783499.     106301.
##  5 1995-03-14    100000      2600000  34536547.    2500000      100000 
##  6 1995-05-30    100000      2175000  35311547.    2075000      100000 
##  7 1994-03-22    100000      3300000  28511547.    3213000.      87000.
##  8 2001-03-27    100000      2650000  67049547.    2575000       75000 
##  9 1995-08-29    100000      2600000  35661547.    2550000       50000 
## 10 1998-09-29    100000      2525000  60899547.    2475000       50000 
## # ... with 19 more rows

Yikes, some heavy losses, particularly in the 90s and early 00s, with the biggest of $325k in March of 2010 - but, when you’ve got 153 million in the bank, it’s not as drastic as it seems.

Conclusions

Hopefully this was an interesting example of the power of leveraging the ML model to predict real-world data. The simulation clearly illustrates the predictive power of the model, but also has a few built-in assumptions that are worth mentioning:

  1. We aren’t making choices among varietals each week (i.e. do I buy pintos or garbanzos, even when the model predicts both will turn a profit) like a real broker would. I initially went down the road to try and simulate this, but got stuck in the code - something I’d like to revisit some day soon.
  2. One BIG assumption is that you’re able to sell all of the beans you buy. It may not always be so easy to totally liquidate all of your beans at a certain price.
  3. There aren’t any costs associated with holding beans here - in the real world, you’d have to warehouse them, and transport them somewhere, all which come at a price.

Regardless, an impressive display of prediction!

Cheers!