Bean munging and Excel Wrangling
Messy Excel Files
So, as I discussed last time, the first big hurdle in starting to explore the domestic dry bean market data was overcoming the terror of working with a bunch of really messy, really gnarly excel files.
The main one looks like this:
Lots of problems, right? The data are in multiple sheets in a single workbook, they’re not uniform, etc. It’s an R-user’s nightmare, but the reality is that data often look like this. So let’s get to work.
A brief point - here is the type of data we can expect from this workbook:
- year
- month
- day
- class (type of bean or variety)
- price
Step 1. Reading in the data
First things first, let’s load up some useful packages
library(tidyverse)
# Warning: package 'dplyr' was built under R version 3.5.1
library(rJava)
library(readxl)
library(XLConnect)
We’ll use the handy functions in readxl and XLConnect to read in the the excel sheets and then use tidyverse to do a bunch of stuff to them to clean them up.
All files can be found at my github repo for this project.
Download the appropriate files to your local working directory - and you’re ready to go.
Let’s read in the data with readxl’s loadWorkbook function (if you were better at making the RCurl package work than I am, you could pull it directly from my GitHub too):
#Loading Workbooks
dealer_price = loadWorkbook(file =
"/Users/KeatonWilson/Documents/Projects/beans2/data/Weekly_Dealer_price.xls")
#Remember your directory will be different
#Turning workbook into a list
dealer_price_list = readWorksheet(dealer_price, sheet = getSheets(dealer_price))
#Lots of errors with this, but don't worry.
#Looks good...lots of NAs, but there is data in there.
head(dealer_price_list[[1]], n = 20)
Great! Now we have a list of the data from each year (where each item in the list is a dataframe of market data for different types of beans for each year).
If we look through all the dataframes in our list - there is a bunch of junk at the top of each one we need to get rid of. Let’s use the handy {r} lapply
function for this.
#Bunch of garbage on the top of every data frame - need to iterate through the list and delete the first 11 rows
dealer_price_list = lapply(dealer_price_list, function(x) x[-c(1:11),])
head(dealer_price_list[[1]], n = 20)
## Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary Col2
## 12 1 6
## 13 1 13
## 14 1 20
## 15 1 27
## 16 2 3
## 17 2 10
## 18 2 18
## 19 2 24
## 20 3 3
## 21 3 10
## 22 3 17
## 23 3 24
## 24 3 31
## 25 4 7
## 26 4 14
## 27 4 21
## 28 4 28
## 29 5 5
## 30 5 12
## 31 5 19
## Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Col14
## 12 32.00 31.75 31.75 36.25 27.25 27.75 38.50 26.75 33.25 31.50 37.00 <NA>
## 13 32.75 32.25 31.50 36.50 29.25 34.00 38.00 27.25 33.25 32.00 37.00 <NA>
## 14 35.50 33.25 34.00 36.50 30.00 36.00 37.50 28.50 33.50 32.75 37.00 <NA>
## 15 35.25 33.75 33.88 35.75 30.50 38.50 37.50 30.50 33.50 34.00 37.00 <NA>
## 16 35.25 34.25 34.50 36.00 30.75 39.50 38.00 31.00 33.50 34.25 37.50 <NA>
## 17 34.75 34.25 34.50 36.50 31.25 39.50 39.50 31.50 34.00 34.25 37.50 <NA>
## 18 35.00 35.50 34.50 36.75 31.75 40.50 40.50 31.25 34.25 34.50 37.50 <NA>
## 19 35.00 35.50 34.50 36.75 31.75 41.50 40.50 31.75 34.75 34.50 40.00 <NA>
## 20 34.25 35.25 34.25 36.25 32.88 44.50 40.50 32.75 34.75 34.25 41.00 <NA>
## 21 34.25 35.50 34.25 36.12 33.00 49.50 39.75 33.50 35.50 34.25 41.00 <NA>
## 22 34.25 35.75 34.25 36.75 34.50 50.50 39.50 33.50 36.75 34.25 41.00 <NA>
## 23 34.50 36.00 35.50 36.00 34.75 50.50 39.50 33.50 37.50 34.25 41.50 <NA>
## 24 34.50 37.00 35.50 36.25 34.75 51.75 39.00 33.25 37.75 34.25 41.50 <NA>
## 25 34.88 38.00 36.25 35.75 34.50 52.00 39.00 33.25 38.00 34.25 41.00 <NA>
## 26 36.25 38.25 36.25 36.00 34.75 51.75 39.25 33.25 38.00 34.25 42.00 <NA>
## 27 36.25 38.25 36.50 36.00 34.75 51.75 39.00 33.25 38.00 34.25 43.50 <NA>
## 28 36.25 38.50 36.50 36.00 36.25 51.75 38.50 33.00 38.50 34.00 44.00 <NA>
## 29 37.00 38.50 36.75 36.00 40.50 51.50 38.50 32.75 42.50 34.00 44.00 <NA>
## 30 38.75 38.75 37.50 36.00 44.50 51.50 38.50 32.75 52.25 34.25 44.50 <NA>
## 31 43.50 39.25 38.50 37.50 46.25 51.50 38.50 32.75 52.50 35.00 44.50 <NA>
Ok, now it gets gnarly.
Here is the problem - not all data frames have the same variables, and it looks like variable names are hidden somewhere down in each dataframe.
I ended up going through each data frame manually - obviously this doesn’t scale super well - and is laborious, but I couldn’t come up with a clever way to automate this, but here is an example of the type of code I used:
#1981
glimpse(dealer_price_list[[1]])
# Observations: 80
# Variables: 14
# $ Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary <chr> ...
# $ Col2 <chr> ...
# $ Col3 <chr> ...
# $ Col4 <chr> ...
# $ Col5 <chr> ...
# $ Col6 <chr> ...
# $ Col7 <chr> ...
# $ Col8 <chr> ...
# $ Col9 <chr> ...
# $ Col10 <chr> ...
# $ Col11 <chr> ...
# $ Col12 <chr> ...
# $ Col13 <chr> ...
# $ Col14 <chr> ...
dealer_price_list[[1]]
# Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary Col2
# 12 1 6
# 13 1 13
# 14 1 20
# 15 1 27
# 16 2 3
# 17 2 10
# 18 2 18
# 19 2 24
# 20 3 3
# 21 3 10
# 22 3 17
# 23 3 24
# 24 3 31
# 25 4 7
# 26 4 14
# 27 4 21
# 28 4 28
# 29 5 5
# 30 5 12
# 31 5 19
# 32 5 27
# 33 6 2
# 34 6 9
# 35 6 16
# 36 6 23
# 37 6 30
# 38 7 7
# 39 7 14
# 40 7 21
# 41 7 28
# 42 8 4
# 43 8 11
# 44 8 18
# 45 8 25
# 46 9 1
# 47 9 9
# 48 9 15
# 49 9 22
# 50 9 29
# 51 10 6
# 52 10 14
# 53 10 20
# 54 10 27
# 55 11 3
# 56 11 10
# 57 11 17
# 58 11 24
# 59 12 1
# 60 12 8
# 61 12 16
# 62 12 22
# 63 12 29
# 64 <NA>
# 65 = =
# 66 <NA> <NA>
# 67 <NA> <NA>
# 68 <NA> <NA>
# 69 Monthly Average Dry Bean Prices, $/cwt 1981 <NA>
# 70 - -
# 71 Month <NA>
# 72 - -
# 73 <NA> <NA>
# 74 <NA> <NA>
# 75 Jan <NA>
# 76 Feb <NA>
# 77 Mar <NA>
# 78 Apr <NA>
# 79 May <NA>
# 80 Jun <NA>
# 81 Jul <NA>
# 82 Aug <NA>
# 83 Sep <NA>
# 84 Oct <NA>
# 85 Nov <NA>
# 86 Dec <NA>
# 87 <NA> <NA>
# 88 Ave * <NA>
# 89 = =
# 90 * = Calendar year simple average. <NA>
# 91 N/A = Not available. <NA>
# Col3 Col4 Col5 Col6 Col7 Col8 Col9
# 12 32.00 31.75 31.75 36.25 27.25 27.75 38.50
# 13 32.75 32.25 31.50 36.50 29.25 34.00 38.00
# 14 35.50 33.25 34.00 36.50 30.00 36.00 37.50
# 15 35.25 33.75 33.88 35.75 30.50 38.50 37.50
# 16 35.25 34.25 34.50 36.00 30.75 39.50 38.00
# 17 34.75 34.25 34.50 36.50 31.25 39.50 39.50
# 18 35.00 35.50 34.50 36.75 31.75 40.50 40.50
# 19 35.00 35.50 34.50 36.75 31.75 41.50 40.50
# 20 34.25 35.25 34.25 36.25 32.88 44.50 40.50
# 21 34.25 35.50 34.25 36.12 33.00 49.50 39.75
# 22 34.25 35.75 34.25 36.75 34.50 50.50 39.50
# 23 34.50 36.00 35.50 36.00 34.75 50.50 39.50
# 24 34.50 37.00 35.50 36.25 34.75 51.75 39.00
# 25 34.88 38.00 36.25 35.75 34.50 52.00 39.00
# 26 36.25 38.25 36.25 36.00 34.75 51.75 39.25
# 27 36.25 38.25 36.50 36.00 34.75 51.75 39.00
# 28 36.25 38.50 36.50 36.00 36.25 51.75 38.50
# 29 37.00 38.50 36.75 36.00 40.50 51.50 38.50
# 30 38.75 38.75 37.50 36.00 44.50 51.50 38.50
# 31 43.50 39.25 38.50 37.50 46.25 51.50 38.50
# 32 45.50 39.25 39.00 37.00 45.75 52.00 38.50
# 33 45.00 40.00 39.50 37.50 45.75 51.50 38.37
# 34 45.25 40.25 39.50 37.50 45.50 51.50 38.50
# 35 44.50 40.25 69.00 36.50 45.50 51.50 38.50
# 36 44.50 40.00 38.50 37.00 44.50 51.00 38.50
# 37 43.25 40.00 36.00 36.00 43.75 51.50 38.50
# 38 42.50 40.00 35.50 35.50 43.50 51.00 39.00
# 39 41.00 39.25 <NA> <NA> 42.75 50.25 38.50
# 40 36.50 37.50 <NA> <NA> 38.50 48.50 39.00
# 41 <NA> <NA> <NA> <NA> 36.00 47.50 39.00
# 42 32.25 35.50 <NA> <NA> 33.00 46.00 49.50
# 43 31.25 35.25 30.00 32.00 33.00 44.00 39.50
# 44 29.50 34.75 <NA> 29.00 31.00 43.00 39.50
# 45 28.25 31.00 <NA> <NA> 29.50 37.00 39.50
# 46 25.25 30.50 <NA> <NA> 28.50 33.00 39.50
# 47 23.50 28.88 <NA> <NA> 29.00 <NA> <NA>
# 48 23.75 28.75 24.00 25.00 29.00 31.00 40.50
# 49 22.75 27.75 23.00 26.50 31.00 30.50 40.50
# 50 22.00 27.50 23.75 26.75 34.50 30.50 40.50
# 51 22.25 31.00 23.00 26.50 42.00 30.50 40.50
# 52 22.38 32.00 22.75 26.50 39.50 31.00 41.25
# 53 22.25 31.00 23.00 26.25 39.50 30.50 41.00
# 54 22.00 30.75 22.75 26.00 38.75 30.50 41.00
# 55 21.75 30.75 22.75 26.00 39.50 30.50 41.00
# 56 21.75 30.75 22.25 25.25 38.00 30.50 41.00
# 57 21.75 30.50 22.75 25.00 38.00 30.25 40.75
# 58 21.50 30.25 22.75 25.00 36.25 30.25 40.25
# 59 20.88 30.25 <NA> 23.00 36.50 30.00 40.00
# 60 20.25 30.00 <NA> 22.00 36.25 29.75 39.75
# 61 18.75 30.25 <NA> 22.00 35.50 29.00 39.25
# 62 18.25 28.75 <NA> 22.00 36.00 29.00 39.25
# 63 17.62 28.25 <NA> 22.00 35.50 28.25 38.25
# 64 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 65 = = = = = = =
# 66 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 67 <NA> <NA> <NA> <NA> <NA> <NA>
# 68 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 69 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 70 - - - - - - -
# 71 Pinto Grt.Nort. Pink S_Red Navy B_Lima L_lima
# 72 - - - - - - -
# 73 <NA> <NA> <NA> <NA> <NA> $/cwt <NA>
# 74 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 75 33.88 32.75 32.78 36.25 29.25 34.06 37.88
# 76 35.00 34.88 34.50 36.50 31.38 40.25 39.63
# 77 34.35 35.90 34.75 36.27 33.98 49.35 39.65
# 78 35.91 38.25 36.38 35.94 35.06 51.81 38.94
# 79 41.19 38.94 37.94 36.63 44.25 51.63 38.50
# 80 44.50 40.10 44.50 36.90 45.00 51.40 38.47
# 81 40.00 38.92 35.50 35.50 41.58 49.92 38.83
# 82 30.31 34.13 30.00 30.50 31.63 42.50 42.00
# 83 23.45 28.68 23.58 26.08 30.40 31.25 40.25
# 84 22.22 31.19 22.88 26.31 39.94 30.63 40.94
# 85 21.69 30.56 22.63 25.31 37.94 30.38 40.75
# 86 19.15 29.50 -- 22.20 35.95 29.20 39.30
# 87 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 88 $31.80 $34.48 $32.31 $32.03 $36.36 $41.03 $39.59
# 89 = = = = = = =
# 90 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 91 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# Col10 Col11 Col12 Col13 Col14
# 12 26.75 33.25 31.50 37.00 <NA>
# 13 27.25 33.25 32.00 37.00 <NA>
# 14 28.50 33.50 32.75 37.00 <NA>
# 15 30.50 33.50 34.00 37.00 <NA>
# 16 31.00 33.50 34.25 37.50 <NA>
# 17 31.50 34.00 34.25 37.50 <NA>
# 18 31.25 34.25 34.50 37.50 <NA>
# 19 31.75 34.75 34.50 40.00 <NA>
# 20 32.75 34.75 34.25 41.00 <NA>
# 21 33.50 35.50 34.25 41.00 <NA>
# 22 33.50 36.75 34.25 41.00 <NA>
# 23 33.50 37.50 34.25 41.50 <NA>
# 24 33.25 37.75 34.25 41.50 <NA>
# 25 33.25 38.00 34.25 41.00 <NA>
# 26 33.25 38.00 34.25 42.00 <NA>
# 27 33.25 38.00 34.25 43.50 <NA>
# 28 33.00 38.50 34.00 44.00 <NA>
# 29 32.75 42.50 34.00 44.00 <NA>
# 30 32.75 52.25 34.25 44.50 <NA>
# 31 32.75 52.50 35.00 44.50 <NA>
# 32 32.75 52.50 35.50 44.50 <NA>
# 33 33.50 52.50 38.00 44.50 <NA>
# 34 35.50 52.50 39.00 44.50 <NA>
# 35 37.50 50.50 39.50 44.50 <NA>
# 36 40.62 50.50 39.50 44.50 <NA>
# 37 40.50 50.00 39.50 44.50 <NA>
# 38 39.50 49.00 39.50 46.00 <NA>
# 39 39.00 49.50 38.50 45.00 <NA>
# 40 37.75 <NA> 38.25 44.75 <NA>
# 41 37.50 47.00 38.00 44.50 <NA>
# 42 36.25 <NA> 37.50 44.50 <NA>
# 43 34.50 <NA> 36.00 46.00 <NA>
# 44 32.50 <NA> 34.00 46.00 <NA>
# 45 32.00 <NA> 32.50 46.00 <NA>
# 46 31.50 <NA> 33.50 46.00 <NA>
# 47 <NA> <NA> <NA> <NA> <NA>
# 48 29.50 <NA> 31.25 46.00 <NA>
# 49 28.50 <NA> 30.25 46.00 <NA>
# 50 28.00 36.75 31.50 46.00 <NA>
# 51 28.25 44.50 32.50 46.00 <NA>
# 52 28.25 43.00 33.50 46.00 <NA>
# 53 28.25 42.75 32.50 47.50 <NA>
# 54 28.25 42.25 32.00 47.50 <NA>
# 55 28.50 42.25 31.50 48.00 <NA>
# 56 29.00 42.00 31.25 50.00 <NA>
# 57 29.50 41.75 32.00 50.00 <NA>
# 58 29.50 41.75 32.50 50.00 <NA>
# 59 29.75 41.75 32.50 50.00 <NA>
# 60 29.75 41.00 32.00 50.00 <NA>
# 61 29.75 39.50 32.00 50.00 <NA>
# 62 39.75 39.50 31.50 50.00 <NA>
# 63 39.75 39.50 31.50 50.00 <NA>
# 64 <NA> <NA> <NA> <NA> <NA>
# 65 = = = =
# 66 <NA> <NA> <NA> <NA> <NA>
# 67 <NA> <NA> <NA> <NA> <NA>
# 68 <NA> <NA> <NA> <NA> <NA>
# 69 <NA> <NA> <NA> <NA> <NA>
# 70 - - - - <NA>
# 71 Blackeye S_White Kidney Garbanzo <NA>
# 72 - - - - <NA>
# 73 <NA> <NA> <NA> <NA> <NA>
# 74 <NA> <NA> <NA> <NA> <NA>
# 75 28.25 33.38 32.56 37.00 <NA>
# 76 31.38 34.13 34.38 38.13 <NA>
# 77 33.30 36.45 34.25 41.20 <NA>
# 78 33.19 38.13 34.19 42.63 <NA>
# 79 32.75 49.94 34.69 44.38 <NA>
# 80 37.52 51.20 39.10 44.50 <NA>
# 81 38.75 49.25 38.75 45.25 <NA>
# 82 33.81 -- 35.00 45.63 <NA>
# 83 29.38 36.75 31.63 46.00 <NA>
# 84 28.25 43.13 32.63 46.75 <NA>
# 85 29.13 41.94 31.81 49.50 <NA>
# 86 33.75 40.25 31.90 50.00 <NA>
# 87 <NA> <NA> <NA> <NA> <NA>
# 88 $32.45 $41.32 $34.24 $44.25 <NA>
# 89 = = = = <NA>
# 90 <NA> <NA> <NA> <NA> <NA>
# 91 <NA> <NA> <NA> <NA> <NA>
dealer_price_list[[1]] = dealer_price_list[[1]][-c(64:100),-14]
colnames(dealer_price_list[[1]]) = c("Month", "Day", "Pinto", "Grt_Northern", "Pink", "Sm_Red", "Navy", "B_Lima", "L_Lima", "Blackeye",
"Small_White", "Kidney", "Garbanzo")
glimpse(dealer_price_list[[1]])
Look at how nice that looks! Now it’s time to iterate through all the dataframes. You may be thinking to yourself, “Hey, why don’t you do that with a loop, or with lapply?”. That’s a great idea… except that dataframes vary in their content. Manual brute-force it is.
Let’s skip ahead a bit - with a bit more munging and cleaning, we end up with a very nice long-format dataframe, that you can find here.
Or alternatively:
dealer_price_long = read_csv(file = "https://raw.githubusercontent.com/keatonwilson/beans/master/data/dealer_price_long.csv?token=AefUVKUxTssySEILhpmU2TOfE32UocJfks5bq6gMwA%3D%3D")
# Parsed with column specification:
# cols(
# date = col_date(format = ""),
# Class = col_character(),
# Price = col_double()
# )
dealer_price_long
# # A tibble: 23,984 x 3
# date Class Price
# <date> <chr> <dbl>
# 1 1981-01-06 Pinto 32
# 2 1981-01-13 Pinto 32.8
# 3 1981-01-20 Pinto 35.5
# 4 1981-01-27 Pinto 35.2
# 5 1981-02-03 Pinto 35.2
# 6 1981-02-10 Pinto 34.8
# 7 1981-02-18 Pinto 35
# 8 1981-02-24 Pinto 35
# 9 1981-03-03 Pinto 34.2
# 10 1981-03-10 Pinto 34.2
# # ... with 23,974 more rows
Also note that this is in tibble format now. Thanks, Hadley. :)
Conclusions
This is a great start. We went from an awful Excel Workbook to a slim and trim tidy dataframe with ~24,000 entries of bean prices from 1981-2010 - this is going to be a big chunk of the data we end up splitting into training and test sets down the road for Machine Learning.
A small aside - there was a bunch of munging and cleaning involved between some of the steps above. If you’re interested in a deeper dive into what was entailed, check out the full source code - it’s pretty well annotated and can give you a nice look at things.
Cheers!