Bean munging and Excel Wrangling
Messy Excel Files
So, as I discussed last time, the first big hurdle in starting to explore the domestic dry bean market data was overcoming the terror of working with a bunch of really messy, really gnarly excel files.
The main one looks like this:
Lots of problems, right? The data are in multiple sheets in a single workbook, they’re not uniform, etc. It’s an R-user’s nightmare, but the reality is that data often look like this. So let’s get to work.
A brief point - here is the type of data we can expect from this workbook:
- year
- month
- day
- class (type of bean or variety)
- price
Step 1. Reading in the data
First things first, let’s load up some useful packages
library(tidyverse)
## Warning: package 'dplyr' was built under R version 3.5.1
library(rJava)
library(readxl)
library(XLConnect)
We’ll use the handy functions in readxl and XLConnect to read in the the excel sheets and then use tidyverse to do a bunch of stuff to them to clean them up.
All files can be found at my github repo for this project.
Download the appropriate files to your local working directory - and you’re ready to go.
Let’s read in the data with readxl’s loadWorkbook function (if you were better at making the RCurl package work than I am, you could pull it directly from my GitHub too):
#Loading Workbooks
dealer_price = loadWorkbook(file =
"/Users/KeatonWilson/Documents/Projects/beans2/data/Weekly_Dealer_price.xls")
#Remember your directory will be different
#Turning workbook into a list
dealer_price_list = readWorksheet(dealer_price, sheet = getSheets(dealer_price))
#Lots of errors with this, but don't worry.
#Looks good...lots of NAs, but there is data in there.
head(dealer_price_list[[1]], n = 20)
## Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary Col2
## 1 Source: USDA, Agricultural Marketing Service, Bean Market News <NA>
## 2 Last Updated: <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Table 42--U.S. dry bean f.ob. dealer prices, 1981 <NA>
## 6 <NA> <NA>
## 7 Month Day
## 8 <NA> <NA>
## 9 <NA> <NA>
## 10 <NA>
## 11 <NA> <NA>
## 12 1 6
## 13 1 13
## 14 1 20
## 15 1 27
## 16 2 3
## 17 2 10
## 18 2 18
## 19 2 24
## 20 3 3
## Col3 Col4 Col5 Col6 Col7 Col8 Col9
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 June 23, 1989 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 Pinto Great Pink Small Navy Baby Large
## 8 <NA> Northern <NA> Red <NA> Lima Lima
## 9 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 <NA> <NA> <NA> <NA> <NA> Dollars per Cwt <NA>
## 11 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 12 32.00 31.75 31.75 36.25 27.25 27.75 38.50
## 13 32.75 32.25 31.50 36.50 29.25 34.00 38.00
## 14 35.50 33.25 34.00 36.50 30.00 36.00 37.50
## 15 35.25 33.75 33.88 35.75 30.50 38.50 37.50
## 16 35.25 34.25 34.50 36.00 30.75 39.50 38.00
## 17 34.75 34.25 34.50 36.50 31.25 39.50 39.50
## 18 35.00 35.50 34.50 36.75 31.75 40.50 40.50
## 19 35.00 35.50 34.50 36.75 31.75 41.50 40.50
## 20 34.25 35.25 34.25 36.25 32.88 44.50 40.50
## Col10 Col11 Col12 Col13 Col14
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## 7 Blackeye Small Kidney Garbanzo <NA>
## 8 <NA> White <NA> <NA> <NA>
## 9 <NA> <NA> <NA> <NA> <NA>
## 10 <NA> <NA> <NA> <NA> <NA>
## 11 <NA> <NA> <NA> <NA> <NA>
## 12 26.75 33.25 31.50 37.00 <NA>
## 13 27.25 33.25 32.00 37.00 <NA>
## 14 28.50 33.50 32.75 37.00 <NA>
## 15 30.50 33.50 34.00 37.00 <NA>
## 16 31.00 33.50 34.25 37.50 <NA>
## 17 31.50 34.00 34.25 37.50 <NA>
## 18 31.25 34.25 34.50 37.50 <NA>
## 19 31.75 34.75 34.50 40.00 <NA>
## 20 32.75 34.75 34.25 41.00 <NA>
Great! Now we have a list of the data from each year (where each item in the list is a dataframe of market data for different types of beans for each year).
If we look through all the dataframes in our list - there is a bunch of junk at the top of each one we need to get rid of. Let’s use the handy {r} lapply
function for this.
#Bunch of garbage on the top of every data frame - need to iterate through the list and delete the first 11 rows
dealer_price_list = lapply(dealer_price_list, function(x) x[-c(1:11),])
head(dealer_price_list[[1]], n = 20)
## Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary Col2
## 12 1 6
## 13 1 13
## 14 1 20
## 15 1 27
## 16 2 3
## 17 2 10
## 18 2 18
## 19 2 24
## 20 3 3
## 21 3 10
## 22 3 17
## 23 3 24
## 24 3 31
## 25 4 7
## 26 4 14
## 27 4 21
## 28 4 28
## 29 5 5
## 30 5 12
## 31 5 19
## Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Col14
## 12 32.00 31.75 31.75 36.25 27.25 27.75 38.50 26.75 33.25 31.50 37.00 <NA>
## 13 32.75 32.25 31.50 36.50 29.25 34.00 38.00 27.25 33.25 32.00 37.00 <NA>
## 14 35.50 33.25 34.00 36.50 30.00 36.00 37.50 28.50 33.50 32.75 37.00 <NA>
## 15 35.25 33.75 33.88 35.75 30.50 38.50 37.50 30.50 33.50 34.00 37.00 <NA>
## 16 35.25 34.25 34.50 36.00 30.75 39.50 38.00 31.00 33.50 34.25 37.50 <NA>
## 17 34.75 34.25 34.50 36.50 31.25 39.50 39.50 31.50 34.00 34.25 37.50 <NA>
## 18 35.00 35.50 34.50 36.75 31.75 40.50 40.50 31.25 34.25 34.50 37.50 <NA>
## 19 35.00 35.50 34.50 36.75 31.75 41.50 40.50 31.75 34.75 34.50 40.00 <NA>
## 20 34.25 35.25 34.25 36.25 32.88 44.50 40.50 32.75 34.75 34.25 41.00 <NA>
## 21 34.25 35.50 34.25 36.12 33.00 49.50 39.75 33.50 35.50 34.25 41.00 <NA>
## 22 34.25 35.75 34.25 36.75 34.50 50.50 39.50 33.50 36.75 34.25 41.00 <NA>
## 23 34.50 36.00 35.50 36.00 34.75 50.50 39.50 33.50 37.50 34.25 41.50 <NA>
## 24 34.50 37.00 35.50 36.25 34.75 51.75 39.00 33.25 37.75 34.25 41.50 <NA>
## 25 34.88 38.00 36.25 35.75 34.50 52.00 39.00 33.25 38.00 34.25 41.00 <NA>
## 26 36.25 38.25 36.25 36.00 34.75 51.75 39.25 33.25 38.00 34.25 42.00 <NA>
## 27 36.25 38.25 36.50 36.00 34.75 51.75 39.00 33.25 38.00 34.25 43.50 <NA>
## 28 36.25 38.50 36.50 36.00 36.25 51.75 38.50 33.00 38.50 34.00 44.00 <NA>
## 29 37.00 38.50 36.75 36.00 40.50 51.50 38.50 32.75 42.50 34.00 44.00 <NA>
## 30 38.75 38.75 37.50 36.00 44.50 51.50 38.50 32.75 52.25 34.25 44.50 <NA>
## 31 43.50 39.25 38.50 37.50 46.25 51.50 38.50 32.75 52.50 35.00 44.50 <NA>
Ok, now it gets gnarly.
Here is the problem - not all data frames have the same variables, and it looks like variable names are hidden somewhere down in each dataframe.
I ended up going through each data frame manually - obviously this doesn’t scale super well - and is laborious, but I couldn’t come up with a clever way to automate this, but here is an example of the type of code I used:
#1981
glimpse(dealer_price_list[[1]])
## Observations: 80
## Variables: 14
## $ Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary <chr> ...
## $ Col2 <chr> ...
## $ Col3 <chr> ...
## $ Col4 <chr> ...
## $ Col5 <chr> ...
## $ Col6 <chr> ...
## $ Col7 <chr> ...
## $ Col8 <chr> ...
## $ Col9 <chr> ...
## $ Col10 <chr> ...
## $ Col11 <chr> ...
## $ Col12 <chr> ...
## $ Col13 <chr> ...
## $ Col14 <chr> ...
dealer_price_list[[1]]
## Contents..Weekly.Dry.Bean.Prices..by.Class..and.Annual.Summary Col2
## 12 1 6
## 13 1 13
## 14 1 20
## 15 1 27
## 16 2 3
## 17 2 10
## 18 2 18
## 19 2 24
## 20 3 3
## 21 3 10
## 22 3 17
## 23 3 24
## 24 3 31
## 25 4 7
## 26 4 14
## 27 4 21
## 28 4 28
## 29 5 5
## 30 5 12
## 31 5 19
## 32 5 27
## 33 6 2
## 34 6 9
## 35 6 16
## 36 6 23
## 37 6 30
## 38 7 7
## 39 7 14
## 40 7 21
## 41 7 28
## 42 8 4
## 43 8 11
## 44 8 18
## 45 8 25
## 46 9 1
## 47 9 9
## 48 9 15
## 49 9 22
## 50 9 29
## 51 10 6
## 52 10 14
## 53 10 20
## 54 10 27
## 55 11 3
## 56 11 10
## 57 11 17
## 58 11 24
## 59 12 1
## 60 12 8
## 61 12 16
## 62 12 22
## 63 12 29
## 64 <NA>
## 65 = =
## 66 <NA> <NA>
## 67 <NA> <NA>
## 68 <NA> <NA>
## 69 Monthly Average Dry Bean Prices, $/cwt 1981 <NA>
## 70 - -
## 71 Month <NA>
## 72 - -
## 73 <NA> <NA>
## 74 <NA> <NA>
## 75 Jan <NA>
## 76 Feb <NA>
## 77 Mar <NA>
## 78 Apr <NA>
## 79 May <NA>
## 80 Jun <NA>
## 81 Jul <NA>
## 82 Aug <NA>
## 83 Sep <NA>
## 84 Oct <NA>
## 85 Nov <NA>
## 86 Dec <NA>
## 87 <NA> <NA>
## 88 Ave * <NA>
## 89 = =
## 90 * = Calendar year simple average. <NA>
## 91 N/A = Not available. <NA>
## Col3 Col4 Col5 Col6 Col7 Col8 Col9
## 12 32.00 31.75 31.75 36.25 27.25 27.75 38.50
## 13 32.75 32.25 31.50 36.50 29.25 34.00 38.00
## 14 35.50 33.25 34.00 36.50 30.00 36.00 37.50
## 15 35.25 33.75 33.88 35.75 30.50 38.50 37.50
## 16 35.25 34.25 34.50 36.00 30.75 39.50 38.00
## 17 34.75 34.25 34.50 36.50 31.25 39.50 39.50
## 18 35.00 35.50 34.50 36.75 31.75 40.50 40.50
## 19 35.00 35.50 34.50 36.75 31.75 41.50 40.50
## 20 34.25 35.25 34.25 36.25 32.88 44.50 40.50
## 21 34.25 35.50 34.25 36.12 33.00 49.50 39.75
## 22 34.25 35.75 34.25 36.75 34.50 50.50 39.50
## 23 34.50 36.00 35.50 36.00 34.75 50.50 39.50
## 24 34.50 37.00 35.50 36.25 34.75 51.75 39.00
## 25 34.88 38.00 36.25 35.75 34.50 52.00 39.00
## 26 36.25 38.25 36.25 36.00 34.75 51.75 39.25
## 27 36.25 38.25 36.50 36.00 34.75 51.75 39.00
## 28 36.25 38.50 36.50 36.00 36.25 51.75 38.50
## 29 37.00 38.50 36.75 36.00 40.50 51.50 38.50
## 30 38.75 38.75 37.50 36.00 44.50 51.50 38.50
## 31 43.50 39.25 38.50 37.50 46.25 51.50 38.50
## 32 45.50 39.25 39.00 37.00 45.75 52.00 38.50
## 33 45.00 40.00 39.50 37.50 45.75 51.50 38.37
## 34 45.25 40.25 39.50 37.50 45.50 51.50 38.50
## 35 44.50 40.25 69.00 36.50 45.50 51.50 38.50
## 36 44.50 40.00 38.50 37.00 44.50 51.00 38.50
## 37 43.25 40.00 36.00 36.00 43.75 51.50 38.50
## 38 42.50 40.00 35.50 35.50 43.50 51.00 39.00
## 39 41.00 39.25 <NA> <NA> 42.75 50.25 38.50
## 40 36.50 37.50 <NA> <NA> 38.50 48.50 39.00
## 41 <NA> <NA> <NA> <NA> 36.00 47.50 39.00
## 42 32.25 35.50 <NA> <NA> 33.00 46.00 49.50
## 43 31.25 35.25 30.00 32.00 33.00 44.00 39.50
## 44 29.50 34.75 <NA> 29.00 31.00 43.00 39.50
## 45 28.25 31.00 <NA> <NA> 29.50 37.00 39.50
## 46 25.25 30.50 <NA> <NA> 28.50 33.00 39.50
## 47 23.50 28.88 <NA> <NA> 29.00 <NA> <NA>
## 48 23.75 28.75 24.00 25.00 29.00 31.00 40.50
## 49 22.75 27.75 23.00 26.50 31.00 30.50 40.50
## 50 22.00 27.50 23.75 26.75 34.50 30.50 40.50
## 51 22.25 31.00 23.00 26.50 42.00 30.50 40.50
## 52 22.38 32.00 22.75 26.50 39.50 31.00 41.25
## 53 22.25 31.00 23.00 26.25 39.50 30.50 41.00
## 54 22.00 30.75 22.75 26.00 38.75 30.50 41.00
## 55 21.75 30.75 22.75 26.00 39.50 30.50 41.00
## 56 21.75 30.75 22.25 25.25 38.00 30.50 41.00
## 57 21.75 30.50 22.75 25.00 38.00 30.25 40.75
## 58 21.50 30.25 22.75 25.00 36.25 30.25 40.25
## 59 20.88 30.25 <NA> 23.00 36.50 30.00 40.00
## 60 20.25 30.00 <NA> 22.00 36.25 29.75 39.75
## 61 18.75 30.25 <NA> 22.00 35.50 29.00 39.25
## 62 18.25 28.75 <NA> 22.00 36.00 29.00 39.25
## 63 17.62 28.25 <NA> 22.00 35.50 28.25 38.25
## 64 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 65 = = = = = = =
## 66 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 67 <NA> <NA> <NA> <NA> <NA> <NA>
## 68 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 69 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 70 - - - - - - -
## 71 Pinto Grt.Nort. Pink S_Red Navy B_Lima L_lima
## 72 - - - - - - -
## 73 <NA> <NA> <NA> <NA> <NA> $/cwt <NA>
## 74 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 75 33.88 32.75 32.78 36.25 29.25 34.06 37.88
## 76 35.00 34.88 34.50 36.50 31.38 40.25 39.63
## 77 34.35 35.90 34.75 36.27 33.98 49.35 39.65
## 78 35.91 38.25 36.38 35.94 35.06 51.81 38.94
## 79 41.19 38.94 37.94 36.63 44.25 51.63 38.50
## 80 44.50 40.10 44.50 36.90 45.00 51.40 38.47
## 81 40.00 38.92 35.50 35.50 41.58 49.92 38.83
## 82 30.31 34.13 30.00 30.50 31.63 42.50 42.00
## 83 23.45 28.68 23.58 26.08 30.40 31.25 40.25
## 84 22.22 31.19 22.88 26.31 39.94 30.63 40.94
## 85 21.69 30.56 22.63 25.31 37.94 30.38 40.75
## 86 19.15 29.50 -- 22.20 35.95 29.20 39.30
## 87 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 88 $31.80 $34.48 $32.31 $32.03 $36.36 $41.03 $39.59
## 89 = = = = = = =
## 90 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 91 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Col10 Col11 Col12 Col13 Col14
## 12 26.75 33.25 31.50 37.00 <NA>
## 13 27.25 33.25 32.00 37.00 <NA>
## 14 28.50 33.50 32.75 37.00 <NA>
## 15 30.50 33.50 34.00 37.00 <NA>
## 16 31.00 33.50 34.25 37.50 <NA>
## 17 31.50 34.00 34.25 37.50 <NA>
## 18 31.25 34.25 34.50 37.50 <NA>
## 19 31.75 34.75 34.50 40.00 <NA>
## 20 32.75 34.75 34.25 41.00 <NA>
## 21 33.50 35.50 34.25 41.00 <NA>
## 22 33.50 36.75 34.25 41.00 <NA>
## 23 33.50 37.50 34.25 41.50 <NA>
## 24 33.25 37.75 34.25 41.50 <NA>
## 25 33.25 38.00 34.25 41.00 <NA>
## 26 33.25 38.00 34.25 42.00 <NA>
## 27 33.25 38.00 34.25 43.50 <NA>
## 28 33.00 38.50 34.00 44.00 <NA>
## 29 32.75 42.50 34.00 44.00 <NA>
## 30 32.75 52.25 34.25 44.50 <NA>
## 31 32.75 52.50 35.00 44.50 <NA>
## 32 32.75 52.50 35.50 44.50 <NA>
## 33 33.50 52.50 38.00 44.50 <NA>
## 34 35.50 52.50 39.00 44.50 <NA>
## 35 37.50 50.50 39.50 44.50 <NA>
## 36 40.62 50.50 39.50 44.50 <NA>
## 37 40.50 50.00 39.50 44.50 <NA>
## 38 39.50 49.00 39.50 46.00 <NA>
## 39 39.00 49.50 38.50 45.00 <NA>
## 40 37.75 <NA> 38.25 44.75 <NA>
## 41 37.50 47.00 38.00 44.50 <NA>
## 42 36.25 <NA> 37.50 44.50 <NA>
## 43 34.50 <NA> 36.00 46.00 <NA>
## 44 32.50 <NA> 34.00 46.00 <NA>
## 45 32.00 <NA> 32.50 46.00 <NA>
## 46 31.50 <NA> 33.50 46.00 <NA>
## 47 <NA> <NA> <NA> <NA> <NA>
## 48 29.50 <NA> 31.25 46.00 <NA>
## 49 28.50 <NA> 30.25 46.00 <NA>
## 50 28.00 36.75 31.50 46.00 <NA>
## 51 28.25 44.50 32.50 46.00 <NA>
## 52 28.25 43.00 33.50 46.00 <NA>
## 53 28.25 42.75 32.50 47.50 <NA>
## 54 28.25 42.25 32.00 47.50 <NA>
## 55 28.50 42.25 31.50 48.00 <NA>
## 56 29.00 42.00 31.25 50.00 <NA>
## 57 29.50 41.75 32.00 50.00 <NA>
## 58 29.50 41.75 32.50 50.00 <NA>
## 59 29.75 41.75 32.50 50.00 <NA>
## 60 29.75 41.00 32.00 50.00 <NA>
## 61 29.75 39.50 32.00 50.00 <NA>
## 62 39.75 39.50 31.50 50.00 <NA>
## 63 39.75 39.50 31.50 50.00 <NA>
## 64 <NA> <NA> <NA> <NA> <NA>
## 65 = = = =
## 66 <NA> <NA> <NA> <NA> <NA>
## 67 <NA> <NA> <NA> <NA> <NA>
## 68 <NA> <NA> <NA> <NA> <NA>
## 69 <NA> <NA> <NA> <NA> <NA>
## 70 - - - - <NA>
## 71 Blackeye S_White Kidney Garbanzo <NA>
## 72 - - - - <NA>
## 73 <NA> <NA> <NA> <NA> <NA>
## 74 <NA> <NA> <NA> <NA> <NA>
## 75 28.25 33.38 32.56 37.00 <NA>
## 76 31.38 34.13 34.38 38.13 <NA>
## 77 33.30 36.45 34.25 41.20 <NA>
## 78 33.19 38.13 34.19 42.63 <NA>
## 79 32.75 49.94 34.69 44.38 <NA>
## 80 37.52 51.20 39.10 44.50 <NA>
## 81 38.75 49.25 38.75 45.25 <NA>
## 82 33.81 -- 35.00 45.63 <NA>
## 83 29.38 36.75 31.63 46.00 <NA>
## 84 28.25 43.13 32.63 46.75 <NA>
## 85 29.13 41.94 31.81 49.50 <NA>
## 86 33.75 40.25 31.90 50.00 <NA>
## 87 <NA> <NA> <NA> <NA> <NA>
## 88 $32.45 $41.32 $34.24 $44.25 <NA>
## 89 = = = = <NA>
## 90 <NA> <NA> <NA> <NA> <NA>
## 91 <NA> <NA> <NA> <NA> <NA>
dealer_price_list[[1]] = dealer_price_list[[1]][-c(64:100),-14]
colnames(dealer_price_list[[1]]) = c("Month", "Day", "Pinto", "Grt_Northern", "Pink", "Sm_Red", "Navy", "B_Lima", "L_Lima", "Blackeye",
"Small_White", "Kidney", "Garbanzo")
glimpse(dealer_price_list[[1]])
## Observations: 63
## Variables: 13
## $ Month <chr> "1", "1", "1", "1", "2", "2", "2", "2", "3", "3",...
## $ Day <chr> "6", "13", "20", "27", "3", "10", "18", "24", "3"...
## $ Pinto <chr> "32.00", "32.75", "35.50", "35.25", "35.25", "34....
## $ Grt_Northern <chr> "31.75", "32.25", "33.25", "33.75", "34.25", "34....
## $ Pink <chr> "31.75", "31.50", "34.00", "33.88", "34.50", "34....
## $ Sm_Red <chr> "36.25", "36.50", "36.50", "35.75", "36.00", "36....
## $ Navy <chr> "27.25", "29.25", "30.00", "30.50", "30.75", "31....
## $ B_Lima <chr> "27.75", "34.00", "36.00", "38.50", "39.50", "39....
## $ L_Lima <chr> "38.50", "38.00", "37.50", "37.50", "38.00", "39....
## $ Blackeye <chr> "26.75", "27.25", "28.50", "30.50", "31.00", "31....
## $ Small_White <chr> "33.25", "33.25", "33.50", "33.50", "33.50", "34....
## $ Kidney <chr> "31.50", "32.00", "32.75", "34.00", "34.25", "34....
## $ Garbanzo <chr> "37.00", "37.00", "37.00", "37.00", "37.50", "37....
Look at how nice that looks! Now it’s time to iterate through all the dataframes. You may be thinking to yourself, “Hey, why don’t you do that with a loop, or with lapply?”. That’s a great idea… except that dataframes vary in their content. Manual brute-force it is.
Let’s skip ahead a bit - with a bit more munging and cleaning, we end up with a very nice long-format dataframe, that you can find here.
Or alternatively:
dealer_price_long = read_csv(file = "https://raw.githubusercontent.com/keatonwilson/beans/master/data/dealer_price_long.csv?token=AefUVKUxTssySEILhpmU2TOfE32UocJfks5bq6gMwA%3D%3D")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## Class = col_character(),
## Price = col_double()
## )
dealer_price_long
## # A tibble: 23,984 x 3
## date Class Price
## <date> <chr> <dbl>
## 1 1981-01-06 Pinto 32
## 2 1981-01-13 Pinto 32.8
## 3 1981-01-20 Pinto 35.5
## 4 1981-01-27 Pinto 35.2
## 5 1981-02-03 Pinto 35.2
## 6 1981-02-10 Pinto 34.8
## 7 1981-02-18 Pinto 35
## 8 1981-02-24 Pinto 35
## 9 1981-03-03 Pinto 34.2
## 10 1981-03-10 Pinto 34.2
## # ... with 23,974 more rows
Also note that this is in tibble format now. Thanks, Hadley. :)
Conclusions
This is a great start. We went from an awful Excel Workbook to a slim and trim tidy dataframe with ~24,000 entries of bean prices from 1981-2010 - this is going to be a big chunk of the data we end up splitting into training and test sets down the road for Machine Learning.
A small aside - there was a bunch of munging and cleaning involved between some of the steps above. If you’re interested in a deeper dive into what was entailed, check out the full source code - it’s pretty well annotated and can give you a nice look at things.
Cheers!