Munging

Machine Learning Algorithm for Tucson Housing Prices

Jan 1, 2019

Introduction The goal of this project has three main components: 1) to scrape a bunch of web data of house information in Tucson (prices, beds, baths, some other stuff), 2) to build a test a series of machine learning models that do a good job of accurately predicting the price a house will sell at and 3) taking this model and building a web interface that folks could use to plug in information on a house in Tucson and get an output.

Classification Workshop: Introduction to Machine Learning in R with caret

Sep 9, 2018

Introduction to Machine Learning in R with caret Part 1 - What is machine learning? What are the tenets, what is the basic workflow? Discussion - two questions (5-minutes with the person sitting next to you - then we’ll come together and discuss as a group) What is machine learning? How is it different than statistics? Some important things to know and think about: Prediction is usually more important than explanation

Updated Bean Model and some additional analytics

Jun 6, 2018

Updating the predictive bean model with new data The predictive model we’ve been exploring so far is based on data from the USDA Economic Research Service, whose database only goes to ~2011. This is a lot of data we’re missing from 2011 to the present - something we would want to incorporate into the model to improve accuracy. I recent corresponded with someone at USDA and was able to track down the rest of the data, so I’m excited to present an updated model, some predictions, and some additional insights we can gain from visualizing the model output in a couple of ways.

A broker simulation leveraging the predictive bean model

May 5, 2018

Hi there! It’s been a while! Sorry for the delay on posting, but things have been a bit crazy with the transition between the end of the semester at UA and moving into Summer research projects! I’ve also become a little obsessed with the Tidy Tuesday challenge on Twitter - it’s a cool project for the #r4ds (R for Data Science) community that sends out weekly datasets that Twitterfolks can munge and visualize with the fantastic tidyverse packages.

Class Market Share - A better visualization

May 5, 2018

Better viz Anna Cates, a blog reader and soil ecologist working at the University of Wisconsin-Madison had a great suggestion for visualizing market-share change over time for different varietals - the stacked bar chart! I thought I’d write a brief post with some code! #packages library(tidyverse) library(lubridate) #importing the master data set we've been working with bean_master_import = read_csv(file = "https://raw.githubusercontent.com/keatonwilson/beans/master/data/bean_master_joined.csv?token=AefUVJns3Rn5W9UiDzbkOhHnKJFGyqHNks5bq6oTwA%3D%3D") bean_master_import_bar = bean_master_import %>% mutate(month = month(date), class = factor(class)) %>% group_by(year, month, class) %>% summarize(monthly_mean_market_share = mean(class_market_share)) #Making a new column of just the beginning of each month, since we're binning by month beg_month_date = paste(bean_master_import_bar$year, bean_master_import_bar$month, rep(1, length(bean_master_import_bar$year)), sep = "-") bean_master_import_bar$beg_month_date = ymd(beg_month_date) #Filtering to get rid of some of the noise bean_master_import_bar %>% filter(monthly_mean_market_share < 0.

Data Exploration and Visualization on Bean Market Data

May 5, 2018

Data Exploration I wanted to take a break this week from Machine Learning and prediction algorithms on the bean data and do a bit of data exploration and visualization of what is a pretty rich data set. The idea here is a bit of a conceptual switch from what we’ve been exploring - here, I’m interested in picking apart trends in the market, and examining relationships between the variables. First, let’s import the data from github and get the appropriate packages loaded:

Bean Market Predictions using Machine Learning Algorithms

Apr 4, 2018

Goals As I’ve discussed in earlier posts, the basic premise of this project was to use a nice (but messy) dataset from the USDA on domestic bean markets to explore a variety of different avenues of analysis, visualization and data exploration. One of the main goals of this project was to see if I could build some machine learning models that do a good job of predicting future prices of different classes of beans.

Preprocessing Bean Data (on the road to Machine Learning)

Apr 4, 2018

The gist Let’s dive a bit deeper into the bean project - this post is the first in a series that will hopefully get at the meat of the project. One of the main questions of this endeavor is: Can we build a model that does a good job of predicting future market prices? More generally: If I know something about the price of garbanzo beans today, and some of the market characteristics, can I predict with a good degree of accuracy what the price will be 6 months from now?

The ecologist jumps into the deep, scary waters of machine learning...

Mar 3, 2018

Misconceptions About two years ago, when I started thinking more deeply about the possibility of a career in data science, I made a concerted effort to figure out where my gaps in knowledge were (a phrase we often overuse in science, in my opinion - it feels a bit tired) and what my weaknesses were. I felt strong in R, felt that I had experience munging messy data, and felt strong in my statistical background, but I had virtually no experience with machine learning (ML), which seemed to be all anyone was talking about when I got on the web and started looking at data science positions.

Bean munging and Excel Wrangling

Mar 3, 2018

Messy Excel Files So, as I discussed last time, the first big hurdle in starting to explore the domestic dry bean market data was overcoming the terror of working with a bunch of really messy, really gnarly excel files. The main one looks like this:Lots of problems, right? The data are in multiple sheets in a single workbook, they’re not uniform, etc. It’s an R-user’s nightmare, but the reality is that data often look like this.

Updated Bean Model and some additional analytics

Bean munging and Excel Wrangling

An overview of the bean project

Keaton Wilson