Moneyball Capstone Project

This was the capstone project for the R ‘bootcamp’ I’m in the process of doing. The idea behind this (and apparently the original Moneyball problem) was to use statistics on players to find the best replacements for 3 departing players for a baseball team.

LOADING DATA, PACKAGES, SUMMARIZING ETC.

Set working directory and load packages first

setwd("~/xyz)

library(dplyr)

library(tidyr)

library(ggplot2)

Import the CSV file:

bat <- read.csv('Batting.csv')

head(bat)

Structure of the dataset:

str(bat)

Head of the 'At Bats' column (which is when the batter is facing the pitcher):

head(bat$AB)

Make a batting average column, which is 'Hits' divided by 'At Bats':

bat$BA <- bat$H/bat$AB

New column that is 'On Base Percentage' (how frequently a batter reaches base):

bat$OBP <- (bat$H + bat$BB + bat$HBP)/(bat$AB + bat$BB + bat$HBP + bat$SF)

New column that is 'Slugging Percentage':

bat$X1B <- bat$H - (bat$X2B + bat$X3B + bat$HR) bat$SLG <- (bat$X1B + (2bat$X2B) + (3bat$X3B) + (4*bat$HR))/bat$AB

Load Salaries CSV file: sal <- read.csv('Salaries.csv')


CLEANING DATA

Limit batting data to after 1985:

bat <- subset(bat,yearID >= 1985)

Merge salary and batting data:

combo <- data.frame(merge(bat,sal,by = c('playerID','yearID')))

Getting data for 'lost players' i.e. Jason Giambi (giambja01), Johnny Damon (damonjo01), Rainer Gustavo Olmedo (saenzol01):

lost.id <- c('giambja01','damonjo01','saenzol01') lost_players <- data.frame(subset(combo,playerID %in% lost.id)) lost_players

Only from year 2001:

lost_players <- subset(lost_players,yearID == 2001) lost_players

Only certain columns:

lostcol <- c("playerID","H","X2B","X3B","HR","OBP","SLG","BA","AB") lost_players <- subset(lost_players[,lostcol]) lost_players


We have the final list of players to replace. Some constraints to find their replacements: Year is 2001 Total combined salary of the 3 cannot exceed 15 million The combined AB needs to be equal or greater than the lost players Mean OBP has to be equal or greater than the lost players

References for the conditions:

totAB <- sum(lost_players.01$AB)

meanOBP <- mean(lost_players.01$OBP)

Year is 2001

combo <- subset(combo,yearID==2001) pl <- ggplot(combo,aes(x=OBP,y=salary)) + geom_point() print(pl)

Make a backup just in case

combo.copy <- combo

Exclude those whose earnings are over 15 million: Since we have an average OBP to meet, restrict that too to over 0.3 at least Exclude with no batting average Have to restrict AB too, we have to reach at least 1469

combo <- subset(combo,salary < 15000000 & OBP >= 0.3 & BA != "NaN" & AB > 450)

head(combo,10)

Restrict to relevant columns

combocol <- c("playerID","teamID.x","AB","HR","BA","OBP","SLG","salary")

Final data frame com <- combo

com <- subset(com[,combocol]) head(com)

One way is to just eyeball it. But maybe we could make an indicator that shows the AB or OBP per dollar spent?

com$sal <- com$salary/1000000 com$OBP.sal <- round(com$OBP/com$sal,3) com$AB.sal <- round(com$AB/com$sal,3)

head(com) head(com %>% arrange(desc(AB.sal,OBP.sal)),10)

So maybe: pujola01, mientdo01 and eckstda01 are the best value for money. But I imagine there are many possible solutions, we could spend a lot more money for better players to try and use up the budget.


Next
Next

Infographic on the energy systems of the future