Moneyball Capstone Project
This was the capstone project for the R ‘bootcamp’ I’m in the process of doing. The idea behind this (and apparently the original Moneyball problem) was to use statistics on players to find the best replacements for 3 departing players for a baseball team.
LOADING DATA, PACKAGES, SUMMARIZING ETC.
Set working directory and load packages first
setwd("~/xyz)
library(dplyr)
library(tidyr)
library(ggplot2)
Import the CSV file:
bat <- read.csv('Batting.csv')
head(bat)
Structure of the dataset:
str(bat)
Head of the 'At Bats' column (which is when the batter is facing the pitcher):
head(bat$AB)
Make a batting average column, which is 'Hits' divided by 'At Bats':
bat$BA <- bat$H/bat$AB
New column that is 'On Base Percentage' (how frequently a batter reaches base):
bat$OBP <- (bat$H + bat$BB + bat$HBP)/(bat$AB + bat$BB + bat$HBP + bat$SF)
New column that is 'Slugging Percentage':
bat$X1B <- bat$H - (bat$X2B + bat$X3B + bat$HR)
bat$SLG <- (bat$X1B + (2bat$X2B) + (3bat$X3B) + (4*bat$HR))/bat$AB
Load Salaries CSV file:
sal <- read.csv('Salaries.csv')
CLEANING DATA
Limit batting data to after 1985:
bat <- subset(bat,yearID >= 1985)
Merge salary and batting data:
combo <- data.frame(merge(bat,sal,by = c('playerID','yearID')))
Getting data for 'lost players' i.e. Jason Giambi (giambja01), Johnny Damon (damonjo01), Rainer Gustavo Olmedo (saenzol01):
lost.id <- c('giambja01','damonjo01','saenzol01')
lost_players <- data.frame(subset(combo,playerID %in% lost.id))
lost_players
Only from year 2001:
lost_players <- subset(lost_players,yearID == 2001)
lost_players
Only certain columns:
lostcol <- c("playerID","H","X2B","X3B","HR","OBP","SLG","BA","AB")
lost_players <- subset(lost_players[,lostcol])
lost_players
We have the final list of players to replace. Some constraints to find their replacements: Year is 2001 Total combined salary of the 3 cannot exceed 15 million The combined AB needs to be equal or greater than the lost players Mean OBP has to be equal or greater than the lost players
References for the conditions:
totAB <- sum(lost_players.01$AB)
meanOBP <- mean(lost_players.01$OBP)
Year is 2001
combo <- subset(combo,yearID==2001)
pl <- ggplot(combo,aes(x=OBP,y=salary)) + geom_point()
print(pl)
Make a backup just in case
combo.copy <- combo
Exclude those whose earnings are over 15 million: Since we have an average OBP to meet, restrict that too to over 0.3 at least Exclude with no batting average Have to restrict AB too, we have to reach at least 1469
combo <- subset(combo,salary < 15000000 & OBP >= 0.3 & BA != "NaN" & AB > 450)
head(combo,10)
Restrict to relevant columns
combocol <- c("playerID","teamID.x","AB","HR","BA","OBP","SLG","salary")
Final data frame
com <- combo
com <- subset(com[,combocol])
head(com)
One way is to just eyeball it. But maybe we could make an indicator that shows the AB or OBP per dollar spent?
com$sal <- com$salary/1000000
com$OBP.sal <- round(com$OBP/com$sal,3)
com$AB.sal <- round(com$AB/com$sal,3)
head(com)
head(com %>% arrange(desc(AB.sal,OBP.sal)),10)
So maybe: pujola01, mientdo01 and eckstda01 are the best value for money. But I imagine there are many possible solutions, we could spend a lot more money for better players to try and use up the budget.