Data Analyst and Lover of Baseball and Beer
By Doug Duffy | 22/12/2016
Major League Baseball (MLB) has a long history of player evaluation or scouting, with teams operating Minor League Baseball (MiLB) facilities around the US, and abroad, with players born all over the world. Increasingly, MLB teams are using analytics and machine learning to augment the decisions and opinions of the scouts’ eyes on the ground. The analytics movement has become so established there’s even been one well-publicized cyber-spying incident between MLB clubs.
In this project, I’ve used machine learning to analyze the statistics of every Minor League Baseball player dating back to 1992 in an effort to predict how successful a future MLB career each player will have, as measured by Wins Above Replacement (WAR), with the full code available on Github. In this portion, I’ll summarize in plain terms the prediction process and what the final model says about the top minor leaguers of 2016 and why. You can cruise the full player projections yourself, not only for this year’s players, but also how the model would’ve projected players dating back to 1992. I’ve tried to keep the nerdy math stuff to the Model Guts section, leaving the Model Building and Final Model sections to talk in broader terms about the process and results.
In order to build a model, you need data, and previously I’ve compiled a SQL database (GitHub) of all the Major and Minor League statistics for every baseball player from the information on Baseball-Reference. Undoubtedly, MLB teams have a larger variety of data on minor leaguers at their disposal, everything from Statcast data to something as simple as velocity readings for pitchers or prospect rankings/ratings. Lacking these other proprietary sources of data, I’ve settled for the publically available data, comprising the player’s Age, League, Level and standard stat categories (H, R, RBI, Avg., etc.). I’ve added to these stats some additional simple metrics favored by sabermetricians such as BB%, BABIP and wOBA.
Predictive modeling is the process of using computer algorithms to examine a historical dataset in order to predict a future outcome based on the same set of input data. In our case, looking at the historical Minor and Major League statistical data and using it to predict the future Major League success of current Minor Leaguers based on their stats. There are a variety of algorithms for doing this and no way of knowing a priori which model will work best for a given dataset. The process is therefore trial and error in nature, fit a particular model and then see how accurately it preforms on a randomly selected portion of the historical data. The accuracy, as judged by RMSE, of the models, for both hitters and pitchers, on five random sections of the historical data each shown by can be seen below. The “gbm” or Gradient Boosting Machine model slightly outperforms the other models, producing the lowest RMSE for both hitters and pitchers.
Figure 1 : RMSE of Base Models (a) Hitters (b) Pitchers
One can use the GBM model that has been built to examine which variables the model views as more useful or important in its process of predicting future MLB success. Due to the complexity of the algorithm used to produce the model, the variable importance is not as simple as saying if a player’s batting average is X points higher in the minors then the player is expected for Y more WAR as a major leaguer. Put mathematically, the response (future success) is not linearly related to the inputs (minor league stats). The more “advanced” regression techniques allow for non-linear relationships between inputs and response or are based on tree-based decision-making; these particulars are discussed further in the other articles.
Due to these intricacies, it is typical to normalize the importance of the most important variable to a value of 100, with less important variables having values ranging between 0 and 100. Plots of the importance of the variables in both the hitter and pitcher GBM models are shown below. The most important variable for all minor leaguers is their age relative to their league, “AgeDif”, followed by simply their league. All stats a player accomplishes are less important than these two simple pieces of information, age and league, particularly so for hitters. In terms of stats, for hitters the most important are wOBA, BB:SO ratio, strikeout %, On-base Percentage, and On-base Plus Slugging, in that order. For pitchers, strikeouts per 9 IP, raw age, WHIP, games and Run Average are the most important. It’s interesting to note that Earned Run Average, in theory accounting for defensive errors, is actually less predictive than Run Average. Explanation regarding the remainder of the variables investigated can be found in the Model Building section.
Figure 2 : Variable Importance GBM Models for (a) Hitters (b) Pitchers
Model ensembling is another technique that can be used to slightly improve the accuracy of the model. This process relies on a simple fact: every model is correct some of the time, but is also incorrect at others. By examining the predictions of a variety of models for a single data point, in this case players, the errors of an individual model can sometimes be corrected. At a mathematical level this ensembling can be as simple as taking an average projection or as complicated as running a “stacked” GBM model using the individual “base” model projections as inputs, sometimes called “meta-models”. For the dataset of minor league baseball statistics, it turns out taking a weighted average of the “base” predictions yields the lowest RMSE, though this only represents a small improvement from a RMSE of 2.263 to 2.235 (~1%).
Figure 3 : 2016 Top 10 Projections (a) Hitters (b) Pitchers
The top projected hitters and pitchers for 2016 are shown above, for hitters only offensive WAR is projected (not accounting for defensive value). For both hitters and pitchers, this value represents their projected WAR over the first six years of their MLB career, for reasons elaborated on later. After getting drained of talent in 2015, it’s a relatively weak field of 2016 hitter projections, going back to 1992 it’s the first year without a projection for a 10 oWAR hitter. A number of these top players have already seen action in the MLB during 2016 (Conforto and Benintendi), but the stats say to also keep your eye on Ozzie Albies and Austin Meadows in 2017. Pitchers are consistently projected for roughly half the WAR of hitters, maxing out around 10 WAR rather than 20 oWAR historically. Most of the top projected pitchers have appeared in the MLB, others have been up and down, and some numerous times. Triston McKenzie and Sean Reid-Foley are both lower in the minors, and far from guaranteed to be seen on a 2017 MLB roster, but are projected for success when they are.