Data Analyst and Lover of Baseball and Beer
By Doug Duffy | 21/02/2016
This short “How To” section explains in plain terms how the data for the MLB Player Birth Series was acquired, organized and plotted. I will NOT get into any coding here, but for those interested the Github is here and should be fully reproducible.
Most of the data for this project came from Baseball Reference, though the Steamer Projections for 2016 Wins Above Replacement (WAR) were taken from Fangraphs. WAR is a catchall statistic that attempts to summarize a player’s total contribution to his team (hitting, pitching, fielding, running, etc.). This type of statistic is required in this exercise, as we must have a single statistic that can be applied to both pitchers and hitters.
The biggest step in grabbing all this data from Baseball Reference is to compile a listing of every player’s unique ID number. This can then be used to programmatically go to every player’s website and take whatever information we require. Luckily, this task is common to many projects in baseball research and I had already compiled this list when putting together a database of all Minor League and MLB historical stats. The Github for that project is here, and should also be reproducible to make your own SQL database. This database not only provided the listing of each player’s ID number, but also what seasons he played and his WAR in each one. Now that we have every MLB player’s unique Baseball Reference ID number we can grab the remaining data we require, in this case all of the name, birth, death, picture link, handedness, height and weight data.
Unfortunately, Baseball Reference does not list the latitude and longitude that will later be required to make the interactive map. However, they have listed a birthplace for all recent players (only a handful of players born in the mid 1800’s lack birthplace info), and this allows us to look up the latitude and longitude using Google’s Geocode API. This is just a fancy Google maps-like tool, which allows you to programmatically ask Google to find the latitude and longitude for any input, like “Philadelphia, PA”. This functioned remarkably well as it failed to work for only a handful of players out of 17,500 player birthplaces, and most of these errors were improperly translated place names in Cuba or South Korea. Dear Baseball Reference, I still for the life of me can’t figure out where Choon Chung Do, South Korea is (listed birthplace for Jae Kuk Ryu).
The final piece of data, the Steamer 2016 WAR Projections, was by far the easiest. Fangraphs provides a function that will export all of the hitter or pitcher projections as a .csv file, a table-like format, readable in either Excel or straight to R (or programming language of your choice).
The first “cleaning” step involved taking the raw information from the website and splitting the title of the variable from the variables themselves, i.e. taking “Position : Pitcher” and trashing the “Position : “ part, while storing “Pitcher” in a column of all the other position info. With birthplace information this was somewhat complicated by the varying presentations of the city, state and/or country information, but with care all information could be properly sorted.
As we are valuing players on both hitting and pitching and taking this data from separate sources, care must be exercised when dealing with historical baseball stats, many players were both hitters and pitchers. To properly account for these players, such as the Babe himself, players with seasons including both hitting and pitching WAR (for the same team and season only) had their WAR summed.
Adding the Steamer projections to the existing Baseball Reference data was, in short, a nightmare, but eventually all of the accents were removed and the various spellings of Zack and Jon were matched.
With some final manual fixes to the players Google failed to find latitudes and longitudes for, we’re ready to define the color and point size schemes. I tried to match a color for each team to their logo, but soon found that nearly every team is red or blue. So, I decided to take some liberties, teams with funky historical logos got those colors to switch it up, the Arizona Diamondbacks got that weird purple instead of mundane maroon for instance, but I’ve always loved that retro Phillies maroon so they got that. Grinding through all of MLB’s historical franchises was somewhat tedious but fun, eventually even the Worcester Ruby Legs got a color (hint: it’s ruby) and I discovered the Seattle Pilots had the sweetest logo ever.
For point size, every player with a WAR that season at or below zero was given the same radius (3 pixels), while all players with greater than zero WAR had their radius increase linearly with WAR (2 pixels radius for every additional WAR).
The interactive map was programmed in R, with the help of two packages, leaflet and shiny (non-coders: packages are just code/functions someone else has written for people to use). Shiny is a web application framework for R, and allows making all types of graphs and maps interactive, that is, it does things based on user inputs. All drop down menus, slider bars, buttons and check boxes are created by Shiny. In short, Shiny is the backbone on which all of the interactives on this site have been made.
Plotting circles of varying color and size in Leaflet is standard fare, where Leaflet excels is in allowing full customization of the pop-up box. This pop-up can be constructed simply from HTML in R allowing the programmer nearly endless options in how the pop-up can look.
After properly sorting all the state and country information, the graphs are fairly straightforward to construct. The top five players and their WAR totals for each state/country were determined, and the players and WAR from each state/country in a given year were calculated using the plyr package. This package makes it simple to break up a data set into chunks, e.g. by state or country, calculate something (sum players and WAR), and then organizes the calculated data together.
Again the interactive graphs were constructed with the help of Shiny, but instead of Leaflet, ggvis was used for the graphs themselves. Each state or country was given a random color, which only appears when a given line is hovered over with a mouse. Again the pop-up, is simply HTML allowing for customization by the programmer.