Sports

How To : MLB Player Birth Series

MLB Birthplace Series

Main Article | 2016 AL/NL | 2016 Blue Jays | How to

Interactives : Map | Graphs

 

 

This short “How To” section explains in plain terms how the data for the MLB Player Birth Series was acquired, organized and plotted. I will NOT get into any coding here, but for those interested the Github is here and should be fully reproducible.

Getting The Data

Most of the data for this project came from Baseball Reference, though the Steamer Projections for 2016 Wins Above Replacement (WAR) were taken from Fangraphs. WAR is a catchall statistic that attempts to summarize a player’s total contribution to his team (hitting, pitching, fielding, running, etc.). This type of statistic is required in this exercise, as we must have a single statistic that can be applied to both pitchers and hitters.

The biggest step in grabbing all this data from Baseball Reference is to compile a listing of every player’s unique ID number. This can then be used to programmatically go to every player’s website and take whatever information we require. Luckily, this task is common to many projects in baseball research and I had already compiled this list when putting together a database of all Minor League and MLB historical stats. The Github for that project is here, and should also be reproducible to make your own SQL database. This database not only provided the listing of each player’s ID number, but also what seasons he played and his WAR in each one. Now that we have every MLB player’s unique Baseball Reference ID number we can grab the remaining data we require, in this case all of the name, birth, death, picture link, handedness, height and weight data.

BRefURLImage(Red)

 

 

Unfortunately, Baseball Reference does not list the latitude and longitude that will later be required to make the interactive map. However, they have listed a birthplace for all recent players (only a handful of players born in the mid 1800’s lack birthplace info), and this allows us to look up the latitude and longitude using Google’s Geocode API. This is just a fancy Google maps-like tool, which allows you to programmatically ask Google to find the latitude and longitude for any input, like “Philadelphia, PA”. This functioned remarkably well as it failed to work for only a handful of players out of 17,500 player birthplaces, and most of these errors were improperly translated place names in Cuba or South Korea. Dear Baseball Reference, I still for the life of me can’t figure out where Choon Chung Do, South Korea is (listed birthplace for Jae Kuk Ryu).

Geocode(Red)

The final piece of data, the Steamer 2016 WAR Projections, was by far the easiest. Fangraphs provides a function that will export all of the hitter or pitcher projections as a .csv file, a table-like format, readable in either Excel or straight to R (or programming language of your choice).

Cleaning and Organizing

The first “cleaning” step involved taking the raw information from the website and splitting the title of the variable from the variables themselves, i.e. taking “Position : Pitcher” and trashing the “Position : “ part, while storing “Pitcher” in a column of all the other position info. With birthplace information this was somewhat complicated by the varying presentations of the city, state and/or country information, but with care all information could be properly sorted.

As we are valuing players on both hitting and pitching and taking this data from separate sources, care must be exercised when dealing with historical baseball stats, many players were both hitters and pitchers. To properly account for these players, such as the Babe himself, players with seasons including both hitting and pitching WAR (for the same team and season only) had their WAR summed.

Adding the Steamer projections to the existing Baseball Reference data was, in short, a nightmare, but eventually all of the accents were removed and the various spellings of Zack and Jon were matched.

NameFixes(Red)

Making the Map

With some final manual fixes to the players Google failed to find latitudes and longitudes for, we’re ready to define the color and point size schemes. I tried to match a color for each team to their logo, but soon found that nearly every team is red or blue. So, I decided to take some liberties, teams with funky historical logos got those colors to switch it up, the Arizona Diamondbacks got that weird purple instead of mundane maroon for instance, but I’ve always loved that retro Phillies maroon so they got that. Grinding through all of MLB’s historical franchises was somewhat tedious but fun, eventually even the Worcester Ruby Legs got a color (hint: it’s ruby) and I discovered the Seattle Pilots had the sweetest logo ever.

For point size, every player with a WAR that season at or below zero was given the same radius (3 pixels), while all players with greater than zero WAR had their radius increase linearly with WAR (2 pixels radius for every additional WAR).

The interactive map was programmed in R, with the help of two packages, leaflet and shiny (non-coders: packages are just code/functions someone else has written for people to use). Shiny is a web application framework for R, and allows making all types of graphs and maps interactive, that is, it does things based on user inputs. All drop down menus, slider bars, buttons and check boxes are created by Shiny. In short, Shiny is the backbone on which all of the interactives on this site have been made.

Ashburn(crop)

Leaflet is a popular open-source JavaScript library, designed exclusively to make interactive maps. The leaflet package in R is just a set of functions allowing use of this JavaScript library in R. Along with Leaflet, there are a wide variety of provider tiles, which function as a base layer for the data from R to be plotted on top of. Obviously, I simply chose the layer I thought looked the best for this map.

Plotting circles of varying color and size in Leaflet is standard fare, where Leaflet excels is in allowing full customization of the pop-up box. This pop-up can be constructed simply from HTML in R allowing the programmer nearly endless options in how the pop-up can look.

Making the Graphs

After properly sorting all the state and country information, the graphs are fairly straightforward to construct. The top five players and their WAR totals for each state/country were determined, and the players and WAR from each state/country in a given year were calculated using the plyr package. This package makes it simple to break up a data set into chunks, e.g. by state or country, calculate something (sum players and WAR), and then organizes the calculated data together.

Again the interactive graphs were constructed with the help of Shiny, but instead of Leaflet, ggvis was used for the graphs themselves. Each state or country was given a random color, which only appears when a given line is hovered over with a mouse. Again the pop-up, is simply HTML allowing for customization by the programmer.

Sources :

  1. Baseball Reference, all player data
  2. Steamer, 2016 WAR projections
  3. Google, Geocode API service for latitudes and longitudes

Doug Duffy | Author

I'm just a data scientist that likes to do cool things with data in my spare time. Let's talk about it.

Leave a Reply

Your email address will not be published. Required fields are marked *

2016 AL/NL Preview

Updated: 27/02/2016

The Changing Face of the MLB

Updated: 27/02/2016