Understanding Sabermetrics (30 page)

Authors: Gabriel B. Costa,Michael R. Huber,John T. Saccoma

BOOK: Understanding Sabermetrics

12.37Mb size Format: txt, pdf, ePub

Figure 12.6. Scatterplot of OPS versus runs for the 2006 season

Fitting a trend line to the data gives encouraging results. In Figure 12.8, we again show the data with the trend line based on the least squares method. The equation of the line is

Runs = 2159.7 × OPS - 872.41

The coefficient of correlation is over 87 percent. This is a valid example of applying a linear regression to data using a single independent variable (OPS).

Figure 12.7 Scatterplot of OPS versus runs with trend line for the 2006 season

As mentioned earlier, we would like to apply this technique to more data. In Figure 13.8, we show a scatter plot of OPS versus runs scored for five seasons, 2002 through 2006. The data appears to exhibit a strong trend. The equation of the regression line is now

Runs = 2171.3 × OPS - 877.37

This equation is very similar to the one developed for just the 2006 data. In addition, the R
²value has increased to 89.66 percent, or almost 90 percent.

Next we try to model the runs-scored data with multiple variables. We will use a simple linear model, where we attempt to predict runs as a function of both on-base average (OBA) and slugging percentage (SLG), given by

Runs =
β
₀x OBA +
β
₂x SLG

Using just the 2006 data (30 data points), we find that the regression equation is given by Runs = - 924.30 + 2585.19 × OBA + 1948.31 × SLG, which yields a correlation coefficient of 87.78 percent, or almost half a percent better than using single-variable regression. Using all 150 data points from the 2002 through 2006 seasons, we develop a regression equation of Runs = - 948.02 + 2696.44 × OBA + 1925.21 × SLG, which yields a correlation coefficient of 90.00 percent, again slightly better than using single-variable regression. The coefficients for each independent variable do not change significantly when more data is added.

Figure 12.8 Scatterplot of OPS versus runs for the 2002 to 2006 seasons

We hope that we have provided an introduction into simulation and regression which will allow the reader to get started in analyzing baseball data. It is not a trivial process, but it can offer insights which might not be available using commonly-accepted sabermetrical measures.

Easy Tosses

1. Create a simulation in which a batter has an equal chance of getting 2, 3, 4, or 5 at-bats in a game (assume that he will get only one of those outcomes). Use a batting average of .300 and simulate a 150-game season (our batter sits out a few games during the season). After the simulation, how does the simulated batting average compare to the input batting average?

2. Several studies have been done to predict runs scored using offensive measures such as RBIs, OPS, and batting average. Select thirty players with a similar number of at-bats from a given season and try to predict the runs scored.

Clubhouse: Answers to Problems

Infield Practice: Sabermetrical Reasoning

Fast Ball Down the Middle

Before the 1990s, Pirate Hall of Famer Ralph Kiner had the second-best career home run percentage behind Babe Ruth (with Harmon Killebrew a tad behind Kiner). Ruth was the first player in history to hit 30, 40, 50 and 60 home runs in a season. Following 1961, and some years after, many people still argued that Ruth, not Roger Maris, held the seasonal home record, due to the extended 1961 season (162 games versus 154 games in Ruth’s time). Ruth held the season home run percentage mark as well.

Over the past ten years or so, however, sluggers like Sammy Sosa, Mark McGwire and Barry Bonds have surpassed many of Ruth’s accomplishments. Apart from the questions and controversies which have been raised, one fact seems to endure. No player in history has out-homered teams 90 times; or pairs of teams, which Ruth accomplished 18 times. It would seem that Ruth is still mighty and still prevails.

Inning 1: Simple Additive Formulas

Easy Tosses

(1)

(2)

(3)

Sample calculation: Pujols: HEQ-O = TB + R + RBI + SB + 0.5 × BB = 359 + 119 + 137 + 7 + 0.5 × 92 = 668

(4)

Sample calculation: Molina: HEQ-D = C: (PO + 3 A + 2 DP - 2 E) × (0.445) = (736 + 3(77) + 2(6) - 2(4)) × (0.445) = 432.095.

If Molina’s putouts were greater than 800, we would have assigned him 800 putouts.

(5)

(6)
Total Average = (TB + BB + HBP + SB) / (AB-H + SH + SF + CS + GIDP)

Pujols: (359 + 92 + 4 + 7) / (535 - 177 + 0 + 3 + 2 + 20) = 1.2063

Howard: (383 + 108 + 9 + 0) / (581 - 182 + 0 + 6 + 0 + 7) = 1.2136

Hard Sliders

(1)

Thus, in 1966, the AL had a POP of .915 and Robinson’s was 1.363. Thus, his relative POP was 1.363 / 0.915 = 1.490 , meaning that Robinson’s POP was 49 percent better than the league average.

Other books

The Swan Book by Alexis Wright

What Happens in Reno by Monson, Mike

Tamar by Mal Peet

Becoming Three by Cameron Dane

Fifty Shades of Fairy Tales Omnibus by Roxxy Meyer, Leigh Foxlee

Kiss and Makeup by Taryn Leigh Taylor

Hannibal: Fields of Blood by Ben Kane

Falling Fast by Sophie McKenzie

Lust by K.M. Liss

Rhuddlan by Nancy Gebel