Why We Need to Be Cautious Using FIP

Peanut butter and jelly, biscuits and gravy, burgers and fries… ERA and FIP. The once little-known sabermetric can now almost always be found right alongside the most traditional way of evaluating a pitcher, ERA. In fact, some baseball analysts are taking it a step further and referring to FIP exclusively of ERA. Clearly we are witnessing a transformation of the “go-to” pitching stat from ERA to FIP and, because that seems to be the trend we should all hop on board, right? Wrong. While the original idea behind FIP was a good one and FIP is certainly very useful, the statistic has been widely misused amongst the general baseball community. It’s time to pump the brakes on the FIP train and re-examine exactly what FIP is and how it should properly be used.

So what is FIP? Fielding Independent Pitching (FIP), as Fangraphs defines it, is “a statistic that estimates a pitcher’s run prevention independent of the performance of the defense.” Essentially, the idea behind FIP is to measure the outcomes of a plate appearance that only the individual pitcher can control: strikeouts, walks, hit by pitches, and home runs. By this formula, pitchers with greater strikeout rates are rewarded and pitchers who get more of their outs on balls in play are punished. When compared to ERA, FIP does a much better job evaluating how well a pitcher performed versus simply calculating how many earned runs were scored while the pitcher was on the mound. Because of variety in defense, luck, and the order and situation in which events occur, a pitcher could perform relatively well and give up three earned runs, while another pitcher could perform very poorly and escape without giving up any earned runs, all because of the vast amount of potential outcomes from a ball in play. This is why FIP serves as a measure of expected regression, positive or negative, when compared to ERA. For example, a pitcher with an ERA of 2.50 and a FIP of 4.50 has probably benefited from terrific defense, given up a majority of his hits when there have not been runners on base, and has been the beneficiary of a couple of very close misses (i.e. a sharply hit ball down the line that landed foul by inches or a shot to straight-away center that the center fielder caught with his back against the wall). On the other hand, a pitcher with an ERA of 4.50 and a FIP of 2.50 has probably been harmed by defensive miscues, some unlucky bounces here and there, and has tended to give up his most damaging hits when runners happen to be on base.

Since its invention, FIP has been incredibly accurate in terms of predicting improvement or decline for a pitcher, often at a common evaluation point during the season like after the first month or at the All-Star break. However, recently, FIP has been used regardless of the time of season, regardless of the pitcher being discussed, and regardless of the situation. Here are just some of the ways FIP has been commonly misused: to compare pitchers, to evaluate pitchers’ careers, to project performance in future seasons, to replace BABIP, to defend pitchers who do not belong in the majors, and to evaluate relievers. Let me elaborate…

Why FIP Cannot Be Used To Compare Pitchers:

The Logic: Pitcher A has a lower FIP than Pitcher B so he is the better pitcher.

Why This Logic Is Flawed: This faulty use of FIP is by far the most common and most aggravating to those who understand what FIP truly measures. Because FIP attempts to compensate for variety in defense, luck, and sequence of events that an individual pitcher experiences, to compare two pitchers who are not pitching with the same defense, luck, and sequence of events is irrational. Even two pitchers on the same team may be pitching on different days with different defensive alignments behind them, different infield/outfield shifts based on the opposing lineup that particular day – hell, at a ballpark like Wrigley Field, one pitcher could take the mound with a 20 MPH wind blowing straight out on a hot and sunny day and his teammate can toe the rubber the next day in the rain with a 20 MPH wind blowing straight in.

When comparing pitchers from different teams it becomes even more obvious why FIP cannot come into play. The difference in Defensive Runs Saved (DRS) between the Arizona Diamondbacks and Philadelphia Phillies last season was 303, showing the massive impact a team’s defense can have. When comparing a pitcher of near equal caliber from the Diamondbacks to a pitcher on the Phillies, the pitcher on the Phillies will almost certainly have the lower FIP, but that does not mean he is the better pitcher. But most importantly, FIP cannot be used to compare two pitchers because every pitcher obtains success in different ways. For example, pitchers like Max Scherzer and Clayton Kershaw are going to have very low FIPs because they are strikeout pitchers who will set down 8-10 guys on strikes on an average day on the bump. However, pitchers like Kyle Hendricks and Zach Godley, who generate nearly half their outs on the ground, will never have great FIPs because FIP only rewards pitchers who strikeout lots of hitters. While the FIP model would argue that Hendricks and Godley are the beneficiaries of good defense behind them, although this may be partially true, it also takes skill and outstanding control to be able to induce weak contact. FIP does not account for that element of what a pitcher can control and therefore cannot be used to compare two different pitchers, when every pitcher has a different skill-set and different style of getting batters out.

Why FIP Cannot Be Used To Evaluate Pitchers’ Careers:

The Logic: Pitcher A had a career FIP of 3.50 therefore he had a very good career.

Why This Logic Is Flawed: When we think of what FIP is designed to predict (positive or negative regression), it is silly to apply this measurement to a pitcher who has had a long career of say 15-20 years. Because no pitcher will pitch on a team with the exact same roster for the entire span of his career, and most pitchers will pitch for at least two different teams, FIP becomes meaningless when applying it to a pitcher’s career. If we were to theoretically apply the meaning of FIP to a pitcher who finished his 15-year career with a 3.30 ERA and a 3.50 FIP, it would be absurd to say something along the lines of, “Well this pitcher clearly benefited from great defense throughout his career and, if he would’ve kept pitching, he would very likely decline.” Over the magnitude of sample size that a career entails, the elements of luck, defense, and sequencing are bound to balance out. The odds that a pitcher pitches 15 years without ever getting a favorable bounce, or without ever getting help from a fabulous defensive play behind them, or without ever getting out of a jam are essentially zero. This is why, very often, pitchers will have a career FIP that is very similar to their career ERA. However, if this is not the case, it is irrational to suggest that the career ERA of the pitcher at hand was somehow “skewed” due to a difference in ERA and FIP.

Why FIP Cannot Be Used to Project Performance In Future Seasons

The Logic: Pitcher A had an ERA of 2.75 but a FIP of 3.40 last season so we should expect his ERA to be around 3.40 next season.

Why This Logic Is Flawed: One of the common misconceptions about FIP in general is that it is a projection, but FIP is, on the contrary, a predictive statistic. In essence, if a pitcher’s FIP is significantly higher than his ERA, we are not projecting his ERA to increase to around where his FIP is, we are simply predicting that he will negatively regress. However, to evaluate a pitcher after a full season and use FIP to project his performance next season does not make sense. A pitcher who pitched a full season to the tune of a 2.75 ERA had an outstanding season, regardless of what his FIP or any other statistic is.

Because the elements that the pitcher could not control vary from season to season and will normally balance out over a long period of time, any significant difference in FIP for a full season is almost certainly due to the percentages of outs generated by the pitcher on balls in play versus via strikeouts. For example, in 2016 Kyle Hendricks of the Cubs finished the season with a Major League-leading 2.13 ERA, but a FIP of 3.20. Hendricks’s K/9 was only 8.05 and his HR/9 was only 0.71 – this would explain why his FIP was so significantly higher than his ERA. However, that does not mean Hendricks got “lucky” in 2016 – this just means that he was able to induce weak contact and was able to keep the ball in the yard. While the fact that his ERA next season was 3.03 might seem contradictory to my argument, if we look at some of his other statistics we can see that the jump in ERA had nothing to do with his defense, luck, or sequencing. With his BB/9 increasing by a whopping 0.5, his HR/9 increasing by 0.39, and his HR/FB% increasing by 5.5, we can conclude reasonably that the reason for his ERA increase was due to worse control and worse command of his pitches, which, may I point out is expected any time a pitcher has a dominant season like Kyle Hendricks did in 2016.

That is exactly why every pitcher that has one season with a sub-3.00 ERA is not enshrined in the Hall of Fame – maintaining that kind of success as a pitcher is hard. As the league starts to “figure you out,” you are almost certainly bound to regress as a pitcher, not because you got lucky, but because you are pitching against professional ballplayers whose job it is to make adjustments when they are not succeeding. When a pitcher has the kind of ERA-FIP discrepancy that Hendricks had in 2016, it is not reasonable to use FIP as a projection of how he will perform in future seasons. In fact, Kyle Hendricks has a career FIP of 3.52 at this point but his career ERA is 3.07, further proving my point.

Why FIP Cannot Be Used to Replace BABIP

The Logic: Because FIP can tell us if a pitcher is predicted to improve or decline, there is no need for BABIP which serves the same purpose.

Why This Logic Is Flawed: Unlike FIP, a career BABIP can be quite indicative of the pitcher’s performance because of the statistically-demonstrated idea that good pitching can lead to weak contact which, in turn, will systematically decrease a pitcher’s BABIP. Just like the best pitchers of all-time tend to have a career BABIP of .300 or lower, the best hitters of all-time tend to have a career BABIP of .300 or higher. Take Miguel Cabrera for example – his career BABIP is .345 and there is no one who could watch Miguel Cabrera play and make an argument that he was the beneficiary of poor defense, and you certainly can’t make the argument, which is very common now, that he beats out a lot of ground balls with his speed because, obviously, Cabrera is one of the slowest players in the league. The reason for Cabrera’s high BABIP is simply that he is one of the greatest hitters of all-time that has hit the ball hard throughout his career which makes it much harder for the defense to field. The same reasoning can be applied to pitchers who consistently execute their pitches to induce weak contact from the opposing hitters. The fact that BABIP solely takes into account the variety of balls in play from a pitcher, while FIP counts every ball in play as the same shows how these two statistics are extremely different and should be treated as such.

Why FIP Cannot Be Used to Defend Pitchers Who Don’t Belong in the Majors

The Logic: Pitcher A just got called up and through his first five starts he has a 10.85 ERA. However, his FIP is 6.55 so he’s clearly just been the recipient of bad luck, poor defense, etc.

Why This Logic Is Flawed: Every statistical model has flaws. Unfortunately, in the case of pitchers who are struggling at epic proportions, FIP’s use becomes very limited. The reason for this stems from the fact that if a pitcher is giving up over ten earned runs per nine innings, there is a guarantee that he is giving up lots of balls in play (unless he’s walking in all those runs which is almost impossible considering pitch count and a manager’s patience). Any time a pitcher is giving up hits at a substantially higher rate than he is getting outs on balls in play, the FIP model automatically assumes that bad defense, bad luck, and bad sequencing is affecting the results to some extent, when this is very rarely the case. Essentially, if a pitcher is struggling to the point of such a high ERA, he’s struggling because of some combination of poor command and poor control, not because of the “factors a pitcher cannot control.” An example of a demotion that sparked criticism last season was the demotion of Jon Gray to AAA when he had a 5.97 RA9, but only a 3.10 FIP. While many people were using FIP to justify his poor performance, his 2.72 BB/9, 36.1 Hard%, and 1.41 HR/9 were much more indicative of why he was struggling, although, to be fair, pitching in Coors Field will almost always spike your HR/9. However, a pitcher giving up just shy of 6 runs per nine innings is not useful to a Major League team and clearly is unable to execute at a Major League level. While Gray showed flashes of brilliance last season, it was his inconsistency that got him demoted, not any type of “bad luck” that those who cited his FIP would suggest.

Why FIP Cannot Be Used to Evaluate Relievers

The Logic: Reliever A’s FIP is significantly higher than his ERA so he is due for negative regression.

Why This Logic Is Flawed: With traditional starting pitchers, there is always some form of consistency; every starter will start the game without the opposing team having any runs, every starter will start the game against the opposing team’s leadoff hitter, every starter will start the game without runners on base, etc. Because this is the case, any results that occur once a starter has begun to pitch the game are largely due to that starting pitcher’s performance.

However, with relievers, that type of consistency does not exist. When a reliever comes into the ball game, he could come in to face the bottom of the order with a 15-run lead and the bases empty, or he could come in to face the opposing team’s cleanup hitter with the game tied and the bases loaded. While, clearly, those are two tremendously different scenarios for a pitcher to pitch in, FIP continues to treat every ball in play the same way. So while a strikeout in the first scenario I outlined wouldn’t really matter at all, a strikeout in the second scenario would be measured the same way. Likewise, a double in the first scenario would barely do harm, but a double in the second scenario would bear a much greater impact on the team’s chances of winning.

Because relievers pitch much fewer innings than starting pitchers, balls in play tend to have far greater effects on their success. But because FIP considers all balls in play the same way, this statistic can be heavily skewed for relievers. Mariano Rivera and Trevor Hoffman, two of the greatest relievers of all-time, consistently had seasons in which their FIP was much higher than their ERA. In fact, the average difference in ERA and FIP throughout Rivera’s career was 0.58. Surely, no one is going to argue that Rivera’s success can be attributed to luck and great defense – Rivera was one of the most dominant pitchers the world has ever seen. In a more current example, Blake Treinen, arguably the best reliever last season, had a FIP of 1.82 despite an ERA of 0.78. Because Treinen’s K/9 was 11.20, his Hard% was sub-30, and his BABIP was .230, the fact that FIP is suggesting that a full run per nine innings Treinen pitched was not scored due to “luck” is absurd. Because the situation upon which the reliever enters the game can vary so much and he is pitching significantly fewer innings per season than starters, it is irrational to apply the same FIP measurement to relievers that we do to starters. In fact, one of the best relievers of this generation, Aroldis Chapman, notably “broke” FIP in July of 2012 when he posted a FIP of -0.99, which obviously makes no practical sense in context.

If there’s one thing you take away from all of that, it is that we must be extremely cautious about how we use FIP due to several flaws in the statistical model that make it impractical to use in certain contexts. Fortunately, there have been several alterations to FIP such as xFIP, which uses a constant HR/FB% equivalent to the league average as well as FIP- which is adjusted to park and league. FIP can be extremely useful to predict regression and to get a better sense of if a pitcher’s production is sustainable. However, like every baseball statistic, it does not tell the whole story and should never be used by itself.

If you liked this article or have any questions or comments, be sure to follow me on Twitter (@zorianbaseball).

Related Articles

Leave a Reply

Back to top button