Baseball is all about timing and the disruption of it, as someone said, and that’s true from the diamond up to front offices and writers alike.
How many times did a GM regret a move he didn’t make, when one of the pieces coming back turned out to be a perennial All Star while he held on to his now below-average starter or fallen bullpen ace? And how does it feel when you have something good on your hands, stall it, and, when you get the itch to write about it, well you’re just too late because someone else caught up?
It’s damn irritating, and it’s why I get angry with myself everytime I’m about to read a Framber Valdez piece. I could have jolted down a write-up about him after his first starts in 2020: there were so many positive signals, from his usual penchant for grounders to a newfound control of his 93–95 mph sinker to its location against RHH, and they all pointed to what we all know now.
Framber Valdez is what the Astros desperately need now that a window is closing and another cycle is starting: Verlander is injured and an upcoming FA, Greinke is also going to the market and to be fair they are both on their last seasons before a comfy retirement and a place in Cooperstown.
Yes, Lance McCullers Jr is still in H-Town, and he wants to stay long term, but, with all due respect and admiration, I’m more bullish on Valdez to be the one to give a Game 7 ball to.
On the previous post you may remember Valdez’s unique act in 2020: his allowed avgLA of -1.0 degrees, one of the lowest ever in the Statcast era. You won’t be surprised at his groundball rate, a 60.3% that only Randy Dobnak (62.1%) surpassed in the pandemic season for SP with at least 40 IP.
What you may forget about Valdez is that he also strikes out a lot of guys, a 9.5 K/9 in 2020 that would scream “high cheese and low hammers” yet it’s a rare sight if paired up with a groundout machine as Framber. That has a lot to do with a curveball that Mike Petriello already crowned as one of the best in the business, one that has ridicolous Savant stats.
Valdez threw 33.5% curves in 2020 and this is what hitters could do with it: a paltry .136 xBA together with the disaster of a .212 xSLG and a .194 xwOBA means that they couldn’t muster a hit, but maybe some balls in play at least?
Nope, have yourself a good trip back to the dugout, thanks to a 42% Whiff and an absurd 37% PutAway rate: when Valdez threw his 3000 rpm demon in a 2-strike count, the hitter had a “Will I strike out?” dilemma…the answer? Let TWICE ask you: “Choose one, yes or yes?”.
So Valdez is not your 2020 base set starter, rather something in between a fireballing K-monster, Gerrit Cole-style, and an old school sinkerballer, and that is what makes him so unique.
Let’s indulge in some bad statistics and pick some arbitrary figures to prove my point: can you guess how many starters, on the usual 40 IP threshold, were able to produce a GB% of 60% or more, a K/9 of 9+ AND a BB/9 less than 3 in the whole Statcast era? You got it, it’s the one and only 2020 Framber Valdez, although once again my apologies to 2017 Lance McCurves who is out of the list for the lot of .03 BB/9.
Have you enjoyed everything I told you about Framber? Sorry, bad news. In a Spring Training game against the Cubs he fractured his left ring finger on a Javier Baez comebacker and he could potentially lose the whole upcoming season, leaving the Astros rotation quite in the pickle, considering also the latest Forrest Withley injury. That is why the Astros signed still FA Jake Odorizzi to a 2 year deal at around $30M total with player option for a third.
That said Framber’s ability on focusing the game on the ground prompted me to reconsider one of my innermost problems: batted ball definitions. Maybe it’s just me, but I always found the threefold distinction of GB/LD/FB, as groundballs, line drives and flyballs, somewhat lacking. Not every grounder is the same, some of them are spinners to the catcher while others are rockets on a corner; same for flyballs, as a Texas Leaguer is not a bomb by any means.
Sure, Statcast made it better, introducing Barrels as a subset of HardHit FBs yet I’m not satisfied, and therefore I’ll bring you to one of the riskiest yet funniest branches of statistics: clustering.
What does clustering mean? In Layman’s terms, clustering is grouping observation as for their values on certain variables on a set criteria, that could be related to homogeneity between said observations or heterogeneity between groups.
In this case, our observations are all the balls in play allowed by Framber in the 2020 season, so no foul balls, strikeouts, walks, HBPs, interferences, also no sac bunts and caught stealings because they stink. Savant makes for an easy search and returns a dataset that has so much more than what we need, as for this clustering I’m only going to use four variables: Launch Angle (LA), Exit Velocity (EV), Distance and xwOBA.
Subjective Choice N°1: Why only those four? Well, for the sake of simplicity but also because I’m more focused on what Framber allowed rather than what he did in terms of pitch type, spin axis, movement and all those shiny little things.
Good, now we have all the Valdez allowed BIP (Balls in Play), so we can get onto the clustering…but wait, what kind of clustering?
Subjective Choice N°2: It’s going to be the most common one, k-means clustering. For those not on the business, here’s a ELI5: k stands for the number of clusters (groups) we want to get, while the target is to minimize the distance between observations into the same cluster. That it is done through an iterative process that starts by setting random points as centroids, the “centers” of our k clusters, so the same is done 100+ times to get solid results.
Great, now we know the endgame but…how do we choose the k number of groups we want? This is not a classic case study, the geyser or iris data for those who went through it, where you already know beforehand. We need something that tells us what a decent k could be for our dataset.
Enter the so called gap statistic: you may think of it as the difference between the distance function we want to minimize on our dataset and the same function applied considering our dataset as uniformly distributed (flat distribution, clusters as rectangles), a sort of benchmark.
When the improvement given by an additional k is no greater than that for the uniformly distributed data, we have found our optimal number of clusters. Good beats, but how is it done in real life? Thankfully R has a gazillion packages and the one that comes in handy is factoextra, a cluster-dedicated library.
To check out the optimal k number of clusters for our Framber dataset we “only” need to run the eclust clustering function for a certain max k (say 10) and then plot the gap statistic for k = 1: max k.
Here the gap statistic is increasing (as 1- difference) although that’s not important: what is key is that 8 seems like the way to go. This agrees with a precious Modern Statistic teaching: “Look for the elbow”.
Subjective Choice N°3: let’s go with k = 8 then! Don’t take this as a given, you could go 7 or 9, even 10+ and it wouldn’t be a mistake per se. Clustering has no stone-cold right or wrong, rather it relies a lot on your personal knowledge of the data at hand and some conscious judgement of all results and advices spurted out by your R-powered machine.
Now that k is set we just rerun the clustering function for that sole k = 8 and we get the grouping of Valdez’s BIP, with clusters labeled 1–8 in no particular order.
What to do with it? 1–8 tells you nothing about what we want to know on all those pesky kinds of balls in play! We need to properly label the clusters, but how?
Subjective Choice N°4: look at the average LA, EV, distance and xwOBA for each cluster and label it using some baseball lingo. Note that this is just my preference, you could label considering other variables (pitch type, infield positioning or combinations of 2+ variables).
This is the funniest part if you ask me: time to name your cluster!
- 1 has balls hit on average at a steep downward angle with below average EV, not going far nor doing any damage, I called them slow rollers;
- 2 is similar, just less overswung, these are my grounders;
- 3 was a no-brainer, high avgEV and avgLA near 0, these are hard liners;
- 4 was the trickiest, as those balls are not hit hard yet at a decent angle and with an avgDist that goes beyond the infield but not far to the outfield, what I would call bloops;
- 5 is just a powered-up version of 2, so hard grounders;
- 6 is the “close but no cigar” area, lacking some EV and with a little too much LA, those fall in the warning tracks;
- 7 is where the damage happens, with long bombs and extra-base hits galore, the scary barrel zone;
- 8 is a worse rendition of 1, so tappers.
Look how the avgxwOBA tells you a lot combined with Launch Angle and Exit Velocity: the difference between a BIP in the barrel zone cluster and one in the tappers is the same as that with BIP in the warning tracks cluster, so either hitting, on average, a spinner to the catcher or a long flyout made no difference in terms of average production.
Time to end the whole process: here is a messy yet astounding, if I can say so, graph of our clustered Valdez allowed BIP, with labels and a nice extra:
Watch out for the order: here it’s alphabetical on the label names, not the same as the initial 1–8 connotation. As you can see I’ve added the result of the BIP, when not an out, in the graph (as usual 1b = single, 2b = double and so on). If you want a deeper look at it, you can group BIP for cluster and at bat (AB) result:
Lastly, you may ask me why the LA-EV graph as values from 0 to +/- 3:4. That’s because original variable values go through my personal nightmare, PCA (Principal Component Analysis). I’m not even trying to explain it, just take my word for it, you’d rather not know.
To make you feel more at home, here’s a basic edition of the previous cluster with “correct” LA and EV:
What have we learned from this long and painful journey? To be fair, not a lot. On the scope of a single season for just a pitcher we can’t make any bold statement, yet seeing how Valdez’s BIP cluster into partitions we can name with a bit of baseball terminology made for a fun ride.
An interesting point could be made with respect to the avgxwOBA on our clusters: for 2020 Framber coaxing a grounder (either a roller, tapper or even a hard one) led to sub .200 avgxwOBA, good news for a guy running a 60% GB rate; on the other hand it’s fair to think if a sub 1.000 avgxwOBA on the barrel zone is somewhat of a lucky occurence, given that the majority of those balls are actual Barrels, with all the damage that comes with them.
What I’ll do next time is a comparison with Framber’s polar opposite in the Astros rotation, flyball inducing master Cristian Javier. It’s going to be the same path, one made of clusters, choices and labels.
Don’t ever forget the beauty, and the danger itself, of clustering: it is so subjective that getting hard truths back is not even the point of it.
Sometimes visualizing data is done not to get back answers, rather to arise new questions. It also leaves you, reader, a lot of room to go about it on your own: do you like spin and release points? Use them to cluster! Do you want to label differently? Go ahead! My pattern is not a must, you can do whatever you want and you’ll find other results and avenues to explore.
Until next time, hit it in the air and don’t lose to Framber, last of the Ground Lords.