Archive for the ‘Box Score Analysis’ Category

Box Score Analysis: Wins by Differential, Part 2

First order of business: congratulations, Boston Celtics, on your 17th title. Consolations to the Lakers as well. And a preliminary prediction of a Spurs championship next year if Manu stays healthy – it’s an odd year, after all.

In case you can’t tell, I’m a huge fan of differentials. It’s fairly well-documented that the best predictor of playoff success over the past several years hasn’t been regular season record, but regular season differential.

Want proof? Just look at this year. In 2008, almost every team was eliminated by a team with a higher differential. Celtics over Pistons (#1 and #2), Lakers over Jazz (#3 and #4), Pistons over Magic (#2 and #5), Jazz over Rockets (#4 and #9), Hornets over Mavs (#6 and #10), and the list goes on. The only exceptions were the Spurs over the Suns and Hornets (and as we all know, the Spurs don’t really care about the regular season) and the Cavs over the Raptors. Differential matters.

Want more proof? Highest differential in 2005 and 2007 were both the Spurs, despite not owning the NBA’s best record in either season. In fact, half of the last 10 NBA champions have won the point differential crown, with only the 2002 and 2004 Lakers winning the title without ranking at least top-5.

So, all this considered, it’s a pretty powerful statistic. It correlates ridiculously well with total regular season wins, but it’s an even better predictor of playoff success, considering how rarely (via Ball Don’t Lie) regular season victories predict championships.

It’s also an extremely versatile statistic. You don’t have to consider things like pace of the game, the team’s focus (offensive or defensive), efficiency ratings or anything else. It’s very straight-forward, and everything else that’s significant in the game comes through in the differential.

That’s why I’m utilizing (ok, abusing) it so much in this on-going analysis; it’s one of the more powerful statistics, and we’re reaffirming that here. We’ve already uncovered a few quite notable discoveries, but the ones in this entry are better than all the rest. I think in this entry, we’re actually going to bridge the gap between ‘interesting to statistics nerds’ (like myself) and ‘interesting to the casual NBA fan’.

Today we’ll be answering the question, is it more important for a team to play well in the first half or second, and is there a particular quarter that it’s more important for a team to perform well in?

Before we continue, let me define real quick what I’m referring to when I say one differential was ‘more beneficial’ than the other. Essentially, if a positive differential (for the home team, meaning they outscored the away team) during quarter (or half) A led to more wins than the same differential over quarter B, then quarter A is defined as ‘more beneficial’ (meaning performing better in that quarter more closely led to more wins). If, on the other hand, a negative differential during quarter A led to more wins than the same differential in quarter B, quarter A is considered less beneficial because being outscored during that portion of the game was less likely to result in a loss.

This approach isn’t without fault, though, and I’d be a propagandist (woah, that’s actually a word?) instead of a statistician if I didn’t mention them. This study assumes that winning a quarter is good, and losing a quarter is bad. The last portion of the study didn’t entirely support this, showing that it could be that losing a quarter by two or less is still good – but, this would only confound a tiny percent of our data. That’s the only obvious confound I can identify; if anyone reading notices something I’ve overlooked, e-mail me – I’m not defending this idea simply because I believe it, but because the statistics appear to back it up.

Fortunately, most other typical confounds don’t really apply here: usually a study like this would need to ensure the four quarters were played out under equal conditions, but this study does not seek to attribute causation: we’re not trying to find out why a particular quarter is more important, only if a particular quarter is more important. So with that, on to the results:

Difference Between Halves

Is it more important to perform well in the first half, or the second half? There are several ways to approach this, but one of the best would be to look at identical performances in the first and second half (here, identical differentials) and find if the same differential leads to more wins in one half than the other. And, in this case, it does – well, sort of.

For this portion of the study, I actually calculated the statistical significance for the difference between win percentages between the first and second halves for every possible differential. OpenOffice.org Calc is my friend.

What we find is that there are 39 differentials which occurred at some point in the season in each half. For 21 of these differentials, it was better to win the second half by that score than the first (for example, winning the second half by 8 led to an 86% winning percentage, whereas winning the first half by 8 led to only a 64% winning percentage). For the other 18, it was better to win the first half; so right here we see no strong evidence that the second half is more important.

But, this goes to another level when taking statistical significance into account. Of those 39 differentials, the difference in winning percentage between the first and second halves is only statistically significant (at the 90% confidence level) for 14 of them – meaning that only 14 of the differences are actually indicative of a relevant discrepancy. And of those 14, 11 come when the second half yields more wins.

This is about the strongest statistical evidence I’ve yet come across that strong second half performance results in more wins than identical first half performance. When the statistics show that a certain differential is more beneficial when achieved in a certain half, it shows that the second half is the more beneficial one 11 out of the 14 times. I can’t really re-state that in any other way; the stats strongly suggest that the second half is more important. I’m sure the Lakers will agree, having seen both sides of just how much the second half matters (Game 1 vs. the Spurs, Game 4 vs. the Celtics).

The study does raise several questions though. One oddity is that all 14 statistically-significant differences don’t lead to the same half winning. Why do three appear to benefit the first half, but the other eleven appear to benefit the second half? Is there a pattern to which lie on which side?

Let’s find out: the differentials that are more beneficial in the second half are -20, -17, -16, -14, -9, -3, 1, 7, 8, 9 and 18. The differentials that are more beneficial in the first half are -13, 10 and 19. Now, those results are rather unusual – the beneficial first-half differentials all lie right next to a differential that’s more beneficial encountered in the second half. For example, every home team (all 20) carrying a 19-point advantage into halftime won, whereas only eight out of eleven home teams pulling in a 19-point advantage in the second half won. On the flip side, every home team that achieved an 18-point advantage in the second half won, whereas only eleven out of fourteen won after doing the same in the first half. The same sorts of trends (though not as drastic) can be observed for 9 vs. 10-point differentials and -13 vs. -14-point differentials.

That’s a very strange observation indeed, and requires some explanation. There are two high-level possibilities:

  • There are certain differentials that are more beneficial when encountered in the first half, and certain ones that are more beneficial in the second half.
  • There is an overall trend to which half is it better to encounter every differential in, and the differentials pointing towards the first half being more beneficial are a result of sampling error.

The first of those options would be more plausible if there was a pattern to which differentials were more beneficial when – for example, if large differentials were better when encountered in the second half and narrow ones better in the first half. But the results don’t suggest that – the beneficial first-half differentials are right next to the beneficial second-half differentials.

So, that leaves us with the second option. The second option seems a bit like a cop-out – just throw out the results that we don’t like? And I admit, it is – statistically it’s a pretty slim chance that three of the fourteen would come up just by chance (somewhere around a 2% chance). But it’s an even slimmer chance that the difference would be as high as it is (11 vs. 3), and there remains no logical explanation for why such close differentials would yield completely opposite results.

Fortunately, we can drop this here – why? Because like I mentioned in the last post, we can re-hash this analysis when we multiply our sample size. With a sample size of 12,300, every difference will be statistically relevant – so if these results persist into that study, we can conclude that somehow, for some inexplicable reason, an 18-point differential is better in the second half, but a 19-point one is better in the first.

Difference between Quarters

Like the halves study, this was done by finding the statistical difference between every pair of quarters – all six pairs. And since we already set up the framework for how we’re doing these analyses, let’s jump right into the results – all ‘statistically significant’ stats are at least at the 90% confidence level.

Rather than get unnecessarily wordy, here’s the format for the stats: (Quarter): (number of differentials with advantage); (Quarter): (number of differentials with advantage); (Quarter) SS: (number of statistically significant advantages); (Quarter) SS: (number of statistically significant advantages) – followed by a brief summary of what I’m taking away from it. Like above, below we’re assuming that one of the quarters is absolutely better (over all differentials) than the other – this is not a conclusively proven assumption, nor can it be at this stage, but there’s no evidence (logical or statistical) to the contrary.

1st Quarter vs. 2nd Quarter: 1st: 19; 2nd: 15; 1st SS: 3; 2nd SS: 2
No real conclusive evidence that the first quarter is more important than the second, though this is fairly conclusive that the second quarter is not more important than the first.

1st Quarter vs. 3rd Quarter: 1st: 19; 3th: 17; 1st SS: 2; 3th SS: 8
Now, this is interesting. The 1st quarter holds a slight advantage over the third in terms of the winning percentage resulting from different differentials; but, only 2 of the 19 first-quarter-favoring differentials is statistically significant. On the other hand, almost half of the third-quarter-favoring differentials are statistically significant. This suggests a notable lean towards the third quarter, but not a completely conclusive one.

1st Quarter vs. 4th Quarter: 1st: 16; 4th: 19; 1st SS: 2; 4th SS: 4
Like the 1st vs. 2nd data, this data is too close to suggest that the fourth quarter is conclusively more important than the first; however, like in the second, this is evidence that the first quarter almost certainly is not more important than the fourth.

2nd Quarter vs. 3rd Quarter: 2nd: 13; 3rd: 21; 2nd SS: 3; 3rd SS: 4
The wide discrepancy in non-statistically significant differentials is not as notable as the closeness of the statistically significant differentials, but it’s still statistically significant that the non-statistically significant differentials are so far apart (once again, statistics analyzing statistics – and don’t you dare try to read that sentence 3 times fast). So what does that mean in non-statistics-ese? Basically, there’s evidence that the third might be a bit more critical than the 2nd, but more notably the second certainly isn’t more critical than the third.

2nd Quarter vs. 4th Quarter: 2nd: 16; 4th: 18; 2nd SS: 2; 4th SS: 7
Fourth is more important than second, basically – the large discrepancy in statistically-significant differentials shows that.

3rd Quarter vs. 4th Quarter: 3rd: 18; 4th: 19; 3rd SS: 8; 4th SS: 9
And the third and fourth are about as equal as they can be from the data available – either could actually be better than the other, or they could be functionally the same.

Now, pardon me, but I’m going to be a nerd for a moment and break these into nonsensical looking equations to try to come up with a Unified Theory of Quarter Differentials.

So, we have 1st >= 2nd, 3rd > 1st, 4th >= 1st, 3rd > 2nd, 4th > 2nd, 3rd = 4th.

Well, right away that’s good news (well, ‘good’ if you want this study to have a conclusion): there are no blatant contradictions there. That’s actually better news than it may appear – if there truly was no pattern to which quarter was better than which (and all the observed results were simply random), there would almost certainly be a contradiction.

The conclusion is not, however, what I had anticipated. Judging from earlier data, I was fairly sure that the third quarter would prove to be conclusively most important. However, according to this data, it isn’t: the unified formula suggested by the data is 4th = 3rd > 1st >= 2nd – that is, the third and fourth quarters are definitely more important than the first and second, but are themselves even, and the first and second may also be even.

Now, these statistics allow a certain degree of interpretation – for example, how do you resolve 4th >= 1st, 1st >= 2nd but 4th >2nd? The nature of this discrepancy – that is, two comparisons suggesting a logical order but not a logical degree, is not an uncommon characteristic of this type of analysis’s random error. There is likely a resolution (or, alternatively, there is no pattern and these results are, in fact, by chance), but unfortunately the only way to establish that resolution is to increase the sample size – that, again, is a task for later in the summer.

I could milk this some more, but let’s cut this portion off here – we have one more thing to touch on before we turn the page on this portion of the analysis.

Overall Differences

Before ending this portion of the analysis, let’s look at one last approach. This approach is not as specific or thorough as the above ideas, which actually grants an advantage: while the following approach would not be able to pick up on subtle differences between different quarters and halves, it can be assumed that any notable difference in the following approach is indicative of a true difference.

This approach is simple: what was the overall winning percentage of teams that “won” each quarter and half? A simple compilation of the gargantuan dataset we already had revealed these results (note, the numbers do not add to 1230 due to tied quarters that hypothetically favor neither team):

  • 1st Quarter Winners: 756 wins, 405 losses, 65.1%
  • 2nd Quarter Winners: 745 wins, 429 losses, 63.4%
  • 3rd Quarter Winners: 786 wins, 387 losses, 67.0%
  • 4th Quarter Winners: 766 wins, 396 losses, 65.9%
  • 1st Half Winners: 855 wins, 323 losses, 72.5%
  • 2nd Half Winners: 889 wins, 304 losses, 74.5%

And lastly comes that epic question of statistical significance. Which of these differences are statistically significant? I’ll spare the math and just jump to the conclusion; factoring everything in, these statistics show a statistically significant advantage in the comparisons of 1st > 2nd, 3rd > 1st, 3rd > 2nd, and 4th > 2nd (it’s notable that though all are significant at nearly the 90% level, the 3rd > 2nd and 4th > 2nd are the most significant). Additionally, the data suggests that 4th > 1st and 3rd > 4th, though not at nearly certain confidence levels (66% and 72%, respectively).

So, considering only these conclusions, we would come up with this demented equation: 3rd >= 4th > 1st > 2nd. Don’t get too excited by the absence of a contradiction with our earlier formula – while usually a conclusion is nearly-certainly proven when two separate approaches lead to the same result, these two approaches aren’t entirely different – they’re based on the same data, so it’s natural for them to be closely related.

What’s notable, though, is that there is a larger degree of certainty of the relationships when using this method. Whereas the earlier method failed to make a determination between the 3rd and 4th quarter, this one suggests a possible benefit for the 3rd quarter (and certainly shows no benefit for the 4th quarter – equality is still possible, however). To a greater degree, this approach marks a more certain definition for the comparison of the 1st and 2nd – namely, that the 1st is, indeed, more important.

So which do we accept? That’s a surprisingly subjective question for a statistical issue. Statistically, we could take either of the two studies, or a combination of the two – but they do show slightly different things (‘Little White Statistics’), so the question becomes where we attribute the random error. This effect can be minimized by expanding the study size, but in the meantime it’s a matter of judgment – I, personally, feel random error will affect the first method more due to its more segmented nature – the sample sizes for each individual comparison are smaller (giving random error a larger impact), and there are many, many more samples to be affected. Therefore, I believe the conclusion of this latter portion to be more accurate.

Statistically, the proper way to say this would be something along the lines of “we can be 90% confident of the latter conclusion (3rd >= 4th > 1st > 2nd) and 95% confident of the former conclusion (3rd = 4th > 1st >= 2nd)”. Looking carefully, we see that the latter conclusion is really a specific case of the former conclusion (noting that ‘3rd = 4th’ means that we can’t make a judgment, not that we’re judging them to be equal) – or, in other words, both the former and latter conclusion can be correct, but the latter can’t be correct without the former. One of those ‘all squares are rectangles but not all rectangles are squares’ types of situations.

So, I’m accepting the latter for the time being, but we’ll definitely revisit this portion of the study when we analyze the past 10 years – at a sample size of 12,300, nearly everything is statistically significant, allowing us to more definitively define these trends.

But wait! What about a comparison of the halves? Comparing the halves allows us to say with 86% confidence that the second half is more important (under the definition we’ve repeated several times) – but, it’s a commonly accepted notion that at least a 90% confidence level is required (many places a 95% confidence level) to make a conclusion, so alas, we have no conclusion. If the same ratio holds up when looking at the last 10 years, we’ll have an incredibly statistically significant conclusion, so we’ll see then. But verily this vichyssoise of verbiage veers most verbose, so let me move on to the Little White Takeaways.

LITTLE WHITE TAKEAWAYS


In this installment, we looked at whether or not certain quarters (or halves) were more beneficial to perform well in – or, in other words, which quarters correlated best with a team’s success – or, in even simpler terms, which quarters are most important. And in our analysis, we arrived at the following conclusions:

  • The 2nd half is likely to be more important than the 1st half, but the statistical evidence is not yet definitive (though it’s about as close as it can be without being considered definitive).
  • There is a “pecking order” of which quarter is most important to perform well in. Generally, this order is of the form 3rd ? 4th > 1st >= 2nd (or, the 1st is better than or as good as the 2nd, and the 3rd and 4th are indistinguishable both better than then 1st and 2nd).
  • Specifically, that same order can be narrowed down a bit, down to 3rd >= 4th > 1st > 2nd (or, the 1st is better than the 2nd, the 3rd and 4th are better than the 1st, and the 3rd may be better than the 4th). This doesn’t contradict the above idea, it’s just a special case of it – we can be very confident that the above is true, and slightly less confident that this one is true.
  • All these conclusions will become much more set-in-stone when we perform this analysis on every box score over the past 10 years (rather than just this season). Some of you may wonder about the rule changes over the past 10 years affecting the results – but the good news is that the differential statistic should not be affected with regards to wins. There’s no reason to believe that the rule changes resulted in a certain quarter becoming more important (although never fear, just in case we’ll run some basic tests to see if somehow they did).
  • So what’s next? Well, there’s three items of business on tap for the near-future. First, there are still many angles of this analysis to consider – most notably, do certain teams perform better in certain quarters, and do the elite teams perform better in a particular quarter? Second of all, like I said in the first post, we want to conduct that analysis on Kobe Bryant and test if his team’s success really does go down as his shot volume increases, as well as whether that’s due to his choices or other causes (teammates’ off-nights forcing him to shoot more). And thirdly, I’d like to run a few of these tests we’ve been doing the past few days on just NBA playoff or NBA finals games, to see if the statistics change.

    So what comes first? Probably the first (on teams and differentials), but after that we’re likely headed for a break from this study for a few weeks so we can focus on Kobe Bryant. Yes, I know it would’ve been smarter to analyze Kobe during the Finals when all eyes were on him. I’m a blogger, not a businessman.

Wednesday, June 18th, 2008

Box Score Analysis: Wins by Differential, Part 1

Good afternoon, sports fans – I’m going to start out by saying that the whole ‘entry every two days’ thing isn’t going to continue all summer, so if you’re getting tired of reading a novel every couple days, never fear, this rate of posting will only continue through shortly after the season ends. I’ll probably settle into a once- or twice-a-week schedule over the summer, depending on how long my ideas for analysis hold out.

Speaking of which, does anyone know the plural for ‘analysis’? Analysises? Analyses? Analysi?

Today we’re going to look at one of the two things I previewed in the last Box Score Analysis entry – how quarter differentials correlate to wins. Essentially, we’re asking the question “how often did a team winning by X points after the first quarter go on to win the game?” for every possible value of X (and for every quarter and half).

In this case, I’ll be splitting the analysis in half. Unknown to me when I set out on this part of this research, there are a lot of conclusions, some far more important than others. Putting them all in one entry would dilute the impact of the more meaningful ones, so in this entry we’ll be covering the less impactful (though still interesting) ones. Next entry we’ll cover the real heavy-hitters. So today we want to see if there’s a certain time when the probability of winning drastically increases – for example, how much more likely to win is a team leading by 7 at halftime compared to a team leading by 5? Is it significant at all?

Unlike the last entry, I’m going to spend a good bit less time covering the statistical reasoning behind the conclusions and more time covering the conclusions themselves. If you want to see the proof behind the numbers, by all means let me know and I’d be glad to send it to you; or, you can run the numbers yourself: I’m posting the data sheet that’s being used to derive all this information right here.

Statistical Significance Overview

But let me start by going back to that pesky ‘statistical significance’ idea (which, if you understand already, jump ahead three paragraphs). Again, the upcoming ‘Stats Primer for a Sports Fan’ will detail what statistical significance is, but basically if something doesn’t have it, it’s not proven. A stat is ‘statistically significant’, by definition, if it is very unlikely to have simply happened by chance. For example, if a player is listed as a 60% free-throw shooter and misses three times out of three free-throw attempts, that’s not statistically significant enough to make us doubt that he’s really a 60% shooter (because statistically there was a 1-in-20 chance he’d miss all three). But, if a player is listed as a 95% free-throw shooter and misses three straight, that’s pretty significant because it’s unlikely that a shooter who was really that good would miss three out of three (statistically, it’s about a 1-in-10,000 chance). (Important note: we’re saying this as if we only observed the shooter taking three free-throws. The best free-throw shooter in the world will miss three straight at some point in his career – but what are the odds that the specific time we say ‘hey, take three free-throws’ and observe only those three that he misses all three?)

Statistical significance is thrown around a lot because it’s a pretty general term, but here we’re going to mainly use it when talking about comparing two statistics. For example, Peja Stojakovic shot 92.9% from the free-throw line this year, and Dirk Nowitzki shot 87.9%. Is that difference statistically significant? If so, we can say that there’s statistical proof that Stojakovic was a better free-throw shooter than Nowitzki this year; but if not, we can’t conclusively assert that (incidentally, it’s not statistically significant, although the difference between Chauncey Billups shooting 91.8% and Dirk is significant even though Chauncey shot worse than Stojakovic. See why we call it ‘Little White Statistics’?).

And a final note: when we refer to ‘confidence’ in terms of statistical significance, it means something pretty simple: basically, we can that confident that the observed results come from an actual difference, rather than just a random sampling error. So basically, when we say “we can conclude this at 95% confidence”, it means we’re 95% sure what we’re concluding is true.

Study Background

Alright, enough fluff. The reason I bring up statistical significance is because this analysis really depends on it to make any kind of conclusions. But before we get to the takeaways, a brief background:

This portion of the study was completed by taking all the box scores from the 2007-2008 NBA regular season, computing the quarter/half differentials for each quarter (with respect to the home team, so a negative differential means the away team outscored the home team), and then looking at how many wins and losses each quarter/half differential led to. Then, we did our correlation voo-doo magic to see what increase in win percentage each point added to the differential gave. And finally, we looked to see if any of that crap was statistically significant. And if you really want to see the numbers, I can show them to you – but I’d recommend taking my word on it. If I was making stuff up, I’d make up far more conclusions than this.

And with that, on to the results, subdivided into topics for your reading convenience:

The Halftime Differential

Let’s lead off with something bizarre. In the 2007-08 season, what halftime differential from leading-by-5 to trailing-by-5 was most likely to lead to a home team victory? Leading-by-5? No – within that range, the home team won most often (over that margin) when they were trailing by three points at halftime. This season, the home team trailing at halftime by 3 points won a bizarre 75.7% of their games (28 out of 37), compared to about 65% from margins +1 to +5, and around 55% from -1 to -2. That’s statistically significant at 95% confidence compared to differentials -2 through 1, but not statistically significant compared to 2 and higher.

Similarly bizarre, in games that were tied at halftime, the home team actually lost more often than they won – the home team won only 46% of games that were tied at halftime (24 out of 52). That’s not statistically significant compared to most negative differentials, but it is compared to that -3 halftime differential (at an excessively high confidence level, too).

So is the home team really more likely to win when they’re down by 3 at halftime than if they’re tied? I’m taking this conclusion with a grain of salt. 95% confidence is a high level, but statistically that means that for every 20 conclusions you make at 95% confidence, one will likely be wrong. I have a feeling this might be that one – but fortunately, this topic is very easy for further research (which I’ll mention later). And yes, in case you’re keeping score at home, we just used statistics to analyze statistics. To be specific, we statistically proved that statistics aren’t always reliable. But is that a reliable conclusion? And with that, this blog disappeared in a puff of logic.

But by that same token, we’re not talking 95% confidence in this statistic. According to the numbers, we can (apparently, note I’m still as skeptical as you) assume a 3-point halftime deficit leads to more home team wins (than a halftime tie) with a remarkable 99.7% confidence. So either I completely screwed up the math somewhere, or we’re on to something (if anyone’s skeptical enough to check my math, we have a proportion of .757 with 37 samples and a proportion of .462 with 52 samples). But I’m still skeptical, so this will definitely be one of the items touched on when we re-do certain parts of this analysis for all the games over the past ten years (oops, gave away the ending).

I should also note I’m not implying any causation here – I’m certainly not saying it’s wise for a home team to drop down 3 points before halftime. What we’re looking at here are measures that predict what would happen anyway. We aren’t saying that trailing by three at halftime leads to a win – what we’re saying is that the conditions that lead to a 3-point halftime deficit also lead to a victory by the end of the game.

Through-Three Differential

The team leading at the end of three quarters was always more likely to win this season, regardless of whether they were home or away, and regardless of the differential. Away teams leading by as little as one point after three quarters won 61.5% of the time, while the home team leading by as little as one point won 54.7% of the time. The difference in the winner is certainly statistically significant (at 94% confidence).

Also interesting (and touched on more in the next analysis) is that once you get to a meager 4-point lead going into the fourth quarter, your victory percentage is sky-high – 75% for the home team, 71% for the away at a 4-point differential, and the percentages only get higher from there.

Critical Points

There’s absolutely no way to phrase this section title that completely prevents any possible puns.

At the beginning, we said we wanted to see if there’s a certain differential in each quarter/half that signifies greatly increased odds of a win. And, as it turns out, one does appear. Analyzing statistical significance here is difficult (because we’d have to compare every pair of differentials’ winning percentages over a large range, for each of the seven time periods), but just some random sampling (yes, now we’re randomly sampling our statistics) for statistical significance revealed these are likely significant at the 90% confidence level, at the least.

  • 1st Quarter: Home: 2; Away: 6
  • 2nd Quarter: Home: 4; Away: 6
  • 3rd Quarter: Home: 5; Away: 5
  • 4th Quarter: Home: 3; Away: 7
  • First Half: Home: -3; Away: 5
  • Second Half: Home: 1; Away: 4
  • Through-3: Home: 2; Away: 1

There’s some pretty interesting stuff in there, believe it or not. In most cases, those point differentials correspond to a point at which teams become around 20% more likely to win the game, and sustain that increased win percentage over higher differentials. There’s a couple notable items in this:

  • First of all, it’s pretty notable how much less the home team needs to do to raise their win percentage. In most cases, a differential of -2 (the away team leading by 2) is what corresponds to an even winning percentage between the two teams.
  • Even more notable is that the home team still has a strong chance of winning as long as they’re losing by 3 or less points at the end of the first half. We covered in great length the fact that a 3-point halftime deficit this season still resulted in a winning record for the home team – but after 3, the drop is significant – trailing by four only brings victory 41% of the time, and the ratio decreases steadily after that. And, conveniently, the different between -3 and -4 is statistically significant, adding to the intrigue of the -3 differential.
  • We mentioned this earlier, but it’s also notable how delicate the through-three differential is – one 3-pointer drastically changes the odds of victory from the home team’s favor (70% when winning by 2 entering the fourth) to the away team’s (62%), a pretty ridiculous 32% swing.

As I said above, no causation is implied here; I’m not trying to say that the act of winning the first quarter by 2 points causes the home team to be substantially more likely to win. Instead, I’m suggesting that whatever causes the home team to be up by 2 or more also causes the home team to eventually win the game. Leading by those differentials is a sign that they stand a good chance of winning the game – not the reason they do.

Regression Analysis

Like last time, I ran a regression analysis, seeking a correlation between differential (for each quarter and half) and winning percentage.

There is one – an incredibly strong one. The second, third and fourth quarter differentials each correlate incredibly strongly to winning percentage (the first quarter differential correlates as well, but not quite as strongly – R=.9 for the first quarter whereas R=.94 for two, three and four). What this means is basically, outscoring your opponents by more points during a certain period of time does raise your chance of winning. We’re really uncovering deep, hidden secrets now, aren’t we? I think we just statistically proved that you win a game by outscoring your opponent. Groundbreaking, absolutely groundbreaking.

The slopes of these regression lines border on relevant, though. The quarter regressions all hold slopes of roughly .023, implying that for every point added to the differential, winning percentage increases by .023. To put that in terms that make sense, it means statistically if a team outscores its opponent by 5 points in the second quarter in every game, they’ll likely win two more games (over a season) than if they outscored their opponent by only 4 points in those quarters.

More relevantly, that means that if a team raises its average differential in one quarter by 1 point, it’ll average 2 more wins over an 82-game season. For a long-term coach, that’s a great goal. Raise it by 1 point per quarter and that’s possibly an 8 game improvement. That might sound drastic, but consider how strong a 4-point average differential difference makes in the league – in 2007-08, a 4-point difference is what separated the Jazz and the Raptors.

Miscellaneous

And beyond all of the above, there are a few things in this analysis that I just find flat-out interesting. There’s no statistical relevance to any of them, but they’re interesting observations.

  • No home team recovered from being down 16, 17 or 19 points after one quarter (total of eight occurences), but two of the three home teams down 20 after the first recovered: Minnesota against Indiana and Phoenix against Seattle. Minnesota completely erased the 20-point deficit and led at halftime by 1, whereas Phoenix trailed by only 2.
  • The home team actually held a winning record when being outscored by 10 in the second quarter, or by 6, 7 or 10 in the third. They did not, however, hold a winning record when being outscored by anything more than 4 in the first quarter, or 5 in the fourth.
  • The lowest quarter differential to yield a 100% winning percentage was 13, when scored in the fourth quarter by the home team. The away team required a 16-point quarter-differential, but could have it occur in either the first or third quarters.
  • The away team won 3 times when being outscored by 19 points in the second half, but never won when being outscored by more than 16 unless it was 19.

One of the things I plan to do later in the summer is re-hash the more ‘controversial’ or ‘fuzzy’ conclusions from this analysis by expanding the sample pool ten-fold and looking at the statistics for every game over the past ten years. If the conclusion on halftime differential holds up then, it’ll be only a one in a trillion (in other words, impossible) chance that it’s by coincidence.

I think that’s about all the information I can beat out of this data without stepping into the second half of our analysis. If anyone has any other questions that might be answered by this data, feel free to e-mail me at the heavily disguised e-mail address on the left. Wait until tomorrow though, since I’m only half-done with this portion of the analysis. Now, on to the takeaways.

LITTLE WHITE TAKEAWAYS

So, in this analysis, we looked at how often each point differential led to a win (or, more specifically, the winning percentage associated with each point differential). As always, teams were separated by location, since it’s been thoroughly discovered that differential trends are very different between home and away teams.

  • Halftime Differential: Crazy stuff – I recommend reading this part regardless of your knowledge or interest in statistics. Basically, there’s evidence that the home team wins more often when trailing by 3 points at halftime than if the game is tied at halftime. It sounds bizarre, but the statistics behind it are extremely straightforward. Later this summer I’ll look at this again with data from the past 10 years (or 7, depending on how far back Yahoo!’s box scores go) and see if it still holds true.
  • Through-Three Differential: The team leading at the end of three quarters, regardless of home or away and regardless of the amount they lead by, is always statistically more likely to win the game (though not extensively – 82% of games are won by the team leading after three, but only around 60% are won by the team leading if they lead by less than 3).
  • Critical Points: There are critical points in the differential for each quarter and half, meaning that there is a certain differential that begins to lead to a much larger chance of winning. For example, a 1-3 point advantage for the home team in the second quarter (only the second quarter, not first and second) yields about a 57% chance of victory – however, 4 and above yields a 70%+ chance.
  • Regression Analysis: A regression analysis showed there’s a very strong correlation between quarter differentials (in every quarter) and final result. Especially interesting from this part is the impact that a small improvement in differential can have – this part is also interesting reading even for those not interested in statistics.
  • Miscellaneous: Weird stuff happens.

Don’t miss next entry, though. In the next entry we unveil some very interesting statistics about the power of individual quarters, and what periods of the game are most important to perform well in. It’s definitely interesting even to the casual fan, so come back tomorrow when it’s finished and posted. Until then, wish me luck on my last week as a college student.

-DJ

Friday, June 13th, 2008

Box Score Analysis: The Basics

Like I mentioned last entry, there’s lots and lots of ways to approach this – and some of them are really, really interesting. But there’s a basic foundation that should be laid that encompasses the results in the most general sense.

How well do in-game (within and after each quarter) differentials correlate to the actual differential in the final score? While next time we’ll look at how often each differential leads to a win, this time we’re just looking at how well the periodic differentials predict the final result. If you’re unfamiliar with the idea of correlation, don’t worry – it’s pretty easy to understand what’s going on below.

We’re going to check the correlations between seven different Percent Differentials and the final differential: each quarter’s Percent Differential (for example, the differential for JUST the second quarter, not the first two quarters), each half’s Percent Differential, and the differential after the first three quarters combined (just for curiosity sake). So without giving myself any opportunity to be more wordy, on to the analysis:

(If the items like ‘R’, ‘Slope’ and ‘Standard Error’ don’t make any sense to you, come back in a few days when I have the Statistics Primer posted – it’ll give you an overview of what these things mean. In the meantime, just know that ‘slope’ means that, on average, the quarter differential is the slope multiplied by the final differential and R represents how strong the correlation is)

1st Quarter Differential vs. Final DifferentialCorrelation #1: First Quarter Differential vs. Final Differential

R: .44
Slope: .2462
Standard Error: 6.9

Correlating First Quarter differential with Final Differential yields are very loose correlation, as suggested by the low correlation coefficient (R). So, next time you’re tempted to say “a 4-point lead after one quarter? Why, that’s a double-digit win!” come back and look at that chart because, unfortunately, it really doesn’t work that way very often; unless your First Quarter differential’s up in the high 10s or lower 20s, it’s probably best not to try to draw any conclusions.

2nd Quarter Differential vs. Final DifferentialCorrelation #2: Second Quarter Differential vs. Final Differential

R: .44
Slope: .2411
Standard Error: 6.7

And the correlation between the Second Quarter differential and the Final differential is… well, essentially identical to the one with the first quarter. Don’t be misled by the graph, however – it may appear that the Second Quarter is even more jumbled and random than the first, but this is really a result of a few outliers in the top right (total blowouts) changing the appearance of the graph.

Third Quarter Differential vs. Final DifferentialCorrelation #3: Third Quarter Differential vs. Final Differential

R: .48
Slope: .2700
Standard Error: 6.8

Now, the observant members of our audience will notice that there is a slight difference between these third quarter measurements and the previous two quarters: namely, R is .04 higher, and the slope is .03 higher. Is this statistically relevant (that is, do these statistics conclusively demonstrate something absolute, or could they be a result of random error)? That… is a question for the end of this analysis.

Fourth Quarter Differential vs. Final DifferentialCorrelation #4: Fourth Quarter Differential vs. Final Differential

R: .43
Slope: .2355
Standard Error: 6.8

And in the fourth quarter, we return to the results from the first two quarters – actually, even a tiny bit lower. While this small discrepancy isn’t statistically significant (basically, it doesn’t conclusively prove anything), I believe (with no statistical grounds) that it is still accurate, due to one type of game: blowouts. A notable portion (that I can calculate if anyone is interested) of NBA games are decided by 15 points are more. These games usually see bench players entering the game and playing the final minutes, resulting in the fourth quarter differential being completely different from the rest of the game. This would result in a lower R value, as we see here (which, again, statistically isn’t proven to actually be lower – I’m just speculating).

I’m going to pause here before moving on to the first-half and through-three correlations to analyze this a bit, given that these four studies can be directly compared (all are 12 minute periods). Above I mentioned that the third quarter yields higher values for R and slope than the other three quarters. These measurements, if accurate, would suggest two things: (a) a higher third-quarter differential means a higher final differential, compared to that of the other three quarters, and (b) third-quarter differential is a better predictor of final differential. But, are these measurements statistically significant?

There’s good news and bad news on that. First, the bad news: we can’t conclude from this data that a higher third-quarter differential leads to a higher final differential compared to the other quarters; the standard error (basically, how much the data varies) is too high to really draw any statistical conclusions on the slopes of any of the quarters, other than they’re somewhere in the .22-.28 range.

There is good news, though. According to the data, we can say (with 90% confidence) that the R value for the third quarter really is higher than the R value for the others; the 90% confidence interval for the third quarter R value lies just barely outside the 90% confidence interval for the other quarters.

So what does that mean? The statistics show that the third quarter differential – that is, the point differential in only the third quarter (not quarters one through three) – is a stronger predictor of the final differential than the point differentials of the other quarters. Or, in simpler terms, you’ll find the third quarter predicts the final outcome more often than any of the other quarters. This, to me, is early evidence of something I think will be statistically proven by the time we’re done with this analysis – that is, that the third quarter is the most important quarter in the game. Obviously this hasn’t been conclusively shown here yet, but the early indicators are there.

Now let’s take a look at the halves:

First Half Differential vs. Final DifferentialCorrelation #5: First Half Differential vs. Final Differential

R: .6466
Slope: .4873
Standard Error: 7.84

As could be expected, a half serves as a much better predictor of the game’s final differential than just a quarter, which is shown here by the higher R value. Interestingly though, this R value is still relatively low (given the corresponding R-square value of .42, which symbolizes a present but weak correlation). Also interesting is that the slope – .4873 – is lower than .5. Given that these data are computed from the actual regular-season results, it’s necessary for all the slopes to add to about 1 (you’ll notice the four quarters’ slopes add to roughly 1 as well), which means…

Correlation #6: Second Half Differential vs. Final Differential

R: .6592
Slope: .5054
Standard Error: 7.86

…that the slope for the second half should be higher. And, indeed, it is. Unfortunately, the discrepancy between the slopes is nowhere near statistically significant (thanks again to that high standard error), but that doesn’t mean it isn’t notable anyway. Lacking statistic significance means we haven’t proven anything, but it doesn’t mean that we haven’t found evidence possibly suggesting something. There is also a difference here in the R-values between the two halves – this difference isn’t statistically significant either (at a 90% confidence level), but it does reinforce the early idea that the third quarter may be the most significant quarter in the game (though its effects may be diluted by the comparably weakest fourth quarter, both of which factor into the second half).

And now, one last analysis, just for kicks and giggles…

Through-Three Differential vs. Final DifferentialCorrelation #7: Through-Three Differential vs. Final Differential

R: .83
Slope: .7573
Standard Error: 6.90

This correlation isn’t as useful as the others given that it can’t be compared to any comparable time period (except the final three quarters, which wouldn’t be too useful); and additionally, it’s really just the inverse of the quarter analysis. But it’s useful for keeping our sanity while actually watching games because the differential entering the third quarter is strongly correlated (far more strongly than anything else we’ve looked at) with the final differential. This is likely an effect of 36 minutes having an (obviously) stronger impact on the game than any 12-minute period, but it’s still interesting to see just how close the correlation is. While even a 10-point lead after one quarter failed to correlate with a double-digit win, a 7-point lead entering the fourth strongly relates to an easy win (obviously not EVERY time, but a substantial proportion).

So, that’s about all the information I can milk from this portion of the analysis. I’ll sum everything up below in the Takeaways section, but this analysis provides us with a great jumping-off point for the next two portions of this study.

First of all, while the high standard error made it difficult to draw any conclusions about the final differential, it shows that there is a high degree of variability in the differentials after each quarter (as opposed to the majority of games having only a 4 or 5 point swing per quarter). From there, we can examine the question, do particular teams find more success in different quarters, and if so, is there a particular trend among the more successful teams?

Secondly, while we’ve shown that most differentials do a poor job of predicting the final differential, we haven’t examined whether they predict the final outcome at all. A team with a 10-point halftime lead may ease up in the second half, causing the differential to fail to correlate but preserving the win. What differentials at what milestones most often correlate with a victory?

These are our next two topics (not necessarily in that order) – they should be up within the next week at most.

LITTLE WHITE TAKEAWAYS

In this portion of the analysis, we’ve uncovered one fact that is actually backed up by statistics, and a handful of ideas that are suggested by the statistics, though far from being explicitly proven.

The notably demonstrated fact is that the differential within the third quarter (that is, only in the third quarter, not through the first three quarters) is statistically the most accurate (of the four quarters) in predicting the final game differential. This serves as possible early evidence that the third quarter may be the most important quarter in an NBA game.

Also notable, though, was the fact that none of the quarters, nor either half, demonstrated the ability to reliably predict the final differential. There are correlations, but they are very weak; this means, for example, that an 8-point halftime lead typically predicts anywhere from a 4-point loss to a 20-point win. So next time you’re tempted to say “oh, a 5-point lead, we’re going to win by double digits!”, remember this entry.

Statistically we couldn’t actually demonstrate anything else, but the statistics did suggest a couple other ideas that should be explored further. Note that these are most certainly not proven truths, just possibilities:

  • That the second half differential is a better predictor than the first half.
  • That the fourth quarter is the least effective predictor, though it is likely diluted by blowouts (in which the fourth quarter is played very differently from the first three).

Coming up next are two extremely interesting (in my opinion) parts of the analysis: first, which teams typically do better in which quarters, and is there a trend to which quarter elite teams outperform their competition in? And second, how often do particular leads after each quarter correspond to victories, even if the margin of victory is lower?

So that’s the end of this marathon analysis. This one is likely longer than the others will be, due to its role as the jumping-off point, so if you’re out of breath after reading this epic of an analysis, don’t worry – so am I.

-DJ

Tuesday, June 10th, 2008

Introducing the Box Score Analysis

It’s a trap we’ve all fallen into from time to time. Clinging to a 3-point lead after one quarter, we desperately tell ourselves, “it’s ok, it’s ok! At this rate, that’s a 12-point victory!” Fortunately most of us don’t delude ourselves into thinking that 4-point lead after one minute of play automatically translates into the most lopsided victory of all time, but I think all of us have tried to draw certainty from a first-quarter score at least a few times.

Does it ever work? Of course not. (Drumroll please – incoming is the first statistical fact ever stated on Little White Statistics) In the 2007-08 NBA season, only 47 (out of 1230, a whopping 3.8%) of games had the final score differential be the corresponding multiple of the differential at the end of the first quarter, halftime or through-three (meaning a 12-point win after a 3-point first quarter lead, a 6-point halftime lead or a 9-point through-three lead). In case you’re interested, 15 were multiples of the first-quarter differential, 20 of the halftime differential, and 14 of the through-three differential (2 games had more than one).

Now, I know what you’re thinking – well of course it doesn’t match exactly, but it’s pretty close, right? Well, dear friend, that is the first question we’re going to answer. How well does the score at a particular point in the game correlate to the game’s final outcome? Not just how often does the team leading at halftime win the game – but how accurate is the assumption that a 5-point lead after one might lead to a double-digit victory?

There’s about three hundred eighty-two and a half different ways to examine this – and fortunately, we have all summer! So, after parsing out the box scores for every regular-season game (and discovering some interesting things in Yahoo!’s box scores…), I have a database of every by-quarter box score of every game for the regular season (in related news, if you need any text-parsing applications written, I have some experience).

To make this analysis easier, I’m going to introduce a couple terms to avoid elaborately explaining the same concept over and over. Actually, right now I can only think of one term: I’m calling it Percent Differential (PD). Percent Differential refers to the lead a team holds, with respect to how much of the game has been played.

For example, a team that leads by 3 points after one quarter, 7 points at halftime and 10 points at the end of the third quarter would be said to have relatively the same Percent Differential throughout the game (that is, they outscore their opponent by about the same amount during every quarter). A team that leads by 5 after one quarter, trails by 2 at halftime and ends up winning by 12 would have a very different Percent Differential throughout the game (and in case it’s not obvious, ‘differential’ just refers to how many points a team leads/trails by).

So, next entry we’ll get the ball rolling on what’ll be an ongoing analysis of the predictive power of the game score at different points in the game. Like I said, there’s dozens of things to take away from this sort of analysis – some of the things we’ll look at include:

  • Correlation between quarter-differentials and the final differential.
  • The critical points where a certain lead begins to strongly correlate to winning percentage.
  • Which specific teams are more consistent with their Percent Differentials.
  • Whether home teams have a better chance of maintaining positive Percent Differentials (whether home teams are more likely to increase their leads and decrease their deficits).
  • Whether the statistics change as the season goes along.
  • How quarter differentials impact half differentials, and how half differentials impact games.
  • What teams are historically stronger in certain quarters, and whether that translates into real success (for example, is the best third-quarter team better overall than the best first-quarter team?).

Fortunately, as I’ve gone along with this study, I’ve already started to observe some interesting trends – so unlike some studies where the eventual usefulness of the results is up in the air until they’re obtained, in this particular instance I can already say there will be something notable coming out of this. So, if you’re as interested in this statistical orgy as I am, join us next time – the next post should be up in a couple days or so.

LITTLE WHITE TAKEAWAYS

Told you I was having a simplified ‘takeaways’ section. Most entries will have one of these sections, so that if you want to skip the statistical crap you can jump straight to what they’re supposedly proving. Or, alternatively, you can look here and check what I’m claiming – then, if you agree you can just smile and move along, and if you disagree you can thrash my reasoning to try to find a counter-argument. Point being, here we’ll sum up the results.

Well, except for this post, since we haven’t proved anything yet. The takeaway here is we’re going to do cool stuff, so you should come back. We have cake (the cake is a lie).

-DJ

Monday, June 9th, 2008