I hate to sound fatalistic but there’s actually quite a bit about the world that we just don’t know. It’s far too opaque, interconnected, random, and complex. That’s not to say we can’t know anything, but more that we need to proceed with caution any time we’re trying to figure out anything with more than a minimal degree of complexity to it.
Take, for obvious example, the current COVID-19 world. A certain portion of the population considers the virus to be an existential threat that could kill virtually anyone who comes into contact with it. Another portion of the population at large sees CV as a minor inconvenience, no more remarkable than a particularly bad flu season.
And here’s the rub: they can both cite data, statistics, experts, and so on to support their side! How is that possible?
One place to start looking is the way we’ve pursued research and data gathering around the novel coronavirus. In many situations less is more while more is decidedly less. This is true of data because, somewhat paradoxically the more data you have the less you know. Got it?
Consider, for example, the randomness that’s inherent in the daily CV death numbers. Any time you’re evaluating data there will be a certain amount of randomness mixed in, making it difficult to distinguish signal from noise. By reporting daily numbers we have more data to reference, which actually makes the picture less clear. The world is not a spreadsheet we can look at and objectively extract facts and figures. There is always bias, noise, and randomness mixed in. Always.
Consider the following real-world data. Yesterday (Sept. 9, 2020) Nebraska (where I live) had 15 new COVID-attributed deaths, the highest single day total since June 12th.
THAT’S TERRIBLE NEWS!!! … right?

Well, as Lee Corso would say, “Not so fast, my friend!” In the days preceding the 9th we saw just two new deaths on the 8th, and zero per day on the 4th, 5th, 6th, and 7th.
So, it’s possible that all of a sudden CV switched gears and has become way more deadly in a 24-48 hour timespan. Or maybe we’re being fooled by randomness, as our friend Nassim Nicholas Taleb likes to say.
If you’re not familiar with the US calendar you may not realize that Monday, September 7th, was Labor Day. What if the people in charge of tabulating these numbers had the day off on Monday and spent most of Tuesday catching up on an overflowing email inbox?
Maybe one county out there got their numbers in (2 new deaths) on Tuesday but the rest of the state entered deaths from Friday, Saturday, Sunday, Monday, and Tuesday (5 days) into the system to be counted on Wednesday. Wouldn’t you expect there to be a spike?
If you do the math looking at deaths from the 4th through the 9th you have 17 deaths over 6 days (4th through 9th), or 2.83 per day. Which makes a lot more sense when you realize the 7 day moving average (next screenshot) has been hovering between one and three deaths per day since June 25th, except for a single day (Jul. 20) when it hopped up to 4 and then dropped back to 3 the next day.

As this example illustrates, it’s often helpful to have less data when you’re trying to figure out what’s really going on. By lowering the resolution of the data, so to speak, to look at a 7 day moving average rather than daily numbers we’re able to ascertain that the primary factor influencing the spike in CV deaths was a federal holiday, not the actual reality of COVID’s spread and severity in Nebraska.
The graph above also points out another interesting reality. In terms of the most concrete impact of COVID-19 (deaths) Nebraska is actually in decidedly better shape through July and August when compared to May and June when the moving 7 day average hovered between three and seven deaths per day.
One last thought. As my friend Dusty once observed:
He’s right and the reason, again, has to do with too much data. There are Twitter accounts like Matt’s below that seem to love posting daily figures and then telling us how bad things are. If the number of daily positive cases is high the commentary is something like, “Man… super high number… super bad… super scary…” and if the number is lower than expected, like in the following tweet, then it’s discounted and explained away as being somehow out of line with reality.
This blend of data selection bias and confirmation bias is particularly dangerous when politicians and bureaucrats use this approach to guide their policy decisions.
One way to counter this is to zoom out, look at larger sample sizes (7 day moving average vs. daily numbers, for example) and to consider multiple types of data (e.g. case counts, hospitalizations, deaths) rather than looking at them as individual, canonical metrics.
Another way to counter this is to drastically limit the power and authority of politicians and bureaucrats have over their fellow citizens, but we’ll leave that point for another day.
Solid article Mike
Jumped up and cheered when I saw the data selection and confirmation bias. Has been near and dear to my heart when pouring over the SARS-COV-2 data. Those two biases are strong when independent and dependent variable selection is off the mark!
Still remember Don Clifton (SRI – Gallup) telling me you manage what you can measure and if your measurements are off you can’t manage