I spent the last two weeks in a “summer” school on Machine Learning. That, combined with poor recovery from jetlag has resulted in waking up at night after nightmares involving probability distributions. This ill state of mind gave the inspiration for this post.
I had previously written that one of the most misused concepts in all of statistics and data mining is the average. I decided to broaden that argument.
For most of the statistics involved in your daily business problems, we love to think they will follow the normal distribution (shown in the figure). This is what we have been conditioned as engineers and business people of the 21st century. The “curve” we say, bell-shaped. And we love the mean and the standard deviation.
Well, we are mostly wrong. And some of us even go as far as to justify their fallacious arguments around the topic throwing out arcane terms like the “law of large numbers”, or the “central limit theorem”.
For many of us, the statistical distribution closest to our hearts was the distribution of grades people got on an exam. Where I went to college, your life basically depended on the “curve”. Despite now working actively as a data scientist, I was deep below the mean in my first stats course – and ended up going into the industry with only a fair understanding of stats (and a lot of ill-justified self-confidence typical in fresh graduates of my college).
Not a lot of people interrupted me when I summarized almost all data items in means and standard deviations, and I even fast tracked a promotion. (Then I got curious and went to grad school.)
Long endeavors in educating humans have shown that indeed, marks on an exam follow a quasi-symmetric bell-shaped curve. So characterizing that variable as a normal distribution is accurate. This is probably true for their heights too, but not for the things that matter in business: credit default rates, monthly incomes, telecoms spending, and the basket value. These variables most probably follow the Exponential distribution. (see shape above)
As you can kind of infer from the shape, there is no intuitive average to such a curve. The mean of an exponentially distributed variable tells you nothing. The standard deviation, even less so. So what do we say about this data? How do we describe it? In my experience, the best way to throw around intuition about such data in a boardroom is by using the so-called “n-tiles”:
What were those? Well. Let’s say you know the telecoms spend of 100 people. To give a good snapshot, your best bet is to rank those people and start counting: the highest-spend guy spends X dollars, the top 10% of people spend %70 of the money, and the top 20% spend 90% of the total, etc. What you do, essentially captures the essence, ehem, parameterized by the “lambda” of your exponential distribution. But you get to do that without saying “lambda”. (My advice, never say “lambda” to co-workers in a boardroom. People have been known to react unpredictably to letters of the Greek alphabet. To most, it just brings back painful memories)
Reader beware, the next two paragraphs get even scarier.
Now. But what about the “central limit theorem”, some might say. This nice theorem only states that whatever the distribution, if you have large enough numbers, means of subgroups of your population will tend towards a normal distribution. Say income in your society is exponentially distributed. If you randomly select large enough groups of people in your society, and start noting down the “averages” of these people’s incomes within each group. These “averages” will start following a bell curve.
The “law of large numbers”? Here’s what that says: Your income may be exponentially distributed, but you can still calculate average incomes for people (no matter how counter-intuitive that is). As you keep measuring more and more subgroups of society and start averaging them out, the “average of averages” you calculate will tend more and more towards the true average of the population. However, this does not mean the average will make sense, or that income does not follow an exponential distribution.
In sum, there’s no law that says “every statistic out there follows a bell-shaped curve”.