In the data analysis process, I find many of us overlook the value of understanding data distribution. Upon gathering the data, the analysts tend to apply the hypothesis testing as well as exploring the relationship among the variables. Not that I against this practice, but sometimes we do not have many variables at our hands and yet we need to make sense out of the data. In that case, a basic examination on the data would help.
Data distribution is about shape. The basic distribution is Gaussian normal distribution. The famous bell-shaped density function. A bit warning: For you guys that do not know what I am talking about, I suggest you to just close this page. I’m sure I posted many other entries in this blog that you can enjoy :D.
There are three basic properties of normal distribution. First, the mean has the highest density function value. Or in other word, the peak of the density function always falls in the middle. Second, it is symmetric. And third, it has an S- shape before reach the peak. What does it tell us?
These properties indicate that normal distribution represents the good nature of universe. The blind eyes of justice and fairness. Surely, we cannot expect that everything is uniform. There must be certain variation, but there will be a cancelation effect. For example, there will be geniuses. But then again, we also have idiots. Universe dictates the percentages of those genius and idiot people are equal. The majority will have IQ between 90 to110. So, we have normal distribution for people intelligentsia. It is true for many other aspects in which there is no resource allocation problem.
However, in the present of resource allocation, the universe teaches us that normal distribution is not applicable. If we examine the distribution of data on mountain high we will find the shape is different. Many of the observations fall in the left compare to the right of the mean. It is not symmetric, thus it is not bell shaped anymore. Why does this happen? Because there is a resource limitation of rocks and other substances to forms the mountains.
Statisticians often call this as beta distribution.
So, it is easy to understand why data which are a result of competition always have beta distribution. Distribution of income, land ownership, market size or any similar things do not have a normal distribution. Why?
Because people or companies competes on the limited resources such as money, land or consumers.
Why this is important for us?
First, the data distribution tells us what happen in terms of resource allocation. We can spot the peculiarity when it happens. For example, we have a quite large number of employees working on the same tasks. We collect the data on their speed on completing the task. Naturally, the distribution is normal because they suppose to work independently and do not have to compete on certain resource. When the data shows the beta distribution shape, as a manager we have to look on what resource they compete on. Maybe, it is something that we set, for example, incentive system. In that case, we can accept the beta distribution. But, if we do not set any, we have to dig it deeper.
Second, the data distribution can also direct us on how to improve the performance. The unexpected distribution (whether it is normal or beta distribution) leads us to span of control problems. On the other hand, the expected distribution calls for more resource for further capacity building.
If, after understand the process, we expect the sales performance should be beta distribution with certain average. But it turns out to be normal. Maybe we do not provide good supervision. Or the incentive is targeted to wrong sales people.
On the other hand, we might already get the expected distribution. Then it is a matter of increase the performance through training, innovation or other capacity building.
In essence, data distribution is too important to ignore.