Sunday, May 10, 2020

Benford Distribution

In a statistics class I teach for the University of Redlands, School of Business, we spend a lot of time graphing data and reviewing distributions. It is important for students, or anyone analyzing data, to understand the underlying distribution of their data. Sometimes the distribution is known or given. When the distribution is not known, graphing the data can provide clues to the underlying distribution. Once the distribution is known, then one can then make calculations about to further describe the data (descriptive statistics) or to make hypotheses about a population (inferential statistics).

Here are some typical data sets and the type of distributions that describe them:

Heights of people - Normal Distribution
Number of arrivals/time at an emergency room - Poisson Distribution
Time between successive patients at an ER - Exponential Distribution
Number of "Heads" when flipping 5 coins at a time - Binomial Distribution
Random numbers produced by Excel function RAND() - Uniform Distribution
First digit of entries in a large collection of numerical data - Benford Distribution

The last distribution listed above is one of my favorites because it is not intuitive. For example, consider a forensic accountant reviewing a large number of expense receipts submitted by expense reports. Such expenses vary over several orders of magnitude - a cup of coffee for $2.45 to an airline ticket for $2,500. When presented to students, the immediate response is that the distribution of the first digit of all these entries should be uniformly distributed between 1 and 9 (that is, approximated 1/9 of the entries should begin with a 1, 1/9 with a 2, etc.). However, these entries will follow the Benford Distribution which says more entries will begin with a lower digit than a higher digit.

The Benford Distribution is given by:
 where P(d) is the probability of d as first digit and ln is the natural logarithm.

First Digit      Distribution
 1 30.1%
 2  17.6%
 3 12.5% 
 4 9.7% 
 5 7.9% 
 6 6.7% 
 7         5.8% 
 8 5.1% 
 9 4.6% 

I looked at the distribution of the first digit of the longest rivers in the world. This list of rivers with over 180 entries starts with Nile (6650 km) and ends with the Finders in Australia (1004 km). The distribution of the first digit of the lengths is:

First Digit      Distribution
Frequency 
 
 1125
 2     35
 314 
 4    
 5
 6
 7        
 8
 9

While this example doesn't follow the full distribution, it does follow the trend that low digit entries far exceeds the higher digit entries. Re-examining the list, note the last entry is just over 1000 km; if the list continued, the next entries would probably be in the 900's, 800's and 700's filling. This is not a perfect example as the range of the underlying data does not extend more than one order of magnitude.

I tried another list: largest city populations. The list given starts with Tokyo (37 million) and ends with Guadalajara (5 million). Again, we barely have one order of magnitude; however, if we continued the list we would see many orders of magnitude as small town populations of a few hundred are reached.

First Digit      Distribution
Frequency
   
 127
 2     5
 31
 4    0
 516
 614
 7        8
 86
 94

Wikipedia's entry on the Benford Distribution gives a better example using the heights of the 60 tallest structures in the world.

The distribution is given for the first digit of the heights as listed in meters or feet.
Leading digitmetersfeetIn Benford's law
Count%Count%
12643.3%1830.0%30.1%
2711.7%813.3%17.6%
3915.0%813.3%12.5%
4610.0%610.0%9.7%
546.7%1016.7%7.9%
611.7%58.3%6.7%
723.3%23.3%5.8%
858.3%11.7%5.1%
900.0%23.3%4.6%

I read that the first digit of powers of 2 gives a good Benford Distribution. In the sheet linked here, I confirmed by raised 2 to the power of 0 through 100:
First DigitCount
131
217
313
410
57
67
76
85
95



Update: a reasonable explanation for the distribution is presented in another post.






No comments:

Post a Comment

Women in Mathematics

(Image: Hypatia by  Jules Maurice Gaspard , public domain) I recently re-read Instant Mathematics (see prior post:   https://jamesmacmath.bl...

Popular in last 30 days