Origin, Alternative Expressions of Newcomb-Benford Law and Deviations of Digit Frequencies ()
1. Introduction
The surprising fact of the uneven distribution of decimal digits over decimal places was noticed by Newcomb [1] in 1881 and then rediscovered in 1938 by Benford [2], who gave the corresponding mathematical expression
, (1)
where F(n) is the frequency of numbers having the first digit n (the base of the logarithm is 10). Evidently, the sum of all 9 frequencies equals 1.
Since then, the law has been repeatedly tested and applied in a wide variety of areas [3] - [10] and continues to attract the attention of researchers [11] - [22]. Meanwhile, efforts have been made to justify or derive the above equation [23] [24] [25] [26]. In Ref [26] it was connected with the scaling invariance of physical laws. It was shown that the Benford law is valid for numbers distributed exponentially [27]. Also, there exists the geometrical explanation of the Benford law [28]. Such a variety of explanations for the same law is somewhat unusual in physics and mathematics where, as a rule, there is a single main reason explaining its origin.
The simplest explanation of the Newcomb-Benford law has been given by the author of the present communication in cooperation with Ed. Bormashenko and E. Shulzinger [29]. We have shown that Benford’s law follows as a consequence of the “positionality” of numeral systems like the decimal one.
People unacquainted with the literature on the subject refuse to believe that in arrays of unfalsified data almost a third of decimal numbers begin with the digit 1. At the same time, no one is surprised by the fact that in the binary system all the numbers begin with the digit 1.
The main assumptions of the present work are as follows. 1) Any natural numerical array, which is being analyzed, is bounded from above either essentially or by the number of presented digits (the position of the decimal point is irrelevant). 2) Within this array, the probability of encountering any number is the same for all numbers, but not for digits in each position.
First, I will give a brief overview of the cited work [29] presenting its results in a more convenient form. An alternative to Equation (1) expression will be deduced in a new way; its relation to Equation (1) will be clarified. The importance of inequalities for frequencies of digits will be emphasized and illustrated by the distribution of the population in Israeli cities and by the results by state of the 2020 presidential elections in the United States. Finally, the same method of extremal frequencies will be applied to the determination of digits frequencies at the second decimal place.
2. Frequencies of Digits as a Consequence of the Structure of the Positional Numeral System
Benford’s law is often used to check the reliability of various arrays of numerical data. In any case, these arrays are bounded above and below, and the role of these restrictions can be played simply by the given number of digits. For example, the number of votes cast for a particular candidate in some place is limited by the number of those who have the right to vote; the distribution of the population by city is limited by the population of the country, etc. Аs will be clear from what follows, only the upper limit is significant, while the lower one is unimportant.
Now let us consider the frequency F(n) of numbers with a certain digit
at first place in the set of natural numbers
. How this frequency changes with increasing m? In Ref. [29] it has been shown that the frequency F(n) to pass a sequence of alternating local minimums,
, and local maximums
. For example, for n = 1, the minimums of F(1) are achieved for values of m equal to 9, 99, 999, … due to the maximum of the denominator in the frequency definition while the numerator remains constant. Starting from these values, the frequency begins to rise since the growth of the numerator in percentage is greater than the growth of the denominator. Maximums are attained at values of m equal to 19, 199, 1999, and so on. Another example is n = 7 where local minimums and maximums of the frequency are at m values of 69, 699, 6999, …, and 79, 799, 7999, …, respectively. In general form, these statements are expressed by the formulas for
, (2)
(3)
Amount of numbers starting with the digit n up to
equals to
; analogously, amount of numbers starting with the digit n up to
equals
. Once more taking into account Equations ((2), (3)), one has for the frequencies at local minimums and maximums:
, (4)
. (5)
The dependences of these quantities on m and the digit value n are shown in Figure 1. With the increase of k, both quantities in Equations ((4), (5)) converge
Figure 1. Maximal and minimal frequencies of decimal digits at the first place in a set of natural numbers restricted by m (logarithmic scale).
very rapidly to
, (6)
. (7)
It is seen that the frequencies vary inversely with a digit value that is the qualitative formulation of the Newcomb-Benford law. The limiting values of frequencies (6) and (7) also presented in Table 1 are very useful in analyzing real numerical arrays because in reality the exact upper bond, m, is unknown. The lowest bond is also unknown, but this is unessential since contribution of the first terms of the set
to both numerator and denominator of the frequency definition is negligible compared to contribution of the last terms.
From the minimal and maximal asymptotical frequencies, some mean values may be constructed like arithmetic, geometric, logarithmic or harmonic one. The geometric mean turned out to be closest to the original Benford distribution in Equation (1). One has from Equations ((6), (7))
(8)
where a numerical multiplier is the normalizing factor
. The quantity F(n) is interpreted as the probability for the digit n to occupy the first place in a number of the decimal numeral system. As is seen from Table 1, the differences between the results of Benford’s law (1) and Equation (8) are negligible, at least from a practical point of view.
3. Equivalence of Two Formulations of Benford’s Law
It is possible to prove the equivalence of Equation (8) to the Benford law (1). After the change of variable n
(9)
Equation (8) turns into
. (10)
Passing to approximate formulas we have
Table 1. Frequencies of digits in numerical arrays.
. (11)
The inverse to (9) transformation is
(12)
that gives after substitution of A =0.43077 from Equation (8) and transition to logarithms with a base of 10
. (13)
The error
associated with neglecting higher-order terms reaches a maximum of +0.006 for n = 1, but for n = 2 it is already equal to +0.001. The validity of the proof of the equivalence of Benford’s law and Equation (8) is confirmed by numerical results in the table. In conclusion, the question arises whether to consider the formula (8) as an excellent approximation to Benford’s law (1), or, conversely, to consider Benford’s formula as an excellent approximation to the law (8) deduced from the properties of the decimal numeral system?
4. Population of Israeli Cities and by State Results of the 2020 Presidential Elections in the United States
From Equations ((6) and (7)) the inequalities for F(n) follow
, (14)
which is more useful than Equations ((1), (8)) in applications because the exact upper bound of an array considered is unknown. Figure 2 shows an analysis
Figure 2. The frequencies of digits at the first place of the numbers from data on population of 72 Israeli cities [30]. Solid line corresponds to Equation (8). Deviation intervals according to inequality (14).
Figure 3. The frequencies of digits at the first place of the numbers from the results by state of the 2020 US presidential election [31].
using (8) and (14) of the population distribution over 72 cities of Israel in accordance with the results of the 2008 census [30]. Lacking the right software and programming skills, I choose data with a small number of items and performed the calculation manually. For the same reason, as another example, a sampling of 51 items was considered, which represents the results by state of the 2020 US presidential election [31] (Figure 3).
In both Figures, there are only a few cases of going beyond the deviation intervals (14), and these goings are small. It should not be forgotten that all consideration is probabilistic. In addition, for the above reason, small samplings were chosen. The Benford law is often used to detect violations and fraud in datasets. From this point of view, the adequacy of the census and the vote count in the presidential 2020 elections successfully pass this test.
5. Other Positional Numeral Systems
Generalization to other than decimal numeral systems is straightforward. Instead of Equations ((8) and (14)) one has Equations ((15) and (16)) [29]
(15)
where N is the base of a numeral system,
, and
is the normalizing factor and
. (16)
In particular, in the binary system (N = 2), all the last equations turn into 1 for n = 1 (all the numbers in the binary system begin with 1).
Normalizing factors for the most popular numeral systems are as follows:
,
,
,
. For systems with a low base, the probability of finding the digit 1 at the first place of number is high; for example,
. This makes them undesirable for digital encryption since the chances are high that the encoded number starts with one. In this regard, coding using high base numeral systems is preferred; for example, in hexadecimal numeral system this probability is two times lower,
.
6. Frequencies of Digits at the Second Decimal Place
The alternation of local minimums and maxima of digit frequencies when expanding a limited set of natural numbers {m} can also be used for the second decimal position. Let us show this by the example of the digits 0 and 2. For n = 0, maxima are attained for the following values of m: 10, 20, …, 90, 109, 209, …, 909, 1099, 2099, … 9099, 10,999, 20,999, …, 90,999, …. Wherein, amount of numbers contained in {m} with 0 at the second place changes, respectively, as follows: 1, 2, …, 9, 19, 29, …, 99, 199, 299, …, 999, 1999, 2999, … 9999, …. In general form, this is expressed with a help of two indices:
and
where i runs from 1 to 9. Thus, the corresponding frequency is
. (17)
Analogously, for the minimal frequencies the values of m and amount of the required numbers in {m} are, respectively, 9, 19, 29, …, 99, 199, 299, …, 999, 1999, 2999, …, 9999, 19,999, 29,999, …, 99,999, … and 0, 1, 2, …, 9, 19, 29, …, 99, 199, 299, …999, 1999, 2999, …, 9999. The minimal frequencies are
(18)
For n = 2, the maximal frequencies are attained for m values: 12, 22, …, 92, 129, 229, …929, 1299, 2299, …, 9299, 12,999, 22,999, …. 92,999, …. Amount of numbers with 2 at the second place in {m} is as follows: 1, 2, …, 9, 19, 29, …, 99, 199, 299, …, 999, 1999, 2999, …, 9999, …. The formula for the frequency:
. (19)
The minimal frequencies are for m: 11, 21, …, 91, 119, 219, …, 919, 1199, 2199, …, 9199, 11,999, 21,999, …, 91,999, … with the same amount of numbers in numerator as in Equation (18):
. (20)
For
:
(21)
. (22)
Both the minimum of Equation (22) (minimum minimorum) and the maximum of Equation (21) are attained at
. Further, the limits for k going to infinity:
, (23)
. (24)
One of possible estimations of the probability to find the digit n at the second place of a number may be the normalized geometrical mean of the maximal and minimal frequencies (23) and (24):
(25)
(compare to Equation (15)).Corresponding numerical data are presented in Table 2.
It is seen from Table 2 that the confidence intervals strongly overlap that may prevent the second digit statistics from analyzing arrays of numbers. Also, the distribution of probabilities is too smooth.
For the numeral systems with the arbitrary base N, one has from Equation (25):
(26)
where
(compare to Equation (25)). For low bases, the probabilities may differ more sharply. Thus, for the binary system (N = 2):
while
.
7. Conclusions
With the expansion of a bounded set of natural numbers, the density of numbers starting with a certain digit experiences quasiperiodic oscillation (see Figure 1). The maxima and minima of this oscillation quickly stabilize and determine possible deviations from Benford’s law. Formulas for these deviations are useful when analyzing numeric arrays for fraud.
The geometric mean of the above minimum and maximum decreases with the increase in values of the initial digits of numbers, giving an alternative quantitative
Table 2. Probabilities and deviation intervals for digits at the second decimal place.
expression (8) for the Newcomb-Benford law. This expression approximately coincides with Benford’s formula (1) up to the second order in some small parameter, which is less than 1.
The results are generalized to the arbitrary base of positional numeral systems. The systems with the higher base are preferable in digital encryption because for them digit frequencies are close to each other and maximal possible deviations overlap.
The elaborated method of extremal digital frequencies applies to the second decimal place. In this case, a smooth dependence on the digit value may prevent the method from applying to the check of numerical arrays for fraud.
Perhaps, taking into account information about the boundaries of the numerical arrays under consideration will narrow the confidence intervals and improve the correspondence of the calculated and measured frequencies in the case of truthful data. Work in this direction is expected to be carried out in the near future.
Acknowledgements
The author is grateful to Professor Edward Bormashenko who brought his attention to this field.