Measures of Variability for Qualitative Variables Using the R Software ()
1. Introduction
Many measures of variation have been defined for qualitative variables [1] [2] [3] ; however, they are little known and utilized, even though qualitative variables are frequent and highly important in social and health research [4] [5] . In fact, many basic or applied statistics manuals do not even mention them, and commercial or open-access statistical software does not include them. It is important to note that their development is relatively recent, beginning in the mid-20th century with the publication of the seminal article by the American mathematician Claude Elwood Shannon (1916-2001) on the mathematical theory of communication [6] , and with the diversity index proposed by Simpson [7] . These measures proliferated between the 1960s, and 1980s [2] , and continue to be developed today [3] [8] [9] .
In this article, six measures are considered: Freeman’s [10] Variation Ratio (FVR), Wilcox’s [1] Variation Ratio (WVR), Moral’s [3] Universal Variation Ratio (UVR), Kvalseth’s [11] Standard Deviation from Mode (SDM), Gibbs-Poston’s [12] Index of Qualitative Variation (IQV), and Shannon’s [13] Relative Entropy (RelE). The selected measures are simple to calculate, practical to use, clear to interpret, and have been applied in social research [3] [14] [15] .
A characteristic shared by all these measures of variability is to have a range from 0 to 1. In this range, 0 corresponds to the distribution of a constant random variable, in which one value concentrates all the probability at the population level or all the frequency at the sample level. The value of 1 corresponds to a uniform distribution, in which all its values have the same probability at the population level or the same frequency at the sample level.
The aim of this article is twofold. On the one hand, the six measures of variation are presented in a simple way for better understanding among social researchers, since they are not included in commonly used statistical software, whether commercial or freely available. On the other hand, an additional objective is to facilitate their use through the R software, which is freely available. Two calculation situations are considered: qualitative data random samples with a single mode and with two or more modes.
2. Freeman’s Variation Ratio
The variation ratio was developed by the American psychology Larry C. Freeman for unimodal distributions and can be denoted by FVR [10] . It starts from a formula of variation around the mode, and its expression is simplified to the complement of the frequency of the single mode. Refer to Formula (1), where n_{i} represents absolute frequency of each value (i = 1, 2, …, k), k is the number of qualitative categories, n_{mo} denotes absolute frequency of the single mode (mo), n signifies sample size, f_{i} stands for relative frequency, and f_{mo} represents relative frequency of the single mode.
$\begin{array}{c}FVR=\frac{{\displaystyle {\sum}_{i=1}^{k}\left({n}_{i}-\frac{{n}_{mo}}{k}\right)}}{{\displaystyle {\sum}_{i=1}^{k}{n}_{i}}}=\frac{{\displaystyle {\sum}_{i=1}^{k}\left({n}_{i}-\frac{{n}_{mo}}{k}\right)}}{n}={\displaystyle \underset{i=1}{\overset{k}{\sum}}\left(\frac{{n}_{i}}{n}-\frac{{n}_{mo}}{k\times n}\right)}\\ ={\displaystyle \underset{i=1}{\overset{k}{\sum}}\left({f}_{i}-\frac{{f}_{mo}}{k}\right)}={\displaystyle \underset{i=1}{\overset{k}{\sum}}{f}_{i}}-{\displaystyle \underset{i=1}{\overset{k}{\sum}}\frac{{f}_{mo}}{k}}=1-\frac{k\times {f}_{mo}}{k}=1-{f}_{mo}\end{array}$ (1)
It only applies to a unimodal distribution. In the case of a constant random variable, where a single value a concentrates all the frequency (f_{a} = 1), the variation ratio is 0, since the value a is the mode and its relative frequency is one (FVR = 1 −f_{mo} = 1 − 1 = 0). This represents the minimum variability condition. In the case of a uniform distribution, it can be argued that its value is 1 because this distribution has no mode, and consequently, the relative frequency of its mode is 0 (FVR = 1 − f_{mo} = 1 − 0 = 1). This represents the situation of maximum variability. Its advantage is the simplicity of calculation and its disadvantage is the scarce information on the distribution used for its computation.
3. Wilcox’s Variation Ratio
The American political scientist Allen R. Wilcox published a standardized variation ratio based on the mode, which can be denoted by WVR [1] . Refer to Formula (2).
$\begin{array}{c}WVR=1-\frac{{\displaystyle {\sum}_{i=1}^{k}\left({n}_{mo}-{n}_{i}\right)}}{n\left(k-1\right)}=1-\frac{{\displaystyle {\sum}_{i=1}^{k}{n}_{mo}}-{\displaystyle {\sum}_{i=1}^{k}{n}_{i}}}{n\left(k-1\right)}=1-\frac{k{n}_{mo}-n}{n\left(k-1\right)}\\ =1-\frac{\frac{1}{n}}{\frac{1}{n}}\times \frac{k{n}_{mo}-n}{n\left(k-1\right)}=1-\frac{k\frac{{n}_{mo}}{n}-\frac{n}{n}}{k-1}=1-\frac{k{f}_{mo}-1}{k-1}\end{array}$ (2)
It is straightforward to establish the relationship between WVR and FVR, as shown in Formula (3).
$WVR=1-\frac{k{f}_{mo}-1}{k-1}=\frac{k-1-k{f}_{mo}+1}{k-1}=\frac{k}{k-1}\left(1-{f}_{mo}\right)=\frac{k}{k-1}FVR$ (3)
Based on this equality, it can be stated that WVR is greater than or equal to FVR. This measure of variability necessarily requires that the distribution be unimodal. When there are two or more modes, it cannot be calculated. In the case of a uniform distribution, the modal frequency cannot be given a value of 0, as was argued with FVR, since the WVR statistic falls outside the range of 0 to 1 stipulated for standardized indices of variation of qualitative variables: WVR = k/(k − 1) × FVR = k/(k − 1) × 1 = k/(k − 1) > 1. Consequently, it has more limitations than FVR; however, it includes additional information on the number of qualitative categories of the variable.
4. Moral’s Universal Variation Ratio
The Spanish-Mexican psychologist José Moral proposed a modification of Freeman’s formula [3] . This new proposal allows its application in cases of multiple modes and considers the number of categories (k), as does the WRV. The author called this modified statistic the Universal Variation Ratio (UVR), because it can be applied to any type of qualitative variable distribution. Refer to Formula (4), where k represents number of qualitative categories of the variable, c stands for number of values with maximum (absolute or relative) frequency and this value can vary from 1 to k, f_{max} denotes maximum relative frequency corresponding to the mode (f_{mo}), except in a uniform distribution (c = k), in which it is considered that there is no mode and the value of all frequencies is 1/k.
$UVR=\frac{1-\frac{1}{c}\times {f}_{\mathrm{max}}}{1-\mathrm{max}\left(\frac{1}{c}\times {f}_{\mathrm{max}}\right)}=\frac{1-\frac{{f}_{\mathrm{max}}}{c}}{1-\frac{1}{{k}^{2}}}=\frac{1-\frac{{f}_{\mathrm{max}}}{c}}{\frac{{k}^{2}-1}{{k}^{2}}}=\frac{{k}^{2}}{{k}^{2}-1}\times \left(1-\frac{{f}_{\mathrm{max}}}{c}\right)$ (4)
Starting from Freeman’s formula, 1 − f_{mo}, the proposed formula weights the modal relative frequency by the inverse of the number of modes (1/c) and divides the expression by its maximum value. This maximum is reached with the uniform distribution, when c = k, f_{max} = 1/k y 1 – 1/c × f_{max} = 1 – 1/k × 1/k = 1 – 1/k^{2}.
When a value monopolizes the entire frequency (constant random variable), where the modal frequency is unique and has a unit value (c = 1 and f_{max} = 1), the Universal Variation Ratio (UVR) reaches its minimum value of 0 (Formula (5)).
$UVR=\frac{{k}^{2}}{{k}^{2}-1}\times \left(1-\frac{{f}_{\mathrm{max}}}{c}\right)=\frac{{k}^{2}}{{k}^{2}-1}\times \left(1-\frac{1}{1}\right)=\frac{{k}^{2}}{{k}^{2}-1}\times 0=0$ (5)
If there is no mode (uniform distribution), all categories have the same frequency and this frequency is the maximum (1/k), the value of c is k and the Universal Variation Ratio (UVR) reaches its maximum value of 1 (Formula (6)).
$UVR=\frac{1-{f}_{\mathrm{max}}/c}{\left({k}^{2}-1\right)/{k}^{2}}=\frac{1-\left(1/k\right)/k}{\left({k}^{2}-1\right)/{k}^{2}}=\frac{1-1/{k}^{2}}{1-1/{k}^{2}}=1$ (6)
As c approaches k, where k is the number of categories (uniform distribution), the result of the proposed modification approaches 1, as shown in Formula (7).
$UVR=\underset{c\to k}{\mathrm{lim}}\frac{1-\frac{1}{c}\times {f}_{\mathrm{max}}}{1-1/{k}^{2}}=\frac{1-\frac{1}{k}\times \frac{1}{k}}{1-1/{k}^{2}}=\frac{1-1/{k}^{2}}{1-1/{k}^{2}}=1$ (7)
This is because, to the extent that the sample of qualitative variable A presents more categories with maximum frequency (c), the subtractive effect of the maximum frequency decreases in the modified variation ratio (1 − f_{max}/c), and consequently, the value of this measure of variability increases (UVR). The lower the number of categories (k), the greater the increase in the Universal Variation Ratio (UVR), since the variability is more evenly distributed, moving the distribution of variable A away from that of a constant random variable (minimum value) and closer to that of a uniform distribution (maximum value).
In the case of a mode (c = 1), which is the situation in which FVR and Moral’s UVR are comparable, UVR yields a value greater than or equal to FVR, as does WRV. Refer to Formula (8).
$UVR=\frac{1-{f}_{Max}/c}{\left({k}^{2}-1\right)/{k}^{2}}=\frac{1-{f}_{Mod}/1}{\left({k}^{2}-1\right)/{k}^{2}}=\frac{{k}^{2}\left(1-{f}_{Mod}\right)}{{k}^{2}-1}=\frac{{k}^{2}}{{k}^{2}-1}FVR\ge FVR$ (8)
When the number of qualitative categories (k) is very small, there is a greater difference between UVR and FVR. With two categories, the difference or increase is one third: UVR-FVR = k^{2}/(k^{2} − 1) = 0.333. With three categories, the difference or increase is one-eighth: UVR-FVR = k^{2}/(k^{2} − 1) = 0.125. However, as the number of categories increases, the difference becomes smaller. With four categories, the difference or increment is 0.067, and with five, it is 0.042. When the number of categories tends to infinity, UVR converges to FVR, as shown in Formula (9), where c = 1.
$UVR=\underset{k\to \infty}{\mathrm{lim}}\frac{{k}^{2}}{{k}^{2}-1}\left(1-\frac{{f}_{\mathrm{max}}}{c}\right)=\underset{k\to \infty}{\mathrm{lim}}\frac{{k}^{2}}{{k}^{2}-1}\left(1-{f}_{mo}\right)=1-{f}_{mo}=FVR$ (9)
In contrast to FVR, the new measure of variation proposed by Moral [3] is less than or equal to Wilcox’s WVR [1] , when there is a single qualitative category with maximum frequency (c = 1). Consequently, FVR always takes a value less than or equal to UVR, and UVR always takes a value less than or equal to WVR. Refer to Formula (10).
$\begin{array}{l}UVR=\frac{{k}^{2}}{{k}^{2}-1}\times FVR\to FVR=\frac{{k}^{2}-1}{{k}^{2}}\times UVR=\left(1-\frac{1}{{k}^{2}}\right)\times UVR\\ WVR=\frac{k}{k-1}\times FVR\to FVR=\frac{k-1}{k}\times WVR=\left(1-\frac{1}{k}\right)\times WVR\\ FVR=\left(1-\frac{1}{{k}^{2}}\right)\times UVR=\left(1-\frac{1}{k}\right)\times WVR\\ FVR\le UVR\le WVR\end{array}$ (10)
5. Kvalseth’s Standard Deviation from Mode
The Norwegian-born American engineer Tarald O. Kvalseth proposed a new standardized index named Standard Deviation from Mode (SDM) [11] . Its advantage is that it utilizes all frequencies. Although it can be calculated with multiple modes (two or more of the k values with modal frequency), or even with a uniform distribution (the k values with maximum frequency), the author proposed the measure of variation for a unimodal distribution (Formula (11)). This restriction allows him to determine an asymptotic standard error, define an asymptotic confidence interval, and specify the use of inferential statistics to compare two or more SDM statistics.
$\stackrel{^}{SDM}=1-\sqrt{\frac{{\displaystyle {\sum}_{i=1}^{k}{\left({f}_{mo}-{f}_{i}\right)}^{2}}}{k-1}}$ (11)
Algebraically, it can be demonstrated that the value of Kvalseth’s SDM estimator is less than or equal to WVR. Refer to Formula (12).
$\begin{array}{l}\stackrel{^}{SDM}\le WVR\\ 1-\sqrt{\frac{{\displaystyle {\sum}_{i=1}^{k}{\left({f}_{mo}-{f}_{i}\right)}^{2}}}{k-1}}\le 1-\frac{k\times {f}_{mo}-1}{k-1}\end{array}$ (12)
As the sample size increases (
$n\to \infty $ ), the sampling distribution of the SDM statistic converges to a normal distribution with mean SDM and variance
${\sigma}_{DEM}^{2}$ . Refer to Formula (13).
$\begin{array}{l}\stackrel{^}{SDM}\stackrel{d}{\to}N\left(SDM,{\sigma}_{SDM}^{2}\right)\\ {\sigma}_{SDM}^{2}=\frac{{p}_{Mo}\times {\left(1-k{p}_{Mo}\right)}^{2}+{\displaystyle {\sum}_{i=1}^{k}{p}_{i}}\times {\left({p}_{Mo}-{p}_{i}\right)}^{2}}{n\times {\left(k-1\right)}^{2}\times {\left(1-SDM\right)}^{2}}-\frac{{\left(1-SDM\right)}^{2}}{n}\end{array}$ (13)
The square root of the above expression gives the standard deviation of the sampling distribution of SDM or standard error of SDM, as shown in Formula (14).
${\sigma}_{SDM}=\sqrt{\frac{{p}_{Mo}\times {\left(1-k{p}_{Mo}\right)}^{2}+{\displaystyle {\sum}_{i=1}^{k}{p}_{i}\times {\left({p}_{Mo}-{p}_{i}\right)}^{2}}}{n\times {\left(k-1\right)}^{2}\times {\left(1-SDM\right)}^{2}}-\frac{{\left(1-SDM\right)}^{2}}{n}}$ (14)
Substituting the probability with its estimator
${\stackrel{^}{p}}_{i}={n}_{i}/n={f}_{i}$ and using the sample mode, we obtain the estimator of the standard deviation of the sampling distribution of SDM or standard error of SDM (Formula (15))
${\stackrel{^}{\sigma}}_{SDM}=\sqrt{\frac{{f}_{Mo}{\left(1-k{f}_{mo}\right)}^{2}+{\displaystyle {\sum}_{i=1}^{k}{f}_{i}{\left({f}_{mo}-{f}_{i}\right)}^{2}}}{n{\left(k-1\right)}^{2}{\left(1-\stackrel{^}{SDM}\right)}^{2}}-\frac{{\left(1-\stackrel{^}{SDM}\right)}^{2}}{n}}$ (15)
When the sample size is reasonably large (n ≥ 30), the standard error allows for interval estimates of SDM. See Formula (16), where z_{1-α/2} represents quantile of order 1 − α/2 in a standard normal distribution and 1 − α = level of confidence.
$P\left(\stackrel{^}{SDM}-{z}_{1-\frac{\alpha}{2}}\times {\stackrel{^}{\sigma}}_{SDM}\le SDM\le \stackrel{^}{SDM}+{z}_{1-\frac{\alpha}{2}}\times {\stackrel{^}{\sigma}}_{SDM}\right)=1-\alpha $ (16)
6. Gibbs-Poston Index of Qualitative Variation
The American sociologists Jack P. Gibbs and Dudley L. Poston [12] took up the M_{1} index (Formula (17)), which had been defined as a diversity index by the English statistician Edward Hugh Simpson [7] , who derived it from the Italian statistician and sociologist Corrado Gini [16] .
${M}_{1}=1-{\displaystyle \underset{i=1}{\overset{k}{\sum}}{f}_{{x}_{i}}^{2}}$ (17)
This index can be interpreted as the complement of the probability that a random pair of samples belongs to the same category; in other words, it estimates the probability that the pair does not belong to the same category. This index has also been referred to as the differentiation index, livelihood differentiation index, and geographical differentiation index, depending on the context in which it has been used [17] .
Gibbs and Poston [12] proposed a second index (M_{2}), which is the standardized version of the previous one, called the Index of Qualitative Variation (IQV). Refer to Formula (18). It is the most widely used measure of qualitative variation, especially in the social sciences [11] .
${M}_{2}=IQV=\frac{k\left(1-{\displaystyle {\sum}_{i=1}^{k}{f}_{{x}_{i}}^{2}}\right)}{k-1}=\frac{k\times {M}_{1}}{k-1}=\frac{k}{k-1}\times {M}_{1}$ (18)
In the case of a constant random variable or random sample in which the n data points correspond to a single value a, the value of IQV is 0 (Formula (19)).
$IQV=\frac{k\left(1-{\displaystyle {\sum}_{i=1}^{k}{f}_{i}^{2}}\right)}{k-1}=\frac{2\times \left(1-1\right)}{2-1}=\frac{0}{1}=0$ (19)
In the case of a uniform distribution, the value of IQV is 1 (Formula (20)).
$IQV=\frac{k\left(1-{\displaystyle {\sum}_{i=1}^{k}{f}_{{x}_{i}}^{2}}\right)}{k-1}=\frac{k\left(1-1/k\right)}{k-1}=\frac{k\left(k-1\right)/k}{k-1}=\frac{k-1}{k-1}=1$ (20)
This measure of variation applies to any type of distribution. It uses all the information of the distribution, but usually gives high values, especially as the frequency is more distributed (proximity to a uniform distribution).
7. Shannon’s Relative Entropy
Entropy is the measure of disorder in a system of elements [18] . In probability theory, entropy is at its maximum when all elements are equiprobable and the presence of some elements does not allow predicting the appearance of others. The concept originates from thermodynamics and was applied to information theory by the American mathematician and electrical engineer Claude Elwood Shannon [13] . From there, it transitioned to statistics as a property to characterize both discrete and continuous distributions and to measure variability in qualitative variables (classification systems). At the population level, it is denoted by the capital Greek letter eta (Η), and at the sample level, by the capital Latin letter E. Entropy is the mathematical expectation of Shannon’s information or logarithm of the probability mass function [13] . Refer to Formula (21).
$H\left(X\right)=E\left(I\right)=E\left[-{\mathrm{log}}_{b}\left(P\left(X=x\right)\right)\right]=E\left[-{\mathrm{log}}_{b}\left({f}_{X}\left(x\right)\right)\right]$ (21)
When the information (I_{X}) is in base 2, we speak of bits or binary units of information. When the information is in decimal base, we speak of dits or decimal units of information. When the information is in the natural base, we speak of nats or natural units of information. This last option is the most used.
For an empirical or sample frequency distribution of a qualitative variable A (with k values), the entropy is calculated as shown in Formula (22), where f_{n}(a_{i}) represents relative frequency of the value a_{i} of the qualitative variable A, n(a_{i}) denotes absolute frequency of the value a_{i} of the qualitative variable A, and n stands for sample size.
$E=-{\displaystyle \underset{i=1}{\overset{k}{\sum}}{f}_{n}\left({a}_{i}\right)\mathrm{ln}\left({f}_{n}\left({a}_{i}\right)\right)}=-{\displaystyle \underset{i=1}{\overset{k}{\sum}}\frac{{n}_{{a}_{i}}}{n}\mathrm{ln}\left(\frac{{n}_{{a}_{i}}}{n}\right)}$ (22)
Any null frequency must be omitted in the entropy calculation. Proceeding in this manner, the statistic Η_{A} can take values in the interval [0, ln(k)], where k is the number of categories, including the categories omitted due to null frequency.
Knowing the maximum value of entropy in a discrete distribution, which corresponds to the discrete uniform distribution U{a, b}, the Shannon’s normalized or relative entropy can be calculated. This maximum value is the natural logarithm of the cardinality or number of values (k) in the bounded interval from a (minimum) to b (maximum) of the support of a discrete uniform distribution: ln(k), where k = #{a, b}. The relative entropy is denoted by RelΗ at the population level and RelE at the sample level, and it is the average entropy or information divided by its maximum value [13] . For an empirical or sample frequency distribution, the relative entropy is calculated using Formula (23). In the case of a constant random variable, the number of categories (k) is given a value of 2.
$RelE=\frac{E}{\mathrm{max}\left(E\right)}=\frac{-{\displaystyle {\sum}_{i=1}^{k}{f}_{n}\left({a}_{i}\right)\mathrm{ln}\left({f}_{n}\left({a}_{i}\right)\right)}}{\mathrm{ln}\left(k\right)}=\frac{-{\displaystyle {\sum}_{i=1}^{k}\frac{{n}_{{x}_{i}}}{n}\mathrm{ln}\left(\frac{{n}_{{x}_{i}}}{n}\right)}}{\mathrm{ln}\left(k\right)}$ (23)
Like the Gibbs-Poston measure [12] , it applies to any type of distribution. It utilizes all the information of the distribution, but usually gives high values, particularly when the frequency is more evenly distributed (approaching a uniform distribution).
8. Pattern of Behavior of the Six Defined Variation Measures
Moral [3] studied the behavior of these six measures in relation to different distributions: the distribution of a constant random variable and eight discrete distributions with five nominal categories: the first with a distribution close to that of a constant random variable; the second with a distribution close to symmetry around a single mode; the third with strict symmetry around a single mode; the fourth close to a uniform distribution, although with a single mode; the fifth with a bimodal distribution; the sixth with a trimodal distribution; the seventh with a quadrimodal distribution; and the eighth with a uniform distribution.
After analyzing this data, Moral [3] notes that FVR, UVR, WVR, and SDM are more sensitive to the proximity to a distribution in which a category concentrates all the probability than IQV and RelE, as their value is closer to the expected value for the distribution of a constant random variable, which is 0. On the contrary, the last two indices are more sensitive to the proximity to a uniform distribution, albeit with a poorly defined mode, i.e., a unimodal distribution with very similar frequencies, so their value is closer to the expected value for a uniform distribution, which is 1. When the distribution is uniform, implying that all categories have exactly the same probability or frequency, the value of all indices is 1, except for Wilcox’s variation ratio, which cannot be calculated. It should be noted that the universal variation ratio closely resembles Freeman’s variation ratio when the distribution is unimodal. The more defined the single mode is, the closer the latter two indices are to the value of 0, indicating the presence of a constant random variable.
In the case of more than one mode, Freeman’s and Wilcox’s variation ratios cannot be calculated. With the presence of more than one mode, the increase in the index value is experienced more strongly in IQV and RelE. The universal variation ratio shows the most moderate increase with two modes, while the standard deviation from the mode is the most moderate with more than two modes.
9. Confidence Intervals for Qualitative Measures of Variability
There are essentially five methods to define the confidence interval of a statistic:
1) Asymptotic normal method: Also known as the Wald-type confidence interval [19] . This method relies on the central limit theorem. It is based on the convergence in distribution of the sampling distribution of the standardized statistic to a standard normal distribution. It necessitates a large random sample and that the sampling distribution has finite mean and variance. The delta method is typically employed to determine such mean and variance [20] . If the sampling distribution is normal, a large but random sample is not necessary.
2) Student’s t-distribution method: It is a robust variant of the asymptotic normal method, utilized when the finite population standard deviation is unknown and the sample size is small [21] .
3) Resampling methods: This approach involves obtaining the sampling distribution of the statistic by generating numerous samples from the original random sample and calculating the statistic in each of them. The point estimator of the parameter is the mean of this generated distribution, and the standard error is the sample standard deviation of the generated distribution. The first technique developed was the permutation of the data from the original sample, followed by the jackknife technique, in which one element is removed, and n samples of n − 1 data are generated. Finally, the technique of sampling with replacement improved the jackknife procedure. Confidence intervals, such as Wald-type or t-type, or using the percentiles of the simulated sampling distribution, are defined [22] .
4) Bayesian method: In Bayesian statistics, confidence intervals are replaced by credible intervals. These intervals represent the range within which a parameter value falls with a certain probability, given the observed data and prior knowledge [23] .
5) Exact method: The confidence interval is constructed using the probability distribution of the sample statistic. Unlike approximate methods, exact confidence intervals provide precise coverage probabilities for the true parameter of interest, regardless of the sample size or distributional assumptions. They have primarily been developed for statistics that follow a discrete distribution, such as binomial, multinomial, hypergeometric, or Poisson distribution [24] .
Each method has its strengths and weaknesses. The choice of method depends on the sample size and the population distribution or sampling distribution of the statistic. For example, when the finite population standard deviation is unknown and the sample size is small, the t-distribution method is preferred over the Wald-type method. Bootstrap methods are useful when the underlying distribution is unknown or non-normal. Bayesian methods are powerful when prior knowledge about the parameter of interest is available. When the probability distribution of the statistic is known, the exact confidence interval is preferred, especially with small samples.
Since the sampling distributions of the six measures of variation considered are unknown, repetitive sampling with replacement (bootstrap) can be used to obtain these distributions. Through nonparametric methods: percentile (PERC) and Corrected-Bias and Accelerated (BCa) percentile, confidence intervals can be defined [25] . In cases of bias and skewness, the latter method is preferred over the former [26] . When the bias-corrected percentile has an extreme value, the confidence interval cannot be calculated by the BCa method. In this case, the bias-corrected percentile can be given a value of 0, which yields a result equivalent to the percentile method, as suggested by Efron [27] . The calculation algorithms to obtain the confidence interval for both methods are shown below [28] .
1) A random sample, denoted by x, of size n from the variable X is utilized as the starting point:
$x=\left\{{x}_{1},{x}_{2},\cdots ,{x}_{n}\right\}\subseteq X$ . The variable X can be qualitative, ordinal, or quantitative. In this work, our focus is on qualitative variables.
2) The measure of variability for qualitative variable is calculated:
$\stackrel{^}{\theta}=t\left(x\right)$ , which is the estimate of the parameter θ in the original sample of size n.
3) B independent samples of size n are drawn with replacement from the original random sample x of size n. It is recommended that there be at least 1000 draws (B ≥ 1000), that the sample size be at least 30 (n ≥ 30), and that the random sample be representative of the population from which it was collected.
4) In each of the B samples, the measure of variability
${\theta}_{i}^{*}\left(i=1,2,\cdots ,B\right)$ is calculated, which generates the bootstrap sampling distribution of the statistic or estimator. This distribution can be represented by means of a histogram, using the Freedman-Diaconis rule to determine the uniform width of class intervals [29] . This method searches for an optimal interval width without making assumptions about the distribution of the statistic.
5) The bootstrap estimate of the parameter θ is the arithmetic mean of the bootstrap sampling distribution of the statistic or estimator and is denoted by
${\stackrel{^}{\theta}}_{bootstrap}$ (Formula (24)). Bootstrap standard error is the sample or bias-corrected standard deviation of the bootstrap sampling distribution of the statistic or estimator and is denoted by se_{boostrap} (Formula (25)). Bootstrap bias is the difference between the bootstrap estimate and the estimate in the original sample (
$\stackrel{^}{\theta}$ ) and is denoted by bias_{boostrap} (Formula (26)).
${\stackrel{^}{\theta}}_{bootstrap}={\stackrel{\xaf}{\stackrel{^}{\theta}}}^{\ast}=\frac{{\displaystyle {\sum}_{i=1}^{B}{\stackrel{^}{\theta}}_{i}^{\ast}}}{B}$ (24)
$s{e}_{bootstrap}=\sqrt{\frac{{\displaystyle {\sum}_{i=1}^{B}{\left({\stackrel{^}{\theta}}_{i}^{\ast}-{\stackrel{^}{\theta}}_{bootstrap}\right)}^{2}}}{B-1}}$ (25)
$bia{s}_{boostrap}={\stackrel{^}{\theta}}_{bootstrap}-\stackrel{^}{\theta}$ (26)
6) The α/2 order quantile of the bootstrap sampling distribution of the statistic or estimator defines the lower limit, and the 1 − α/2 order quantile constitutes the upper limit of the confidence interval at (1 − α) × 100 by the percentile method (Formula (27)).
$P\left({q}_{\alpha /2}\left({\left\{{\stackrel{^}{\theta}}_{i}^{*}\right\}}_{i=1}^{B}\right)\le \theta \le {q}_{1-\alpha /2}\left({\left\{{\stackrel{^}{\theta}}_{i}^{*}\right\}}_{i=1}^{B}\right)\right)=1-\alpha $ (27)
The quantile can be obtained using Rule 8 of the R software, as recommended by Hyndman and Fan [30] for calculating sample quantiles (Formula (28)). This rule expresses the order of the quantile (p) as the median of the order statistic i of a standard continuous uniform distribution U [0, 1], which serves as the non-informative prior when estimating a probability in Bayesian inference and is the distribution followed by randomly chosen probability values. If many random samples of size n are drawn from a standard continuous uniform distribution and the data of each sample are sorted in ascending order, the statistic of order i follows the beta distribution of shape parameters: α = i and β = n + 1 − i. Since α and β are not less than 1 and when they are equivalent are greater than 1, its median is approximately: (α − 1/3)/(α + β − 2/3).
$\begin{array}{l}p=\frac{\alpha -1/3}{\alpha +\beta -2/3}=\frac{i-1/3}{i+n+1-i-2/3}=\frac{i-1/3}{n+1/3}\\ i=\frac{1}{3}+p\left(n+\frac{1}{3}\right)=\lfloor i\rfloor +\left(i-\lfloor i\rfloor \right)\\ {x}_{\left(p\right)}={x}_{\left(i\right)}+\left(i-\lfloor i\rfloor \right)\left({x}_{\left(i+1\right)}-{x}_{\left(i\right)}\right)\end{array}$ (28)
7) To obtain the confidence interval using the second method, we start by calculating the bias-corrected percentile, as shown in Formula (29), where I is the indicator function (0 when the condition is not met and 1 when it is met), and Φ^{−1} represents the probit function or quantile function of the standard normal distribution.
${z}_{0}^{*}={\Phi}^{-1}\left(\frac{{\displaystyle {\sum}_{i=1}^{B}I\left({\stackrel{^}{\theta}}_{i}^{*}\le \stackrel{^}{\theta}\right)}}{B}\right)$ (29)
For extreme values, when the argument of the probit function in the Formula (29) approaches 0 or 1,
${z}_{0}^{*}$ becomes undefined. In this case,
${z}_{0}^{*}$ can be assigned a value of 0, so that the bootstrap Bias-Corrected and accelerated (BCa) confidence interval corresponds to the percentile bootstrap confidence interval [27] .
8) The asymmetry correction factor (acceleration) is calculated using Formula (30) based on the jackknife method. Following this method, n samples are generated from the random sample x of size n, removing one sample datum in each sample, and the statistic or estimator
${\stackrel{^}{\theta}}_{\left(-i\right)}$ is calculated in each of the n samples.
$a=\frac{{\displaystyle {\sum}_{i=1}^{n}{\left({\displaystyle {\sum}_{i=1}^{n}{\stackrel{^}{\theta}}_{\left(-i\right)}/n}-{\stackrel{^}{\theta}}_{\left(-i\right)}\right)}^{3}}}{6{\left[{\displaystyle {\sum}_{i=1}^{n}{\left({\displaystyle {\sum}_{i=1}^{n}{\stackrel{^}{\theta}}_{\left(-i\right)}/n}-{\stackrel{^}{\theta}}_{\left(-i\right)}\right)}^{2}}\right]}^{3/2}}$ (30)
9) The orders of the bias-corrected and accelerated quantiles corresponding to the lower and upper limits of the confidence interval at (1 − α) × 100 are determined. Refer to Formula (31), where Φ is probit function, and z_{α}_{/2} and z_{1-α/2} are quantiles of a standard normal distribution.
${p}_{LL}=\Phi \left({z}_{0}^{*}+\frac{{z}_{0}^{*}+{z}_{\alpha /2}}{1-a\left({z}_{0}^{*}+{z}_{\alpha /2}\right)}\right)\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{p}_{UL}=\Phi \left({z}_{0}^{*}+\frac{{z}_{0}^{*}+{z}_{1-\alpha /2}}{1-a\left({z}_{0}^{*}+{z}_{1-\alpha /2}\right)}\right)$ (31)
10) The p_{LL} order quantile of the bootstrap sampling distribution of the statistic or estimator defines the lower limit, and the p_{UL} order quantile defines the upper limit of the confidence interval at (1 − α) × 100 by the bias-corrected and accelerated percentile method, as shown in Formula (32). These quantiles can be computed using rule 8 of the R software [30] .
$P\left({q}_{{p}_{LL}}\left({\left\{{\stackrel{^}{\theta}}_{i}^{*}\right\}}_{i=1}^{B}\right)\le \theta \le {q}_{{p}_{UL}}\left({\left\{{\stackrel{^}{\theta}}_{i}^{*}\right\}}_{i=1}^{B}\right)\right)=1-\alpha $ (32)
10. Computation of Measures of Variability for Qualitative Variables with the R Software
One of the reasons for the underutilization of these measures of variability is their unavailability in statistical software. Below are two scripts for the R software, which can be adjusted to accommodate sample data other than those provided. The elements requiring change for such adjustment are highlighted in blue.
The first script pertains to a random sample with one mode, whereas the second script is designed for a random sample with two modes. In the first sample, the script calculates the six measures of variation and the asymptotic confidence interval for Kvalseth’s SDM, given the unique mode.
For the two-mode sample, commands for FVR and WRV are excluded, along with the standard error and asymptotic confidence interval for the SDM, as they cannot be computed. In the first script, bootstrap confidence intervals are computed using the percentile method. In the second script, these confidence intervals are calculated using the bias-corrected and accelerated percentile method. All results are rounded to four decimal places. The first script provides low-definition graphics output generated by the R program, while the second script displays the commands used to save these graphics as high-definition image files (*.jpeg).
10.1. Example 1
From the qualitative variable A = “marital status in a population of adult men aged 20 to 60 years living in a northern Mexican border city”, a random sample of 72 participants was drawn. It is required to represent its distribution by means of a frequency table and a bar chart. Additionally, calculating the mode of the 72 sample data as a measure of central tendency, along with Freeman’s Variation Ratio (FVR), Wilcox’s Variation Ratio (WVR), Moral’s Universal Variation Ratio (UVR), Kvalseth’s Standard Deviation from Mode (SDM) with its asymptotic standard error, Gibbs-Poston Index of Qualitative Variation (IQV), and Shannon’s Relative Entropy (RelE) as measures of variation is desired.
Furthermore, obtaining the 95% confidence interval for the six measures of variation is necessary. For this purpose, it is advisable to utilize bootstrap using the percentile method.
As additional information, including the bootstrap point estimate, standard error, and bias is necessary, along with representing the bootstrap distribution using a histogram with the density curve superimposed, following the Freedman-Diaconis rule to determine the width of class intervals [29] .
---R script for Example 1---
# Data definition (common for all six measures of variation)
A<- c(“Single”, “Married”, “Living together”, “Separated”, “Divorced”, “Widowed”)
a<- c(2, 1, 3, 3, 6, 5, 3, 2, 1, 1, 3, 2, 5, 6, 1, 1, 2, 3, 4, 1, 2, 1, 5, 2, 3, 2, 2, 3, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 4, 2, 2, 4, 1, 3, 2, 2, 1, 5, 1, 1, 3, 2, 1, 1, 2, 2, 2, 1, 4, 2, 1, 4, 2, 1, 3, 2, 1)
# Frequency table and bar plot (common for all six measures)
frequency_table<- table(factor(a, levels = 1:6, labels = A))
Marital_status<- names(frequency_table)
abs_freq<- as.vector(frequency_table)
N <- sum(abs_freq)
rel_fre<- abs_freq/N
rel_freq = round(rel_fre, 4)
pct <- rel_freq * 100
complete_table<- data.frame(Marital_status, abs_freq, rel_freq, pct)
print_table<- complete_table[, c(“Marital_status”, “abs_freq”, “rel_freq”, “pct”)]
cat(“Frequency distribution of marital status among men\n”)
print(print_table)
barplot(rel_freq, names.arg = A, xlab = “MS”, ylab = “Relative frequency”, main = “Bar chart: Marital status”, las = 2, col = “lightgray”)
# Mode calculation (common for all six measures)
modes <- Marital_status[abs_freq == max(abs_freq)]
mode_frequency<- max(rel_fre)
n <- length(a)
k <- length(Marital_status)
c <- length(modes)
cat(“Sample size: n =”, n, “\n”)
cat(“Number of nominal categories: k =”, k, “\n”)
cat(“Modal categories: mo =”, modes, “\n”)
cat(“Number of modal values: c =”, c, “\n”)
cat(“Relative frequency of the mode: fmo =”, round(mode_frequency, 4), “\n”)
# Calculation of FVR and its 95% PERC bootstrap confidence interval
FVR <- 1 - mode_frequency
cat(“Freeman’s (1965) Variation Ratio: FVR =”, round(FVR, 4), “\n”)
set.seed(123)
B <- 1000
boot_FVR<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_mode<- max(boot_freq_table)/sum(boot_freq_table)
boot_FVR[i] <- 1 - boot_mode
}
BE_FVR <- mean(boot_FVR)
bias_FVR<- mean(boot_FVR) - FVR
se_FVR<- sd(boot_FVR)
cat(“Bootstrap estimation for FVR:”, round(BE_FVR, 4), “\n”)
cat(“Bootstrap bias for FVR:”, round(bias_FVR, 4), “\n”)
cat(“Bootstrap standard error for FVR:”, round(se_FVR, 4), “\n”)
PERC_CI_FVR <- quantile(boot_FVR, c(0.025, 0.975), type = 8)
cat(“The 95% PERC bootstrap confidence interval for FVR: [”, round (PERC_CI_FVR[1], 4), “,”, round(PERC_CI_FVR[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for FVR
hist(boot_FVR, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE, main = “Freedman-Diaconis histogram with overlaid density curve”, xlab = “Bootstrap Freeman Variation Ratio”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_FVR), col=“black”, lwd=4)
# Calculation of WVR and its 95% PERC bootstrap confidence interval
WVR <- k/(k - 1) * (1 - mode_frequency)
cat(“Wilcox’s (1973) Variation ratio: WVR =”, round(WVR, 4), “\n”)
set.seed(123)
boot_WVR<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_mode<- max(boot_freq_table)/sum(boot_freq_table)
boot_WVR[i] <- k/(k - 1) * (1 - boot_mode)
}
BE_WVR <- mean(boot_WVR)
bias_WVR<- mean(boot_WVR) - WVR
se_WVR<- sd(boot_WVR)
cat(“Bootstrap estimation for WVR:”, round(BE_WVR, 4), “\n”)
cat(“Bootstrap bias for WVR:”, round(bias_WVR, 4), “\n”)
cat(“Bootstrap standard error for WVR:”, round(se_WVR, 4), “\n”)
percentile_conf_interval_WVR<- quantile(boot_WVR, c(0.025, 0.975), type = 8)
cat(“The 95% PERC bootstrap confidence interval for WVR: [”, round (percentile_conf_interval_WVR[1], 4), “,”, round(percentile_conf_interval_WVR[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for WVR
hist(boot_WVR, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE,
main = “Freedman-Diaconis histogram with overlaid density curve”,
xlab = “Bootstrap Wilcox Variation Ratio”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_WVR), col = “black”, lwd = 4)
# Calculation of UVR and its 95% PERC bootstrap confidence interval
UVR <- k^2 /( k^2-1) * (1 - mode_frequency/c)
cat(“Moral’s (2022) Universal Variation Ratio: UVR =”, round(UVR, 4), “\n”)
set.seed(123)
boot_UVR<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_mode<- max(boot_freq_table)/sum(boot_freq_table)
boot_mode_num<- sum(boot_freq_table == max(boot_freq_table))
boot_UVR[i] <- k^2/(k^2 - 1) * (1 - boot_mode/boot_mode_num)
}
BE_UVR <- mean(boot_UVR)
bias_UVR<- mean(boot_UVR) - UVR
se_UVR<- sd(boot_UVR)
cat(“Bootstrap estimation for UVR:”, round(BE_UVR, 4), “\n”)
cat(“Bootstrap bias for UVR:”, round(bias_UVR, 4), “\n”)
cat(“Bootstrap standard error for UVR:”, round(se_UVR, 4), “\n”)
PERC_CI_UVR <- quantile(boot_UVR, c(0.025, 0.975), type = 8)
cat(“The 95% PERC bootstrap confidence interval for UVR: [”, round (PERC_CI_UVR[1], 4), “,”, round(PERC_CI_UVR[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for UVR
hist(boot_UVR, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE, main = “Freedman-Diaconis histogram with overlaid density curve”, xlab = “Bootstrap Moral Universal Variation Ratio”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_UVR), col=“black”, lwd=4)
# Calculation of Tvalseth’s Standard Deviation from Mode (SDM)
f_mo<- mode_frequency
SDM <- 1 - sqrt(sum((f_mo - rel_fre)^2)/(k - 1))
cat(“Tvalseth’s (1988) Standard Deviation from Mode: SDM =”, round(SDM, 4), “\n”)
# Calculating the asymptotic standard error and 95% confidence interval for SDM
ase <- sqrt((f_mo * (1 - k * f_mo)^2 + sum(rel_fre * (f_mo - rel_fre)^2))/(N * (k - 1)^2 * (1 - SDM)^2) - (1 - SDM)^2/N)
cat(“Asymptotic Standard Error of SDM: ASE(SDM) =”, round(ase, 4), “\n”)
alpha <- 0.05
z_crit<- qnorm(1 - alpha/2)
ACI<- c(SDM - z_crit * ase, SDM + z_crit * ase)
cat(“The 95% asymptotic confidence interval for SDM: [”, round(ACI[1], 4), “,”, round(ACI[2], 4), “]\n”)
# Calculation of the 95% PERC bootstrap confidence interval for SDM
set.seed(123)
boot_SDM<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_mode_freq<- max(boot_freq_table)/sum(boot_freq_table)
boot_SDM[i] <- 1 - sqrt(sum((boot_mode_freq - rel_fre)^2)/(k - 1))
}
BE_SDM <- mean(boot_SDM)
bias_SDM<- mean(boot_SDM) - SDM
se_SDM<- sd(boot_SDM)
cat(“Bootstrap estimation for SDM:”, round(BE_SDM, 4), “\n”)
cat(“Bootstrap bias for SDM:”, round(bias_SDM, 4), “\n”)
cat(“Bootstrap standard error for SDM:”, round(se_SDM, 4), “\n”)
PERC_CI_SDM <- quantile(boot_SDM, c(0.025, 0.975), type= 8)
cat(“The 95% PERC bootstrap confidence interval for SDM: [”, round (PERC_CI_SDM[1], 4), “,”, round(PERC_CI_SDM[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for SDM
hist(boot_SDM, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE, main = “Freedman-Diaconis histogram with overlaid density curve”, xlab = “Bootstrap Tvalseth Standard Deviation from Mode”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_SDM), col=“black”, lwd=4)
# Calculation of IQVand its 95% PERC bootstrap confidence interval
IQV <- k/(k - 1) * (1 - sum(rel_fre^2))
cat(“Gibbs-Poston (1975) Index of Qualitative Variation: IQV =”, round(IQV, 4), “\n”)
boot_IQV<- numeric(B)
set.seed(123)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_frequency_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_rel_fre<- as.vector(boot_frequency_table)/sum(boot_frequency_table)
boot_IQV[i] <- k/(k - 1) * (1 - sum(boot_rel_fre^2))
}
BE_IQV <- mean(boot_IQV)
bias_IQV<- mean(boot_IQV) - IQV
se_IQV<- sd(boot_IQV)
cat(“Bootstrap estimation for IQV:”, round(BE_IQV, 4), “\n”)
cat(“Bootstrap bias for IQV:”, round(bias_IQV, 4), “\n”)
cat(“Bootstrap standard error for IQV:”, round(se_IQV, 4), “\n”)
PERC_CI_IQV <- quantile(boot_IQV, c(0.025, 0.975), type = 8)
cat(“The 95% PERC bootstrap confidence interval for IQV: [”, round (PERC_CI_IQV[1], 4), “,”, round(PERC_CI_IQV[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for IQV
hist(boot_IQV, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE, main = “Freedman-Diaconis histogram with overlaid density curve”, xlab = “Bootstrap Gibbs-Poston Index of Qualitative Variation”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_IQV), col=“black”, lwd=4)
# Calculation of RelE and its 95% PERC bootstrap confidence interval
entropy <- -sum(rel_fre * log(rel_fre + (rel_fre == 0)))
RelE <- entropy/log(k)
cat(“Shannon’s (1948) Entropy: E =”, round(entropy, 4), “\n”)
cat(“Shannon’s (1948) Relative Entropy: RelE =”, round(RelE, 4), “\n”)
set.seed(123)
boot_RelE<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(a, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = A))
boot_rel_fre<- as.vector(boot_freq_table)/sum(boot_freq_table)
boot_entropy<- -sum(boot_rel_fre* log(boot_rel_fre + (boot_rel_fre== 0)))
boot_RelE[i] <- boot_entropy/log(k)
}
BE_RelE<- mean(boot_RelE)
bias_RelE<- mean(boot_RelE) - RelE
se_RelE<- sd(boot_RelE)
cat(“Bootstrap estimation for RelE:”, round(BE_RelE, 4), “\n”)
cat(“Bootstrap bias for RelE:”, round(bias_RelE, 4), “\n”)
cat(“Bootstrap standard error for RelE:”, round(se_RelE, 4), “\n”)
PERC_CI_RelE<- quantile(boot_RelE, c(0.025, 0.975), type = 8)
cat(“The 95% PERC bootstrap confidence interval for RelE: [”, round (PERC_CI_RelE[1], 4), “,”, round(PERC_CI_RelE[2], 4), “]\n”)
# Freedman-Diaconis histogram with overlaid density curve for RelE
hist(boot_RelE, breaks = “fd”, col = “lightgrey”, border = “black”, freq = FALSE, main = “Freedman-Diaconis histogram with overlaid density curve”, xlab = “Bootstrap Shannon Relative Entropy”, ylab = “Density”, xlim = c(0, 1))
lines(density(boot_RelE), col=“black”, lwd=4)
This script can be executed online as the R software is accessible at https://rdrr.io/snippets/ with over 19000 pre-installed packages available for free. Table 1 along with the requested statistics and confidence intervals are displayed as output. The graphs are excluded because they are low-definition. It should be noted that the ggplot2 library of the R software can be utilized to save these graphics as high-definition image files (*.jpeg), as shown the second script.
Sample size: n = 72
Number of nominal categories: k = 6
Modal categories: mo = Married
Number of modal values: c = 1
Relative frequency of the mode: fmo = 0.3889
Freeman’s (1965) Variation Ratio: FVR = 0.6111
Bootstrap estimation for FVR: 0.597
Bootstrap bias for FVR: −0.0142
Bootstrap standard error for FVR: 0.0459
The 95% PERC bootstrap confidence interval for FVR: [0.5, 0.6806]
Table 1. Frequency distribution of marital status among men.
Wilcox’s (1973) Variation ratio: WVR = 0.7333
Bootstrap estimation for WVR: 0.7163
Bootstrap bias for WVR: −0.017
Bootstrap standard error for WVR: 0.055
The 95% PERC bootstrap confidence interval for WRV: [0.6, 0.8167]
Moral’s (2022) Universal Variation Ratio:UVR = 0.6286
Bootstrap estimation for UVR: 0.6229
Bootstrap bias for UVR: −0.0057
Bootstrap standard error for UVR: 0.0682
The 95% PERC bootstrap confidence interval for UVR: [0.5143, 0.85]
Tvalseth’s (1988) Standard Deviation from Mode: SDM = 0.7133
Asymptotic Standard Error of SDM: ASE(SDM) = 0.061
The 95% asymptotic confidence interval for SDM: [0.5939, 0.8328]
Bootstrap estimation for SDM: 0.6990
Bootstrap bias for SDM: −0.0143
Bootstrap standard error for SDM: 0.0434
The 95% PERC bootstrap confidence interval for SDM: [0.6047, 0.7743]
Gibbs-Poston (1975) Index of Qualitative Variation: IQV = 0.8625
Bootstrap estimation for IQV: 0.8505
Bootstrap bias for IQV: −0.012
Bootstrap standard error for IQV: 0.0359
The 95% PERC bootstrap confidence interval for IQV: [0.775, 0.9126]
Shannon’s (1948) Entropy: E = 1.4514
Shannon’s (1948) Relative Entropy: RelE = 0.81
Bootstrap estimation for RelE: 0.7896
Bootstrap bias for RelE: −0.0204
Bootstrap standard error for RelE: 0.0486
The 95% PERC bootstrap confidence interval for RelE: [0.6894, 0.8771]
10.2. Example 2
From the qualitative variable C = marital status of the population of adult women aged 18 to 60 years living in a border city in northern Mexico, a random sample was drawn. It is necessary to represent these data with a frequency table and a bar chart, calculate the mode as a measure of central tendency, as well as Moral’s Universal Variation Ratio (UVR), Kvalseth’s Standard Deviation from Mode (SDM), Gibbs-Poston Index of Qualitative Variation (IQV), and Shannon’s Relative Entropy (RelE) as measures of variation. Additionally, obtaining the 95% confidence interval for the four qualitative measures of variation is desired. For this purpose, it is advisable to utilize bootstrap using the bias-corrected and accelerated percentile method. Furthermore, information on the bootstrap point estimate, standard error, and bias is needed, along with displaying the bootstrap distribution using a histogram, determining the width of class intervals by the Freedman-Diaconis rule [29] .
---R script for Example 2---
# Data definition
C<- c(“Single”, “Married”, “Free union”, “Separated”, “Divorced”, “Widowed”)
c<- c(2, 2, 1, 2, 2, 2, 2, 1, 2, 3, 1, 2, 5, 1, 2, 2, 1, 3, 2, 1, 2, 1, 5, 2, 2, 2, 2, 1, 2, 2, 4, 5, 3, 3, 3, 2, 1, 1, 5, 1, 1, 1, 2, 2, 3, 1, 2, 3, 4, 1, 4, 4, 1, 1, 3, 3, 1, 1, 1, 1, 6, 1, 4, 1, 1, 2, 2, 6, 2, 1, 1, 1, 1, 2, 3, 2, 2)
# Frequency table (common for all four measures of variation)
frequency_table<- table(factor(c, levels = 1:6, labels = C))
Marital_status<- names(frequency_table)
abs_freq<- as.vector(frequency_table)
N <- sum(abs_freq)
rel_fre<- abs_freq/N
rel_freq = round(rel_fre, 4)
pct <- rel_freq * 100
complete_table<- data.frame(Marital_status, abs_freq, rel_freq, pct)
print_table<- complete_table[, c(“Marital_status”, “abs_freq”, “rel_freq”, “pct”)]
cat(“Frequency distribution of marital status among women\n”)
print(print_table)
# Bar plot using ggplot2 and saved as a JPEG file (common for all four measures)
library(ggplot2)
df<- data.frame(Marital_Status = factor(C, levels = C), Relative_Frequency = rel_fre)
plot <- ggplot(df, aes(x = Marital_Status, y = Relative_Frequency)) +
geom_bar(stat = “identity”, fill = “lightgray”, color = “black”) +
labs(x = “Marital status”, y = “Relative frequency”) +
theme(axis.text.x.bottom = element_text(angle = 10, hjust = 0.5, size = 7), axis.text.y = element_text(size = 7), axis.title.x = element_text(size = 9), axis.title.y = element_text(size = 9), panel.background = element_rect(fill = “white”), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(color = “black”))
jpeg(“bar_chart.jpeg”, width = 800, height = 600, units = “px”, res = 300)
print(plot)
dev.off()
plot
# Mode calculation (common for all four measures)
modes <- Marital_status[abs_freq == max(abs_freq)]
mode_frequency<- max(rel_fre)
n <- length(c)
k <- length(Marital_status)
nmo<- length(modes)
cat(“Sample size: n =”, n, “\n”)
cat(“Number of nominal categories: k =”, k, “\n”)
cat(“Modal categories: mo =”, modes, “\n”)
cat(“Number of modal values: c =”, nmo, “\n”)
cat(“Relative frequency of the mode: fmo =”, round(mode_frequency, 4), “\n”)
# Calculation of UVR and its bootstrap sampling distribution
UVR <- k^2 /( k^2-1) * (1 - mode_frequency/nmo)
cat(“Moral’s (2022) Universal Variation Ratio: UVR =”, round(UVR, 4), “\n”)
set.seed(123)
B <- 1000
boot_UVR<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(c, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = C))
boot_mode_freq<- max(boot_freq_table)/sum(boot_freq_table)
boot_mode_num<- sum(boot_freq_table == max(boot_freq_table))
boot_UVR[i] <- k^2/(k^2 - 1) * (1 - boot_mode_freq/boot_mode_num)
}
BE_UVR <- mean(boot_UVR)
bias_UVR<- mean(boot_UVR) - UVR
se_UVR<- sd(boot_UVR)
cat(“Bootstrap estimation for UVR:”, round(BE_UVR, 4), “\n”)
cat(“Bootstrap bias for UVR:”, round(bias_UVR, 4), “\n”)
cat(“Bootstrap standard error for UVR:”, round(se_UVR, 4), “\n”)
# Histogram with overlaid density curve for UVR (save as JPEG file)
hist_data1 <- data.frame(boot_UVR)
q25 <- quantile(hist_data1$boot_UVR, 0.25, type = 8)
q75 <- quantile(hist_data1$boot_UVR, 0.75, type = 8)
iqr<- q75 - q25
FD <- 2 * iqr/(length(hist_data1$boot_UVR)^(1/3))
hist_plot1 <- ggplot(hist_data1, aes(x = boot_UVR)) +
geom_histogram(binwidth = FD, fill = “lightgrey”, color = “black”, aes(y = ..density..)) + geom_density(color = “black”, size = 1.5) + labs(x = “Bootstrap Moral’s UVR”, y = “Density”) + theme(panel.background = element_rect(fill = “white”), axis.text.x.bottom = element_text(size = 8), axis.text.y = element_text (size = 8), axis.title.x = element_text(size = 9), axis.title.y = element_text(size = 9), axis.line = element_line(color = “black”))
jpeg(“histogram_UVR.jpeg”, width = 800, height = 600, units = “px”, res = 300)
print(hist_plot1)
dev.off()
hist_plot1
# Bias-corrected percentile for UVR
z_0_UVR<- qnorm(sum(boot_UVR<= UVR)/B)
if (is.infinite(z_0_UVR)) {z_0_UVR<- 0} else {z_0_UVR<- z_0_UVR}
cat(“Bias-corrected percentile for UVR:”, round(z_0_UVR, 4), “\n”)
# Skewness correction factor (acceleration) using jackknife estimation for UVR
jackknife_UVR<- numeric(n)
for (i in 1:n) {
jackknife_sample<- c[-i]
jackknife_freq_table<- table(factor(jackknife_sample, levels = 1:6, labels = C))
jackknife_mode_freq<- max(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_mode_num<- sum(jackknife_freq_table == max(jackknife_freq_ table))
jackknife_UVR[i] <- k^2/(k^2 - 1) * (1 - jackknife_mode_freq/jackknife_ mode_num)
}
jackknife_UVR_mean<- sum(jackknife_UVR)/n
a_UVR<- sum((jackknife_UVR_mean - jackknife_UVR)^3)/(6 * sum ((jackknife_UVR_mean - jackknife_UVR)^2)^(3/2))
cat(“Skewness correction factor (acceleration):”, round(a_UVR, 4), “\n”)
# BCa bootstrap confidence interval for UVR
z_LL<- qnorm(0.025)
z_UL<- qnorm(0.975)
LL_BCa_UVR<- pnorm(z_0_UVR + (z_0_UVR + z_LL)/(1 - a_UVR* (z_0_UVR + z_LL)))
UL_BCa_UVR<-pnorm(z_0_UVR + (z_0_UVR + z_UL)/(1 - a_UVR* (z_0_UVR +z_UL)))
BCa_confidence_interval_UVR<- quantile(boot_UVR, probs = c(LL_BCa_ UVR, UL_BCa_UVR), type = 8)
cat(“The 95% BCabootstrapconfidenceintervalfor UVR: [”, round(BCa_ confidence_interval_UVR[1], 4), “,”, round(BCa_confidence_interval_UVR[2], 4), “]\n”)
# Calculation of SDM and its bootstrap sampling distribution
f_mo<- mode_frequency
SDM <- 1 - sqrt(sum((f_mo - rel_fre)^2)/(k - 1))
cat(“Tvalseth’s (1988) Standard Deviation from Mode: SDM =”, round(SDM, 4), “\n”)
set.seed(123)
boot_SDM<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(c, replace = TRUE)
boot_frequency_table<- table(factor(boot_sample, levels = 1:6, labels = C))
boot_mode_frequency<- max(boot_frequency_table)/sum(boot_frequency_table)
boot_rel_fre<- as.vector(boot_frequency_table)/sum(boot_frequency_table)
boot_SDM[i] <- 1 - sqrt(sum((boot_mode_frequency-boot_rel_fre)^2)/(k - 1))
}
BE_SDM <- mean(boot_SDM)
bias_SDM<- mean(boot_SDM) - SDM
se_SDM<- sd(boot_SDM)
cat(“Bootstrap estimation for SDM:”, round(BE_SDM, 4), “\n”)
cat(“Bootstrap bias for SDM:”, round(bias_SDM, 4), “\n”)
cat(“Bootstrap standard error for SDM:”, round(se_SDM, 4), “\n”)
# Histogram with overlaid density curve for SDM (save as JPEG file)
hist_data2 <- data.frame(boot_SDM)
q25 <- quantile(hist_data2$boot_SDM, 0.25, type = 8)
q75 <- quantile(hist_data2$boot_SDM, 0.75, type = 8)
iqr<- q75 - q25
FD <- 2 * iqr/(length(hist_data2$boot_SDM)^(1/3))
hist_plot2 <- ggplot(hist_data2, aes(x = boot_SDM)) + geom_histogram (binwidth = FD, fill = “lightgrey”, color = “black”, aes(y = ..density..)) + geom_density(color = “black”, size = 1.5) + labs(x = “Bootstrap Tvalseth’s SDM”, y = “Density”) + theme(panel.background = element_rect(fill = “white”), axis.text.x.bottom = element_text(size = 8), axis.text.y = element_text(size = 8), axis.title.x = element_text(size = 9), axis.title.y = element_text(size = 9), axis.line = element_line(color = “black”))
jpeg(“histogram_SDM.jpeg”, width = 800, height = 600, units = “px”, res = 300)
print(hist_plot2)
dev.off()
hist_plot2
# Bias-corrected percentile for SDM
z_0_SDM <- qnorm(sum(boot_SDM<= SDM)/B)
if (is.infinite(z_0_SDM)) {z_0_SDM <- 0} else {z_0_SDM <- z_0_SDM}
cat(“Bias-corrected percentile for SDM:”, round(z_0_SDM, 4), “\n”)
# Skewness correction factor (acceleration) using jackknife estimation for SDM
jackknife_SDM<- numeric(n)
for (i in 1:n) {
jackknife_sample<- c[-i]
jackknife_freq_table<- table(factor(jackknife_sample, levels = 1:6, labels = C))
jackknife_mode_freq<- max(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_rel_fre<- as.vector(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_SDM[i] <- 1 - sqrt(sum((jackknife_mode_freq - jackknife_rel_fre)^2)/(k - 1))
}
jackknife_SDM_mean<- sum(jackknife_SDM)/n
a_SDM<- sum((jackknife_SDM_mean - jackknife_SDM)^3)/(6 * sum((jackknife_SDM_mean - jackknife_SDM)^2)^(3/2))
cat(“Skewness correction factor (acceleration):”, round(a_SDM, 4), “\n”)
# BCa bootstrap confidence interval for SDM
LL_BCa_SDM<- pnorm(z_0_SDM + (z_0_SDM + z_LL)/(1 - a_SDM* (z_0_SDM + z_LL)))
UL_BCa_SDM<-pnorm(z_0_SDM + (z_0_SDM + z_UL)/(1- a_SDM* (z_0_SDM + z_UL)))
BCa_confidence_interval_SDM<- quantile(boot_SDM, probs = c(LL_BCa_ SDM, UL_BCa_SDM), type = 8)
cat(“The 95% BCa bootstrap confidence interval for SDM: [”, round (BCa_confidence_interval_SDM[1], 4), “,”, round(BCa_confidence_ interval_SDM[2], 4), “]\n”)
# Calculation of IQV and its bootstrap sampling distribution
IQV <- k/(k - 1) * (1 - sum(rel_fre^2))
cat(“Gibbs-Poston (1975) Index of Qualitative Variation: IQV =”, round(IQV, 4), “\n”)
set.seed(123)
boot_IQV<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(c, replace = TRUE)
boot_frequency_table<- table(factor(boot_sample, levels = 1:6, labels = C))
boot_rel_fre<- as.vector(boot_frequency_table)/sum(boot_frequency_table)
boot_IQV[i] <- k/(k - 1) * (1 - sum(boot_rel_fre^2))
}
BE_IQV <- mean(boot_IQV)
bias_IQV<- mean(boot_IQV) - IQV
se_IQV<- sd(boot_IQV)
cat(“Bootstrap estimation for IQV:”, round(BE_IQV, 4), “\n”)
cat(“Bootstrap bias for IQV:”, round(bias_IQV, 4), “\n”)
cat(“Bootstrap standard error for IQV:”, round(se_IQV, 4), “\n”)
# Histogram with overlaid density curve for IQV (save as JPEG file)
hist_data3 <- data.frame(boot_IQV)
q25 <- quantile(hist_data3$boot_IQV, 0.25, type = 8)
q75 <- quantile(hist_data3$boot_IQV, 0.75, type = 8)
iqr<- q75 - q25
FD <- 2 * iqr/(length(hist_data3$boot_IQV)^(1/3))
hist_plot3 <- ggplot(hist_data3, aes(x = boot_IQV)) + geom_histogram (binwidth = FD, fill = “lightgrey”, color = “black”, aes(y = ..density..)) + geom_density(color = “black”, size = 1.5) + labs(x = “Bootstrap Gibbs-Poston IQV”, y = “Density”) + theme(panel.background = element_rect(fill = “white”), axis.text.x.bottom = element_text(size = 8), axis.text.y = element_text(size = 8), axis.title.x = element_text(size = 9), axis.title.y = element_text(size = 9), axis.line = element_line(color = “black”))
jpeg(“histogram_IQV.jpeg”, width = 800, height = 600, units = “px”, res = 300)
print(hist_plot3)
dev.off()
hist_plot3
# Bias-corrected percentile for IQV
z_0_IQV <- qnorm(sum(boot_IQV<= IQV)/B)
if (is.infinite(z_0_IQV)) {z_0_IQV <- 0} else {z_0_IQV <- z_0_IQV}
cat(“Bias-corrected percentile for IQV:”, round(z_0_IQV, 4), “\n”)
# Skewness correction factor (acceleration) using jackknife estimation for IQV
jackknife_IQV<- numeric(n)
for (i in 1:n) {
jackknife_sample<- c[-i]
jackknife_freq_table<- table(factor(jackknife_sample, levels = 1:6, labels = C))
jackknife_mode_freq<- max(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_rel_fre<- as.vector(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_IQV[i] <- k/(k - 1) * (1 - sum(jackknife_rel_fre^2))
}
jackknife_IQV_mean<- sum(jackknife_IQV)/n
a_IQV<- sum((jackknife_IQV_mean - jackknife_IQV)^3)/(6 * sum((jackknife_IQV_mean - jackknife_IQV)^2)^(3/2))
cat(“Skewness correction factor (acceleration):”, round(a_IQV, 4), “\n”)
# BCa bootstrap confidence interval for IQV
LL_BCa_IQV<- pnorm(z_0_IQV + (z_0_IQV + z_LL)/(1 - a_IQV* (z_0_IQV + z_LL)))
UL_BCa_IQV<- pnorm(z_0_IQV + (z_0_IQV + z_UL)/(1 - a_IQV* (z_0_IQV + z_UL)))
BCa_confidence_interval_IQV<- quantile(boot_IQV, probs = c(LL_BCa_IQV, UL_BCa_IQV), type = 8)
cat(“The 95% BCa bootstrap confidence interval for IQV: [”, round (BCa_confidence_interval_IQV[1], 4), “,”, round(BCa_confidence_ interval_IQV[2], 4), “]\n”)
# Calculation of RelE and its bootstrap sampling distribution
entropy <- -sum(rel_fre * log(rel_fre + (rel_fre == 0)))
RelE <- entropy/log(k)
cat(“Shannon’s (1948) Entropy: E =”, round(entropy, 4), “\n”)
cat(“Shannon’s (1948) Relative Entropy: RelE =”, round(RelE, 4), “\n”)
set.seed(123)
boot_RelE<- numeric(B)
for (i in 1:B) {
boot_sample<- sample(c, replace = TRUE)
boot_freq_table<- table(factor(boot_sample, levels = 1:6, labels = C))
boot_rel_fre<- as.vector(boot_freq_table)/sum(boot_freq_table)
boot_entropy<- -sum(boot_rel_fre * log(boot_rel_fre + (boot_rel_fre == 0)))
boot_RelE[i] <- boot_entropy/log(k)
}
BE_RelE<- mean(boot_RelE)
bias_RelE<- mean(boot_RelE) - RelE
se_RelE<- sd(boot_RelE)
cat(“Bootstrap estimation for RelE:”, round(BE_RelE, 4), “\n”)
cat(“Bootstrap bias for RelE:”, round(bias_RelE, 4), “\n”)
cat(“Bootstrap standard error for RelE:”, round(se_RelE, 4), “\n”)
# Histogram with overlaid density curve for RelE (save as JPEG file)
hist_data4 <- data.frame(boot_RelE)
q25 <- quantile(hist_data4$boot_RelE, 0.25, type = 8)
q75 <- quantile(hist_data4$boot_RelE, 0.75, type = 8)
iqr<- q75 - q25
FD <- 2 * iqr/(length(hist_data4$boot_RelE)^(1/3))
hist_plot4 <- ggplot(hist_data4, aes(x = boot_RelE)) + geom_histogram (binwidth = FD, fill = “lightgrey”, color = “black”, aes(y = ..density..)) + geom_density(color = “black”, size = 1.5) + labs(x = “Bootstrap Shannon’s RelE”, y = “Density”) + theme(panel.background = element_rect(fill = “white”), axis.text.x.bottom = element_text(size = 8), axis.text.y = element_text(size = 8), axis.title.x = element_text(size = 9), axis.title.y = element_text(size = 9), axis.line = element_line(color = “black”))
jpeg(“histogram_RelE.jpeg”, width = 800, height = 600, units = “px”, res = 300)
print(hist_plot4)
dev.off()
hist_plot4
# Bias-corrected percentile for RelE
z_0_RelE <- qnorm(sum(boot_RelE<=RelE)/B)
if (is.infinite(z_0_RelE)) {z_0_RelE <- 0} else {z_0_RelE <- z_0_RelE}
cat(“Bias-correctedpercentile for RelE:”, round(z_0_RelE, 4), “\n”)
# Skewness correction factor (acceleration) using jackknife estimation for RelE
jackknife_RelE<- numeric(n)
for (i in 1:n) {
jackknife_sample<- c[-i]
jackknife_freq_table<- table(factor(jackknife_sample, levels = 1:6, labels = C))
jackknife_mode_freq<- max(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_rel_fre<- as.vector(jackknife_freq_table)/sum(jackknife_freq_table)
jackknife_entropy<- -sum(jackknife_rel_fre * log(jackknife_rel_fre + (jackknife_rel_fre == 0)))
jackknife_RelE[i] <- jackknife_entropy/log(k)
}
jackknife_RelE_mean<- sum(jackknife_RelE)/n
a_RelE<- sum((jackknife_RelE_mean - jackknife_RelE)^3)/(6 * sum((jackknife_RelE_mean - jackknife_RelE)^2)^(3/2))
cat(“Skewness correction factor (acceleration):”, round(a_RelE, 4), “\n”)
# BCa bootstrap confidence interval for RelE
LL_BCa_RelE<- pnorm(z_0_RelE + (z_0_RelE + z_LL)/(1 - a_RelE* (z_0_RelE + z_LL)))
UL_BCa_RelE<- pnorm(z_0_RelE + (z_0_RelE + z_UL)/(1 - a_RelE* (z_0_RelE + z_UL)))
BCa_confidence_interval_RelE<- quantile(boot_RelE, probs = c(LL_BCa_RelE, UL_BCa_RelE), type = 8)
cat(“The 95% BCa bootstrap confidence interval for RelE: [”, round (BCa_confidence_interval_RelE[1], 4), “,”, round(BCa_confidence_interval_ RelE[2], 4), “]\n”)
When the script is executed with the R software installed on the personal computer, Table 2, along with the requested statistics and confidence intervals, is displayed as output alongside Figures 1-5. If the script is run online, the table, statistics, and low-definition graphs appear, but the *.jpeg files are not created.
Table 2. Sample frequency distribution of marital status in women.
Figure 1. Bar chart of marital status in women.
Sample size: n = 77
Number of nominal categories: k = 6
Modal categories: mo = Single Married
Number of modal values: c = 2
Relative frequency of the mode: fmo = 0.3636
Moral’s (2022) Universal Variation Ratio:UVR = 0.8416
Bootstrap estimation for UVR: 0.6257
Bootstrap bias for UVR: −0.2158
Bootstrap standard error for UVR: 0.064
Bias-corrected percentile for UVR: 2.1201
Skewness correction factor (acceleration): −0.0194
The 95% BCa bootstrap confidence interval for UVR: [0.8482, 0.875]
Figure 2. Freedman-Diaconis histogram and density curve of the bootstrap UVR distribution.
Tvalseth’s (1988) Standard Deviation from Mode: SDM = 0.7335
Bootstrap estimation for SDM: 0.6963
Bootstrap bias for SDM: −0.0372
Bootstrap standard error for SDM: 0.0438
Bias-corrected percentile for SDM: 0.8134
Skewness correction factor (acceleration): 0.0211
The 95% BCa bootstrap confidence interval for SDM: [0.6853, 0.8162]
Figure 3. Freedman-Diaconis histogram and density curve of the bootstrap SDM distribution.
Gibbs-Poston (1975) Index of Qualitative Variation: IQV = 0.8533
Bootstrap estimation for IQV: 0.8435
Bootstrap bias for IQV: −0.0098
Bootstrap standard error for IQV: 0.0337
Bias-corrected percentile for IQV: 0.2378
Skewness correction factor (acceleration): 0.0212
The 95% BCa bootstrap confidence interval for IQV: [0.7942, 0.9229]
Figure 4. Freedman-Diaconis histogram and density curve of the bootstrap IQV distribution.
Shannon’s (1948) Entropy: E = 1.4268
Shannon’s (1948) Relative Entropy: RelE = 0.7963
Bootstrap estimation for RelE: 0.779
Bootstrap bias for RelE: −0.0173
Bootstrap standard error for RelE: 0.047
Bias-corrected percentile for RelE: 0.3319
Skewness correction factor (acceleration): 0.0303
The 95% BCa bootstrap confidence interval for RelE: [0.7207, 0.903]
Figure 5. Freedman-Diaconis histogram and density curve of the bootstrap RelE distribution.
11. Conclusions
The present article focuses on nominal category qualitative variables but can also be applied to ordered category variables with a discrete set of categories. For these variables, Kvalseth’s SDM is highly recommended as it utilizes more information than the variation ratios and is less affected by the ceiling effect induced by the proximity to the upper bounding distribution (uniformitarian distribution) than IQV and RelE [3] . Although Kvalseth [11] initially restricted the use of SDM to unimodal distributions, this measure is applicable to multimodal distributions and even to a uniform distribution, just like Moral’s UVR, which is the best option among the variation ratios that are very easy to calculate. It should be noted that the most commonly used measure of variability is the Gibbs-Poston IQV [12] [31] , and another important measure is Shannon’s relative entropy [32] . Nevertheless, both measures are significantly influenced by proximity to the uniform distribution [3] .
By utilizing bootstrap, it is feasible to derive the confidence interval and even the standard error. When the distribution is unknown, the preferable option is a non-parametric bootstrap method. Among these, the percentile method and the bias-corrected and accelerated percentile method stand out [26] . Both methods are perfectly valid when the bias and acceleration are minimal. However, when one of these two indices becomes moderate (|bias| ≥ 0.1 or |a| ≥ 0.025), the bias-corrected and accelerated percentile method outperforms [33] . The primary issue with the latter method arises when the bias-corrected percentile (z_{0}) exhibits an extreme value. In such instances, assigning a value of 0 to z_{0} yields a result equivalent to the percentile method, as proposed by Efron [27] .
In each of the two scripts presented, its bootstrap sample is the same for all calculated measurements by using one seed and staying constant when generating the six or four sampling distributions. The selection of the seed (123) is arbitrary and could be any number. However, it is customary to employ fixed numbers or straightforward patterns to facilitate code reproducibility [28] . At the same time, the assessment of asymmetry (acceleration) is conducted using the jackknife method, ensuring complete stability and consistency in the confidence intervals across various measures of variation. It should be noted that these scripts should be applied with random, representative, and sufficiently large samples, with a minimum of 30 data points for the estimation to be deemed valid [26] [27] .
The application of these six measures of variability is feasible with discrete and continuous variables, particularly when their distribution exhibits a defined peak. Nonetheless, it is discouraged due to the underutilization of the information inherent in the data compared to absolute or relative measures based on the average or median of differential scores concerning the arithmetic mean or median [34] . With discrete quantitative variables, the mode is still estimated using the maximum frequency value method, akin to qualitative variables [35] . However, for continuous quantitative variables, it necessitates employing a density estimation method to identify the value with the highest density [36] .
An additional issue with samples of continuous variables is the necessity to discretize the distribution into k class intervals, which further exacerbates the loss of information and disregards the quantitative nature of the sample data [37] . To determine the number of class intervals, a viable option is Freedman-Diaconis rule [29] . Among all measures of variation for quantitative variables, Kvalseth’s SDM and estimation of the mode via the maximum density value would be the optimal choice for continuous quantitative variables with unimodal distribution [36] . It’s worth noting that, from the density function of a continuous distribution, Shannon’s information can be directly calculated using integrals. Additionally, with certain continuous functions, one can ascertain the maximum entropy and obtain the relative entropy through the quotient between the entropy and its maximum [38] .
It is recommended to utilize measures of variability, such as Kvalseth’s SDM [11] and Moral’s UVR [3] , alongside well-known measures like the Gibbs-Poston IQV [12] and Shannon’s relative entropy [13] . Additionally, measures of shape, such as Moral’s skewness and peakedness, [39] should be used in conjunction with frequency tables, bar charts, and the measure of central tendency, the mode, when describing qualitative variables. Statistical tests to check whether one or more point data represent outliers are well known with quantitative variables [40] [41] [42] . However, there are new methods for detecting outliers with qualitative variables, such as k-mode based cluster analysis [43] .
Persisting on the development of descriptive measures for these variables, which are highly relevant in the social and health sciences, is important, as this area remains scarcely addressed in mathematical statistics [4] [8] . What might be the directions of future research in qualitative variation? Apart from confidence intervals, inferential tests for comparing measures of variation are an important future direction. Kvalseth’s work with standard deviation from the mode [11] and a later-introduced measure of the odds measure of qualitative variation [44] are situated in this line. A related approach to qualitative variation measures that has been developed for some time in social research is segregation indices [45] , among which the Hutchens index [46] [47] can be highlighted. Interval estimation and inferential comparisons with these indices may also be interesting lines of development.
Acknowledgements
The author expresses gratitude to the reviewers and editor for their helpful comments.