Area Skewness for Random Samples Drawn from an Unknown or Specified Distribution

Abstract

Singh, Gewali, and Khatiwada proposed a skewness measure for probability distributions called Area Skewness (AS), which has desirable properties but has not been widely applied in practice. One reason for this may be the lack of dissemination and further exploration of this new measure. Additionally, the authors focused mainly on specific probability distributions. One of the advantages of AS is that it considers the entire shape of the distribution, rather than focusing only on moments or the linear distance between central tendency statistics or quantiles. This holistic approach makes AS particularly robust in cases where the distribution deviates from normality or contains outliers. This paper aims to generalize its use to random samples with either known or unknown distributions. The study has three objectives: 1) to develop an R script for point and interval estimation of AS; 2) to provide interpretive norms of normality by examining normality in bootstrap sampling distributions; and 3) to compare asymptotic and bootstrap standard errors. Interval estimation is approached asymptotically and through bootstrap. The script was illustrated using two examples: one with generated data and another with real-world data. Interpretive norms of normality are derived from 40 samples of various sizes, created by inverse transform sampling to follow a standard normal distribution. Bootstrap intervals at three confidence levels (0.9, 0.95, and 0.99) were obtained using the normal method, with two exceptions: the bias-corrected and accelerated percentile method for the 60-data sample and the percentile method for the 600-data sample, as these deviated from normality. Asymptotic 95% confidence intervals are also provided. The asymptotic standard error was larger than the bootstrap one, with the difference decreasing as the sample size increased. The script is concluded to have practical and educational utility for estimating AS, whose asymptotic sampling distribution is normal.

Share and Cite:

Rubia, J. (2025) Area Skewness for Random Samples Drawn from an Unknown or Specified Distribution. Open Journal of Statistics, 15, 93-128. doi: 10.4236/ojs.2025.151007.

1. Introduction

Singh, Gewali, and Khatiwada [1] developed a measure to evaluate symmetry in a specified continuous probability distribution with a finite mean, called Area Skewness (AS). The AS statistic ranges from −1 to 1, where −1 indicates left-tailed skewness, 1 indicates right-tailed skewness, and 0 indicates symmetry. They highlighted that its advantage lies in being bounded compared to classical measures, such as the standardized distance from the median to the mode (Sk2) [2], the standardized third central moment ( β 1 ), and the standardized third cumulant (γ1) [3].

The importance of AS statistic lies in its bounded nature and its ability to provide an intuitive interpretation of skewness. Unlike traditional measures such as Sk2, β 1 , or γ1, which can yield unbounded values, AS offers a standardized range that facilitates comparisons across datasets. Furthermore, AS accounts for the entire distribution shape, rather than focusing solely on specific moments or points like the mean, median, or mode. This holistic approach makes AS particularly robust in cases where the distribution deviates from normality or contains outliers. By addressing limitations in existing measures, AS enhances the accuracy and interpretability of skewness analysis, making it a valuable tool in fields such as psychology, economics, sociology, and other empirical sciences.

Given a continuous probability distribution with density function fX(x) and cumulative distribution function FX(x), and with finite mean or mathematical expectation μ(X), AS is computed by the probability of randomly drawing a value in the interval between the arithmetic mean μ(X) and the median Mdn(X) of the distribution (Equation (1)).

AS( X )=P( x[ Mdn( X ),μ( X ) ] ) = Mdn( X ) μ( X ) f X ( x )dx = F X ( μ( X ) ) F X ( Mdn( X ) ) =P( Xμ( X ) )P( XMdn( X ) ) (1)

Let us try to understand intuitively what is expressed in this formula. The AS statistic measures the distance between the cumulative probability values (area under the curve) of the mean and the median. In a continuous distribution, the probability of the median is always 0.5. If the distribution is symmetrical, the mean and median coincide. In cases of asymmetry, they differ: if the mean accumulates more probability than the median, the distribution exhibits positive skewness, influenced by the elongation of the right tail; conversely, if it accumulates less probability, the distribution exhibits negative skewness, influenced by the elongation of the left tail. This measure is analogous to Pearson’s approach [2], which uses the linear distance between the median and mode (in terms of X values). However, the AS statistic offers a more holistic approach, reflecting the overall distribution of the data.

The authors considered the normal distribution with a position parameter (μ) of 100 and a scale parameter (σ) of 20, the gamma distribution with a shape parameter (α) of 2 and a scale parameter (β) of 1, and the continuous triangular distribution with threshold parameters a = 0 and c = 1, and a mode parameter b with three values: 0.95 (negative skewness), 0.5 (symmetry), and 0.05 (positive skewness). They obtained bootstrap confidence intervals at 90%, 95%, and 99% for random samples of 25, 50, 75, and 100 data points drawn from these distributions. They calculated not only AS but also Pearson’s (Sk2 and β 1 ) and Fisher’s (γ1) measures of skewness. Among all these measures, AS proved to be the most accurate. Additionally, the histograms that graphically represent the bootstrap sampling distribution of AS displayed a bell-shaped and symmetrical profile corresponding to a normal distribution across different sample sizes and distributions. However, the authors did not emphasize this regularity, nor did they argue that the asymptotic sampling distribution of AS is normal, as it is based on the difference of two cumulative probabilities.

Consequently, AS seems to be an excellent measure of asymmetry, but it is not being used in methodological or applied research. It does not appear in any current statistical package, and its formulation for applied research was implied but not developed. One reason for this may be the lack of dissemination and further research on this new measure. In addition, the authors focused on specific probability distributions and did not consider real-world data with unknown distributions.

The purpose of the present study is to extend the application of the skewness measure proposed by Singh et al. [1] to samples of size 𝑛 drawn from a quantitative variable X with either a specified distribution with a finite mean or an unknown distribution assumed to have a finite mean. Although the original study, using simulated data, provided graphical evidence of a good fit to the normality of the sampling distribution of the AS statistic (or skewness estimator), the authors did not emphasize this, nor did they specifically analyze it, even though the finding can be supported by the central limit theorem. In this study, it is adopted as a hypothesis and tested using simulation methodology.

The present study has three objectives. The first objective is to develop a script for the R program to obtain the point and interval estimation of AS. The latter is proposed using an asymptotic or Wald-type confidence interval [4] [5] and bootstrap confidence intervals computed through three methods: normal, percentile, and bias-corrected and accelerated (BCa) percentile [6]. The R script is illustrated using two examples: one with generated data from a hyperbolic secant distribution, which is symmetrical, and another with real-world data on the weaning age of Mexican infants, which exhibits positive skewness.

The second objective is to provide interpretive norms through bootstrap confi-dence intervals at 90%, 95%, and 99%, and an asymptotic confidence interval at 95%. Additionally, the bootstrap sampling distributions of AS generated from normally distributed data samples are tested for normality. The expectation is that the null hypothesis of normality holds since AS is based on the difference of the quantile orders of two central tendency statistics, i.e., the difference of two cumulative probabilities [7] [8]. These prescriptive confidence intervals of normality may be useful for interpretation when only the point estimate of AS is available.

The third objective is to compare the asymptotic and bootstrap standard errors. The expectation is that there will be convergence as the sample size increases [9]. If the asymptotic approximation is less efficient, the asymptotic standard error will be larger than the bootstrap one, although they should eventually converge [10].

2. Calculation of the Area Skewness of a Sample with Specified or Unknown Distribution

In case the distribution of the variable X is unknown and X is a continuous or discrete quantitative variable with a wide range and low-frequency values, from which a random sample x of size n has been drawn (Equation (2)), the quantile orders (Equation (3)) corresponding to the arithmetic mean m(x) (Equation (4)) and the sample median mdn(x) (Equation 5) can be calculated using a linear interpolation rule for the calculation of sample quantiles [11].

Equation (2) represents the n sample data in their random order of collection. Equation (3) expresses the sample mean and median as pth quantiles, illustrating that the order p of the quantile corresponds to a probability value that accumulates the statistic. Equation (4) gives the formula for the sample mean, which is calculated as the sum of the n sample data divided by the number of data points, while Equation (5) gives the formula for the median, which is determined by arranging the n data in ascending order. If the sample size is odd, the median is the central value; if it is even, the median is the average of the two central values.

x= { x i } i=1 n ={ x 1 , x 2 ,, x n }X (2)

q p x ¯ =m( x )= x ¯ ; F ^ X ( m( x ) )=P( X x ¯ )= p x ¯

q p x ˜ =mdn( x )= x ˜ ; F ^ X ( mdn( x ) )=P( Xmdn( x ) )= p x ˜ (3)

m( x )= i=1 n x i n (4)

x ( 1 ) x ( 2 ) x ( n )

mdn( x )= x ( n 2 ) + x ( n 2 +1 ) 2 (5)

The linear interpolation rule identified as 6 in R [12], which is the most widely used [13], can be employed. This rule is based on the mean or expected value of the i-th order statistic from a sample of size n drawn randomly from a standard continuous uniform distribution, where i represents the order of the quantile, q(p) = x(i), among the n sample data sorted in ascending order. To obtain this order, we proceed as follows:

1) Given n sample data of variable X, xi (i = 1, 2, ..., n), they are ordered in ascending order and assigned an order number, from 1 to n, regardless of repeated values: x(1)x(2) ≤ ... ≤ x(n).

2) The values of the sample arithmetic mean, m(x), and sample median, mdn(x), are sought within this sequence of X scores.

3) If the statistic m(x) or mdn(x) coincides with a singular score in the sequence x(i), the sequential order of the score (i) is taken and this order is divided by n + 1, thus obtaining the order of the quantile or cumulative probability up to the point, including it. If this score is not singular, but repeated, the average of orders corresponding to the repeated score is taken and this average order is divided by n + 1. If m(x) or mdn(x) corresponds to the sample minimum, the quantile order is 0; whereas, if it coincides with the sample maximum, the quantile order is 1. See Equation (6).

x ( 1 ) x ( i ) x ( n )

1 i n

x ¯ ormdn( x )= q p ; F ^ X ( q p )=p

p x ( i ) ={ 0 x ¯ ormdn( x )= x ( 1 ) i n+1 ;1<i<n x ( 1 ) < x ¯ ormdn( x )< x ( n ) 1 x ¯ ormdn( x )= x ( n ) (6)

4) If no value is found in the ordered sequence x(i) with which the statistic m(x) or mdn(x) matches, the two scores from which it can be obtained by linear interpolation, i.e., the two scores between which the statistic is linearly interpolated, are sought (Equation (7)). The quotient between the difference of the statistic and the prior score x(i) (numerator) and the difference between the posterior score x(i+1) and the prior score x(i) (denominator) is added to the order of the prior score ⌊i⌋ (Equation (8) for the mean and Equation (10) for the median). This decimal quantity i is divided by n + 1 and the order of the quantile or cumulative probability up to and including the point is achieved (Equation (9) for the mean and Equation (11) for the median).

x ( 1 ) x ( i ) x ( i +1 ) x ( n ) (7)

1 i i +1 n

x ¯ =m x ( i )

x ( i ) < x ¯ < x ( i +1 )

i= i + x ¯ x ( i ) x ( i +1 ) x ( i ) (8)

p x ¯ =i/ ( n+1 )

F ^ X ( x ¯ )=P( X x ¯ )= p x ¯ (9)

mdn( x ) x ( i )

x ( i ) <mdn( x )< x ( i +1 )

i= i + mdn( x ) x ( i ) x ( i +1 ) x ( i ) (10)

p x ˜ =i/ ( n+1 )

F ^ X ( mdn( x ) )=P( Xmdn( x ) )= p x ˜ (11)

5) Finally, the difference between the order of the quantile or cumulative probability corresponding to the arithmetic mean of the sample (minuend) and the order of the quantile or cumulative probability corresponding to the median of the sample (subtrahend) is calculated. The resulting remainder is the area asymmetry or probability that a value of X lies in the area bounded by the mean and median (Equation (12)).

SA= F ^ X ( x ¯ ) F ^ X ( mdn( x ) )= p x ¯ p x ˜ (12)

Another option is to use the rule 8 in R [12]. This rule is based on the median of the i-th order statistic in a sample of n data drawn randomly from a standard continuous uniform distribution. The procedure is similar, except that instead of dividing i (order of the central tendency statistic among the n sample data) by n + 1 (sample size increased by one unit), 1/3 is first subtracted from the order i and then the difference is divided by n + 1/3 (Equation 13). This is the procedure for calculating sample quantiles recommended by the Hyndman-Fan simulation study [14] and had previously been suggested by Tukey [15] for exploratory data analysis. It is now considered the best option with samples of variables with an unknown distribution [16]-[18].

p x ( i ) ={ 0 x ¯ ormdn( x )= x ( 1 ) i1/3 n+1/3 ;1<i<n x ( 1 ) < x ¯ ormdn( x )< x ( n ) 1 x ¯ ormdn( x )= x ( n ) (13)

In case the sample has been drawn from a normal distribution, one can use rule 9 in R [12], which provides approximately unbiased estimates for various sample sizes [19]. See Equation (14).

p x ( i ) ={ 0 x ¯ ormdn( x )= x ( 1 ) i0.375 n+0.25 ;1<i<n x ( 1 ) < x ¯ ormdn( x )< x ( n ) 1 x ¯ ormdn( x )= x ( n ) (14)

Another simpler procedure is to use the empirical cumulative distribution function (Fn). We count how many values in the sample are less than or equal to the arithmetic mean and how many are less than or equal to the median, then divide both frequencies by the sample size. The difference between the empirical cumulative probabilities of the arithmetic mean (minuend) and the median (subtrahend) yields the area asymmetry. See Equation (15), where 𝐼 denotes the indicator function that assigns each sample datum x a value of 1 if the condition expressed as an argument is satisfied and 0 if it is not. The summation gives the number of values that satisfy the condition in the sample of size n.

AS= F n ( x ¯ ) F n ( mdn( x ) )= i=1 n I( x i x ¯ ) n i=1 n I( x i mdn( x ) ) n (15)

However, this procedure alters the sampling distribution of AS by discretizing it, thus deviating significantly from normality. Gaps appear between values in the histogram, with some values occurring too frequently, and the scatter plot shows a step-like profile in the normal Q-Q plot. Consequently, this approach is not recommended and is not used in the development of the script.

If variable X is ordinal or quantitative with few discrete values, this measure does not apply, as it was proposed for variables with a continuous probability distribution and finite mean. However, it could be computed using the latter procedure, which is not addressed in this study. Nonetheless, a script for the possible calculation and analysis of AS with ordinal or discrete variables with few values is included in the appendix, following this procedure for calculating quantile orders.

To interpret AS, we can use the bootstrap confidence intervals provided by Singh et al. [1] for samples of normally or symmetrically triangularly distributed data with sizes of 25, 50, 75, and 100. If the point estimate falls within the interval, symmetry can be inferred. One might wonder about larger samples. For these cases, a more appropriate approach is to obtain the confidence interval. Since the asymptotic sampling distribution of the statistic can be assumed to be normal, a Wald-type confidence interval can be used for large samples [4] [5], and the calculation of this interval is presented below.

It is also possible to use a 90% or 95% bootstrap confidence interval that does not require assumptions about the distribution of the statistic. However, since the bootstrap sampling distribution of AS is normal, the Gaussian method is a good choice. This method provides symmetric intervals and assumes normality in the bootstrap sampling distribution of the statistic [20]. If the distribution is not normal, the BCa percentile method would be the best option [6]. If the (asymptotic or bootstrap) confidence interval includes 0, the sample is from a distribution with symmetry; if not, the distribution is asymmetric.

3. Asymptotic Standard Error and Confidence Interval of AS

The quantile order of a statistic, such as the mean or median, represents the probability accumulated by that statistic in the distribution of the variable. Recall that a probability can be estimated by a binomial proportion. In this case, it would be the number of values less than or equal to the statistic (mean or median) divided by the sample size (Equation 15). The exact sampling distribution of a proportion is the binomial distribution, which converges to a normal distribution as the number of trials (n) tends to infinity, provided the probability of success p is not extreme (0 or 1) but approaches 0.5 [21]. Thus, an asymptotic standard error (ase) and a Wald-type confidence interval can be derived for the difference of two quantile orders based on the difference of two matched probabilities from the same sample [7] [8]. See Equation 16, where z1-α/2 is the quantile of order 1 − α/2 of a standard normal distribution.

However, instead of using the inverse of the cumulative distribution function (Equation (15)), the inverse of R’s rule 8 is used to calculate the sample quantiles (Equation (13)). The reason for not using the inverse of the cumulative distribution function, which corresponds to rounding rule 1 in R, is to avoid the discretization of the sampling distribution of the statistic that this rule generates. The inverse of rule 8 is preferred because it provides the most appropriate estimate of the quantiles [16]-[18] and results in an asymptotic normal sampling distribution, as is the case, which is the distributional convergence observed in other skewness measures [22].

p ¯ = p x ¯ + p x ˜ 2

ase= p ¯ ( 1 p ¯ ) n

AS ^ = p x ¯ p x ˜

P( AS ^ z 1 α 2 × p ¯ ( 1 p ¯ ) n AS AS ^ + z 1 α 2 × p ¯ ( 1 p ¯ ) n )=1α (16)

4. Method

For the first objective, a script was developed that begins with the verification of sample randomness using the Wald-Wolfowitz runs test [23] to ensure compliance with this assumption necessary for statistical inference [24]. In the script, the shape of the distribution is described by determining the absence of outlier cases (null hypothesis) using Grubbs’ test [25] and Tukey’s criterion of scores within the interval bounded by 1.5 times the interquartile range of the first quartile (lower bound) and third quartile (upper bound) [15]. The null hypothesis of symmetry is checked by D’Agostino’s test [26], and that of mesokurtosis by the Anscombe-Glynn test [27]. Normality is tested by the Shapiro-Francia test [28] with Royston’s standardization [29] and the D’Agostino-Berlanger-D’Agostino test [30].

In addition, the distribution is represented by a box plot [15] [31] and a histogram [2] [32] with overlaid density and normal curves [33]. The number of class intervals is determined by the Freedman-Diaconis rule, which does not require distributional assumptions and establishes the optimal bin width [34]. The density is estimated using Epanechnikov’s kernel function [35], with the bandwidth set by the Sheather-Jones method [36], which minimizes the asymptotic mean integrated square error. Both methods perform best in minimizing error across various distributions [37].

In the script, the inverse of R’s method 8 for calculating sample quantiles [12] is used to determine the quantile order (p) to which the arithmetic mean, pm = qp[m(x)]. and the median, pmdn = qp[mdn(x)]. correspond, when calculating AS (pmpmdn). This method is chosen as it is the most recommended for calculating sample quantiles [16]-[18].

In the script, the bootstrap confidence interval is calculated using three methods: normal, percentile, and BCa percentile. Symmetry in the bootstrap sampling distribution of AS is checked by D’Agostino’s test [26]. Normality is verified by the Shapiro-Francia test [28] [29], which predicts the empirical quantiles (regressand) by the normal theoretical quantiles (regressor), making it one of the most powerful normality tests [38] and suitable for samples of 5 to 5000 data [29]. For larger samples of more than 5000 data, the D’Agostino-Berlanger-D’Agostino normality test [30] is performed. This test uses the sum of squares of the standardized D’Agostino skewness test statistic and the standardized Anscombe-Glynn kurtosis test statistic.

The script also computes the bootstrap estimation bias, bootstrap standard error, and jackknife acceleration [39]. If the bootstrap standard error is large (greater than a quarter of the range of the statistic), then the procedure is inadequate [40]. If there is normality, which is expected, the best method is the Gaussian or normal method. If there is no normality but there is symmetry, and the bias of the bootstrap estimation and jackknife acceleration are small (|bias| ≤ 0.05 and |a| ≤ 0.025, respectively), the percentile method is a good choice. Otherwise, the BCa percentile method is preferred [41]. Finally, the two-tailed bootstrap probability is calculated for the null hypothesis of symmetry measured by area skewness: H0: AS(X) = 0 [42]. The significance level in the tests is set at the conventional level of 0.05, although this can be modified [43].

The script was demonstrated with two examples. In the first example, it was applied to a random sample generated from a hyperbolic secant distribution. To provide the quantitative data with specific relevance to the social and health sciences, the data were assumed to represent an instrumental variable, following Ding’s work on the occurrence of this exponential family distribution in statistical and applied research [44]. In the second example, the script was applied to real-world data on the weaning age of 200 Mexican infants.

For the second objective of providing interpretive norms regarding the normal distribution, 40 samples were generated by inverse transform sampling to follow a standard normal distribution, with sizes ranging from 20 to 90 in increments of 10, from 100 to 900 in increments of 50, from 1000 to 4500 in increments of 500, and from 5000 to 10,000 in increments of 1000. From these random samples, 1000 samples were drawn by resampling with replacement, and AS was calculated for each of them, resulting in 40 bootstrap sampling distributions of AS. Bootstrap estimation, bootstrap estimation bias, bootstrap standard error, and jackknife acceleration were computed. Additionally, skewness was checked using the D’Agostino test [26] and kurtosis using the Anscombe-Glynn test [27].

Although normality was tested by the D’Agostino-Berlanger-D’Agostino test [30] in the 40 bootstrap sampling distributions of AS, it was handled as a complementary test in the first 35 distributions (from 20 to 5000 data), since primacy was given to the Shapiro-Francia test [28] [29] for its higher power [38]. It was the sole normality test in the last five distributions (from 6000 to 10,000 data), as it is an asymptotic test that is powerful with large samples and the Shapiro-Francia test cannot be applied [29]. The bootstrap confidence intervals at 90%, 95%, and 99% were computed using the normal method, as the assumption of normality was met in the bootstrap sampling distribution of AS [6], with the exception of two samples, for which the BCa (n = 60) and percentile (n = 600) methods were used. The calculation of the Wald-type confidence interval was based on Equation 16.

For the third objective, the difference between the asymptotic and bootstrap standard errors was calculated, along with the ratio of these two errors, to report the minimum, maximum, and mean of these two random variables, which did not follow a normal distribution. The equivalence of the bootstrap standard error to the asymptotic standard error was tested using the Snedecor-Cochran F-test for the equality of variances between two independent populations [45]. On one hand, we have the asymptotic standard error of the original sample of size n, and on the other hand, we have the bootstrap standard error, obtained from 1000 bootstrap samples. Consequently, the degrees of freedom (df) for the F-test are as follows: df1 = n − 1 and df2 = 999 if the asymptotic standard error is greater than the bootstrap standard error; or the reverse (df1 = 999 and df2 = n − 1) if the asymptotic standard error is less than or equal to the bootstrap standard error.

Additionally, regarding third objective, the linear correlation between the difference in the errors and the sample size was computed to test for a significant inverse correlation, which would indicate that as the sample size increases, the difference between the errors decreases. The bootstrap BCa confidence interval for Pearson’s product-moment correlation coefficient was computed using a 1000-sample simulation.

5. Results

5.1. The R Script for Point and Interval Estimation of Area Skewness

To run the script, six R libraries are required: randtests [46], ggplot2 [47], outliers [48], moments [49], nortest [50], and boot [51]. The histogram and box plot of the variable X, as well as the histogram and normal Q-Q plot of the bootstrap sampling distribution of AS, are saved as high-definition JPEG files. Options highlighted in blue can be modified to adjust the script to the researcher’s needs, such as the vector of scores, significance level, confidence level, and number of decimal places.

# Vector of scores or records as input data (x).

x <- c(0.676775472, 2.463186427, …, 0.559120572)

# Load libraries of R.

library(randtests)

library(outliers)

library(moments)

library(nortest)

library(boot)

library(ggplot2)

# Random test for the original sample of size n.

runse <- runs.test(x, alternative = "two.sided", threshold = median(x), pvalue = 'exact')

runsa <- runs.test(x, alternative = "two.sided", threshold = median(x), pvalue = 'normal')

cat("Wald-Wolfowitz runs test (criterion: median):", "\n")

cat("Sample median: mdn(x) =", median(x), "\n")

cat("Sample size: n =", length(x), "\n")

cat("Number of runs: r =", runse$runs, "\n")

cat("n_0 = #(x_i < mdn(x)) =", runse$parameter["n1"], ", n_1 = #(x_i > mdn(x)) =", runse$parameter["n2"], ", and n = n_0 + n_1 =", runse$parameter["n"], "\n")

cat("Two-tailed exact probability value: p =", round(runse$p.value, 4), "\n")

cat("Mean: M(R|n_0, n_1) =", runse$mu, "and", "Standard deviation: DE(R|n_0, n_1) =", round(sqrt(runse$var), 4), "\n")

cat("Standardized number of runs: z_r =", round(runse$statistic, 4), "\n")

cat("Two-tailed asymptotic probability value: p-value =", round(runsa$p.value, 4), "\n")

# Identification of outliers based on the Tukey criterion [15].

cat("Outliers among n sample data:", "\n")

min <- min(x)

q1 <- quantile(x, probs = 0.25)

q3 <- quantile(x, probs = 0.75)

IQR <- q3 - q1

LL <- q1 - 1.5 * IQR

max <- max(x)

UL <- q3 + 1.5 * IQR

cat("Sample minimum =", min,"\n")

cat("Lower limit for outlier case =", LL,"\n")

cat("Upper limit for outlier case =", UL,"\n")

cat("Sample maximum =", max,"\n")

if (min < LL) {cat("There is at least one outlier in the left tail.\n")

} else {cat("There are no outliers in the left tail.\n")}

if (max > UL) {cat("There is at least one outlier in the right tail.\n")

} else {cat("There are no outliers in the right tail.\n")}

grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)

# Box plot displayed in the console and saved as a JPEG file.

data <- data.frame(x = x)

plot1 <- ggplot(data, aes(x = "", y = x)) +

geom_boxplot(fill = "lightyellow", color = "black", notch = FALSE) +

coord_flip() + labs(y = "X values", x = "") + theme(axis.text.x.bottom = element_text(size = 8), axis.title.x = element_text(size = 10), panel.background = element_rect(fill = "white"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(color = "black"))

print(plot1)

ggsave(filename = "boxplot.jpeg", plot = plot1, device = "jpeg", dpi = 300, width = 8, height = 6, units = "cm")

# Assessment of symmetry, mesokurtosis, and normality for the vector x.

cat("Symmetry, mesokurtosis, and normality in the distribution from which the original sample was drawn:", "\n")

agostino.test(x)

anscombe.test(x)

sf.test(x)

agostino <- agostino.test(x)

ag <- anscombe.test(x)

K2 <- (agostino$statistic["z"]^2 + ag$statistic["z"]^2)

p <- pchisq(K2, df = 2, lower.tail = FALSE)

cat("D’Agostino-Berlanger-D’Agostino Normality test: x", "\n")

cat("k-squared(x) =", round(K2, 4),",","p_value =", round(p, 4), "\n")

# Histogram with overlaid density and normal curves displayed in the console and saved as a JPEG file.

cat("Number of bins for the histogram according to four rules (k):", "\n")

fd <- ceiling(((max(x) - min(x)) * (length(x))^(1/3)) / (2 * (quantile(x, probs = 0.75, type = 8) - quantile(x, probs = 0.25, type = 8))))

cat("Friedman-Diaconis rule as default option: k =", fd, "\n")

scott <- ceiling(((max(x) - min(x)) * (length(x))^(1/3)) / (3.49 * sd(x)))

cat("Scott’s rule when the variable follows a normal distribution: k =", scott, "\n")

sqrt <- ceiling(sqrt(length(x)))

cat("Square Root rule when the variable follows an arcsine distribution: k =", sqrt, "\n")

rice <- ceiling(2* (length(x))^(1/3))

cat("Rice university rule when the variable follows an arcsine distribution: k =", rice, "\n")

plot2 <- ggplot(data, aes(x = x)) + geom_histogram(aes(y = ..density..), bins = fd, fill = "lightyellow", color = "black") + geom_density(kernel = "epanechnikov", bw = "sj", color = "darkblue", size = 1) + stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), color = "red", size = 1) + labs(x = "X values", y = "Density") + theme(axis.text.x.bottom = element_text(size = 8), axis.text.y = element_text(size = 8), axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10), panel.background = element_rect(fill = "white"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(color = "black"))

print(plot2)

ggsave(filename = "histogram.jpeg", plot = plot2, device = "jpeg", dpi = 300, width = 8, height = 6, units = "cm")

# Function to find the quantile order of the arithmetic mean and median.

find_order <- function(value, x_sorted) {if (value %in% x_sorted) {

return(mean(which(x_sorted == value)))} else {

lower_index <- max(which(x_sorted < value))

upper_index <- min(which(x_sorted > value))

return(lower_index + (value - x_sorted[lower_index]) / (x_sorted[upper_index] - x_sorted[lower_index]))}}

# Function to calculate Area Skewness (AS) statistic.

n <- length(x)

calc_asymmetry <- function(x) {n <- length(x)

x_sorted <- sort(x)

m <- mean(x)

mdn <- median(x)

i_m <- find_order(m, x_sorted)

i_mdn <- find_order(mdn, x_sorted)

p_m <- (i_m - 1/3) / (n + 1/3)

p_mdn <- (i_mdn - 1/3) / (n + 1/3)

return(list(asymmetry = p_m - p_mdn, p_m = p_m, p_mdn = p_mdn))}

result <- calc_asymmetry(x)

p_m <- result$p_m

p_mdn <- result$p_mdn

AS <- result$asymmetry

cat("Point estimate of area skewness:", "\n")

cat("Quantile order or cumulative probability of the arithmetic mean =", round(p_m, 4), "\n")

cat("Quantile order or cumulative probability of the median =", round(p_mdn, 4), "\n")

cat("Area Skewness: AS =", round(AS, 4), "\n")

# Bootstrap sampling distribution of AS statistic (bsd$t).

set.seed(123)

boot_func <- function(data, indices) {sample_data <- data[indices]

return(calc_asymmetry(sample_data)$asymmetry)}

bsd <- boot(data = x, statistic = boot_func, R = 1000)

print(bsd)

cat("Bootstrap point estimate: boot_AS =", mean(bsd$t), "\n")

# Jackknife acceleration.

AS_jack <- numeric(n)

for (i in 1:n) {x_jack <- x[-i]

AS_jack[i] <- calc_asymmetry(x_jack)$asymmetry}

accel <- sum((mean(AS_jack) - AS_jack)^3) / (6 * sum((mean(AS_jack) - AS_jack)^2)^(3/2))

cat("Jackknife acceleration: a =", round(accel, 6), "\n")

# Assessment of symmetry, mesokurtosis, and normality for bsd$t (the bootstrap sampling distribution of the AS statistic).

cat("Symmetry, mesokurtosis, and normality of the bootstrap sampling distribution of the AS statistic:", "\n")

agostino.test(bsd$t, alternative = "two.sided")

anscombe.test(bsd$t, alternative = "two.sided")

agostino <- agostino.test(bsd$t, alternative = "two.sided")

ag <- anscombe.test(bsd$t, alternative = "two.sided")

sf.test(bsd$t)

K2 <- (agostino$statistic["z"]^2 + ag$statistic["z"]^2)

p <- pchisq(K2, df = 2, lower.tail = FALSE)

cat("D’Agostino-Berlanger-D’Agostino Normality test: bsd$t", "\n")

cat("k^2(bsd$t) =", round(K2, 4),",","p_value =", round(p, 4), "\n")

# Histogram and normal Q-Q plot for bsd$t displayed in the console and saved as a JPEG file.

bootstrap_plot <- plot(bsd)

jpeg("bootstrap_plot.jpeg", width = 1600, height = 1200, units = "px", res = 300)

par(mar = c(5, 4, 4, 2) + 0.1)

plot(bootstrap_plot)

dev.off()

# 95% bootstrap confidence intervals for the AS using three methods.

boot.ci(bsd, conf = 0.95, type = c("norm", "bca", "perc"))

# Two-tailed bootstrap test for symmetry assessed using the AS statistic.

p_boot <- min(2 * min(sum(bsd$t <= 0) / length(bsd$t), sum(bsd$t >= 0) / length(bsd$t)), 1)

cat("Two-tailed bootstrap probability value for the null hypothesis of symmetry (H_0: AS = 0): p-value =", round(p_boot, 4), "\n")

# Decision regarding the null hypothesis based on a significance level of 0.05.

if (p_boot < 0.05) {cat("Decision: Reject the null hypothesis of symmetry at the 5% significance level.\n")

} else {cat("Decision: Accept the null hypothesis of symmetry at the 5% significance level.\n")}

5.2. Example 1 of Script Application: Generated Data

5.2.1. Data Generation

For the example, a random sample of 800 data points was generated from a hyperbolic secant distribution with a mean of 1.4 and a standard deviation of π / 2 using the inverse transform method (Equation (17)).

uU~U[ 0,1 ]andlL~HS( μ=1.4,σ= π 2 )

l i =1.4+ln[ tan( π 2 u i ) ];i=1,2,,n=800 (17)

The data can be considered to originate from the following experimental situation. A population-based study was conducted with overweight and obese individuals who had recently been diagnosed with type II diabetes at their outpatient health center. A sample of 800 individuals was randomly selected. Of these, 400 people participated in a motivational program to change health habits, which they entered immediately after providing their informed consent. The program lasted two weeks and was delivered in a small group format, either virtually or in person, by health psychologists. One week after providing consent, all 800 participants were given a personal physical exercise and diet program to follow at home and at work. This program was customized for each individual over one to two weeks and included monthly follow-ups for one year. The follow-ups were conducted through either face-to-face or virtual consultations, overseen by family medicine residents and nurses.

After one year, weight loss was measured at the health centers where the initial measurements were recorded. For data analysis, a consistent estimator of the average causal effect of the motivational program on weight loss was used, denoted by 𝑐 [52]. The definition of the estimator is shown in Equation (18), where y ¯ 1 represents the mean weight loss (in kilograms) in the group that received the motivational program, y ¯ o represents the mean weight loss in the group that did not receive it, p1 is the proportion of individuals who followed the physical exercise and diet program in the group that received the motivational program, and p0 represents the proportion of individuals who did not follow the physical exercise and diet program among those who did not receive the motivational program.

c= y ¯ 1 y ¯ o p 1 p o (18)

On the other hand, the quantitative variable of weight loss in kilograms (regressand) was predicted using the dichotomous variable of whether or not the physical exercise and diet program was followed (regressor) through simple linear regression with the ordinary least squares method (Equation (19)).

y ^ i =a+b x i ;i=1,2,,n (19)

Finally, the variable L was created as the natural logarithm of the absolute difference between the consistent estimator (c) and linear regression model predictions ( y ^ i ). This instrumental variable was denoted as the natural logarithm of the loss attributable to the motivational program per participant (Equation (20)).

l i =ln( | c y ^ i | );i=1,2,,n (20)

The variable L follows a hyperbolic secant distribution with mean ln(η) and standard deviation π/2. Equation (21) shows the determination of η, where Y is the quantitative variable of weight loss in kilograms, X represents the dichotomous variable of following or not following the physical exercise and diet program, σ denotes the standard deviation (of X or Y), and ρXY symbolizes the Pearson product-moment correlation coefficient between X and Y [44].

L ~HS( μ L =ln( η ), σ L =π/2 )

η= σ Y 1 ρ XY σ X (21)

The following R script was used to obtain the sample, which is stable because a seed was set. The three scores shown in the second line (“Vector of scores or records as input data”) of the script for estimating AS correspond to the first two and the last scores generated by the script, totaling 800 scores. To obtain these results, replace the vector definition with the first four lines of the script. The fifth line, which prints the 800 scores to 9 decimal places, is not necessary for the estimation of AS but can be used if desired. If you choose to print the scores, ensure they are separated by commas, filling in the ellipses in the second line of the script with the actual scores.

# Script generating the 800-data sample, which is used for the example.

set.seed(123)

quantile_hyperbolic_secant <- function(p) {return(1.4 + log(tan(pi / 2 * p)))}

u <- runif(800)

x <- quantile_hyperbolic_secant(u)

print(x)

5.2.2. Script Results with Generated Data

After running the script, the 800-data sample is random according to the Wald-Wolfowitz test (runs = 397, z = −0.283, p-value = 0.7772 > α = 0.05). It presents outliers in the left and right tails by Tukey’s criterion and Grubbs’ test (G = 4.6268, u = 0.9732, p-value = 0.0013 < α = 0.05), as seen in the box plot (Figure 1). Consequently, the profile of the distribution is leptokurtic by the Anscombe-Glynn test (b2 = 5.0691 > 3, z = 6.4831, p-value < 0.001 < α = 0.05) and deviates from normality by the Shapiro-Francia (w = 0.9767, p-value < 0.001 < α = 0.05) and D’Agostino-Berlanger-D’Agostino (k2 = 42.4392, p-value < 0.001 < α = 0.05) tests. However, it does have a symmetric contour according to the D’Agostino test ( b 1 = −0.0549, z = −0.6392, p-value = 0.5227), as observed in the histogram with the overlaid density and normal curves (Figure 2). The contours of the histogram and density curve are sharper and have heavier tails than the Gaussian bell. However, like the Gaussian bell, these two profiles are symmetrical.

Figure 1. Box and whisper plot.

Figure 2. Histogram (yellow bins) with overlaid density (blue line) and normal (red line) curves.

The quantile order or cumulative probability of the arithmetic mean is 0.5135, and the quantile order of the median is 0.5. Therefore, AS, or the difference between these orders, is 0.0135, which is a small value close to 0.

The difference between the bootstrap estimate (AS = 0.0121) and the original estimate (AS = 0.0135), or bias, is −0.0014. Consequently, the bootstrap estimate has minimal bias. The bootstrap standard error is small (bse = 0.0124); hence, the bootstrap estimates are adequate. The acceleration, or rate of change of the standard deviation of the estimates as scores change in the jackknife sampling distributions, is minimal (jackknife a = 0.0003). The bootstrap sampling distribution of AS is symmetric ( b 1 = −0.0816, z = −1.0595, p-value = 0.2894 > α = 0.05) and conforms to normality (w = 0.9984, p-value = 0.4496 > α = 0.05). The histogram illustrates the bell-shaped profile of the normal distribution, and the points align at 45 degrees in the normal Q-Q plot (Figure 3). Therefore, the Gaussian confidence interval is the most appropriate, 95% CI (−0.0094, 0.0392), with the other two methods being not inadequate. Zero is included in the confidence interval for all three methods, so the null hypothesis of symmetry (H0: AS = 0) is maintained, corroborated by the two-tailed bootstrap probability: p-value = 0.334, which is greater than α = 0.05. With a normal sampling distribution, the asymptotic confidence interval is valid and also includes 0.

Figure 3. Histogram and box plot of the bootstrap sampling distribution of the AS statistic used to assess asymmetry in the distribution of instrumental variable L.

Wald-Wolfowitz runs test (criterion: median):

Sample median: mdn(x) = 1.3579.

Sample size: n = 800.

Number of runs: r = 397.

n_0 = #(x_i < mdn(x)) = 400, n_1 = #(x_i > mdn(x)) = 400, and n = n_0 + n_1 = 800.

Two-tailed exact probability value: p = 0.8044.

Mean: M(R|n_0, n_1) = 401 and Standard deviation: DE(R|n_0, n_1) = 14.1333.

Standardized number of runs: z_r = -0.283.

Two-tailed asymptotic probability value: p-value = 0.7772.

Outliers among n sample data:

Sample minimum = −5.82114.

Lower limit for outlier case = −2.111538.

Upper limit for outlier case = 4.93838.

Sample maximum = 8.374573.

There is at least one outlier in the left tail.

There is at least one outlier in the right tail.

Grubbs test for one outlier:

G(x) = 4.6268, U = 0.9732, p-value = 0.0013.

Alternative hypothesis: lowest value −5.821139881 is an outlier.

Symmetry, mesokurtosis, and normality in the distribution from which the original sample was drawn:

D’Agostino skewness test:

skew(x) = −0.0549, z = −0.6392, p-value = 0.5227.

Anscombe-Glynn kurtosis test:

kurt(x) = 5.0691, z = 6.4831, p-value = 8.986 × 10−11.

Shapiro-Francia normality test:

W(x) = 0.9767, p-value = 4.143 × 10−09.

D’Agostino-Berlanger-D’Agostino Normality test:

k-squared(x) = 42.4392, p_value = 0.

Number of bins for four rules (k).

Friedman-Diaconis rule as default option: k = 38.

Scott’s rule when the variable follows a normal distribution: k = 25.

Square Root rule when the variable follows an arcsine distribution: k = 29.

Rice university rule when the variable follows an arcsine distribution: k = 19.

Point estimate of area skewness:

Quantile order or cumulative probability of the arithmetic mean = 0.5135.

Quantile order or cumulative probability of the median = 0.5.

Area Skewness: AS = 0.0135.

Ordinary nonparametric bootstrap

Bootstrap statistics:

Statistic

original

bias

std. error

t1*

0.0135

−0.0014

0.01242

Bootstrap point estimate: boot_AS = 0.0121.

Jackknife acceleration: a = −0.0003.

Symmetry, mesokurtosis, and normality of the bootstrap sampling distribution of the AS statistic:

D’Agostino skewness test:

skew(bsd$t) = −0.0816, z = −1.0595, p-value = 0.2894.

Anscombe-Glynn kurtosis test:

kurt(bsd$t) = 3.1131, z = 0.8135, p-value = 0.4159.

Shapiro-Francia normality test:

W(bsd$t) = 0.99842, p-value = 0.4496.

D’Agostino-Berlanger-D’Agostino Normality test:

k^2(bsd$t) = 1.7844, p_value = 0.4097.

Bootstrap confidence intervals:

Level

Normal

Percentile

BCa

95%

(−0.0094, 0.0392)

(−0.0127, 0.0361)

(−0.0102, 0.0396)

Calculations and Intervals on Original Scale. Based on 1000 bootstrap replicates.

Two-tailed bootstrap probability value for the null hypothesis of symmetry (H_0: AS = 0): p-value = 0.334.

Decision: Accept the null hypothesis of symmetry at the 5% significance level.

5.3. Example 2 of Script Application: Real-World Data

The vector of scores representing the weaning age (in months) reported by 200 Mexican mothers (aged 18 ‑ 44 years) is presented below. These data are unpublished and were obtained from a graduate thesis.

x <- c(1.22, 7.83, 11.28, 4.74, 4.92, 6.29, 4.12, 2.58, 6.99, 3.71, 4.28, 1.47, 2.78, 2.41, 4.83, 3.08, 1.08, 1.64, 4.86, 4.72, 4.46, 3.34, 3.30, 2.73, 2.10, 1.07, 5.89, 5.15, 5.64, 2.94, 1.91, 3.42, 8.72, 2.27, 4.89, 2.53, 7.92, 3.17, 2.01, 5.67, 3.89, 4.14, 2.51, 1.08, 2, 2.35, 4.65, 2.84, 2.63, 1.83, 3.15, 1.86, 1.84, 1.09, 1.68, 1.50, 5.82, 3.01, 5.39, 4.39, 4.10, 6.31, 9.21, 5.20, 2.39, 4.38, 3.74, 4.47, 3.83, 1.77, 4.37, 1.53, 5.89, 2.69, 7.95, 1.77, 3.35, 1.61, 5.03, 2.52, 6.59, 8.06, 1.10, 3.58, 1.13, 5.93, 3.41, 1.92, 4.63, 2.68, 8.78, 1.25, 3.52, 5.83, 5.81, 1.98, 9.54, 4.51, 4.05, 2.99, 1.45, 4.36, 2.07, 2.61, 1.31, 4.53, 4.13, 1.96, 2.82, 6.79, 4.77, 1.59, 2.72, 1.84, 2.37, 1.56, 6.62, 1.30, 1.29, 3.66, 3, 2.16, 5.77, 2.64, 1.70, 2.98, 2.91, 2.73, 1.42, 3.08, 5.98, 5.74, 1.19, 1.85, 6.21, 5.15, 5.77, 5.02, 2.08, 1.91, 7.09, 3.06, 1.17, 6.94, 6.51, 5.13, 5.77, 7.84, 4.17, 2.11, 1.43, 1.47, 1.26, 4.81, 1.10, 2.39, 2.47, 2.87, 1.87, 2.66, 3.92, 1.38, 9.52, 7.79, 3.02, 1.79, 3.66, 2.93, 4.01, 3.36, 3.47, 1.36, 4.88, 1.27, 5.57, 1.65, 8.44, 1.83, 2.59, 5.25, 5.22, 2.68, 1.18, 4.20, 5.34, 2.14, 6.51, 5.29, 1.34, 3.35, 4.56, 2.11, 2.35, 3.94, 2.42, 6.34, 1.13, 1.65, 5.50, 4.05)

The null hypothesis of randomness in the sequence of sample data on weaning age was not rejected (Wald-Wolfowitz runs test using median as criterion (mdn(x) = 3.16), number of runs = 100, z(runs) = −0.1418, asymptotic two-tailed p-value = 0.8873). One outlier was identified in the right tail (Grubbs’ test: G = 3.6358, U = 0.9332, two-tailed p-value = 0.0221), corresponding to the maximum value (min(x) = 11.28 months > upper limit for outlier case = 9.5938). The distribution of weaning age was mesokurtic (Anscombe-Glynn kurtosis test: b2 = 3.4474, z(b2) = 1.3697, two-tailed p-value = 0.1708), but positively skewed (D’Agostino skewness test: b 1 ( x ) =0.9091 , z( b 1 ( x ) )=4.7339 , two-tailed p-value < 0.0001). Consequently, it deviated from normality as assessed by both the Shapiro-Francia test with Royston’s standardization (w = 0.927, right-tailed p-value < 0.0001) and the D’Agostino-Belanger-D’Agostino test (k2 = 24.2859, p-value < 0.0001). See Figure 4 & Figure 5 where the colors have been modified: “lightyellow” was changed to “wheat”, “darkblue” to “chocolate4”, and “red” to “darkolivegreen”. To view all the color options available in R, use the following command: colors().

Figure 4. Box and whisper plot of weaning age.

Figure 5. Histogram (wheat bins) with overlaid density (chocolate line) and normal (dark olive‑green line) curves of weaning age.

The value of Area Skewness (AS) was 0.067. Its confidence interval was computed using the bootstrap method. A thousand random samples were drawn with replacement, and the AS statistic was calculated for each, forming the bootstrap sampling distribution of AS (bsd$t). The bootstrap bias (−0.0024), bootstrap standard error (0.0236), and jackknife acceleration (a = −0.0033) were low. Furthermore, the bootstrap sampling distribution of the AS statistic was symmetric (D’Agostino skewness test: b 1 (bsd$t) = −0.0626, z = −0.8130, two-tailed p-value = 0.4162), mesokurtic (Anscombe-Glynn kurtosis test: b2(bsd$t) = 3.0260, z = 0.2812, two-tailed p-value = 0.7786), and followed a normal distribution (Royston-Shapiro-Francia normality test: w = 0.9990, right-tailed p-value = 0.8077; D’Agostino-Belanger-D’Agostino test: k2(bsd$t) = 0.7401, p-value = 0.6907). See Figure 6. The bootstrap confidence interval using the normal method did not include 0 (95% normal CI: 0.0230 to 0.1157), and the null hypothesis of symmetry was rejected at the 5% significance level (two-tailed bootstrap p-value = 0.006). Consequently, the distribution of weaning age among these Mexican women is positively asymmetric, with a long tail to the left, according to the AS statistic and consistent with D’Agostino’s test (Figure 4 & Figure 5).

Figure 6. Histogram and box plot of the bootstrap sample distribution of the AS statistic used to assess asymmetry in the distribution of weaning age.

5.4. Norms for Samples Drawn from a Normal Distribution

Table 1 contains the point estimate of the AS, as well as the normative bootstrap confidence intervals with confidence levels of 90%, 95%, and 99%. These were calculated by the Gaussian method due to the fit to normality according to the Shapiro-Francia and D’Agostino-Berlanger-D’Agostino tests (Table 2), except for the confidence intervals from the bootstrap sampling distributions of the AS generated from samples of 60 and 600 normally distributed data points. In the first case, there was a deviation from normality due to negative skewness (Table 2), so the BCa percentile method was used. In the second case, there was a deviation from normality due to leptokurtosis (Table 2), so the percentile method was used. Additionally, Table 1 contains the asymptotic 95% confidence interval.

Jackknife acceleration was only slightly greater than 0.025 in the bootstrap sampling distribution generated from the sample of 20 normally distributed data points. It ranged from 0.000001 to 0.0266, with an average of 0.0025. The absolute bias ranged from 0.00002 to 0.0244, with a mean of 0.0036. Consequently, the bias was minimal. The bootstrap standard error ranged from a minimum of 0.0030 to a maximum of 0.0734, with an average of 0.0180. One quarter of the potential range (from −1 to 1) corresponds to 0.5, and one eighth to 0.25. Therefore, the bootstrap standard error is small and the estimates are adequate.

Table 1. Original point estimate and bootstrap confidence intervals for area skewness in random samples of 40 different sizes drawn from a standard normal distribution.

Sample

size

Original

estimate

Bootstrap confidence intervals

95%

ACI

M

90%

95%

99%

20

0.0272

N

(−0.0692, 0.1724)

(−0.0923, 0.1955)

(−0.1376, 0.2407)

(−0.1918, 0.2463)

30

0.0030

N

(−0.1077, 0.0893)

(−0.1266, 0.1082)

(−0.1635, 0.1450)

(−0.1759, 0.1820)

40

−0.0171

N

(−0.1174, 0.0464)

(−0.1331, 0.0621)

(−0.1637, 0.0927)

(−0.1721, 0.1378)

50

0.0244

N

(−0.0533, 0.0855)

(−0.0666, 0.0988)

(−0.0925, 0.1247)

(−0.1142, 0.1629)

60

0.0075

BCa

(−0.0710, 0.0583)

(−0.0945, 0.0689)

(−0.1283, 0.0891)

(−0.1190, 0.1340)

70

0.0083

N

(−0.0596, 0.0626)

(−0.0713, 0.0743)

(−0.0941, 0.0972)

(−0.1088, 0.1254)

80

0.0135

N

(−0.0433, 0.0674)

(−0.0539, 0.0780)

(−0.0746, 0.0987)

(−0.0960, 0.1231)

90

0.0053

N

(−0.0523, 0.0517)

(−0.0622, 0.0616)

(−0.0817, 0.0811)

(−0.0980, 0.1086)

100

0.0099

N

(−0.0451, 0.0540)

(−0.0546, 0.0635)

(−0.0732, 0.0821)

(−0.0881, 0.1079)

150

0.0241

N

(−0.0090, 0.0704)

(−0.0166, 0.0781)

(−0.0314, 0.0929)

(−0.0559, 0.1041)

200

0.0254

N

(−0.0078, 0.0644)

(−0.0147, 0.0713)

(−0.0282, 0.0848)

(−0.0439, 0.0946)

250

0.0403

N

(0.0100, 0.0776)

(0.0035, 0.0841)

(−0.0091, 0.0967)

(−0.0217, 0.1022)

300

0.0208

N

(−0.0176, 0.0433)

(−0.0234, 0.0491)

(−0.0348, 0.0605)

(−0.0358, 0.0774)

350

0.0197

N

(−0.0146, 0.0433)

(−0.0201, 0.0489)

(−0.0309, 0.0597)

(−0.0327, 0.0720)

400

0.0130

N

(−0.0157, 0.0354)

(−0.0206, 0.0403)

(−0.0302, 0.0499)

(−0.0360, 0.0620)

450

0.0080

N

(−0.0171, 0.0293)

(−0.0216, 0.0337)

(−0.0302, 0.0424)

(−0.0382, 0.0542)

500

0.0050

N

(−0.0199, 0.0248)

(−0.0242, 0.0291)

(−0.0326, 0.0375)

(−0.0388, 0.0488)

550

0.0062

N

(−0.0174, 0.0261)

(−0.0215, 0.0303)

(−0.0297, 0.0384)

(−0.0355, 0.0480)

600

0.0045

Perc

(−0.0151, 0.0272)

(−0.0179, 0.0332)

(−0.0306, 0.0452)

(−0.0355, 0.0445)

650

0.0040

N

(−0.0181, 0.0221)

(−0.0219, 0.0259)

(−0.0295, 0.0334)

(−0.0345, 0.0424)

700

0.0044

N

(−0.0155, 0.0237)

(−0.0192, 0.0274)

(−0.0266, 0.0348)

(−0.0326, 0.0414)

750

0.0007

N

(−0.0192, 0.0182)

(−0.0228, 0.0218)

(−0.0297, 0.0288)

(−0.0350, 0.0365)

800

0.0005

N

(−0.0172, 0.0182)

(−0.0206, 0.0216)

(−0.0273, 0.0283)

(−0.0341, 0.0352)

850

0.0014

N

(−0.0154, 0.0186)

(−0.0187, 0.0219)

(−0.0251, 0.0282)

(−0.0322, 0.0350)

900

0.0002

N

(−0.0172, 0.0158)

(−0.0204, 0.0189)

(−0.0265, 0.0251)

(−0.0325, 0.0329)

950

0.0036

N

(−0.0124, 0.0204)

(−0.0155, 0.0235)

(−0.0216, 0.0297)

(−0.0282, 0.0354)

1000

0.0011

N

(−0.0153, 0.0166)

(−0.0183, 0.0197)

(−0.0243, 0.0257)

(−0.0299, 0.0321)

1500

−0.0015

N

(−0.0148, 0.0123)

(−0.0174, 0.0149)

(−0.0225, 0.0200)

(−0.0268, 0.0238)

2000

0.00004

N

(−0.0113, 0.0122)

(−0.0136, 0.0145)

(−0.0180, 0.0189)

(−0.0219, 0.0220)

2500

0.0050

N

(−0.0048, 0.0168)

(−0.0069, 0.0189)

(−0.0109, 0.0229)

(−0.0146, 0.0246)

3000

0.0057

N

(−0.0029, 0.0160)

(−0.0048, 0.0178)

(−0.0083, 0.0213)

(−0.0122, 0.0236)

3500

0.0073

N

(−0.0006, 0.0174)

(−0.0023, 0.0191)

(−0.0057, 0.0225)

(−0.0092, 0.0239)

4000

0.0041

N

(−0.0033, 0.0134)

(−0.0049, 0.0150)

(−0.0080, 0.0182)

(−0.0114, 0.0196)

4500

0.0026

N

(−0.0051, 0.0110)

(−0.0067, 0.0126)

(−0.0097, 0.0156)

(−0.0120, 0.0172)

5000

0.0023

N

(−0.0050, 0.0101)

(−0.0065, 0.0116)

(−0.0093, 0.0144)

(−0.0116, 0.0161)

6000

0.0014

N

(−0.0050, 0.0085)

(−0.0063, 0.0098)

(−0.0088, 0.0123)

(−0.0113, 0.0140)

7000

0.0023

N

(−0.0036, 0.0087)

(−0.0048, 0.0099)

(−0.0071, 0.0122)

(−0.0094, 0.0140)

8000

0.0020

N

(−0.0034, 0.0078)

(−0.0044, 0.0089)

(−0.0065, 0.0110)

(−0.0089, 0.0130)

9000

0.0026

N

(−0.0021, 0.0082)

(−0.0031, 0.0092)

(−0.0050, 0.0112)

(−0.0077, 0.0129)

10000

0.0036

N

(−0.0010, 0.0088)

(−0.0020, 0.0098)

(−0.0038, 0.0116)

(−0.0062, 0.0134)

Note. Based on 1000 bootstrap replicates. Calculations and intervals on original scale. Original estimate represents the point estimate of AS in the original sample of size n (20, 30, …, 10,000) generated from a standard normal distribution using inverse transform sampling. Method (M) for calculating bootstrap confidence intervals: Normal or Gaussian (N), Percentile (Perc), and Bias-Corrected accelerated (BCa) percentile method, as well as 95% asymptotic confidence intervals (95% ACI) for normally distributed data from the 40 original samples.

5.5. Normality in the Bootstrap Sampling Distribution of AS

Table 2 shows four statistical hypothesis tests: the D’Agostino symmetry test [26], the Anscombe-Glynn kurtosis test [27], and the Shapiro-Francia [28] [29] and D’Agostino-Berlanger-D’Agostino [30] tests for normality. At a 5% significance level in a two-tailed test, the null hypothesis of symmetry held in 38 of the 40 bootstrap sampling distributions of AS. It was rejected for distributions generated from samples of 40 and 60 normally distributed data points. At the 1% significance level, the null hypothesis of symmetry would hold for the first distribution but not for the second.

At the 5% significance level in a two-tailed test, the null hypothesis of mesokurtosis held for 35 of the 40 bootstrap sampling distributions of AS. It was rejected due to platykurtosis (β2 < 3) in the distributions generated from samples of 70, 500, and 2000 normally distributed data, and due to leptokurtosis (β2 > 3) in the distributions generated from samples of 600 and 4500 normally distributed data. At a 1% significance level, the null hypothesis of mesokurtosis would only be rejected for the distribution generated from the 500 normally distributed data sample, which departs from normality according to the D’Agostino-Berlanger-D’Agostino test [30], but not by the Shapiro-Francia test [28] [29], which is given primacy (Table 2).

At the 5% significance level, the null hypothesis of normality held for 38 of the 40 bootstrap sampling distributions of AS. The test was conducted using the Shapiro-Francia test [28] [29] on distributions generated from samples of 20 to 5000 normally distributed data points and the D’Agostino-Berlanger-D’Agostino test [30] on distributions generated from samples of 6000 to 10,000. Normality was rejected due to skewness in the distribution generated from the sample of 60 normally distributed data points and due to leptokurtosis in the distribution generated from the sample of 600 normally distributed data points. The results of

Table 2. Symmetry, mesokurtosis, and normality tests for the bootstrap sampling distribution of AS.

Sample

size

D’Agostino

AG

SF

DBD

bse -

ase

F-

statistic

z

p

z

p

w’

p

k2

p

20

−0.2167

0.8285

0.2544

0.7992

0.9983

0.3826

0.1117

0.9457

−0.0384

2.320*

30

0.9731

0.3305

−0.1062

0.9154

0.9980

0.2408

0.9582

0.6193

−0.0314

2.323*

40

−2.0437

0.0410

0.6539

0.5132

0.9973

0.0880

4.6043

0.1000

−0.0292

2.516*

50

0.5585

0.5765

1.8021

0.0715

0.9974

0.1078

3.5593

0.1687

−0.0285

2.807*

60

−2.7566

0.0058

−0.3428

0.7318

0.9953

0.0043

7.7164

0.0211

−0.0253

2.707*

70

−0.5400

0.5892

−2.3629

0.0181

0.9982

0.3242

5.8749

0.0530

−0.0227

2.598*

80

−0.5372

0.5912

−0.4608

0.6449

0.9980

0.2480

0.5009

0.7785

−0.0222

2.751*

90

−1.5622

0.1183

−0.5128

0.6081

0.9986

0.5578

2.7033

0.2588

−0.0211

2.781*

100

0.5972

0.5504

0.5647

0.5723

0.9981

0.2987

0.6755

0.7134

−0.0199

2.759*

150

−0.5966

0.5508

0.0911

0.9274

0.9990

0.7975

0.3643

0.8335

−0.0167

2.866*

200

0.1819

0.8557

−1.1984

0.2307

0.9989

0.7807

1.4693

0.4797

−0.0134

2.598*

250

−1.8064

0.0709

−0.1753

0.8609

0.9982

0.3432

3.2937

0.1927

−0.0111

2.376*

300

0.2686

0.7882

−0.2772

0.7816

0.9990

0.8436

0.1490

0.9282

−0.0104

2.440*

350

0.8356

0.4034

0.7049

0.4809

0.9979

0.2110

1.1951

0.5501

−0.0091

2.301*

400

0.3223

0.7472

0.7754

0.4381

0.9978

0.1893

0.7052

0.7029

−0.0095

2.601*

450

0.2362

0.8133

−0.3323

0.7397

0.9995

0.9951

0.1662

0.9203

−0.0095

2.801*

500

0.5020

0.6157

2.8699

0.0041

0.9977

0.1529

8.4884

0.0143

−0.0088

2.713*

550

0.2182

0.8272

1.4699

0.1416

0.9976

0.1446

2.2081

0.3315

−0.0081

2.604*

600

0.7111

0.4770

2.4948

0.0126

0.9961

0.0142

6.7296

0.0346

−0.0072

2.388*

650

0.6914

0.4893

−0.0566

0.9549

0.9989

0.7640

0.4812

0.7861

−0.0074

2.581*

700

0.8023

0.4224

−1.2675

0.2050

0.9989

0.7395

2.2502

0.3246

−0.0070

2.522*

750

0.4813

0.6303

−0.2599

0.7949

0.9989

0.7817

0.2993

0.8610

−0.0069

2.577*

800

−1.0040

0.3154

0.7729

0.4396

0.9986

0.5532

1.6054

0.4481

−0.0069

2.686*

850

1.0836

0.2786

1.0102

0.3124

0.9983

0.3775

2.1947

0.3338

−0.0068

2.756*

900

0.3740

0.7084

−1.5287

0.1263

0.9988

0.6923

2.4769

0.2898

−0.0067

2.789*

950

1.0635

0.2876

1.5955

0.1106

0.9977

0.1710

3.6767

0.1591

−0.0062

2.624*

1000

0.9280

0.3534

1.4645

0.1431

0.9979

0.2022

3.006

0.2225

−0.0061

2.653*

1500

−0.4093

0.6823

−0.5216

0.6020

0.9983

0.3873

0.4395

0.8027

−0.0047

2.475*

2000

−0.3903

0.6963

−2.0182

0.0436

0.9981

0.2823

4.2253

0.1209

−0.0040

2.420*

2500

−1.011

0.3121

−0.6014

0.5476

0.9990

0.8095

1.3836

0.5007

−0.0034

2.296*

3000

−0.6585

0.5102

−0.0291

0.9768

0.9995

0.9927

0.4345

0.8047

−0.0034

2.549*

3500

−1.2224

0.2216

−0.2848

0.7758

0.9989

0.7822

1.5754

0.4549

−0.0030

2.388*

4000

0.3740

0.7084

−1.1335

0.2570

0.9984

0.4120

1.4246

0.4905

−0.0028

2.399*

4500

0.1575

0.8748

2.0402

0.0413

0.9974

0.1040

4.1874

0.1232

−0.0026

2.343*

5000

−1.1361

0.2559

0.0340

0.9729

0.9987

0.5948

1.2919

0.5242

−0.0025

2.382*

6000

−1.7225

0.0850

1.2889

0.1974

4.6287

0.0988

−0.0024

2.513*

7000

0.7618

0.4462

−1.2176

0.2234

2.0627

0.3565

−0.0023

2.630*

8000

0.1212

0.9035

1.0603

0.2890

1.1390

0.5658

−0.0022

2.713*

9000

−1.3128

0.1893

0.6348

0.5255

2.1263

0.3454

−0.0022

2.923*

10000

1.2987

0.1941

−1.3994

0.1617

3.6448

0.1616

−0.0020

2.778*

Note. Based on 1000 bootstrap replicates. Samples generated from a standard normal distribution using inverse transform sampling, D’Agostino skewness test: b 1 = Pearson’s coefficient of skewness based on the standardized third central moment, z = D’Agostino’s standardized testing statistic, and p = probability value for the null hypothesis of symmetry in the bootstrap sampling distribution of AS statistic. Anscombe-Glynn kurtosis test: b2 = Pearson’s coefficient of kurtosis based on the standardized fourth central moment, z = Anscombe-Glynn standardized testing statistic, and p = probability value for the null hypothesis of mesokurtosis in the bootstrap sampling distribution of AS. Shapiro-Francia normality test: w = Royston’s testing statistic, and p = probability value for the null hypothesis of normal distribution in the bootstrap sampling distribution of AS. D’Agostino-Berlanger-D’Agostino (DBD) normality test: k2 = testing statistic, and p = probability value for the null hypothesis of normality in the bootstrap sampling distribution of AS. Probability values less than 0.05 (significant) are highlighted in italics. Standard errors: bse = bootstrap and ase = asymptotic. Snedecor-Cochran F-test for the equality of variances between two independent populations: F = max(ase, bse)2/min(ase, bse)2 = test statistic, two-tailed probability value: * p-value < 0.001.

the Shapiro-Francia test [28] [29] were corroborated by the D’Agostino-Berlanger-D’Agostino test [30] for these two distributions. At the 1% significance level, the null hypothesis of normality would only be rejected for the distribution generated from the sample of 60 normally distributed data points (Table 2).

The histogram and normal Q-Q plot of the smallest bootstrap sampling distribution of AS (n = 20) are shown. A good fit to normality is indicated by the bell-shaped profile in the histogram and the 45-degree alignment of the point cloud in the normal Q-Q plot, similar to the other distributions (Figure 7). Thus, convergence to normality occurs early.

5.6. Comparison of Asymptotic and Bootstrap Standard Errors

The asymptotic standard error is larger than the bootstrap standard error, with the difference between the two being small. It ranges from −0.0384 to −0.0020, with an average of −0.0109. However, the asymptotic standard error is 1.5162 to 1.7097 times larger than the bootstrap standard error, with an average ratio of 1.6057 (Table 2). Consequently, the null hypothesis that the bootstrap standard error is statistically equivalent to the asymptotic standard error is rejected in all cases using a two-tailed Snedecor-Cochran F-test at the 5% significance level.

The correlation between the difference in negative errors and sample size is positive and significant: r[40] = 0.5723, 95% BCa CI (0.4670, 0.6957), bias = 0.0128, bse = 0.0424. Thus, as the sample size increases, the discrepancy between the asymptotic and bootstrap errors decreases. When converting Pearson’s correlation to Cohen’s d, it is observed that the effect size of sample size on the reduction

Figure 7. Histogram and boxplot of the bootstrap sampling distribution of AS in the sample of size 20 generated from a standard normal distribution for inverse transform sampling.

in the difference is large: Cohen’s d = 1.3959, which is greater than 0.8 (Equation (22)) [53].

d= 2r 1 r 2 = 2×0.5723 1 0.5723 2 =1.3959>0.8 (22)

6. Discussion

In relation to the first objective of the study, an R script was developed that not only allows for point and interval estimation of AS [1] but also facilitates checking certain assumptions and measuring asymmetry using the Pearson measure based on the standardized third central moment. The interval estimation is approached both asymptotically and through repetitive sampling with replacement. The significance level for the statistical tests was set at 5%, though this can be adjusted—for example, to 10% for small samples of 20 to 29 [45]-[54]. The script is recommended for use with samples of at least 30 data points [40] and not fewer than 20.

To compute AS, R’s rule 8 for calculating sample quantiles is considered, as it shows the best performance across various types of distributions [14]-[18]. Other options include rule 6 (default in SPSS), rule 7 (default in R), and rule 9 (highly recommended for normal distributions) [14]. Using the inverse of the empirical cumulative distribution function (R’s rule 1) is discouraged because it produces a floating step ladder profile—typical of a discrete distribution—in the normal Q-Q plot of the bootstrap sampling distribution of AS. In other words, because this rule discretizes the distribution.

In the first example, the script is applied to a large random sample of 800 data points generated by inverse transform sampling from a hyperbolic secant distribution, which is symmetric but leptokurtic. The data are considered to correspond to an instrumental variable as described by Ding [44]. This U.S. statistician studied the hypergeometric function, a member of the exponential family with a quadratic variance function. He emphasized the importance of this distribution, which appears in many widely used statistical models. Ding illustrated its importance with three examples: Fisher’s analysis of similarity between twins, Jeffreys’ prior for contingency tables, and invalid instrumental variables, including the variable L used in the first example.

In this health and social sciences example, the instrumental variable L is the natural logarithm of the loss attributable to the motivational program per participant. The bootstrap sampling distribution of AS fits the normal distribution well, making the asymptotic interval perfectly valid. Both the asymptotic and bootstrap normal confidence intervals, as well as the two-tailed bootstrap probability, correctly detect symmetry, in agreement with D’Agostino’s test [26].

A second example was run with data from an empirical study on the weaning age of 200 Mexican infants, where only the color of the graphs was changed and the script worked perfectly. The distribution of this continuous quantitative variable (months with two decimal places, corresponding to days) is unknown. When represented by a histogram and a density curve, it resembles a PERT distribution—a four-parameter beta distribution where the shape parameter β is greater than the shape parameter α—or a triangular distribution where the parameter b (the mode) is closer to the threshold parameter a (minimum) than to the threshold parameter c (maximum), with a downward-sloping profile [55]. In this example, with a medium sample size, the bootstrap sampling distribution of the AS statistic was normal, and the asymptotic and bootstrap standard errors were very similar, as were the corresponding confidence intervals. Consistently, the AS statistic and the other inferential and graphical tests showed positive skewness.

Regarding the second objective of providing interpretative norms for AS, the standard normal distribution was chosen as the symmetry model. This distribution is not only symmetric but also mesokurtic, making it central to parametric statistics [56]. The relevance of shape measures lies in their role in assessing the normality of the distribution of sample data [57]. Confidence intervals with widths of 0.9, 0.95, and 0.99 are provided for 40 sample sizes. These 40 samples were generated to follow a standard normal distribution via inverse transform sampling. The number of simulations conducted was 1000, which is the recommended and commonly used number [6] [40] [41]. These rules are particularly useful when only the point estimate of area asymmetry in a sample is available. Finally, the asymptotic and bootstrap confidence intervals, along with the bootstrap probability, guide the decision regarding symmetry in the distribution of the sample data.

With the exception of the bootstrap sampling distributions of AS generated from samples of 60 and 600 normally distributed data, the remaining 38 distributions were tested for normality using the Shapiro-Francia test [28] [29] for sample sizes ranging from 20 to 5000, and the D’Agostino-Berlanger-D’Agostino test [30] for sample sizes ranging from 6000 to 10,000. The Shapiro-Francia test was chosen for its superiority in predicting empirical quantiles from theoretical quantiles [38] and its applicability to large samples of up to 5000 data points [29] [47]. The D’Agostino-Berlanger-D’Agostino test was used for larger samples exceeding 5000 data points due to its greater power in such situations [30] [58].

Can the claim that the bootstrap sampling distribution of the AS statistic is generally normal be reconciled with the exceptions observed in the 60- and 600-point samples? The two source samples of sizes 60 and 600 show a good fit to normality according to the Shapiro-Wilk W test [28] [29] and the D’Agostino-Berlanger-D’Agostino K-squared test [30]. Both samples are symmetric as per the D’Agostino test for skewness [26], mesokurtic according to the Anscombe-Glynn test for kurtosis [27], free of outliers based on the Grubbs test [25] or Tukey’s criterion [15], and their histograms with overlaid density curves reflect the bell-shaped profile characteristic of a normal distribution. In addition, when the seed is changed from 123 to 12, the bootstrap sampling distribution of the AS statistic does achieve a good fit to normality.

It is important to note that changing the seed affects the random selection of bootstrap samples. A seed of 123 may generate bootstrap samples that inadvertently amplify specific patterns or features of the original data, resulting in a poor fit to normality. In contrast, a seed of 12 may yield a more representative resampling, producing a bootstrap distribution that aligns more closely with normality [42]. These exceptions at sample sizes of 60 and 600 suggest that, while the source samples conform to normality, the bootstrap distribution of the AS statistic is influenced by more than the marginal normality of the data. Factors such as the AS statistic’s sensitivity to sample idiosyncrasies and the stochastic nature of the re- sampling process likely play a significant role [39].

It should be noted that the AS bootstrap sampling distribution conforms to normality not only for samples drawn from normally distributed data but also in the first example provided, where the data follow a hyperbolic secant distribution [44]. Similarly, in the second example concerning weaning age, where the data appear to follow a triangular or four-parameter beta distribution bounded between 0 and 12 months, with the shape parameter β larger than the shape parameter α, the AS bootstrap sampling distribution also exhibits normality [57]. Thus, the sampling distribution of this statistic is normal, as previously observed in Singh et al.’s study [1] with distributions exhibiting positive skewness (gamma and triangular). Considering that quantile order represents cumulative probability and that probability can be estimated by a proportion—with the exact sampling distribution being binomial and the asymptotic distribution being normal—and that area skewness is based on the difference between two quantile orders (i.e., two probabilities), it follows that its distribution in asymptotic sampling is normal [7] [8]. Hence, an asymptotic confidence interval is used.

When to use this interval? The script provides both the fit to normality and the representation of the bootstrap sampling distribution of AS. If normality is present, the Wald-type interval is perfectly valid. If there is a deviation from normality, such as in the bootstrap sampling distribution of AS generated from the sample of 60 normally distributed data, it is advisable to use the bootstrap confidence interval, calculated using the BCa percentile method. In cases where normality is not present but symmetry with minimal bias and acceleration is observed, the percentile method is a good choice, as used in the bootstrap sampling distribution of AS generated from the sample of 600 normally distributed data. This highlights the practicality of the script. Finally, because the confidence interval is asymptotic, a large sample is required, especially as the sample deviates from normality. The distribution of the variable must have finite moments; thus, the method does not apply to the Cauchy distribution, which displays a bell-shaped histogram that is more flattened in the center and has very elongated tails compared to the normal curve.

Regarding the third objective, the asymptotic standard error was, on average, 1.6 times larger than the bootstrap standard error, and the null hypothesis of equality between the two errors was not supported. However, the average difference in arithmetic terms is small, at 0.011; hence, the asymptotic and bootstrap confidence intervals are very similar. Furthermore, a significant relationship was confirmed between increasing sample size and decreasing difference between the errors [9], with a large effect size [53], indicating that both errors should eventually converge [10]. This result may be due to the fact that:

1) The asymptotic standard error is based on theoretical assumptions that assume a large sample size. For small or moderate sample sizes, this approximation may be less accurate because it relies on limiting distributions, such as the normal, which may not fit the actual distribution of the data well [39].

2) The bootstrap method empirically estimates the standard error based on the variability observed in the replicate samples, especially when there are skewed and/or heavy tails [40].

As limitations of the study, symmetric distributions other than the normal—such as the Laplace or logistic distributions, which are leptokurtic, and the uniform or semicircular distributions, which are platykurtic—were not considered to provide a broader range of interpretative norms. However, this was deemed unnecessary, as the asymptotic and bootstrap confidence intervals, along with the bootstrap probability, are sufficient for making decisions [42]. Additionally, asymmetric distributions with finite moments, such as the lognormal or log-logistic distributions, were not included in testing the convergence to normality of the sampling distribution of AS, which are more asymmetric than those considered in Singh et al.’s study [1]. A discretized variant of this measure, based on the inverse of the empirical distribution function (R’s rule 1) for ordinal or discrete variables, could be suggested for further research. See Appendix. However, it is expected to be a very conservative method with respect to the null hypothesis of symmetry.

It is worth noting that applying the AS statistic can be particularly useful for describing the distribution of sample data for a given quantitative variable, especially if it is continuous in nature. In the measurement of attitudes and behaviors, skewness may reflect the adaptive nature of behavior, highlighting the long tail of maladaptive or pathological cases, whereas symmetry may indicate an expressive nature of behavior in a society that is open and tolerant of diversity [59]. Pronounced asymmetry, such as that observed in the Pareto distribution, indicates significant inequality in the distribution, particularly when measuring income [60]. Skewness is also valuable for assessing normality, as symmetry is one of its key conditions [61]. It helps assess symmetry before applying parametric tests for mean comparison (e.g., t-test, ANOVA), particularly when the assumption of normality is not fully met, as these tests are more robust under conditions of symmetry [62].

An advantage of this new measure over the traditional ones introduced by Pearson [2] and Fisher [3] is that it is bounded between 0 and 1, and is based on the cumulative probability difference between the mean and the median and not on the standardized linear distance between two measures of central tendency [2] or the third standardized third moment [3], making it useful for non-normal distributions such as Gamma, Pareto, and Lognormal [1].

7. Conclusion

The script can be practically and didactically useful for promoting the use and study of a measure of skewness applicable not only to known probability distributions but also to sample data from unknown distributions. The sampling distribution of AS, which is very simple to calculate, is normal. Interpretive norms are provided from the standard normal distribution for area asymmetry when only the point estimate is available. However, the R script developed does not require these norms, as it offers both asymptotic and bootstrap confidence intervals and the bootstrap probability to assess symmetry based on AS. Additionally, it includes the analysis of symmetry using the D’Agostino test and provides graphical representations of the empirical distribution through a histogram with overlaid density and normal curves, as well as a box plot. These features collectively offer sufficient tools to evaluate the symmetry of the sample and the distribution from which it is drawn.

Acknowledgements

The author expresses his gratitude to the reviewers and editors for the suggestions and corrections received for the improvement of the manuscript.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1] Singh, A.K., Gewali, L.P. and Khatiwada, J. (2019) New Measures of Skewness of a Probability Distribution. Open Journal of Statistics, 9, 601-621.
https://doi.org/10.4236/ojs.2019.95039
[2] Pearson, K. (1895) Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London, Series A (Mathematical, Physical and Engineering Sciences), 186, 343-414.
[3] Fisher, R.A. (1930) The Moments of the Distribution for Normal Samples of Measures of Departure from Normality. Proceedings of the Royal Society of London, Series A Mathematical and Physical Sciences, 130, 16-28.
[4] Wald, A. (1939) Contributions to the Theory of Statistical Estimation and Testing Hypotheses. The Annals of Mathematical Statistics, 10, 299-326.
https://doi.org/10.1214/aoms/1177732144
[5] Zepeda-Tello, R., Schomaker, M., Maringe, C., Smith, M.J., Belot, A., Rachet, B., Schnitzer, M.E. and Luque-Fernandez, M.A. (2022) The Delta-Method and Influence Function in Medical Statistics: A Reproducible Tutorial.
[6] Efron, B. and Narasimhan, B. (2020) The Automatic Construction of Bootstrap Confidence Intervals. Journal of Computational and Graphical Statistics, 29, 608-619.
https://doi.org/10.1080/10618600.2020.1714633
[7] Wang, N. (2023) Conducting Meta-Analyses of Proportions in R. Journal of Behavioral Data Science, 3, 64-126.
https://doi.org/10.35566/jbds/v3n2/wang
[8] Waudby-Smith, I., Arbour, D., Sinha, R., Kennedy, E.H. and Ramdas, A. (2021) Time-Uniform Central Limit Theory and Asymptotic Confidence Sequences.
[9] Hahn, J. and Liao, Z. (2021) Bootstrap Standard Error Estimates and Inference. Econometrica, 89, 1963-1977.
https://doi.org/10.3982/ecta17912
[10] Chu, B.M., Jacho-Chávez, D.T. and Linton, O.B. (2020) Standard Errors for Nonparametric Regression. Econometric Reviews, 39, 674-690.
https://doi.org/10.1080/07474938.2020.1772563
[11] Stapor, K. (2020) Descriptive and Inferential Statistics. In: Intelligent Systems Reference Library, Springer, 63-131.
https://doi.org/10.1007/978-3-030-45799-0_2
[12] R Core Team and Contributors Worldwide (2024) R Documentation. Sample Quantiles.
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
[13] McGrath, S., Zhao, X., Steele, R., Thombs, B.D., Benedetti, A., Levis, B., et al. (2020) Estimating the Sample Mean and Standard Deviation from Commonly Reported Quantiles in Meta-Analysis. Statistical Methods in Medical Research, 29, 2520-2537.
https://doi.org/10.1177/0962280219889080
[14] Hyndman, R.J. and Fan, Y. (1996) Sample Quantiles in Statistical Packages. The American Statistician, 50, 361-365.
https://doi.org/10.1080/00031305.1996.10473566
[15] Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley.
[16] Bruce, P., Bruce, A. and Gedeck, P. (2020) Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. 2nd Edition, O’Reilly Media.
[17] Linden, A. (2023) CENTILE2: Stata Module to Enhance Centile Command and Provide Additional Definitions for Computing Sample Quantiles. Statistical Software Components. S459262. Boston College Department of Economics.
[18] Sukhoplyuev, D.I. and Nazarov, A.N. (2024) Methods of Descriptive Statistics in Telemetry Tasks. 2024 Systems of Signals Generating and Processing in the Field of on Board Communications, Moscow, 12-14 March 2024, 1-5.
https://doi.org/10.1109/ieeeconf60226.2024.10496798
[19] Ramachandran, K.M. and Tsokos, C.P. (2020) Mathematical Statistics with Applications in R. Academic Press.
[20] Chihara, L.M. and Hesterberg, T.C. (2022) Mathematical Statistics with Resampling and R. 3rd Edition, Wiley.
[21] Schwarzer, G. and Rücker, G. (2021) Meta-Analysis of Proportions. In: Methods in Molecular Biology, Springer, 159-172.
https://doi.org/10.1007/978-1-0716-1566-9_10
[22] Eberl, A. and Klar, B. (2020) Asymptotic Distributions and Performance of Empirical Skewness Measures. Computational Statistics & Data Analysis, 146, Article 106939.
https://doi.org/10.1016/j.csda.2020.106939
[23] Wald, A. and Wolfowitz, J. (1943) An Exact Test for Randomness in the Non-Parametric Case Based on Serial Correlation. The Annals of Mathematical Statistics, 14, 378-388.
https://doi.org/10.1214/aoms/1177731358
[24] Coker, B., Rudin, C. and King, G. (2021) A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results. Management Science, 67, 6174-6197.
https://doi.org/10.1287/mnsc.2020.3818
[25] Grubbs, F.E. (1969) Procedures for Detecting Outlying Observations in Samples. Technometrics, 11, 1-21.
https://doi.org/10.1080/00401706.1969.10490657
[26] D’agostino, R.B. (1970) Transformation to Normality of the Null Distribution of G1. Biometrika, 57, 679-681.
https://doi.org/10.1093/biomet/57.3.679
[27] Anscombe, F.J. and Glynn, W.J. (1983) Distribution of the Kurtosis Statistic B2 for Normal Samples. Biometrika, 70, 227-234.
https://doi.org/10.1093/biomet/70.1.227
[28] Shapiro, S.S. and Francia, R.S. (1972) An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association, 67, 215-216.
https://doi.org/10.1080/01621459.1972.10481232
[29] Royston, P. (1993) A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician, 42, 37-43.
https://doi.org/10.2307/2348109
[30] D’Agostino, R.B., Belanger, A. and D’Agostino, R.B. (1990) A Suggestion for Using Powerful and Informative Tests of Normality. The American Statistician, 44, 316-321.
https://doi.org/10.2307/2684359
[31] Hu, K. (2020) Become Competent within One Day in Generating Boxplots and Violin Plots for a Novice without Prior R Experience. Methods and Protocols, 3, Article 64.
https://doi.org/10.3390/mps3040064
[32] José Moral, D.L.R. (2024) Determination of the Number and Width of Class Intervals Using R. Annals of Environmental Science and Toxicology, 8, 22-42.
https://doi.org/10.17352/aest.000077
[33] Scott, D.W. (2015) Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley.
https://doi.org/10.1002/9781118575574
[34] Freedman, D. and Diaconis, P. (1981) On the Histogram as a Density Estimator: L2 Theory. Probability Theory and Related Fields, 57, 453-476.
https://doi.org/10.1007/BF01025868
[35] Epanechnikov, V.A. (1969) Non-Parametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications, 14, 153-158.
https://doi.org/10.1137/1114019
[36] Sheather, S.J. and Jones, M.C. (1991) A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 53, 683-690.
https://doi.org/10.1111/j.2517-6161.1991.tb01857.x
[37] Kvam, P., Vidakovic, B. and Kim, S.J. (2022) Density Estimation. In: Nonparametric Statistics with Applications to Science and Engineering with R, Wiley, 223-234.
https://doi.org/10.1002/9781119268178.ch11
[38] Hernández, H. (2021) Testing for Normality: What is the Best Method. ForsChem Research Reports, 6, 1-38.
https://www.forschem.org/
[39] Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. Chapman & Hall/CRC Press.
https://doi.org/10.1007/978-1-4899-4541-9
[40] Efron, B. (2022) Exponential Families in Theory and Practice. Cambridge University Press.
https://doi.org/10.1017/9781108773157
[41] Efron, B. and Narasimhan, B. (2022) Package ‘Bcaboot’. Bias Corrected Bootstrap Confidence Intervals.
https://cran.r-project.org/web/packages/bcaboot/bcaboot.pdf
[42] Rousselet, G., Pernet, C.R. and Wilcox, R.R. (2023) An Introduction to the Bootstrap: A Versatile Method to Make Inferences by Using Data-Driven Simulations. Meta-Psychology, 7, 1-24.
https://doi.org/10.15626/mp.2019.2058
[43] Di Leo, G. and Sardanelli, F. (2020) Statistical Significance: P Value, 0.05 Threshold, and Applications to Radiomics—Reasons for a Conservative Approach. European Radiology Experimental, 4, Article 18.
https://doi.org/10.1186/s41747-020-0145-y
[44] Ding, P. (2014) Three Occurrences of the Hyperbolic-Secant Distribution. The American Statistician, 68, 32-35.
https://doi.org/10.1080/00031305.2013.867902
[45] Snedecor, G.W. and Cochran, W.G. (1989) Statistical Methods. 8th Edition, Iowa State University Press.
[46] Caeiro, F. and Mateus, A. (2024) “Randtests”: Testing Randomness in R.
https://doi.org/10.32614/CRAN.package.randtests
[47] Wickham, H., Chang, W., Henry, L., Pedersen, T.L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., Dunnington, D. and van den Brand, T. (2024) “Ggplot2”: Create Elegant Data Visualisations Using the Grammar of Graphics.
https://doi.org/10.32614/CRAN.package.ggplot2
[48] Komsta, L. (2022) “Outliers”: Tests for Outliers.
https://doi.org/10.32614/CRAN.package.outliers
[49] Komsta, L. and Novomestky, F. (2022) “Moments”: Moments, Cumulants, Skewness, Kurtosis and Related Tests.
https://doi.org/10.32614/CRAN.package.moments
[50] Gross, J. and Ligges, U. (2022) Package ‘Nortest’.
https://cran.r-project.org/web/packages/nortest/nortest.pdf
[51] Canty, A., Ripley, B. and Brazzale, A.R. (2024) Package ‘Boot’.
https://cran.r-project.org/web/packages/boot/boot.pdf
[52] Angrist, J.D., Imbens, G.W. and Rubin, D.B. (1996) Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association, 91, 444-455.
https://doi.org/10.2307/2291629
[53] Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences. 2nd Edition, Erlbaum.
[54] Dul, J., van der Laan, E. and Kuik, R. (2018) A Statistical Significance Test for Necessary Condition Analysis. Organizational Research Methods, 23, 385-395.
https://doi.org/10.1177/1094428118795272
[55] Khan, I.A., Bickel, J.E. and Hammond, R.K. (2023) Determining the Accuracy of the Triangular and PERT Distributions. Decision Analysis, 20, 151-165.
https://doi.org/10.1287/deca.2022.0464
[56] Khakifirooz, M., Tercero-Gómez, V.G. and Woodall, W.H. (2021) The Role of the Normal Distribution in Statistical Process Monitoring. Quality Engineering, 33, 497-510.
https://doi.org/10.1080/08982112.2021.1909731
[57] Habibzadeh, F. (2024) Data Distribution: Normal or Abnormal? Journal of Korean Medical Science, 39, e35.
https://doi.org/10.3346/jkms.2024.39.e35
[58] Demir, S. (2022) Comparison of Normality Tests in Terms of Sample Sizes under Different Skewness and Kurtosis Coefficients. International Journal of Assessment Tools in Education, 9, 397-409.
https://doi.org/10.21449/ijate.1101295
[59] Moral de la Rubia, J. (2020) Propiedades métricas de la Escala de Autoritarismo de Ala Derecha en estudiantes de medicina mexicanos. Revista de Psicología y Ciencias del Comportamiento de la Unidad Académica de Ciencias Jurídicas y Sociales, 11, 52-76.
https://doi.org/10.29059/rpcc.20200617-103
[60] Charpentier, A. and Flachaire, E. (2022) Pareto Models for Top Incomes and Wealth. The Journal of Economic Inequality, 20, 1-25.
https://doi.org/10.1007/s10888-021-09514-6
[61] Vaclavik, M., Sikorova, Z. and Barot, T. (2020) Skewness in Applied Analysis of Normality. In: Advances in Intelligent Systems and Computing, Springer, 927-937.
https://doi.org/10.1007/978-3-030-63319-6_86
[62] Orcan, F. (2020) Parametric or Non-Parametric: Skewness to Test Normality for Mean Comparison. International Journal of Assessment Tools in Education, 7, 255-265.
https://doi.org/10.21449/ijate.656077

Copyright © 2025 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.