1. Introduction
Many of the common statistical inference methods rely on the approximate normality of the sample mean via the Central Limit Theorem (CLT) for sufficiently large number of samples (n). A rule of thumb says that the CLT can be used for n > 30 [1] [2] [3] [4]. Singh, Lucas, Dalpatadu, & Murphy [5] showed that this rule of thumb may be inaccurate for highly skewed distributions. Veluchamy [6] developed a graphical approach based on bootstrap for verification of normality of the sample mean.
Skewness plays an important role in statistical analyses in almost all disciplines, and especially in finance. Johnson, Sen and Balyeat [7] applied a skewness adjusted binomial model to futures options pricing and derived the asymptotic skewness model properties. Their results showed that the futures options price, in the presence of skewness, depends not only on mean and standard deviation (sd), but other parameters as well. Kun [8] investigated daily time series of four Shanghai Stock market indices and found inclusion of skewness in models to yield higher investor utility. Chateau [9] investigated the effects of skewness and kurtosis by starting with the Black’s normal model for the European put values, replacing the Gaussian distribution by the Gram-Charlier and the Johnson distribution, and showed that both skewness and kurtosis have significant impact on the model results. The effects of skewness on stochastic frontier models are discussed in [10].
Several measures of skewness are available in statistical literature [11], but most of these are based on the sample moments or quantiles, and as such are adversely affected by the presence of a few outliers. Robust skewness measures such as medcouple have been proposed and investigated in the literature [12] ; the medcouple measures of skewness are a function of sample quantiles and order statistics. A comparison of skewness and kurtosis measures is provided by [13] ; a comparison of the standard t-test and a modified t-test for skewed distributions is available in [14].
Skewness of a probability distribution refers to the departure of the distribution from symmetry. A symmetric distribution has no skewness, a distribution with longer tail on the left is negatively skewed, and a distribution with longer tail on the right is positively skewed [15].
There are mainly three types of skewness measures available in the literature: Fisher-Pearson skewness, adjusted Fisher-Pearson skewness, and Pearson Type 2 skewness. Fisher-Pearson skewness measures are functions of the second and third central sample moments:
(1)
The formulas for calculating Fisher-Pearson sample skewness used by popular statistical software packages [16] are shown below; the statistical software environment R [17] can be used to compute all of the three types.
Fisher-Pearson Skewness (Type 1):
(2)
Adjusted Fisher-Pearson Skewness (Type 2):
(3)
Pearson Type 2 skewness is a simple measure that is calculated from the sample mean, standard deviation, and the sample median m:
(4)
Hotelling and Solomon [18] have shown that
; a close look at the proof shows that the “proof” is actually an intuitive argument for the population value of the Pearson Type 2 skewness, and not for the sample estimate, and hence
may fall outside the range [−3, +3]. In this article, alternative measures of skewness are proposed that are based on nonparametric density estimates, and are compared to some of the commonly used skewness measures. A computational geometric measure of skewness is also introduced.
2. Proposed Measure of Skewness
Many introductory statistics text books include a rule of thumb regarding the relative positions of the mean, the median: for a positively skewed distribution, mean > median > mode, and for a negatively skewed distribution, mean < median < mode [19] [20] [21]. It was pointed out by von Hippel [22] that many violations of this rule exist, especially in the case of discrete probability distributions (see Figure 1(b), Figure 1(c)).
Letting f (x) and F (x) denote the population probability density and cumulative distributions functions of the random variable, with mean μ and median Q2, the proposed skewness measure is defined as the area under f (x) between μ and median Q2 (Figure 2).
.
Figure 1. Plots of the binomial distribution with (a) BIN, n = 7 and p = 0.5; (b) BIN, n = 7 and p = 0.25 and (c) BIN, n = 7 and p = 0.75.
Figure 2. Examples showing area skewness computations.
Area skewness, the probability that the random variable falls inside the true mean μ and the median Q2, can be computed in two steps:
Step 1. The probability density is estimated from the sample; in this article, a nonparametric density estimate [23] [24] is used, but a parametric density estimate can also be used.
Step 2: A numerical integration method can then be used to compute the area between the sample mean and sample median; the trapezoid rule is used in this article for computing area skewness.
Figure 2 shows two simulated examples of area skewness computation. Data from the first example (top graph) is simulated from a normal distribution with mean μ = 100 and standard deviation σ = 10; the true area skewness, in this case, equals 0, and the area skewness computed for the samples is −0.004. The second example in Figure 2 (bottom graph) is generated from the log-normal (LN) distribution which is defined as: Y is LN with parameters μ and σ if log(Y) is normally distributed with mean μ and standard deviation σ; here the log function is the natural log, i.e., the base is e. The LN (μ, σ) distribution has population mean, standard deviation, and skewness given by [25] :
True population mean, median and area skewness for the LN (μ = 5, σ = 1) distribution are:
The sample area skewness value for the generated sample is 0.2047, and the standard skewness estimate is 4.3192.
3. Monte Carlo Simulation for Comparison of Skewness Measures
Three probability distributions with varying degrees of skewness are used in simulation in this study:
N (μ, σ)—normal distribution with mean μ and standard deviation σ.
GAM (α, β)—gamma distribution with shape = α and scale = β, skewness =
.
Tr (a, b, c)—Triangular distribution with parameters a, b, c [26] [27] with probability density and cumulative distribution given by
.
The skewness of the triangular distribution Tr (a, b, c) is given by
.
Triangular distribution is selected for this study as it can be used to model both positively skewed and negatively skewed distribution.
Table 1 shows the specific distributions and their skewness values used in this simulation, and Figure 3 shows plots of the two triangular distributions used in the simulations.
The simulation experiment used in this study is carried out in the following steps:
1) A random sample of size n is generated from the selected probability distribution.
2) Each of the five skewness coefficients (proposed area skewness, Pearson
Figure 3. Plots of the two triangular distributions used in the simulations.
Table 1. Probability distributions used in this simulation.
skewness, and the sample-moments based Types 1-3 skewness coefficients are computed.
Steps (1) and (2) are repeated 10,000 times and the 90%, 95%, and 99% confidence intervals for true skewness are calculated from the 10,000 skewness values.
The simulation experiment was run for n = 25, 50, 75, 100, for each of the three probability models, for each of the two sets of parameter values. The samples sizes chosen represent moderate to a large number of samples, and the true skewness values selected cover a wide range of skewness. Figures 4-23 show the histograms of the 10,000 skewness estimates and the confidence intervals.
Figure 4. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 25 samples from N (100, 20).
Figure 5. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 50 samples from N (100, 20).
Figure 6. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 75 samples from N (100, 20).
Figure 7. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 100 samples from N (100, 20).
Figure 8. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 25 samples from GAM (2, 1); standard skewness = 1.41, pearson skewness = 0.68, area skewness = 0.09.
Figure 9. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 50 samples from GAM (2, 1); standard skewness = 1.41, pearson skewness = 0.68, area skewness = 0.09.
Figure 10. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 75 samples from GAM (2, 1); standard skewness = 1.41, pearson skewness = 0.68, area skewness = 0.09.
Figure 11. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 100 samples from GAM (2, 1); standard skewness = 1.41, pearson skewness = 0.68, area skewness = 0.09.
Figure 12. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 25 samples from Tr (0, 0.5, 1).
Figure 13. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 50 samples from Tr (0, 0.5, 1).
Figure 14. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 75 samples from Tr (0, 0.5, 1).
Figure 15. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 100 samples from Tr (0, 0.5, 1).
Figure 16. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 25 samples from Tr (0, 0.95, 1); standard skewness = −0.56, Pearson skewness = −0.52, area skewness = −0.06.
Figure 17. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 50 samples from Tr (0, 0.95, 1); standard skewness = −0.56, Pearson skewness = −0.52, area skewness = −0.06.
Figure 18. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 75 samples from Tr (0, 0.95, 1); standard skewness = −0.56, Pearson skewness = −0.52, area skewness = −0.06.
Figure 19. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 100 samples from Tr (0, 0.95, 1); standard skewness = −0.56, Pearson skewness = −0.52, area skewness = −0.06.
Figure 20. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 25 samples from Tr (0, 0.05, 1); standard skewness = 0.56, Pearson skewness = 0.52, area skewness = 0.06.
Figure 21. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 50 samples from Tr (0, 0.05, 1); standard skewness = 0.56, Pearson skewness = 0.52, area skewness = 0.06.
Figure 22. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 75 samples from Tr (0, 0.05, 1); standard skewness = 0.56, Pearson skewness = 0.52, area skewness = 0.06.
Figure 23. Histograms and confidence intervals of skewness coefficients from 10,000 simulations of n = 100 samples from Tr (0, 0.05, 1); standard skewness = 0.56, Pearson skewness = 0.52, area skewness = 0.06.
4. A Computational Geometric Measure of Skewness
The probability density function estimated from the data can be modeled by a simple polygon P as shown in Figure 24 (thin solid line). Let lm be the vertical line segment at the sample mean (thick vertical line). Let Ch1 and Ch2 denote polygonal chains to the right and left of lm. By taking lm as a mirror we can consider the reflected images of Ch1 and Ch2 denoted by I1 and I2, respectively. I1 and I2 are drawn as dashed lines in Figure 24. Chains I1 and I2 form a simple polygon P*, which we call image polygon. The overlay of P and P* results in two types of areas: (i) Overlap Area OA, and (ii) Spilled Area SA. In the figure spilled area components are labeled as A, B, C, and D. For a symmetric distribution, spilled area will be small. If the distribution is asymmetric then the portion of spilled area will be large. This motivates us to use the proportion of spilled area as a measure of skewness.
An algorithm for computing spilled area can be developed by using the data structures for representing simple polygon from computational geometry. A sketch of the algorithm for computing spilled areas is shown below. Efficient implementation of Step 5 and Step 6 needs techniques from computational geometry. For this, the input polygon is represented in a doubly connected edge list data structure as reported in [28]. By navigating through this data structure, the intersection points corresponding to the overlay of P and P’ can be computed in linear time.
Algorithm 1: Computing Spilled Area.
Input: A simple polygon P constructed from samples points.
Output: Spilled Area SA.
Step 1: Find the mean vertical line segment lm.
Step 2: Find polygonal chains Ch1 and Ch2 implied by lm from input polygon P.
Step 3: Determine corresponding image chains I1 and I2.
Step 4: Construct image polygon P* by combining I1 and I2.
Step 5: Compute Overlap Area
.
Step 6: Compute Union Area
.
Step 7: Spilled Area SA = UA − OA.
We implemented the algorithm in python programming environment. For illustration purposes, two different samples were generated from different normal distributions. The true geometric skewness measure for any normal distribution is 0, since the normal distribution is symmetric. The results for the two samples are presented below.
The input polygon computed from the first sample is shown in Figure 25, and the overlap area is shown in Figure 26.
For sample 1, node count = 188, overlap area: 0.46, polygon area: 2.91, and the geometric measure of skweness = overlap area/polygon area = 0.1581.
For the second simulated example, Figure 27 and Figure 28 show the input polygon and the overlap area, respectively. For sample 2, node count = 40, overlap area: 0.41, polygon area: 2.93, and the geometric measure of skweness = overlap area/polygon area = 0.1387.
Figure 28. Overlap area for the second sample.
5. Discussion and Results
We have proposed two different skewness measures: area skewness and geometric skewness. The standard skewness measures suffer from one drawback: they do not have known lower and upper bounds. The absolute values of both of the proposed skewness estimates fall in the range (0, 1). We have used Monte Carlo simulations to compute confidence intervals from the area skewness estimate, and we intend to do the same for the geometric skewness estimate in the near future.