Effects of Multicollinearity on Type I Error of Some Methods of Detecting Heteroscedasticity in Linear Regression Model ()
1. Introduction
The violation of the assumption of constant variance of error term in linear regression model results into heteroscedasticity problem. In practice, the nature of heteroscedasticity is usually unknown [1]. The consequences of OLS in the presence of heteroscedasticity are not BLUE, Inefficiency and Invalid Hypothesis testing. Given this fact, the detection of heteroscedasticity in a linear regression model needs to be identified. In reality, multicollinearity may co-exist with the problem of heteroscedasticity. The condition of severe non-orthogonality is referred to as a problem of multicollinearity. Multicollinearity exist when there is high linear relationships between two or more explanatory variables. According to [2] and [3], one should be very cautious about any conclusion with regression analysis when there is multicollinearity in the model, because [4], opined that effect of multicollinearity on type I error rates of the ordinary least square estimator is trivial in which the error rates exhibit no or little significance difference from the pre-selected level of significance. This paper attempts to determine the effects of multicollinearity on type I error rate of some heteroscedasticity detection methods in linear regression model. The heteroscedasticity detection methods chosen for this study are; Breusch Pagan test (BPG), Park test (PT), Spearman’s Rank Correlation test (ST), Non-Constant Variation Score test (NVST), Glejser test (GLJ), Goldfeld-Quandt test (GFQ), Breusch-Godfrey test (BG), Harrison Mc Cabe test (HM) and White test (WT).
2. Background
Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables. Simple linear regression model postulated the relationship between dependent variable and one exogenous variable while multiple linear regression examine the relationship between dependent variable and a set of explanatory variables by fitting a linear equation to observed data. However, one of the assumption of classical linear regression model is that the variance of the error term is constant across observations (homoscedasticity). When homoscedasticity assumption is violated, it then leads to heteroscedasticity. Heteroscedasticity is a major concern in the application of regression analysis, which always occurs in cross sectional data, when the variances of the error terms are no longer constant. It is often investigated with the ideology of relationship between error terms and exogenous variables. According to [5] and [1], the consequences of using the Ordinary Least Square(OLS) estimator to obtain estimates of the population parameters when there is heteroscedasticity includes; inefficient parameter estimates and biased variance estimates which make standard hypothesis test inappropriate. In practice, the nature of heteroscedasticity is usually unknown [1]. There are test procedures for establishing specific structures of heteroscedasticity. Brief literature reviews on some of these heteroscedasticity tests are as follows.
2.1. Breush-Pagan Test (BP)
[6] developed a test used in examining the presence of heteroscedasticity in a linear regression model. The variance of the error term was tested from a regression and is dependent on the value of the independent variables. [3] illustrates this test by considering the following:
Given the regression model
(1)
where Y is the dependent variables, X is the exogenous or explanatory variables, μ is the error term and β’s are the regression coefficient.
[3] suggests that to determine the existence of heteroscedasticity in a given data the following procedures must followed;
Apply OLS in the model and compute the regression residuals.
Perform the auxiliary regression
(2)
where z could be partly replaced by independent variable X.
The test statistic is the result of the coefficient of determination of the auxiliary regression in (2) and sample size n with LM = nR2. The test statistic is asymptotically distributed as
under the null hypothesis of homoscedasticity.
2.2. Park Test (PT)
[5] proposes a LM test, the test assumes the proportionality between error variance and the square of the regressor. According to [1] and [5], LM test formulizes the graphical method by suggesting that
is a particular function of the explanatory variables. Park illustrates this test by regressing the natural log of squared residuals against the independent variable; if the independent variable has a significant coefficient, the data are likely to be heteroscedasticity in nature. Given the model below
(3)
We need to find the log
(4)
where
is the stochastic disturbance term, since
is not known, Park suggest using
as a proxy and run the following regression
(5)
If
turns out to be statistically significant, we there say that heteroscedasticity is present in the data and if it turns out to be insignificant, we may accept the assumption of homoscedasticity.
2.3. Spearman’s Rank Correlation Test (ST)
Spearman’s Rank correlation [7] assumes that the variance of the disturbance term is either increasing or decreasing as X increases and there will be a correlation between the absolute size of the residuals and the size of X in an OLS regression. The data on X and the residuals are both ranked. The rank correlation coefficient is defined as
(6)
where
is the difference between the rank of X and the rank of e in observations i and n is the number of individual ranked.
2.4. Glejser Test (GLJ)
[8] developed a test similar to the Park test, after obtaining the residual (
) (from the OLS regression. [8] suggests that regressing the absolute value of the estimated residuals on the explanatory variables that is thought to be closely associated with the heteroscedastic variance and attempts to determine whether as the independent variable increase in size, the variance of the observed dependent variable increases. This is done by regressing the error term of the predicted model against the independent variable. A high t-statistic (or low prob-value) for the estimate coefficient of the independent variable(s) would indicate the presence of heteroscedasticity.
2.5. Goldfeld-Quandt Test (GFQ)
[9] developed an alternative test to LM test, applying this test requires to perform a sequence of intermediate stages. First step involves to arrange the observations either is ascending or in descending order. Another step aims to divide the ordered sequence into two equal sub-sequences by omitting an arbitrary number P of the central observation. Consequently, the two equal sub-sequences
will summarize each of them a number of
observations. We then compute
two different OLS regression the first one for the lowest values of
and the second for the highest values of
, in addition, obtain the residual sum of squares (RSS) for each regression equation, RSS1 for the lowest values of
and RSS2 for the highest values of
. An F-statistic is calculated based on the following formula:
(7)
The F-statistics is distributed with
degrees of freedom for both
numerator and denominator. Subsequently, compare the value obtained for the F-statistic with the tabulated values of F-critical for the specified number of degrees of freedom and a certain confidence level. If F-statistic is higher than F-critical, the null hypothesis of homoscedasticity is rejected and the presence of heteroscedasticity is confirmed.
2.6. Breusch-Godfrey Test (BG)
[10] developed a LM test of the null hypothesis of no heteroscedasticity against heteroscedasticity of the form
, where
is a vector of independent variables. This vector contains the regressors from the original least square regression. The test is performed by completing an auxiliary regression of the squared residuals from the original equation on (1,
). The test statistic follows a Chi-square distribution with degrees of freedom equal to the number of z under the null hypothesis of no heteroscedasticity.
2.7. White’s Test (WT)
[11] proposed a statistical test that establishes whether the variance of the error in a regression model is constant. This test is generally, unrestricted and widely used for detecting heteroscedasticity in the residual from a least square regression. Particularly, White test is a test of heteroscedasticity in OLS residual. The null hypothesis is that there is no heteroscedasticity. The procedure for running the test is shows as follows:
Given the model
(8)
Estimate Equation (8) and obtained the residual
we then run the following auxiliary regression
(9)
The null hypothesis of homoscedasticity is
where
highlights the fact that the variance of the residual is homoscedasticity i.e.,
. The alternative hypothesis is
aims the fact that the variance of the residual is heteroscedasticity
that is at least one of the bi’s is different from zero, the null hypothesis is rejected. The LM-statistic = nR2 follows as
distribution characterized by m − 1, where n is the number of observation established to determine the auxiliary regression and R2 is the coefficient of determination. Finally, we assume to reject the null hypothesis and to highlight the presence of heteroscedasticity when LM-statistic is higher than the critical value.
2.8. Harrison McCabe Test (HM)
[12] proposes a test to check the heteroscedasticity of the residuals. The breakpoint in the variances is set by default to the half of the sample. The p-value is estimated-using simulation. If the binary quality measure is false, then the homoscedasticity hypothesis can be rejected with respect to the given level.
2.9. Non-Constant Variation Score Test (NVST)
[5] [13] and [14] develop a test of null hypothesis
against an alternative (
) hypothesis with a general functional form. We recall the central issue is whether
is related to X and
. Then, a simple strategy is to use OLS residuals to estimate disturbance and check the relationship between
and
and that of
. Suppose that the relationship between
and X is linear
(10)
Then, we test
against
and base the test on how the squared OLS residual
correlate with X.
3. Materials and Method
Consider the regression model of the form:
(11)
;
where
is the error term and
is the heteroscedasticity variance that is considered.
is the dependent variable,
is the explanatory variables that contain multicollinearity and
is the regression coefficient of the model. A Monte Carlo Experiment was performed 1000 times, in generating the data for the simulation study. The error term containing different explanatory variables, heteroscedasticity structure and dependent variable were generated. The procedure used by [15] [16] and [17], was adopted to generate explanatory variables in this study. This is given as:
(12)
and
where
is the independent standard normal distribution with mean zero and unit variance. Rho (
) is the correlation between any two explanatory variables and p is the number of explanatory variables. In this study, seven (7) error variance containing heteroscedasticity structures were considered, which are;
(13)
(14)
(15)
(16)
(17)
, where
and 0.2 (18)
(19)
these tests were investigated and observed under type I error via the hypothesized values, to achieve this, Monte Carlo experiments is employed.
Moreover, in order to determine the dependent variables, Equation (1) was used in conducting the Monte Carlo experiments. The true values of the model parameters were fixed as follows;
,
,
,
. The sample sizes varied from 15, 20, 30, 40, 50, 100 and 250. At a specified value of sample size and multicollinearity level, the fixed X’s are first generated; followed by the
and the values of
were then determined. Then
and X’s were then treated as real life data set while the methods were applied.
The hypothesis about the methods of detecting heteroscedasticity under different forms of heteroscedasticity structures was tested at (10%, 5% and 1%) levels of significance to examine the (type I error rate) of each error terms. These intervals were referred to as the estimated significance level. The intervals was set to know the number of times each significance level falls between the range set for the confidence interval of each method of detecting heteroscedasticity in order to reject the hypothesis or not. At each level of significance; the interval set for
is (0.09 to 0.14), the interval set for
is (0.045 to 0.054), and the interval set for
is (0.009 to 0.014).
Sample sizes were classified as small (
) , medium (
) and large (
) .
Multicollinearity levels were classified, the least value considered as low (
), high (
), very high (
), Severe (
) and very severe (
).
At a particular
level a confidence interval was set for 10 percent, 5 percent and 1 percent, the number of times
falls in between, the set confidence interval was counted over the sample size, multicollinearity and heteroscedasticity structures. The heteroscedasticity test with highest number of count is chosen to be the best.
(20)
where r is the number of times
falls in between the confidence interval set at a particular significance level. While R is the number of times the experiment was carried out. At a given
, the number of time
falls in between the set confidence interval at a particular sample sizes, multicollinearity levels and heteroscedasticity structures for each of the heteroscedasticity test was counted and the method with highest count is the best.
Procedure to Determine the Best Method of Detecting Heteroscedasticity When Multicollinearity Exist
1)
, which is the probability of committing type I error was chosen to be (10%, 5% and 1%).
2) Calculate
where
, r is the number of times
was rejected by
a particular heteroscedasticity test in a particular sample sizes over a level of multicollinearity with a given heterosceasticity form. R is the number of replications.
3) Set confidence interval for each of the chosen level of the significance as follows;
is (0.09 to 0.14), the interval set for
is (0.045 to 0.054), and the interval set for
is (0.009 to 0.014).
4) At a given
, count the number of time
falls in between the set confidence interval at a particular sample sizes, multicollinearity levels and heteroscedasticity forms for each of the heteroscedasticity detection test.
5) The heteroscedasticity detection method with highest count in (4) is the best.
4. Results and Discussion
Results obtained from the simulation study show the number of times the estimated probability of type I error (
) fall in between the set confidence interval for
, 5% and 1% was counted over the sample sizes and heteroscedasticity structures for each heteroscedasticity detection method at different levels of multicollinearity as presented in Table 1.
Table 1. The number of time estimated probability of type I error
fall in between the set confidence interval over the sample sizes, levels of multicollinearity and heteroscedasticity structures for various heteroscedasticity detection methods investigated.
Source: Simulated data.
From Table 1, the figures showing the performances of the heteroscedasticity detection method over the levels of multicollinearity were presented for
,
and
in Figure 1, Figure 2 and Figure 3 respectively.
From Table 1 and Figure 1, when alpha = 0.1, it was generally observed that BG test is the best-performed method over all the structural forms of heteroscedasticity and sample sizes when there exist multicollinearity in the model.
Figure 1. Figure showing the performances of the heteroscedasticity detection methods over the levels of multicollinearity when Alpha = 0.1.
Also, it was observed from Table 1 that;
1) When multicollinearity level is 0.8 and sample size is 15, BPG and ST methods of heteroscedasticity detection outperformed BG method.
When multicollinearity level is 0.8 and sample size is 20, BPG method of heteroscedasticity detection outperforms BG method.
When multicollinearity level is 0.8 and sample size is greater than 20 BG method’s of heteroscedasticity detection outperformed all other methods.
2) When multicollinearity level is 0.9 BG method of heteroscedasticity detection performed best than all other methods except at sample size 15, at this instance, BPG method of heteroscedasticity detection outperformed equivalently well with BG method.
3) When multicollinearity level is grater or equal to 0.95, BG method of heteroscedasticity detection outperformed all other methods.
Hence, the performances of BG method of heteroscedasticity detection increase as the level of multicollinearity and sample sizes increases.
From Table 1 and Figure 2, when alpha = 0.05, it was generally observed that BG test is the best-performed method over all the structural forms of heteroscedasticity and sample sizes when there exist multicollinearity in the model.
Also, it was observed from Table 1 that;
1) When multicollinearity level is 0.8 and sample size is 15, BG method’s of heteroscedasticity detection outperformed all other methods except at sample size 20 and sample size 100, at these instances, BPG and WT method of heteroscedasticity detection out performed BG method of heteroscedasticity detection respectively.
2) When multicollinearity level is 0.9, BG method of heteroscedasticity detection outperformed all other methods, except when sample size is 15, at this instance ST method of heteroscedasticity detection out performed BG method of
Figure 2. Figure showing the performances of the heteroscedasticity detection methods over the levels of multicollinearity when Alpha = 0.05.
heteroscedasticity detection.
3) When multicollinearity level is 0.95 BG method of heteroscedasticity detection outperformed all other methods except at sample size 40 and sample size 100, at these instance, WT method of heteroscedasticity detection out performed BG method of heteroscedasticity detection respectively.
4) When multicollinearity level is 0.99 BG method of heteroscedasticity detection performed best than all other methods except at sample size 10, at this instance, ST method of heteroscedasticity detection outperformed BG method.
5) When multicollinearity level is grater or equal to 0.999, BG method of heteroscedasticity detection outperformed all other methods except at sample size 15, at these instance, HM method of heteroscedasticity detection compete well with BG method of heteroscedasticity detection to outperform it.
Hence, the performances of BG method of heteroscedasticity detection increase as the level of multicollinearity and sample sizes increases.
From Table 1 and Figure 3, when
, it was generally observed that BG test is the best-performed method of heteroscedasticity detection over all the structural forms of heteroscedasticity and sample sizes when there exist multicollinearity in the model.
Also, it was observed from Table 1 that;
1) When multicollinearity level is 0.8 and sample size is 15, BG method of heteroscedasticity detection outperformed all other methods except at sample size 15 and sample size 100, at these instances, GFQ and ST method of heteroscedasticity detection out performed BG method of heteroscedasticity detection respectively. Also, BPG method of heteroscedasticity detection compete favorably well with BG method at sample size 40.
2) When multicollinearity level is 0.9, BG method of heteroscedasticity detection outperformed all other methods, except when sample size is 40, at this instance
Figure 3. Figure showing the performances of the heteroscedasticity detection methods over the levels of multicollinearity when Alpha = 0.01.
GFQ and HM method of heteroscedasticity performed equivalently well to outperformed BG method of heteroscedasticity detection.
3) When multicollinearity level is 0.95, BG method’s of heteroscedasticity detection outperformed all other methods except at sample size 15 and sample size 40, at these instances, GFQ and HM method of heteroscedasticity compete well to outperform BG method of heteroscedasticity detection respectively.
4) When multicollinearity level is 0.99, BG method of heteroscedasticity performed best than all other methods at all sample sizes except at sample size 15 and sample size 20; at these sample sizes, BPG and ST methods of heteroscedasticity detection performed well and outperformed BG method.
5) When multicollinearity level is grater or equal to 0.999, BG’s method of heteroscedasticity detection outperformed all other methods at all sample sizes except at sample size 15 and sample size 20, at these instances, HM’s method of heteroscedasticity detection compete well with BG’s method of heteroscedasticity detection to outperform it.
Hence, the performances of BG’s method of heteroscedasticity detection increase as the level of multicollinearity and sample sizes increases.
5. Conclusions
In spite of the level of multicollinearity, heteroscedasticity structures and sample size, we are able to conclude from the study on effects of multicollinearity on type I error rates of some methods of detecting heteroscedasticity when there exist multicollinearity in the model that;
The perfomances of BG’s method of heteroscedasticity detection increases as the multicollinearity level increases at all the levels of significance.
The perfomances of BG method of heteroscedasticity detection increases as the sample size increases at all the levels of significance.
Whenever multicollinearity presents in the model with any heteroscedasticity structure, BG’s test is the best method for heteroscedasticity detection in the model at different levels of significance in all sample size categories.