On Identifying Influential Observations in the Presence of Multicollinearity

Abstract

Influential observation is one which either individually or together with several other observations has a demonstrably large impact on the values of various estimates of regression coefficient. It has been suggested by some authors that multicollinearity should be controlled before attempting to measure influence of data point. In using ridge regression to mitigate the effect of multicollinearity, there arises a problem of choosing possible of ridge parameter that guarantees stable regression coefficients in the regression model. This paper seeks to check whether the choice of ridge parameter estimator influences the identified influential data points.

Share and Cite:

Uzuke, C. and Ezeilo, I. (2021) On Identifying Influential Observations in the Presence of Multicollinearity. Open Journal of Statistics, 11, 290-302. doi: 10.4236/ojs.2021.112016.

1. Introduction

It is well understood that not all observations in the data set play equal role when fitting a regression model. We occasionally find that a single or small subset of the data exerts a disproportionate influence on the fitted regression model. That is, parameter estimates or prediction may depend more on the influential subset than the majority of the data. Belsley et al. [1] defined an influential observation as one which either individually or together with several other observations has demonstrably large impact on the calculated values of various estimates, than is the case of most of the other observations. Influential observation in either dependent or independent variable can be as a result of data error or other problem, for example, the influential data points in dependent variable can arise from skewness in the independent variable or from differences in the data generation process for small subset of sample. Obviously, outliers which are observations in a data set which appears to be inconsistent with the remainder of other set of data [2] need not be influential observation in affecting the regression Equation [3]. Andrew and Pregibon [4] highlighted the need to find outliers that matter. They stated that it is not all outliers that need to be harmful in the way that they have undue influence on for instance, the estimation of the parameters in the regression model. If not all outliers matter, examining residual alone might not lead to the detection of influential observation. Thus, other ways of detecting influential observations are needed.

Regression diagnostic comprises of a collection of method used in the identification of influential points and multicollinearity [1]. This includes methods of exploratory data analysis for influential points and identification of violation of assumption of least squares. When the assumption of Ordinary Least Squares (OLS) method that the explanatory variables are not linearly correlated is violated, this results to multicollinearity problem and should be controlled before attempting to measure influence [1]. One of the most popular methods of controlling multicollinearity is the use of Ridge Regression (RR) suggested by Hoerl and Kennard [5]. The idea in RR method is to add small positive number (k > 0) to diagonal elements of the matrix ( X X ) in order to obtain a ridge regression estimator

β ^ R = ( X X + k I ) 1 X Y (1)

Though the estimator obtained is bias but it yields minimum Mean Squares Error (MSE) when compared to OLS estimator. If k = 0, β ^ R becomes the unbiased OLS estimator ( β ^ ).The choice of ridge parameter k has always been a problem in using RR to solve for multicollinearity, hence methods of estimating the value of k had been suggested by several authors. Below are some suggested methods of estimating k: Hoerl and Kennard [5], Hoerl et al. [6], Lawless and Wang [7], Nomura [8], Khalaf and Shukur [9], Dorugade [10], Al-Hassan [11], Dorugade and Kashid [12], Saleh and Kibria [13], Kibria [14], Zang and Ibrahim [15], Alkhamisi et al. [16], Al-Hassan [17], Muniz and Kibria [18], Khalaf and Shukur [9], Khalaf and Mohamed [19], Uzuke et al. [20] etc.

Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s distance ( D i ) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s distance includes; modified cook’s distance ( D i ), DFFITS, Hadi’s measure, Pena statistic, DFBETAS, COVRATIO, etc.

Therefore, problem of multicollinearity and influential observation affect the regression analysis or estimates remarkably. And in using Ridge Regression to mitigate multicollinearity problem, there is always a problem of the method to use to estimate the ridge parameter (k) to achieve reduction in variance larger than increase in bias furthermore, one may want to know whether multiticollinearity affects identification of influential observations.

2. Methodology

The influence of an observation is measured by the effect it produces on the fit when it is deleted in the fitting process. This deletion is always done one point at a time. Let β ^ 0 ( i ) , β ^ 1 ( i ) , , β ^ p ( i ) denote the regression coefficients obtained when the ith observation is deleted ( i = 1 , 2 , , n ) . Similarly, let y ^ 1 ( i ) , y ^ 2 ( i ) , , y ^ n ( i ) and σ ^ ( i ) 2 be the predicted values and residual mean square respectively when the ith observation is dropped. Note that

y ^ m ( i ) = β 0 ( i ) + β ^ 1 ( i ) x m 1 + + β ^ p ( i ) x m p (2)

is the fitted value for the observations m when the fitted equation is obtained with the ith observation deleted. Influential measures look at differences produced in quantities such as ( β ^ j β ^ j ( i ) ) or ( y ^ j y ^ j ( i ) ) . Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s Distance ( D i ) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s Distance includes; modified Cook’s Distance ( D i ), DFFITs, Hadi’s influence measure, Pena statistic, DFBETAS, COVRATIO, etc. This work, adopted the following influential measures;

1) Cook’s Distance

Cook [21] proposed this measure and it is widely used. Cook’s distance measures the difference between the fitted values obtained from the full data and the fitted values obtained by deleting the ith observation. Cook’s distance measure is defined as,

C i = j = 1 n ( y ^ j y ^ j ( i ) ) 2 σ ^ 2 ( p + 1 ) (3)

which can also be expressed as

C i = r i 2 p + 1 × h i i 1 h i i (4)

Thus, Cook’s distance is a multiplication function of two quantities. The first term in Equation (4) is the square of the standardized residual r i , which is given

as r i = e i σ ^ 1 h i i and the second term is called potential function h i i 1 h i i where h i i is the leverage of the ith observation given as h i i = X ( X X + k I ) 1 X .

If a point is influential, its deletion causes large changes and the value of C i will be large. Therefore, large value of C i indicates that the point is influential. It has also be suggested that points with C i value greater than the 50% point of the F distribution with p + 1 and (n p – 1) degrees of freedom be classified as influential points.

2) Welsch and Kuh Measure

Welsch and Kuh [22] developed a similar measure to Cook’s Distance named DFFITs, defined as

DFFITs i = y ^ j y ^ j ( i ) σ ^ ( i ) h i i (5)

DFFITs i is the scaled difference between the ith fitted value obtained from the full data and the ith fitted value obtained by deleting the ith observation. DFFITs i can as well be written as

DFFITs i = r i * h i i 1 h i i , i = 1 , 2 , , n (6)

where r i * is the standardized residual defined as r i * = e i σ ^ ( i ) 1 h i i .

Points with | DFFITs i | > 2 p + 1 / ( n p 1 ) are usually classified as influential points.

3) Hadi’s Influence Measure

Hadi [23] proposed a measure of the influence of ith observation based on the fact that influential observations are outliers in the response variable or in the predictors or both. Accordingly, the influence of the ith observation can be measured by

H i = h i i 1 h i i + p + 1 1 h i i d i 2 1 d i 2 , i = 1 , 2 , , n (7)

where d i = e i S S E (normalized residual). H i is an additive function. The first term of the equation is the potential function which measures outlyingness in the X-space and the second term is a function of the residual, which measures outlyingness in the response variable. Observations with large H i are influential observations in the response and/or the predictor variables. Although the measure H i does not focus on a specific regression result, but it can be thought of as an overall general measure of influence which depicts observations that are influential on at least one regression result.

4) DFBETAS [1]

DFBETAS measures the difference in each parameter estimate with and without the influential data point. It is an influential measure used to ascertain which observation influence specific regression coefficient

DFBETAS i j = b j b j ( i ) s ( i ) 2 ( X X ) i j 1 (8)

where b j ( i ) denote the regression coefficients obtained when the ith observation is deleted in fitting process ( i = 1 , 2 , , n ) and b j the predicted values from the full data, when ith observation is used in the fitting process.

5) Kuh and Welsch Ratio (COVRATIO)

The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the ith observation. This influential measure is given as

COVRATIO = [ det ( s i 2 ( X i X i ) 1 ) det ( s 2 ( X X ) 1 ) ] (9)

which can also be expressed as below

COVRATIO = ( n p r i 2 n p 1 ) 1 h i i (10)

where n is the sample size, p' is the number of independent variable and hii is the hat matrix.

The ridge parameter estimators which were selected to control multicollinearity are

a) k ^ 1 = σ ^ 2 α ^ i 2 Hoerl and Kennard [5]

b) k ^ 2 = σ ^ 2 ( i = 1 p α ^ i 2 ) 1 / p Kibria [14]

c) k ^ 3 = max 1 i p ( λ i S 2 λ i β ^ i 2 + ( n p ) S 2 ) Alkhamisi et al. [16]

d) k ^ 4 = ( i = 1 p λ i σ ^ 2 ( n p ) σ ^ 2 + λ i α ^ i 2 ) 1 p Muniz and Kibria [18]

e) k ^ 5 = ( i = 1 p 1 m i ) 1 p Muniz and Kibria [18]

f) k ^ 6 = ( i = 1 p m i ) 1 p Muniz and Kibria [18]

g) k ^ 7 = median ( 1 m i ) Muniz and Kibria [18]

where m i = σ ^ i 2 α ^ i 2

h) k ^ 8 = 2 p λ max i = 1 p σ ^ 2 α ^ i 2 Dorugade [10]

i) k ^ 9 = ( j = 1 p w j ) 1 / p Uzuke et al., [20]

where w j = I n 2 ( σ ^ 2 ) ( n p ) σ ^ 2 + I n 2 ( α ^ j 2 )

j) k ^ 10 = ( X X ) 1 X Y

3. Illustration

Using the Nigeria Economic indicator (1980-2010) data from the Central Bank of Nigeria (CBN) Statistical Bulletin 2010. The data consist of Gross Domestic Product as the dependent variable (y) and ten [10] independent variables namely Money Supply (x1), Credit to Private Sector (x2), Exchange Rate (x3), External Reserve (x4), Agricultural Loan (x5), Foreign Reserve (x6), Oil Import (x7), Non-oil Export (x8), Oil Export (x9), and Non-oil Export (x10) shown in Appendix III.

Table 1 showed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero(0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.

( x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 1 1 0.7952 0.7218 0.7309 0.7838 0.7757 0.7789 0.8146 0.7532 0.7768 x 2 0.7952 1 0.6813 0.8586 0.9702 0.9168 0.9420 0.9517 0.8851 0.9693 x 3 0.7218 0.6813 1 0.7277 0.7507 0.8270 0.7650 0.8234 0.8350 0.7810 x 4 0.7309 0.8586 0.7277 1 0.9372 0.9317 0.8657 0.8891 0.9438 0.8781 x 5 0.7838 0.9702 0.7507 0.9372 1 0.9580 0.9365 0.9596 0.9505 0.9675 x 6 0.7757 0.9168 0.8270 0.9317 0.9580 1 0.9660 0.9785 0.9877 0.9631 x 7 0.7789 0.9420 0.7650 0.8657 0.9365 0.9660 1 0.9801 0.9455 0.9705 x 8 0.8146 0.9517 0.8234 0.8891 0.9596 0.9785 0.9801 1 0.9612 0.9905 x 9 0.7532 0.8851 0.8350 0.9438 0.9505 0.9877 0.9455 0.9612 1 0.9406 x 10 0.7768 0.9693 0.7810 0.8781 0.9675 0.9631 0.9705 0.9905 0.9406 1 )

Identification of Influential Observations

Using five different influential measures; Cook’s distance, DFFITs, Hadi influence measure, DFBETAs and COVRATIO, influential observations in the real data are identified using the criteria of Table 2 when multicolinearity is not controlled (OLS: k = 0) and when controlled using the selected ridge parameter estimators. The values for the measure criteria are presented in Table 2.

The influential observations identified by the five influential measures in the presence of multicollinearity and when controlled using some selected ridge parameters (k) were presented in Table 3. When compared with values of Table 2,

Table 1. Result of test for multicollinearity.

Tableshowed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero (0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.

Table 2. Influential measures, calculated measure criteria and values obtained.

Table 3. Influential observations identified.

any observation whose calculated influence measure is greater than the criteria value obtained is identified as an influential observation or data point. Cook’s Distance and Hadi influence measure performed alike. They fail to identify influential data points when ridge estimators were used to control multicollinearity. DFFITs and COVRATIO measure identified single observation 25 in both OLS and when multicollinearity was controlled while DFBETAS identified data point 29 as well.

4. Summary and Conclusion

Ridge estimator affects influential observation identified. Cook’s distance and Hadi influence measure were able to identify several influential data points on the data in the presence of multicollinearity but failed to identify any data points when the multicollinear effect has been controlled. DFFITs, DFBETAs and COVRATIO identified the same single data point in the presence of multicollinearity and when it has been controlled. Cook’s distance and Hadi influence measure are very sensitive in the presence of multicollinearity, this made them to identify several influential data points but they are less sensitive when multicollinearity is controlled where they fail to identify and data point. DFFITs, DFBETAs and COVRATIO perform better than them and should be used when multicollinearity is controlled.

Appendix I

Algorithm for the R Programme

The model

Y i = X β + ε i

Y = β 1 X 1 + β 2 X 2 + + β p X p + ε i

Using the unit length scaling shown below:

Y ˜ = Y y ¯ L y ,

X ˜ j = X j x ¯ j L j , j = 1 , 2 , , p

where y ¯ is the mean of Y, x ¯ j is the mean of X j , and

L y = i = 1 n ( y i y ¯ ) 2 , and L j = i = 1 n ( x i j x ¯ j ) 2 , i = 1 , 2 , , n

such that i = 1 n x i j 2 = 1 , j = 1 , 2 , , p

We obtain the following model

Y ˜ = β 1 X ˜ 1 + β 2 X ˜ 2 + + β p X ˜ p + ε

Obtain A = X ˜ X ˜

Eigenvalues of A = tj

Eigenvectors of A = D

Confirm that D D = I

Confirm that D X ˜ X ˜ D = t j

Obtain α j = D β

Obtain σ ^ 2 = i = 1 n ε i n p

Methods of estimating ridge parameter k

1) k ^ 1 = σ ^ 2 α ^ i 2 Hoerl and Kennard (1970)

where, σ ^ 2 = i = 1 p e i 2 / n p is the residual mean square estimate of σ 2 and α ^ i is the ith element of α ^ which is an unbiased estimator of α = D β where D is the eigenvectors of the matrix X X

2) k ^ 2 = σ ^ 2 ( i = 1 p α ^ i 2 ) 1 / p , i = 1 , 2 , , p Kibria (2003)

3) k ^ 3 = max ( λ i σ ^ 2 λ i α ^ i 2 + ( n p ) σ ^ 2 ) Alkhamisi et al. (2006)

where λ i is the ith eigenvalues of the matrix X X and S 2 = j = 1 p ε i 2 n p

4) k ^ 4 = ( i = 1 p λ i σ ^ 2 ( n p ) σ ^ 2 + λ i α ^ i 2 ) 1 p Muniz and Kibira [18]

5) k ^ 5 = ( i = 1 p 1 m i ) 1 p

6) k ^ 6 = ( i = 1 p m i ) 1 p

7) k ^ 7 = median ( 1 m i )

where m i = σ ^ i 2 α ^ i 2

8) k ^ 8 = 2 p λ max i = 1 p σ ^ 2 α ^ i 2 , i = 1 , 2 , , p Dorugade [10]

9) k ^ 9 = ( j = 1 p w j ) 1 / p Uzuke et al. [20]

where the weight w j = I n 2 ( σ ^ 2 ) ( n p ) σ ^ 2 + I n 2 ( α ^ j 2 )

10) OLS = ( X X ) 1 X Y

Methods of detecting influential observation

Method 1 (cook’s distance)

C i = t i 2 p + 1 × h i i 1 h i i ,

The criteria is given as

C i > F 0.05 ( p + 1 , n p 1 )

where

h i i = X ( X X + k I ) 1 X , and t i = e i σ ^ 1 h i i

Method 2 (DFFITs)

DFITS i = t i * h i i 1 h i i , i = 1 , 2 , , n

The criteria is given as

DFFITs > 2 p + 1 n p 1

where r i * is the R-residual defined as t i * = t i n p 1 n p t i 2 and h i i = X ( X X + k I ) 1 X

Method 3 (Hadi measure)

H i = h i i 1 h i i + p + 1 1 h i i d i 2 1 d i 2 , i = 1 , 2 , , n

where d i = e i S S E called normalized residual.

Method 4 (DFBETAS)

b j b j ( i ) s ( i ) 2 ( X X + k I ) i j 1

The criteria is given as DIFBETAs > 2 n

Method 5 (COVRATIO)

( n p t i 2 n p 1 ) 1 h i i

The criteria is given as

| COVRATIO 1 | > 3 p n

where

h i i = X ( X X + k I ) 1 X , and t i = e i σ ^ 1 h i i

Appendix II

R Codes for Detecting Influential Observation for Different k Values

for(i in 1:9){

h=matrix(hatr(lmridge(V1~.,rr, k[i]],30,30)

ss=(sqrt(h[i,i]/(1-h[i,i])))

C=NULL

DF9=NULL

H=NULL

DFB=NULL

COV=NULL

for(i in 1:30){

b1=coefficients(lm(V1~.,rr[-i,]))

r1=c(residuals(lm(V1~.,rr[-i,])))

sig1=(sum(r1^2))/(n-p)

num=c[3]-b1[3]

hh=solve(t(xx[-i,])%*%(xx[-i,]))

denom=sqrt(sig1*hh[3,3])

C=rbind(C,(((r[i]^2/((sig)*(1-h[i,i]))))/(11))*(h[i,i]/(1-h[i,i])))

DF9=rbind(DF9,r[i]/(sqrt(sig1*(1-h[i,i])))*sqrt(h[i,i]/(1-h[i,i])))

H=rbind(H,(h[i,i]/(1-h[i,i]))+(11/(1-h[i,i]))*(r1[i]/sqrt(ssr)))

DFB=rbind(DFB,num/denom)

COV=rbind(COV,(sig1/sig)*(h[i,i]/(1-h[i,i])))

}

Appendix III

Table A1. Nigerian economic indicator (1980-2010) data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Belsley, A., Kuh, E. and Welsch, R. (1989) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley and Sons, New York.
[2] Johnson, R.A. and Wichern, D.W. (2002) Applied Multivariate Statistical Analysis. Pearson Education, Delhi.
[3] Mickey, M.R., Dunn, O.J. and Clark, V. (1976) Note on the Use of Stepwise Regression in Detecting Outliers. Computers and Biomedical Research, 1, 105-111.
https://doi.org/10.1016/0010-4809(67)90009-2
[4] Andrews, D.F. and Pregibon, D. (1978) Finding the Outliers That Matters. Journal of Royal Statistical Society, Series B, 40, 85-93.
https://doi.org/10.1111/j.2517-6161.1978.tb01652.x
[5] Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Non-Orthogonal Problems. Technometrics, 12, 55-67.
https://doi.org/10.1080/00401706.1970.10488634
[6] Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975) Ridge Regression: Some Simulations. Communications in Statistics, 4, 105-123.
https://doi.org/10.1080/03610917508548342
[7] Lawless, J.F. and Wang, P. (1976) A Simulation Study of Ridge and Other Regression Estimators. Communications in Statistics A, 5, 307-323.
https://doi.org/10.1080/03610927608827353
[8] Nomura, M. (1988) On the Almost Unbiased Ridge Regression Estimation. Communications in Statistics—Simulation and Computation, 17, 729-743.
https://doi.org/10.1080/03610918808812690
[9] Khalaf, G. and Shukur, G. (2005) Choosing Ridge Regression Parameters for Regression Problems. Communications in Statistics—Simulation and Computations, 32, 419-435.
[10] Dorugade, A. (2014) New Ridge Parameters for Ridge Regression. Journal of the Association of Arab Universities for Basic and Applied Sciences, 15, 94-99.
https://doi.org/10.1016/j.jaubas.2013.03.005
[11] Al-Hassan, Y.M. (2010) Performance of a New Ridge Regression Estimator. Journal of the Association of Arab Universities for Basic and Applied Science, 9, 23-26.
https://doi.org/10.1016/j.jaubas.2010.12.006
[12] Dorugade, A.V. and Kashid, D.N. (2010) Alternative Methods for Choosing Ridge Parameter for Regression. Applied Mathematical Science, 4, 447-456.
[13] Saleh, A.K.Md. and Kibria, B.M.G. (1993) Performances of Some New Preliminary Test Ridge Regression Estimators and Their Properties. Communication in Statistics—Theory and Methods, 22, 2747-2764.
https://doi.org/10.1080/03610929308831183
[14] Kibria, B.M.G. (2003) Performance of Some New Ridge Regression Estimators. Communications in Statistics—Simulation and Computation, 32, 417-435.
https://doi.org/10.1081/SAC-120017499
[15] Zang, J. and Ibrahim, M. (2005) A Simulation Study on SPSS Ridge Regression and Ordinary Least Square Regression Procedures for Multicollinearity Data. Journal of Applied Statistics, 32, 571-588.
https://doi.org/10.1080/02664760500078946
[16] Alkhamisi, M., Khalaf, S. and Shukur, G. (2006) Some Modifications for Choosing Ridge Parameters. Communications in Statistics—Theory and Methods, 37, 544-564.
https://doi.org/10.1080/03610920701469152
[17] Al-Hassan, Y.M. (2008) A Monte Carlo Evaluation of Some Ridge Estimators. Japan Journal of Applied Science: Natural Science Series, 10, 101-110.
[18] Muniz, G. and Kibria, B.M.G. (2009) On Some Ridge Regression Estimators: An Empirical Comparison. Communications in Statistics—Simulations and Computation, 38, 621-630.
https://doi.org/10.1080/03610910802592838
[19] Khalaf, G. and Iguernane, M. (2014) Ridge Regression and Ill-Conditioning. Journal of Modern Applied Statistical Methods, 13, 355-363.
https://doi.org/10.22237/jmasm/1414815420
[20] Uzuke, C.A., Mbegbu, J.I. and Nwosu, C.R. (2017) Performance of Kibria, Khalaf and Shukur’s Methods When the Eigenvalues Are Skewed. Communications in Statistics—Simulation and Computation, 46, 2071-2102.
https://doi.org/10.1080/03610918.2015.1035444
[21] Cook, R. (1977) Detection of Influential Observations in Linear Regression. Technometrics, 19, 15-18.
https://doi.org/10.1080/00401706.1977.10489493
[22] Welsch, R. and Kuh, E. (1977) Linear Regression Diagnostics. Technical Report, Solan School of Management, Massachusetts Institute of Technology, Cambridge, 923-977.
https://doi.org/10.3386/w0173
[23] Hadi, A. (1992) A New Measure of Overall Potential Influence in Linear Regression. Computational Statistics and Data Analysis, 14, 1-27.
https://doi.org/10.1016/0167-9473(92)90078-T

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.