Revisiting Akaike’s Final Prediction Error and the Generalized Cross Validation Criteria in Regression from the Same Perspective: From Least Squares to Ridge Regression and Smoothing Splines

Abstract

In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived from two quite different perspectives. Here, settling on the most commonly accepted definition of the MSPE as the expectation of the squared prediction error loss, we provide theoretical expressions for it, valid for any linear model (LM) fitter, be it under random or non random designs. Specializing these MSPE expressions for each of them, we are able to derive closed formulas of the MSPE for some of the most popular LM fitters: Ordinary Least Squares (OLS), with or without a full column rank design matrix; Ordinary and Generalized Ridge regression, the latter embedding smoothing splines fitting. For each of these LM fitters, we then deduce a computable estimate of the MSPE which turns out to coincide with Akaike’s FPE. Using a slight variation, we similarly get a class of MSPE estimates coinciding with the classical GCV formula for those same LM fitters.

Share and Cite:

Mvondo, J. and Nguéma, E. (2023) Revisiting Akaike’s Final Prediction Error and the Generalized Cross Validation Criteria in Regression from the Same Perspective: From Least Squares to Ridge Regression and Smoothing Splines. Open Journal of Statistics, 13, 694-716. doi: 10.4236/ojs.2023.135033.

1. Introduction

In many branches of activity, the data analyst is confronted with the need to model a continuous numeric variable Y (the response) in terms of one or more explanatory other variables (called covariates or predictors or regressors) X 1 , , X p in a population Ω through a model

Y = Φ ( X ) + ε , (1)

where1: X = ( X 1 , , X p ) Τ ; Φ ( ) is a function from p , generally unknown, called the regression function; ε is an unobserved error term, also called residual error in the model (1).

In our developments in this paper, and contrary to popular tradition, we will not need that the variables X and ε necessarily be stochastically independent. However, the usual minimal assumption for (1) is that the variables X and ε satisfy:

( A 0 ). E ( ε | X ) = 0 .

Though sometimes far more debatable, we will also admit that the postulated regression model (1) satisfies the homoscedasticity assumption for the residual error variance:

( A 1 ). V ar ( ε | X ) = σ 2 > 0 .

Now, as is well known, Assumption ( A 0 ) implies that

Φ ( X ) = E ( Y | X ) , i .e . x p , Φ ( x ) = E ( Y | X = x ) . (2)

However, that conditional expectation function can almost never be analytically computed in practical situations. The aim of the regression analysis of Y | X (i.e.Y given X”) is rather to estimate the unknown regression function Φ ( ) in (1) based on some observed data

( x 1 , y 1 ) , , ( x n , y n ) p × , i .i .d . D ( X , Y ) , (3)

collected on a sample of size n drawn from Ω. If achieved, this will result in a computable function Φ ^ : p so that the final practical model used to express the response variable Y in terms of the vector of predictors X will be:

Y = Φ ^ ( X ) + ε ˜ , (4)

where ε ˜ is the residual error in the modeling. But once we get such a fit for the regression model (1), an obvious question then arises: how can one measure the accuracy of that computed fit (its so called goodness-of-fit)?

For some specific regression models (generally parametric ones and, most notably, the LM fitted by OLS with a full column rank design matrix), various measures of accuracy of their fit to given data, with closed form formulas, have been developed. But such specific and easily computable measures of accuracy are not universally applicable to all regression models. So they cannot be used to compare two arbitrarily fitted such models. Opposite to that, in the late 1960s, Akaike decisively introduced an approach (but in a much broader context including time series modeling) which has that desirable universal feature [1] , and is thus now generally recommended to compare different regression fits for the same data, albeit, sometimes, at some computational cost. It is based on estimating the prediction error of the fit. But various definitions and measures of that error have been used. By far, the most popular is the Mean Squared Prediction Error (MSPE), but there appears to be a myriad ways of defining it and/or theoretical estimates of it. A detailed lexicon on the matter is given in [2] and the references therein.

These various definitions and theoretical estimates of the MSPE are, undoubtedly, insightful in their respective motivations and aims, but they ultimately make the subject of prediction error assessment utterly complicated and confusing for most non expert users. This is compounded by the fact that many of these definitions and discussions of prediction error liberally mix theoretical estimates (i.e. finite averages over sample items) with intrinsic numeric characteristics of the whole population (expressed through expectations of some random variable). In that respect, while being both aimed at estimating the MSPE, the respective classical derivations of Akaike’s Final Prediction Error (FPE) and Craven and Wahba’s Generalized Cross Validation (GCV) stem from two quite different perspectives [1] [3] . This makes it not easy, for the non expert user, to grasp that these two selection criteria might even be related in the first place.

The first purpose of this paper is to settle on the definition of the MSPE most commonly known by users to assess the prediction power of any fitted model, be it for regression or else. Then, in that framework, we shall provide a conceptually simpler derivation of the FPE as an estimate of that MSPE in a LM fitted by OLS when the design matrix is full column rank. Secondly, we build on that to derive generalizations of the FPE for the LM fitted by other well known methods under various scenarios, generalizations seldom accessible from the traditional derivation of the FPE. Finally, we show that, in that same unified framework, a minor variation in the derivation of the MSPE estimates yield the well known formula of the GCV for all these various LM fitters. For the latter selection criterion, previous attempts [4] [5] have been made to provide a derivation of it different from the classical one as an approximation of the leave-one-out cross validation (LOO-CV) estimate of the MSPE. We view our approach as a more straightforward and inclusive derivation of the GCV score.

To achieve that, we start, in Section 2, by reviewing the prediction error viewpoint in assessing how a fitted regression model performs and to settle on the definition of the MSPE most commonly known by users to measure that performance (while briefly recalling the alternative one most given in regression textbooks). Then, in Section 3, focusing specifically on the LM, we provide theoretical expressions of that MSPE measure valid for any arbitrary LM fitting method, be it under random or non random designs. In the next sections, these expressions are successively specialized to some of the best known LM fitters to deduce, for each, computable estimates of the MSPE, a class of which yielding the FPE and a slight variation the GCV. Those LM fitters include: OLS, with or without a full column rank design matrix (Section 4), Ridge Regression, both Ordinary and Generalized (Section 5), the latter embedding smoothing splines fitting. Finally, Section 6 draws a conclusion and suggests some needed future work.

As customary, we summarize the data (3) through the design matrix and the response vector:

X = ( x 1 , , x n ) Τ M n , p ( ) and y = ( y 1 , , y n ) Τ n . (5)

It is important to emphasize that technically, each observed response y i is a realization of a random variable. Consequently, the same will be true of the vector y . On the other hand, the matrix X is considered as fixed or random, depending on whether the covariates X 1 , , X p are viewed as fixed or random variables. According to which one holds, one talks of fixed design or random design. Our presentation will encompass both scenarios. For convenience, we will use the same symbol to denote a random variable and its realization. Nonetheless, the distinction between the latter two will be apparent from context.

2. Prediction Error of a Fitted Regression Model

2.1. The Prediction Error Viewpoint for Assessing a Fitted Regression Model

Using the data (3), assume we got an estimate Φ ^ ( ) of the unknown regression function Φ ( ) in the regression model (1). Thus, we fitted the model (1) to express the response Y in terms of the vector of covariates X = ( X 1 , , X p ) in the population under study Ω. From a predictive perspective, assessing the goodness-of-fit of that fitted model amounts to answering the question: How well the fitted model is likely to predict the response variable Y on a future individual based on its covariates values X = ( X 1 , , X p ) ?

To try to formalize an answer, consider a new member of Ω, drawn independently from the sample of Ω which produced the data (3), and assume we have observed its covariates vector X = X 0 p , but not its response value Y = Y 0 . Nonetheless,served value of the latter. This is the prediction problem of Y | X (“Y given X”) on that individual. With model (4) fitted to the response variable Y, it seems natural to predict the unknown value of Y = Y 0 on that individual, given from the former, we want to get an idea of the unob X = X 0 , by:

Y ^ 0 = Φ ^ ( X 0 ) . (6)

But there is, obviously, an error attached to such a prediction of Y = Y 0 value by Y ^ 0 , called the prediction error or predictive risk of Y ^ 0 w.r.t. Y 0 .

2.2. The Mean Squared Prediction Error

The prediction error of Y ^ 0 w.r.t. Y 0 needs to be assessed beforehand to get an idea of the quality of the fit provided by the estimated regression model (4) for Y | X in the population under study. To that aim, the global measure mostly used is the Mean Squared Prediction Error (MSPE), but one meets a bewildering number of ways of defining it in the literature [2] , creating a bit of confusion in the mindset of the daily user of Statistics. Yet, most users would accept as definition of the MSPE:

MSPE ( Y ^ 0 , Y 0 ) = E ( Y ^ 0 Y 0 ) 2 , (7)

an intrinsic characteristic of the population which we need to estimate from the available data (3).

In this paper, an unconditional expectation like the r.h.s. of (7) is meant to integrate over all the random variables involved in the integrand. A conditional expectation, on the other hand, acts the same, except for the conditioning random variable(s). As such, en route to getting MSPE ( Y ^ 0 , Y 0 ) , we will pass successively through:

MSPE ( Y ^ 0 , Y 0 | X , y , X 0 ) = E [ ( Y ^ 0 Y 0 ) 2 | X , y , X 0 ] , (8a)

MSPE ( Y ^ 0 , Y 0 | X , y ) = E [ ( Y ^ 0 Y 0 ) 2 | X , y ] , (8b)

MSPE ( Y ^ 0 , Y 0 | X ) = E [ ( Y ^ 0 Y 0 ) 2 | X ] , (8c)

representing, each, the MSPE conditioned on some information, which might be relevant in its own right. In particular, MSPE ( Y ^ 0 , Y 0 | X ) is the relevant definition of the MSPE in the case of a fixed design. But, as a consequence of moving successively through (8a)-(8c) to get (7), handling the fixed design case will not need a special treatment because computing (8c) will be a necessary intermediate step.

Obviously, for most fitted regression models, trying to compute MSPE ( Y ^ 0 , Y 0 ) or MSPE ( Y ^ 0 , Y 0 | X ) is a hopeless task. However, it is known that K-fold cross validation (CV) can be used to estimate these quantities in a nearly universal manner, imposing no distributional assumption and using a quasi automatic algorithm [6] . Nonetheless, cross validation has its own defects such as a high extra computational cost2, the impact of the choice of K and, generally, upwards bias. As for the latter defect, Borra and Di Ciaccio [2] showed in an extensive simulation study that Repeated Corrected CV, a little popularized correction to K-fold cross validation developed by Burman [7] , performed pretty well and outperformed, on some well known nonlinear regression fitters and under a variety of scenarios, all the other theoretically more technically involved suggested measures of the MSPE alluded to above.

Fortunately, after fitting a Linear Model to given data by a chosen method, there is, at least for the most common methods, no need to shoulder the computational cost attached to CV to estimate the MSPE of the fitted model. It is the purpose of this article to show that for the most well known LM fitters, one can transform (7)-(8c) in a form allowing to deduce, in a straightforward manner, two estimates of the MSPE computable from the data, which turn out to coincide, respectively, with the FPE and the GCV. This, therefore, will yield a new way to get these two selection criteria in linear regression completely different from how they are usually respectively motivated and derived.

2.3. Prediction Error and Sources of Randomness in the Regression Process

The rationale behind settling on (7) or (8c) as definition of the MSPE lies in the fact that, from our standpoint, any global measure of the prediction error in the population computed from a sample collected in it must account both for the randomness in drawing that sample and in that of drawing a future individual. Consequently, each of the expectations in (7)-(8c) should be understood as integrating over all the possible sources of randomness entering in the process of computing the prediction Y ^ 0 of Y 0 , safe the conditioning variables, if any:

1) the realized, but unobserved random n-vector ε = ( ε 1 , , ε n ) Τ of sample residual errors in the model (1) for the data (3). This vector results from the fact that for the observed sample of n individuals, the model (1) implies that

y i = Φ ( x i ) + ε i , i = 1, , n , with ε 1 , , ε n i .i .d . D ε ; (9)

2) X 0 p , the vector of covariates for the potential newly independently drawn individual for which the unknown value Y 0 of the response Y is to be predicted;

3) ε 0 , the error of the model (1) for that new individual, Y 0 = Φ ( X 0 ) + ε 0 , ε 0 D ε .

4) and, in case of a random design, the entries in the design matrix X .

The key assumption to assess the prediction error of a regression model is then:

( A 2 ). The random couple ( ε 0 , X 0 ) D ( ε , X ) and is independent from ( ε , X ) .

2.4. Measuring the Prediction Error in Regression: Sample Based Definition

While the GCV score is classically derived, indeed, as an estimate of the MSPE as given by (7), through a two-stage process where the intermediate step is the well known LOO-CV estimate of that MSPE, the traditional derivation of the FPE criterion stems from a completely different angle. Actually, the latter angle is the one presented in most textbooks on regression [8] [9] . In it, with the regression function Φ in (1) estimated through Φ ^ , a function computed from the data (3), the prediction error of that fit is rather measured by how well the vector of predicted responses on sample items, y ^ = ( y ^ 1 , , y ^ n ) Τ , with y ^ i = Φ ^ ( x i ) , i = 1, , n , estimates the vector of exact responses on those items, = ( Φ 1 , , Φ n ) Τ , with Φ i = Φ ( x i ) , i = 1, , n . In that respect, one considers the mean average squared error, also called risk,

MASE ( y ^ | X ) = 1 n i = 1 n E [ ( y ^ i Φ i ) 2 | X ] , (10)

with bias-variance decomposition

MASE ( y ^ | X ) = 1 n i = 1 n V ar ( y ^ i | X ) + 1 n i = 1 n b i 2 . (11)

where b i = E ( y ^ i | X ) Φ i is the conditional bias, given X , in estimating Φ i by y ^ i .

But the relation to the more obvious definition (8c) of the conditional MSPE is better seen when one considers, instead, the average predicted squared error ( [8] Chapter 3, page 42] or prediction risk ( [9] Chapter 2, page 29):

PSE ( y ^ | X ) = 1 n i = 1 n E [ ( y i y ^ i ) 2 | X ] , (12)

where y 1 , , y n are putative responses assumed generated at the respective predictor values x 1 , , x n through model (1), but with respective errors ε 1 , , ε n independent from the initial ones ε 1 , , ε n . Nonetheless, there is a simple relation between (10) and (12):

PSE ( y ^ | X ) = σ 2 + MASE ( y ^ | X ) . (13)

hence minimizing PSE ( y ^ | X ) w.r.t. Φ ^ is the same as doing so for MASE ( y ^ | X ) .

In its classical derivation for linear regression (see, e.g., [10] , pages 19-20), the FPE selection criterion is an estimate of PSE ( y ^ | X ) . Now, with the terminology elaborated in [5] , measuring the prediction error by the latter amounts to using a Fixed-X viewpoint as opposed to the Random-X one when measuring it instead through MSPE ( Y ^ 0 , Y 0 ) . But an even more common terminology to distinguish these two approaches to estimating the predictive performance of a regression method qualifies the first as in-sample prediction and the second one as out-of-sample prediction. It should be said that while in the past, the prediction error was mostly evaluated using the in-sample paradigm, facing the complexity of data met in modern statistics, noticeably high dimensional data, many researchers in regression have advocated or used the out-of-sample viewpoint, though this might be through either (7), (8b), or (8c), depending on the author(s). In that respect, in addition to the aforementioned paper of Trosset and Tibshirani, we may cite, e.g., Breiman and Spector [11] , Leeb [12] , Dicker [13] , Dobriban and Wager [14] .

Note, however, that the prediction error viewpoint in assessing the quality of a regression fit is not without its own demerits. Indeed, in [15] and [16] , it is highlighted that in the specific case of smoothing splines regression, one can find a fit which is optimal from the prediction error viewpoint, but which clearly undersmoothes the data, resulting in a wiggly curve. But the argument appears to be more a matter of visual esthetics because the analysis in those papers targets the regression function Φ, which is, indeed, probably the main objective of many users of univariate nonparametric regression. Nonetheless, when measuring the prediction error through MSPE ( Y ^ 0 , Y 0 ) , the target is rather the response Y, i.e. Φ + error, in which the wiggliness is inherently embedded. Do not forget that when formulating his Final Prediction Error, Akaike was working on real world engineering problems [17] . Hence his interest in targeting the actual output in his predictions.

3. MSPE of a Linear Model Fitter

From now on, we focus attention on the generic LM, with β = ( β 1 , , β p ) Τ :

Y = β 1 X 1 + + β p X p + ε = X Τ β + ε , (14)

to be fitted to the data (3). Because of its ease of tractability and manipulation, the LM is the most popular approach to estimating the regression function Φ ( ) in (1). It is mostly implemented by estimating β through the Ordinary Least Squares criterion. However, several other approaches have been designed to estimate β based on various grounds, such as Weighted Least Squares, Total Least Squares, Least Absolute Deviations, LASSO [18] and Ridge Regression [19] . Furthermore, some generally more adaptive nonparametric regression methods proceed by first nonlinearly transforming the data to a scale where they can be fitted by a LM. Due to its more than ever quite central role in statistical modeling and practice, several books have been and continue to be fully devoted to the presentation of the LM and its many facets such as: [20] [21] [22] and [23] .

3.1. Some Preliminaries

For the sample of n individuals with recorded data (3), the general regression Equation (9) becomes, in the case of the LM (14):

y i = x i Τ β + ε i , i = 1, , n , (15)

or, better globally summarized in matrix form,

y = X β + ε , with ε = ( ε 1 , , ε n ) Τ . (16)

Then Assumptions ( A 0 ) and ( A 1 ) respectively imply here:

E ( ε | X ) = 0 and M cov ( ε | X ) = σ 2 I n , (17)

with I n the n-by-n identity matrix. Consequently,

E ( y | X ) = X β and M cov ( y | X ) = σ 2 I n . (18)

Fitting the LM (14) boils down to estimating the p-vector of parameters β . We call Linear Model fitter, or LM fitter, any method allowing to achieve that. First, we consider an arbitrarily chosen such method. It uses the data (3) to compute β ^ p , an estimate of β . The precision of that estimate3 can be assessed through its Mean Squared Error matrix:

M se ( β ^ ) = E [ ( β ^ β ) ( β ^ β ) Τ ] . (19a)

But to reach M se ( β ^ ) generally requires passing through the conditional Mean Squared Error matrix of β ^ given the design matrix X :

Mse ( β ^ | X ) = E [ ( β ^ β ) ( β ^ β ) Τ | X ] . (19b)

The relationship between the two is:

Mse ( β ^ ) = E X [ M se ( β ^ | X ) ] . (19c)

Those two matrices will play a key role in our MSPE derivations to come.

The precision of an estimate of a vector parameter like β is easier to assess when its Mean Squared Error matrix coincides with its covariance matrix. Hence, our interest in:

Definition 1 In the LM (14), β ^ is an unbiased estimate of β , conditionally on X , if E ( β ^ | X ) = β .

Then one has: M se ( β ^ | X ) = M cov ( β ^ | X ) , the conditional covariance matrix of β ^ given X .

Since E ( β ^ ) = E X [ E ( β ^ | X ) ] , it is immediate that if β ^ is an unbiased estimate of β , conditionally on X , then E ( β ^ ) = β , i.e. β ^ is an unbiased estimate of β (unconditionally). So the former property is stronger than the latter, but is more useful in this context. Note also that it implies: M se ( β ^ ) = M cov ( β ^ ) , the covariance matrix of β ^ . On the other hand, when β ^ is a biased estimate of β , the bias-variance decomposition of M se ( β ^ ) might be of interest:

M se ( β ^ ) = B ias ( β ^ ) B ias ( β ^ ) Τ + M cov ( β ^ ) , (20)

where B ias ( β ^ ) = E ( β ^ ) β .

3.2. MSPE of a Linear Model Fitter: Base Results

In the prediction setting of Section 2.1 applied to the LM (14), one has, for the new individual:

Y 0 = X 0 Τ β + ε 0 , (21)

with ε 0 D ε , but unknown. It would then be natural to predict the response value Y 0 by

Y ˜ 0 = X 0 Τ β = Φ ( X 0 ) , (22)

were the exact value of the parameter vector β available. Since that is not typically the case, one rather predicts Y 0 by the computable quantity

Y ^ 0 = X 0 Τ β ^ = Φ ^ ( X 0 ) . (23)

The goal here is to find expressions, in this context, for the MSPE of Y ^ 0 w.r.t. Y 0 as given by (7)-(8c), manageable enough to allow the derivation of computable estimates of that MSPE for the most common LM fitters. The starting point to get such MSPE expressions is the base result:

Theorem 1 For any β ^ estimating β in the LM (14), one has, under ( A 0 ), ( A 1 ), ( A 2 ):

MSPE ( Y ^ 0 , Y 0 | X , y , X 0 ) = σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ X 0 X 0 Τ ] , (24a)

MSPE ( Y ^ 0 , Y 0 | X , y ) = σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] , (24b)

MSPE ( Y ^ 0 , Y 0 | X ) = σ 2 + tr [ M se ( β ^ | X ) E ( X 0 X 0 Τ ) ] , (24c)

MSPE ( Y ^ 0 , Y 0 ) = σ 2 + tr [ M se ( β ^ ) E ( X 0 X 0 Τ ) ] . (24d)

Proof. From Y ^ 0 Y 0 = X 0 Τ ( β ^ β ) ε 0 , we first get:

( Y ^ 0 Y 0 ) 2 = ε 0 2 2 ε 0 X 0 Τ ( β ^ β ) + [ X 0 Τ ( β ^ β ) ] 2 . (25)

Now, from (8a) and using Assumption ( A 2 ),

MSPE ( Y ^ 0 , Y 0 | X , y , X 0 ) = E ( ε 0 2 | X 0 ) 2 E ( ε 0 | X 0 ) X 0 Τ ( β ^ β ) + [ X 0 Τ ( β ^ β ) ] 2 = V ar ( ε 0 | X 0 ) + [ X 0 Τ ( β ^ β ) ] 2 = σ 2 + [ X 0 Τ ( β ^ β ) ] 2 , (26)

the latter because E ( ε 0 | X 0 ) = E ( ε | X ) = 0 , and so E ( ε 0 2 | X 0 ) = V ar ( ε 0 | X 0 ) = V ar ( ε | X ) = σ 2 . On the other hand, [ X 0 Τ ( β ^ β ) ] 2 = tr [ X 0 Τ ( β ^ β ) ] 2 because [ X 0 Τ ( β ^ β ) ] 2 is a scalar. Thus

[ X 0 Τ ( β ^ β ) ] 2 = tr [ X 0 Τ ( β ^ β ) ( β ^ β ) Τ X 0 ] = tr [ ( β ^ β ) ( β ^ β ) Τ X 0 X 0 Τ ] ,

which, inserted in (26), yields (24a).

Thanks to (24a) and identity (8b), we have

MSPE ( Y ^ 0 , Y 0 | X , y ) = E [ ( Y ^ 0 Y 0 ) 2 | X , y ] = E X 0 { E [ ( Y ^ 0 Y 0 ) 2 | X , y , X 0 ] } , since X 0 ( X , y ) = E X 0 { σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ X 0 X 0 Τ ] } = E X 0 ( σ 2 ) + E X 0 { tr [ ( β ^ β ) ( β ^ β ) Τ X 0 X 0 Τ ] }

= σ 2 + tr { E X 0 [ ( β ^ β ) ( β ^ β ) Τ X 0 X 0 Τ ] } = σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ E X 0 ( X 0 X 0 Τ ) ] = σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] .

Likewise, (24b) and (8c) give:

MSPE ( Y ^ 0 , Y 0 | X ) = E [ ( Y ^ 0 Y 0 ) 2 | X ] = E y | X E [ ( Y ^ 0 Y 0 ) 2 | X , y ] = E y | X { σ 2 + tr [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] } = E y | X ( σ 2 ) + E y | X { tr [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] } = σ 2 + E y | X { tr [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] }

= σ 2 + tr { E y | X [ ( β ^ β ) ( β ^ β ) Τ E ( X 0 X 0 Τ ) ] } = σ 2 + tr { E y | X [ ( β ^ β ) ( β ^ β ) Τ ] E ( X 0 X 0 Τ ) } = σ 2 + tr [ M se ( β ^ | X ) E ( X 0 X 0 Τ ) ] .

From relations (24c) and (7), one gets:

MSPE ( Y ^ 0 , Y 0 ) = E ( Y ^ 0 Y 0 ) 2 = E X E [ ( Y ^ 0 Y 0 ) 2 | X ] = E X { σ 2 + tr [ M se ( β ^ | X ) E ( X 0 X 0 Τ ) ] } = E X ( σ 2 ) + E X { tr [ M se ( β ^ | X ) E ( X 0 X 0 Τ ) ] } = σ 2 + tr { E X [ M se ( β ^ | X ) E ( X 0 X 0 Τ ) ] } = σ 2 + tr { E X [ M se ( β ^ | X ) ] E ( X 0 X 0 Τ ) } = σ 2 + tr [ M se ( β ^ ) E ( X 0 X 0 Τ ) ] .

The above result is interesting in that it imposes no assumption on β ^ , hence it is valid for any LM fitter. But an immediate important subcase is provided in:

Corollary 2 If, conditional on X , β ^ estimates β unbiasedly in the LM (14), then, under Assumptions ( A 0 ), ( A 1 ) and ( A 2 ):

MSPE ( Y ^ 0 , Y 0 | X ) = σ 2 + tr [ M cov ( β ^ | X ) E ( X 0 X 0 Τ ) ] , (27a)

MSPE ( Y ^ 0 , Y 0 ) = σ 2 + tr [ M cov ( β ^ ) E ( X 0 X 0 Τ ) ] . (27b)

4. MSPE When Fitting the LM by Ordinary Least Squares

By far, the most popular approach to estimating the parameter vector β in the LM (14) is through minimizing the Ordinary Least Squares (OLS) criterion using the observed data (3):

β ^ = β ^ OLS = arg min β p i = 1 n ( y i x i Τ β ) 2 . (28)

The properties of β ^ OLS as an estimate of β depend on whether the design matrix X has full column rank or not. This remains true when studying the corresponding MSPE as well.

4.1. MSPE in the LM Fitted by OLS with X of Full Column Rank

4.1.1. LM Fitted by OLS with X Full Column Rank

Here, we consider the LM (14) fitted under:

( A 3 ). rank ( X ) = p , i.e. X is a full column rank matrix.

That assumption is known to be equivalent to saying that the square matrix X Τ X is nonsingular, thus implying that

β ^ OLS = ( X Τ X ) 1 X Τ y . (29)

Furthermore, given Assumptions ( A 0 )-( A 1 ), (18) holds, so

E ( β ^ OLS | X ) = β and M cov ( β ^ OLS | X ) = σ 2 ( X Τ X ) 1 . (30)

We also recall that under these assumptions, with e ^ = y X β ^ OLS the residual response vector and the Euclidean norm, a computable unbiased estimate of the residual variance σ 2 is:

σ ^ 2 = SSR n p , with SSR = e ^ 2 = i = 1 n ( y i x i Τ β ^ OLS ) 2 , (31)

the latter being the sum of squared residuals in the OLS fit to the data.

Now, from the first identity in (30), we deduce that when X is full column rank, β ^ OLS is an unbiased estimate of β , conditionally on X . Hence, combining the second identity in (30) with Corollary 2 yields:

Theorem 3 In the LM (14) fitted by OLS with Assumptions ( A 0 ), ( A 1 ), ( A 2 ) and ( A 3 ),

MSPE ( Y ^ 0 , Y 0 | X ) = σ 2 ( 1 + tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) , (32a)

MSPE ( Y ^ 0 , Y 0 ) = σ 2 ( 1 + tr { E [ ( X Τ X ) 1 ] E ( X 0 X 0 Τ ) } ) . (32b)

Proof. From (24c),

MSPE ( Y ^ 0 , Y 0 | X ) = σ 2 + tr [ M se ( β ^ OLS | X ) E ( X 0 X 0 Τ ) ] = σ 2 + tr [ M cov ( β ^ OLS | X ) E ( X 0 X 0 Τ ) ] = σ 2 + tr [ σ 2 ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] = σ 2 + σ 2 tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] = σ 2 ( 1 + tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) .

Now, using (32a), one gets:

MSPE ( Y ^ 0 , Y 0 ) = E X E [ ( Y ^ 0 Y 0 ) 2 | X ] = E X { σ 2 ( 1 + tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) } = σ 2 { E X ( 1 + tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) } = σ 2 { 1 + E X ( tr [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) } = σ 2 { 1 + tr ( E X [ ( X Τ X ) 1 E ( X 0 X 0 Τ ) ] ) } = σ 2 ( 1 + tr { E [ ( X Τ X ) 1 ] E ( X 0 X 0 Τ ) } ) ,

from which the result is got.

4.1.2. The FPE and the GCV in the LM Fitted by OLS with X of Full Column Rank

From (32b) in Theorem 3, we deduce a closed form computable estimate of MSPE ( Y ^ 0 , Y 0 ) , using data (3), by estimating, respectively:

• the residual variance σ 2 by σ ^ 2 given by (31);

• the p × p expectation matrix E [ ( X Τ X ) 1 ] by the observed ( X Τ X ) 1 ;

• the p × p expectation matrix E ( X 0 X 0 Τ ) by (given that x 1 , , x n are i.i.d. D X 0 ):

1 n i = 1 n x i x i Τ = 1 n X Τ X . (33)

Therefore, one estimates MSPE ( Y ^ 0 , Y 0 ) by

MSPE ^ ( Y ^ 0 , Y 0 ) = σ ^ 2 { 1 + 1 n tr [ ( X Τ X ) 1 ( X Τ X ) ] } = σ ^ 2 [ 1 + 1 n tr ( I p ) ] = σ ^ 2 ( 1 + p n ) = n + p n p SSR n = S ^ 2 n + p n p = FPE , (34)

with

S ^ 2 = SSR / n = e ^ 2 / n = y X β ^ OLS 2 / n , (35)

the usual Maximum Likelihood Estimator of the residual variance σ 2 in the LM (14) when the residual error ε is assumed to follow a N ( 0, σ 2 ) Gaussian distribution.

We see that the final estimate MSPE ^ ( Y ^ 0 , Y 0 ) obtained for MSPE ( Y ^ 0 , Y 0 ) coincides with Akaike’s Final Prediction Error (FPE) goodness-of-fit criterion for the LM (14) fitted by OLS [1] . The main difference between the derivation above and the traditonal one is that the latter uses the sample viewpoint of the prediction error reviewed in Section 2.4. The latter excludes the possibility that covariates values on a future individual might be completely unrelated to the observed x i ’s in the sample (3). In particular, it does not account for any potential random origin for the design matrix X , a situation often encountered in certain areas of application of the LM such as in econometrics.

On the other hand, estimating, instead, E ( X 0 X 0 Τ ) by X Τ X / ( n p ) yields as estimate of MSPE ( Y ^ 0 , Y 0 ) :

σ ^ 2 ( 1 + p n p ) = n SSR ( n p ) 2 = S ^ 2 ( 1 p / n ) 2 = GCV , (36)

the traditional GCV estimate of MSPE ( Y ^ 0 , Y 0 ) in the LM fitted by OLS with X full column rank.

Remark 1 Note that the very way the two estimates (34) and (36) of MSPE ( Y ^ 0 , Y 0 ) were derived above implies that they can also validly serve, each, as an estimate of the conditional MSPE ( Y ^ 0 , Y 0 | X ) given by (32a). This will remain true for all the estimates derived for the MSPE under the other scenarios examined in this paper.

4.2. MSPE in the LM Fitted by OLS with X Not of Full Column Rank

4.2.1. LM Fitted by OLS with X Column Rank Deficient

Although Assumption ( A 3 ) is routinely admitted by most people when handling the LM, one actually meets many concrete instances of data sets where it does not hold. Fortunately, with the formalization by Moore [24] , Penrose [25] and, especially, Rao [26] of the notion of generalized inverse (short: g-inverse) of an arbitrary matrix, it became possible to handle least squares estimation in the LM without having to assume the design matrix X necessarily of full column rank.

To begin with, it is shown in most textbooks on the LM that whatever the rank of the design matrix X , a vector β ^ p is a solution to the OLS minimization problem (28) if, and only if, β ^ is a solution to the so called normal equations:

X Τ X β ^ = X Τ y . (37)

When Assumption ( A 3 ) holds, the unique solution to the normal equations is clearly β ^ = β ^ OLS given by (29). When that’s not the case, the square matrix X Τ X is singular, hence does not have a regular inverse ( X Τ X ) 1 . Nevertheless, even then it can be shown that the normal Equation (37) are always consistent. But the apparent negative thing is that they then have infinitely many solution vectors β ^ , actually all vectors β ^ p of the form:

β ^ = β ^ OLS = ( X Τ X ) X Τ y , (38)

where ( X Τ X ) is any g-inverse of X Τ X in the sense of Rao. Given that multitude of possible OLS estimates of β in this case, one may worry that this may hinder any attempt to get a meaningful estimate of the MSPE in the fitted LM. But we are going to show that such a worry is not warranted.

When Assumption ( A 3 ) does not hold, in spite of there being as many solutions β ^ OLS to the normal Equation (37) as there are g-inverse s ( X Τ X ) of X Τ X , i.e. infinitely many, it is a remarkable well known fact that the fitted response vector

y ^ = X β ^ OLS = H X y , with H X = X ( X Τ X ) X Τ , (39)

is the same, whichever g-inverse ( X Τ X ) of X Τ X is used to compute β ^ OLS through (38). This stems from the hat matrix H X being equal to the matrix of the orthogonal projection of n into the range space of X ( [22] Appendix A). Therefore, the residual response vector

e ^ = y y ^ = y X β ^ OLS = ( I n H X ) y , (40)

does not vary with the g-inverse ( X Τ X ) either. Then we will need the result:

Lemma 4 With r X = rank ( X ) , one has: tr ( H X ) = r X and E ( e ^ 2 | X ) = ( n r X ) σ 2 .

Proof. On the one hand, one has:

tr ( H X ) = tr [ X ( X Τ X ) X Τ ] = tr [ ( X Τ X ) X Τ X ] . (41)

Now, ( X Τ X ) being a g-inverse of X Τ X , it comes that ( X Τ X ) X Τ X is an idempotent matrix ( [22] Appendix A, page 509]. Hence

tr [ ( X Τ X ) X Τ X ] = rank [ ( X Τ X ) X Τ X ] = rank ( X Τ X ) = rank ( X ) = r X . (42)

Relations (41) and (42) give tr ( H X ) = r X . On the other hand,

e ^ = ( I n H X ) y = ( I n H X ) ( X β + ε ) = ( I n H X ) X β + ( I n H X ) ε . (43)

Now, H X being the matrix of the orthogonal projection of n into the range space of X , then H X X = X . Hence,

( I n H X ) X = X H X X = 0. (44)

From (43) and (44), we get e ^ = ( I n H X ) ε . Therefore:

e ^ 2 = ε Τ ( I n H X ) Τ ( I n H X ) ε = ε Τ ( I n H X ) 2 ε = ε Τ ( I n H X ) ε .

Under Assumptions ( A 0 ),( A 1 ), and thanks to the known identity which gives the expectation of a quadratic form ( [27] Appendix B, page 170], one has:

E [ e ^ 2 | X ] = [ E ( ε | X ) ] Τ ( I n H X ) E ( ε | X ) + tr [ ( I n H X ) M cov ( ε | X ) ] = tr [ ( I n H X ) σ 2 I n ] = σ 2 [ tr ( I n ) tr ( H X ) ] = σ 2 ( n r X )

An unbiased estimate of the LM residual variance σ 2 in this case is thus known to be:

σ ^ X 2 = SSR n r X , with SSR = e ^ 2 = y X β ^ OLS 2 = i = 1 n ( y i x i Τ β ^ OLS ) 2 . (45)

We will also need the mean vector and covariance matrix of β ^ OLS . First,

E ( β ^ OLS | X ) = ( X Τ X ) X Τ X β , (46a)

which shows that β ^ OLS is a biased estimator of β when Assumption ( A 3 ) does not hold. But in spite of that, note that from (39),

E ( y ^ | X ) = E ( X β ^ OLS | X ) = X E ( β ^ OLS | X ) = H X X β = X β = E ( y | X ) . (46b)

On the other hand,

M cov ( β ^ OLS | X ) = σ 2 ( X Τ X ) S , with ( X Τ X ) S = ( X Τ X ) X Τ X [ ( X Τ X ) ] Τ , (47a)

the symmetric and positive semi-definite matrix ( X Τ X ) S also being a g-inverse of X Τ X . Then

M cov ( y ^ | X ) = M cov ( X β ^ OLS | X ) = X M cov ( β ^ OLS | X ) X Τ = σ 2 X ( X Τ X ) S X Τ = σ 2 H X , (47b)

again independent of the g-inverse ( X Τ X ) of X Τ X used to compute β ^ OLS .

4.2.2. Preliminary for the MSPE in the LM Fitted by OLS without Assumption ( A 3 )

Our first aim here is to examine the MSPE in the LM when fitted by OLS under the assumption that the design matrix X might not have full column rank. So β has been estimated through β ^ OLS given by (38) and we are interested in MSPE ( Y ^ 0 , Y 0 ) = E ( Y ^ 0 Y 0 ) 2 , where Y ^ 0 = X 0 Τ β ^ OLS is taken as prediction of Y on an independently sampled new individual for whom X = X 0 p would have been observed, but not Y = Y 0 . We are going to use the results of Section 3.2. First note that since, from the above, β ^ OLS is a biased estimator of β , Corollary 2 does not apply here. Nonetheless, from Theorem 1, we get:

MSPE ( Y ^ 0 , Y 0 | X ) = σ 2 + tr [ M se ( β ^ OLS | X ) E ( X 0 X 0 Τ ) ] , (48a)

MSPE ( Y ^ 0 , Y 0 ) = σ 2 + tr [ M se ( β ^ OLS ) E ( X 0 X 0 Τ ) ] . (48b)

Our estimation of MSPE ( Y ^ 0 , Y 0 ) in this case will be based on those two identities and:

Lemma 5 In the LM (14) with Assumptions ( A 0 )-( A 1 ),

tr [ M se ( β ^ OLS | X ) X Τ X ] = r X σ 2 , with r X = rank ( X ) . (49)

Proof. For

A ( X ) = tr [ M se ( β ^ OLS | X ) X Τ X ] = tr { E [ ( β ^ OLS β ) ( β ^ OLS β ) Τ | X ] X Τ X } ,

A ( X ) = tr { X E [ ( β ^ OLS β ) ( β ^ OLS β ) Τ | X ] X Τ }

= tr { E [ X ( β ^ OLS β ) ( β ^ OLS β ) Τ X Τ | X ] }

= tr { E [ ( y ^ X β ) ( y ^ X β ) Τ | X ] }

= tr [ M cov ( y ^ | X ) ] , thanks to (46b)

= σ 2 tr ( H X ) , by (47b).

Hence (49), thanks to Lemma 4.

4.2.3. The FPE and the GCV in the LM Fitted by OLS with X Column Rank Deficient

Given (48a), we first estimate E ( X 0 X 0 Τ ) by X Τ X / n as in (33), which entails the following preliminary estimate of MSPE ( Y ^ 0 , Y 0 | X ) in the present case, using the last lemma:

MSPE ˜ ( Y ^ 0 , Y 0 | X ) = σ 2 + 1 n tr [ M se ( β ^ OLS | X ) X Τ X ] = σ 2 ( 1 + r X n ) . (50)

But since MSPE ( Y ^ 0 , Y 0 ) = E X MSPE ( Y ^ 0 , Y 0 | X ) , then (51) also gives a preliminary estimate of MSPE ( Y ^ 0 , Y 0 ) . Then, estimating σ 2 by σ ^ X 2 given by (45), our final estimate of MSPE ( Y ^ 0 , Y 0 ) in this case, computable from data, is:

FPE = σ ^ X 2 ( 1 + r X n ) = S ^ 2 n + r X n r X , (51)

which is also an estimate of MSPE ( Y ^ 0 , Y 0 | X ) . It is denoted FPE because it generalizes (34) in assessing the goodness-of-fit of OLS in the LM when the design matrix X is column rank deficient. The remarkable feature is that this estimate is the same, whichever g-inverse ( X Τ X ) of X Τ X was used to get the estimate β ^ OLS of β in (38).

Estimating, instead, E ( X 0 X 0 Τ ) by X Τ X / ( n r X ) yields as estimate of MSPE ( Y ^ 0 , Y 0 ) :

σ ^ X 2 ( 1 + r X n r X ) = n SSR ( n r X ) 2 = S ^ 2 ( 1 r X / n ) 2 = GCV , (52)

the GCV estimate of MSPE ( Y ^ 0 , Y 0 ) in the LM fitted by OLS when X is not full column rank.

5. MSPE when Fitting the LM by Ridge Regression

The design matrix X being column rank deficient means that its p columns are linearly dependent, or almost so. This happens when there is multicollinearity among the p regressors X 1 , , X p , and thus some of them are redundant with some others. However, when this occurs, computing an OLS estimate β ^ OLS of β , given by (38), in a numerically stable manner is not easy and requires using carefully designed Numerical Linear Algebra programming routines. The difficulty stems from the fact that this requires, at least implicitly, to find, along the way, the exact rank of X , which is difficult to achieve, precisely because of the multicollinearity among its columns. It can then be of much interest to have a method which can fit the LM without having to bother about the exact rank of X . This is precisely what Ridge regression (RR) tries to achieve.

Hoerl and Kennard presented two variants of Ridge Regression [19] . In the initial one (the default), which some have termed Ordinary Ridge Regression (ORR), to fit the LM (14), one estimates β through regularizing the OLS criterion (28) by a ridge constraint, yielding:

β ^ = β ^ λ = arg min β p [ i = 1 n ( y i x i Τ β ) 2 + λ β 2 ] , (53)

for some λ > 0 , a penalty parameter to choose appropriately. The unique solution to (54) is known to be (whatever the rank of X ):

β ^ λ = G λ 1 X Τ y , with G λ = X Τ X + λ I p . (54)

Hoerl and Kennard presented ORR in [19] assuming that the design matrix X was full column rank (i.e. our Assumption ( A 3 )), which also requires that n p . But because the symmetric matrix X Τ X is always at least semi-positive definite, imposing λ > 0 entails that the p × p matrix G λ is symmetric and positive definite (SPD), whatever the rank of X and the ranking between the integers n and p. That is why, in what follows, we do not impose any rank constraint on X (apart being trivially ≥1 because X is nonzero). No specific ranking either is assumed between n and p.

However, hereafter, since it does not require extra work, we directly consider the extended setting of Generalized Ridge Regression (GRR) which, to fit the LM (14), estimates β through solving the minimization problem:

β ^ = β ^ λ = arg min β p [ i = 1 n ( y i x i Τ β ) 2 + λ β D 2 ] , (55)

where λ is as in ORR and D is a p × p symmetric and semi-positive definite (SSPD) matrix, both given, while β D 2 = β Τ D β . The solution of (56) is still (55), but now with

G λ = X Τ X + λ D , (56)

under the assumption that the latter is SPD. Since smoothing splines fitting can be cast in the GRR form (56), what follows applies to that hugely popular nonparametric regression method as well.

5.1. The MSPE Issue for Ridge Regression

More than for Least Squares, depending on the unspecified parameter λ , it is critical to assess how Ridge Regression fits the LM for given data in order to be able to select the best λ value, i.e. the one ensuring the best fit. From the prediction error point of view stated in this article, this amounts to choosing the λ > 0 for which the RR fit has the smallest MSPE. It, thus, requires to estimate MSPE λ ( Y ^ 0 , Y 0 ) = E ( Y ^ 0 Y 0 ) 2 for any given λ value, where X 0 and Y 0 are as before, while Y ^ 0 = X 0 Τ β ^ λ . Traditionally, estimating MSPE λ ( Y ^ 0 , Y 0 ) in this context is mostly done using the Generalized Cross Validation (GCV) criterion initially developed by Craven and Wahba for selecting the best value of the smoothing parameter in a smoothing spline [3] . That GCV is obtained as a variation of the LOO-CV. Here, to estimate MSPE λ ( Y ^ 0 , Y 0 ) , we take a different route. We first note that the fitted sample response vector is

y ^ λ = X β ^ λ = H λ y , with H λ = X G λ 1 X Τ . (57)

On the other hand, from (55),

E ( β ^ λ | X ) = G λ 1 X Τ E ( y | X ) = T λ β , with T λ = G λ 1 X Τ X I p . (58)

So, again, Corollary 2 does not apply. But, using Theorem 1,

MSPE λ ( Y ^ 0 , Y 0 | X ) = σ 2 + tr [ M se ( β ^ λ | X ) E ( X 0 X 0 Τ ) ] , (59a)

MSPE λ ( Y ^ 0 , Y 0 ) = σ 2 + tr [ M se ( β ^ λ ) E ( X 0 X 0 Τ ) ] . (59b)

For estimating those quantities, we will need some preliminary results.

5.2. Preliminary Results for Estimating the MSPE in Ridge Regression

First, two simple, but remarkable identities about the matrices H λ and T λ given in (58) and (59).

Lemma 6 T λ = I p λ G λ 1 D and λ X G λ 1 D = ( I n H λ ) X .

Proof. The first identity is easily got from (59) and (57). Indeed, one has:

I p = G λ 1 G λ = G λ 1 X Τ X + G λ 1 λ D = T λ + λ G λ 1 D ,

As for the second one,

λ X G λ 1 D = X G λ 1 ( G λ X Τ X ) = X X G λ 1 X Τ X = ( I n H λ ) X .

Next, a key preliminary about the Mean Squared Error matrix of β ^ λ as an estimate of β :

Lemma 7 Under Assumptions ( A 0 )-( A 1 ), one has:

tr [ M se ( β ^ λ | X ) X Τ X ] = σ 2 tr ( H λ 2 ) + ( I n H λ ) X β 2 . (60)

Proof. Inserting (20) in tr [ M se ( β ^ λ | X ) X Τ X ] = tr [ X M se ( β ^ λ | X ) X Τ ] gives:

tr [ M se ( β ^ λ | X ) X Τ X ] = tr ( A ) + tr ( B ) ,

with A = X M cov ( β ^ λ | X ) X Τ and B = X B ias ( β ^ λ ) [ X B ias ( β ^ λ ) ] Τ . Now,

A = M cov ( y ^ λ | X ) = H λ M cov ( y | X ) H λ = σ 2 H λ 2

tr ( B ) = tr { [ X B ias ( β ^ λ ) ] Τ X B ias ( β ^ λ ) } = X B ias ( β ^ λ ) 2 = X ( T λ I p ) β 2 = λ X G λ 1 D β 2 = ( I n H λ ) X β 2 ,

the last three identities using (59) and Lemma 6.

With the above lemma, we are now in a position to be able to estimate MSPE λ ( Y ^ 0 , Y 0 ) in Ridge Regression. We examine, hereafter, two paths for achieving that: one leads to the FPE, the other one to the GCV.

5.3. Estimating the MSPE in Ridge Regression by the FPE

Here, we first estimate E ( X 0 X 0 Τ ) by X Τ X / n in (60a). Then, given (61), this suggests the preliminary estimate of MSPE λ ( Y ^ 0 , Y 0 | X ) in RR:

MSPE ˜ λ 0 ( Y ^ 0 , Y 0 | X ) = σ 2 + 1 n tr [ M se ( β ^ λ | X ) X Τ X ] = σ 2 [ 1 + tr ( H λ 2 ) n ] + ( I n H λ ) X β 2 n , (61)

which is also, therefore, a preliminary estimate of MSPE λ ( Y ^ 0 , Y 0 ) . It is only a preliminary estimate because even given λ , it still depends on the two unknowns σ 2 and β . Interestingly, it can be shown ( [8] Chapter 3, page 46) that (62) coincides with PSE ( y ^ λ | X ) given by (62) in the present setting.

It is useful to note that (62) depends on β only through the squared bias term ( I n H λ ) X β 2 . To estimate the latter, let e ^ λ = y y ^ λ = ( I n H λ ) y , the vector of sample residuals, and SSR λ = e ^ λ 2 , the sum of squared residuals in the Ridge Regression fit. Then, thanks to a well known identity ( [9] Chapter 2, page 38),

E ( SSR λ | X ) = ( I n H λ ) X β 2 + σ 2 tr [ ( I n H λ ) 2 ] ,

implying that SSR λ σ 2 tr [ ( I n H λ ) 2 ] is an unbiased estimate of ( I n H λ ) X β 2 . Hence a general formula for computing an estimate of the MSPE in Ridge Regression:

FPE ( σ ^ 2 ) = 2 σ ^ 2 tr ( H λ ) + SSR λ n , (62)

where σ ^ 2 is a chosen estimate of σ 2 , possibly computed from the RR fit, thus dependent on λ .

We denoted FPE ( σ ^ 2 ) the estimate of the MSPE given by (63) for the reason to follow. Indeed, probably the most popular estimate of the residual variance σ 2 from an RR fit is the one proposed by Wahba in the context of smoothing splines [28] :

σ ^ 1 2 = SSR λ n tr ( H λ ) . (63)

Now, if one uses σ ^ 2 = σ ^ 1 2 in (63), an algebraic manipulation easily leads to:

FPE ( σ ^ 1 2 ) = SSR λ n n + tr ( H λ ) n tr ( H λ ) = FPE , (64)

recovering the classical formula of the FPE for this setting, but this time as an estimate of the MSPE rather than the PSE.

5.4. Estimating the MSPE in Ridge Regression by the GCV

Here, we estimate E ( X 0 X 0 Τ ) in (60a) rather by X Τ X / [ n tr ( H λ ) ] . With (61), this suggests a preliminary estimate MSPE ˜ λ 1 ( Y ^ 0 , Y 0 | X ) of MSPE λ ( Y ^ 0 , Y 0 | X ) in RR, correspondig to MSPE ˜ λ 0 ( Y ^ 0 , Y 0 | X ) where the denominators n have been replaced by n tr ( H λ ) . Using the same unbiased estimate of ( I n H λ ) X β 2 as in the previous section and an estimate σ ^ 2 of σ 2 , we get another general formula for computing an estimate of the MSPE in RR:

GCV ( σ ^ 2 ) = σ ^ 2 tr ( H λ ) + SSR λ n tr ( H λ ) . (65)

We denoted it GCV ( σ ^ 2 ) because again taking σ ^ 2 = σ ^ 1 2 , one gets:

GCV ( σ ^ 1 2 ) = SSR λ n [ 1 n 1 tr ( H λ ) ] 2 = GCV , (66)

the well known formula of Craven and Wahba’s GCV [3] for this setting, but this time derived without any reference to Cross Validation.

6. Conclusion and Perspectives

In this work, the goal was not to derive new and better selection criteria to assess the goodness-of-fit in regression, but rather to show how one can derive the well known Akaike’s FPE and, Craven and Wahba’s GCV [3] as direct estimates of the measure of prediction error most commonly known to users, which is not how they are traditionally derived. We achieved this for some of the best known linear model fitters, with the two derivations differing only slightly for each of them. But, nowadays, in regression, use of the FPE criterion is generally not recommended because much better performing criteria are known [29] , while GCV has its own shortcomings in certain settings (e.g. small sample size), though hugely popular and almost the best in some difficult high dimensional situations [12] . It is then our hope that, in the future, one can, through the same unified framework used in this paper, derive new and better selection criteria, different from other already available such proposals for the same setting, among which we can cite the AICc [30] , the modified GCV [31] [32] , the modified RGCV and R1GCV [33] [34] , RC p , RC p + and RC p ^ [5] .

NOTES

1By default, in this paper, vectors and random vectors are column vectors, unless specified otherwise; and AT denotes the transpose of the matrix (or vector) A.

2However, that defect is more and more mitigated these days, thanks to the availability of increasingly user-friendly parallel programming environments in popular statistical software systems, provided one has a computer allowing to exploit such possibilities, e.g. a laptop with several core processors.

3Keeping in line with our stated convention of denoting a random variable and its realization by the same symbol, whereas, technically, an estimate of a parameter is a realization of a random variable called estimator of that parameter, we use the term estimate here for both. So when an estimate appears inside an expectation or a covariance notation, it is definitely the estimator which is meant.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Akaike, H. (1970) Statistical Predictor Identification. Annals of the Institute of Statistical Mathematics, 22, 203-217.
https://doi.org/10.1007/BF02506337
[2] Borra, S. and Di Ciaccio, A. (2010) Measuring the Prediction Error. A Comparison of Cross-Validation, Bootstrap and Covariance Penalty Methods. Computational Statistics & Data Analysis, 54, 2976-2989.
https://doi.org/10.1016/j.csda.2010.03.004
[3] Craven, P. and Wahba, G. (1979) Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation. Numerische Mathematik, 31, 377-403.
https://doi.org/10.1007/BF01404567
[4] Li, K.-C. (1985) From Stein’s Unbiased Risk Estimates to the Method of Generalized Cross Validation. Journal of the Japan Statistical Society, 38, 119-130.
[5] Rosset, S. and Tibshirani, R. (2020) From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation. JASA, 115, 138-151.
https://doi.org/10.1080/01621459.2018.1424632
[6] Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition, Springer-Verlag, New York.
[7] Burman, P. (1989) A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation and Repeated Learning-Testing Methods. Biometrika, 76, 503-514.
https://doi.org/10.1093/biomet/76.3.503
[8] Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. Monographs on Statistics and Applied Probability. Chapman and Hall, London.
[9] Eubank, R.L. (1999) Nonparametric Regression and Spline Smoothing. 2nd Edition, Marcel Dekker, New York.
https://doi.org/10.1201/9781482273144
[10] McQuarrie, A.D.R. and Tsai, C.-L. (1998) Regression and Time Series Model Selection. World Scientific Publishing Co. Re. Ltd, Singapore.
https://doi.org/10.1142/3573
[11] Breiman, L. and Spector, P. (1992) Submodel Selection and Evaluation in Regression. The X-Random Case. International Statistical Review, 60, 291-319.
https://doi.org/10.2307/1403680
[12] Leeb, H. (2008) Evaluation and Selection of Models for Out-of-Sample Prediction When the Sample Size Is Small Relative to the Complexity of the Data-Generating Process. Bernoulli, 14, 661-690.
https://doi.org/10.3150/08-BEJ127
[13] Dicker, L.H. (2013) Optimal Equivariant Prediction for High-Dimensional Linear Models with Arbitrary Predictor Covariance. The Electronic Journal of Statistics, 7, 1806-1834.
https://doi.org/10.1214/13-EJS826
[14] Dobriban, E. and Wager, S. (2018) High-Dimensional Asymptotics of Prediction. Annals of Statistics, 46, 247-279.
https://doi.org/10.1214/17-AOS1549
[15] Lukas, M.A. (2014) Performance Criteria and Discrimination of Extreme Undersmoothing in Nonparametric Regression. Journal of Statistical Planning and Inference, 153, 56-74.
https://doi.org/10.1016/j.jspi.2014.05.006
[16] Lukas, M.A., de Hoog, F.R. and Anderssen, R.S. (2015) Practical Use of Robust GCV and Modified GCV for Spline Smoothing. Computational Statistics, 31, 269-289.
https://doi.org/10.1007/s00180-015-0577-7
[17] Kitagawa, G. (2008) Contributions of Professor Hirotogu Akaike in Statistical Science. Journal of the Japan Statistical Society, 38, 119-130.
https://doi.org/10.14490/jjss.38.119
[18] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. JRSS, Series B, 58, 267-288.
https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
[19] Hoerl, A.E. and Kennard, R. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55-67.
https://doi.org/10.1080/00401706.1970.10488634
[20] Searle, S.R. (1971) Linear Models. John Wiley & Sons, New York.
[21] Rao, C.R. (1973) Linear Statistical Inference and Its Applications. 2nd Edition, Wiley, New York.
https://doi.org/10.1002/9780470316436
[22] Rao, C.R., Toutenburg, H., et al. (2008) Linear Models and Generalizations: Least Squares and Alternatives. 3rd Edition, Springer-Verlag, Berlin.
[23] Olive, D.J. (2017) Linear Regression. Springer International Publishing, Berlin.
https://doi.org/10.1007/978-3-319-55252-1
[24] Moore, E.H. (1920) On the Reciprocal of the General Algebraic Matrix (Abstract). Bulletin of the AMS, 26, 394-395.
[25] Penrose, R. (1955) A Generalized Inverse for Matrices. Proceedings—Cambridge Philosophical Society, 51, 406-413.
https://doi.org/10.1017/S0305004100030401
[26] Rao, C.R. (1962) A Note on a Generalized Inverse of a Matrix with Applications to Problems in Mathematical Statistics. JRSS, Series B, 24, 152-158.
https://doi.org/10.1111/j.2517-6161.1962.tb00447.x
[27] Kendrick, D.A. (2002)) Stochastic Control for Economic Models. 2nd Edition, VTEX Ltd., Vilnius.
[28] Wahba, G. (1978) Improper Priors, Spline Smoothing and the Problem of Guarding against Model Errors in Regression. JRSS, Series B, 40, 364-372.
https://doi.org/10.1111/j.2517-6161.1978.tb01050.x
[29] Härdle, W., Hall, P. and Marron, J.S. (1988) How Far Are Automatically Chosen Regression Smoothing Parameters from Their Optimum-(with Discussion). JASA, 83, 86-89.
https://doi.org/10.2307/2288922
[30] Hurvich, C.M., Simonoff, J.S. and Tsai, C.-L. (1998) Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion. JRSS, Series B, 60, 271-293.
https://doi.org/10.1111/1467-9868.00125
[31] Cummins, D.J., Filloon, T.G. and Nychka, D. (2001) Confidence Intervals for Nonparametric Curve Estimates: Toward More Uniform Pointwise Coverage. JASA, 96, 233-246.
https://doi.org/10.1198/016214501750332811
[32] Kim, Y.-J. and Gu, C. (2004) Smoothing Spline Gaussian Regression: More Scalable Computation via Efficient Approximation. JRSS, Series B, 66, 337-356.
https://doi.org/10.1046/j.1369-7412.2003.05316.x
[33] Lukas, M.A. (2006) Robust Generalized Cross-Validation for Choosing the Regularization Parameter. Inverse Probability, 22, 1883-1902.
https://doi.org/10.1088/0266-5611/22/5/021
[34] Lukas, M.A. (2008) Strong Robust Generalized Cross-Validation for Choosing the Regularization Parameter. Inverse Probability, 24, Article ID: 034006.
https://doi.org/10.1088/0266-5611/24/3/034006

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.