Nonparametric Estimation in Linear Mixed Models with Uncorrelated Homoscedastic Errors

Eugène-Patrice Ndong Nguéma; Betrand Fesuh Nono; Henri Gwét

doi:10.4236/ojs.2021.114035

Open Journal of Statistics > Vol.11 No.4, August 2021

Nonparametric Estimation in Linear Mixed Models with Uncorrelated Homoscedastic Errors

Eugène-Patrice Ndong Nguéma, Betrand Fesuh Nono^*, Henri Gwét
Department of Mathematics, Ecole Polytechnique, Yaoundé, Cameroon.
DOI: 10.4236/ojs.2021.114035 PDF HTML XML 290 Downloads 1,934 Views Citations

Abstract

Today, Linear Mixed Models (LMMs) are fitted, mostly, by assuming that random effects and errors have Gaussian distributions, therefore using Maximum Likelihood (ML) or REML estimation. However, for many data sets, that double assumption is unlikely to hold, particularly for the random effects, a crucial component in which assessment of magnitude is key in such modeling. Alternative fitting methods not relying on that assumption (as ANOVA ones and Rao’s MINQUE) apply, quite often, only to the very constrained class of variance components models. In this paper, a new computationally feasible estimation methodology is designed, first for the widely used class of 2-level (or longitudinal) LMMs with only assumption (beyond the usual basic ones) that residual errors are uncorrelated and homoscedastic, with no distributional assumption imposed on the random effects. A major asset of this new approach is that it yields nonnegative variance estimates and covariance matrices estimates which are symmetric and, at least, positive semi-definite. Furthermore, it is shown that when the LMM is, indeed, Gaussian, this new methodology differs from ML just through a slight variation in the denominator of the residual variance estimate. The new methodology actually generalizes to LMMs a well known nonparametric fitting procedure for standard Linear Models. Finally, the methodology is also extended to ANOVA LMMs, generalizing an old method by Henderson for ML estimation in such models under normality.

Keywords

Clustered Data, Linear Mixed Model, Fixed Effect, Uncorrelated Homoscedastic Error, Random Effects Predictor

Share and Cite:

Ndong Nguéma, E.-P., Fesuh Nono, B. and Gwét, H. (2021) Nonparametric Estimation in Linear Mixed Models with Uncorrelated Homoscedastic Errors. Open Journal of Statistics, 11, 558-605. https://doi.org/10.4236/ojs.2021.114035

1. Introduction

The Linear Mixed Model (LMM) is an extension of the classical Linear Model (LM) aimed at modeling a continuous scalar response Y in terms of observed covariates some of which (say $X_{1}, \dots, X_{p}$ ) have fixed effects as in the LM, being the same for all individuals in the population under study, and others (say $Z_{1}, \dots, Z_{r}$ ) have random effects, thus possibly varying between some well identified subgroups in that population. Mixed models have been used to analyze data sets with clustered, longitudinal or multilevel structure in a variety of fields, such as medicine [1] [2], agriculture [3], animal breeding [4], small area estimation [5] [6], genetics [7] [8], growth modeling [9] [10], etc. Detailed presentations of LMMs can be found in [11] or Demidenko [12], with a more practical emphasis in West et al. [13]. The subclass of variance components (or ANOVA) LMMs is thoroughly examined in Searle et al. [14].

At the outset of this field, that latter class of LMMs was almost exclusively the only one considered to fit real world data sets, with ANOVA methods at the forefront. But things started to turn around by the end of the 1970s, with the advent of widespread powerful computational tools (both in hardware and software) which have now sufficiently matured and are widely accessible. As a consequence, nowadays by far the most popular approach to fitting an LMM to a given data set is to use a Gaussian LMM, i.e. assuming that both the random effects and the model residual errors have Gaussian distributions, then using either the Maximum Likelihood (ML) or the Restricted ML (REML) estimation procedures. This is not easy task, however, since the corresponding (restricted) likelihood equations are quite involved, even to simply be derived, and have no closed form solutions. Thus in the last 4 - 5 decades, considerable endeavor, both theoretical and computational, has been devoted to deriving and numerically solving these equations as efficiently as possible [11] [15] [16] [17] [18] [19]. The numerical solution of the nonlinear system of (restricted) likelihood equations is done, most often, through using one of two standard iterative methods: the Newton-Raphson (NR) or Fisher scoring (FS, a more statistical variant of NR) and the Expectation-Maximization (EM) algorithm in various forms [20] [21] [22].

Despite that unrivaled popularity, there is an obvious major issue with fitting Gaussian LMMs: the assumption of normality of both random effects and errors is dubious in many practical settings, especially for the former. Lange and Ryan [23], for instance, present practical cases of nonnormal random effects. In the last decade, faced with bigger and bigger data sets, both in size and dimensionality, a growing interest has been focused on how to analyze LMMs for such data. For instance, a quasi-likelihood approach for estimation and inference in linear mixed-effects models with high-dimensional fixed effects and possibly large or unbalanced cluster sizes has been recently proposed by Li et al. [24].

Now, prior to the widespread adoption of ML/REML in LMMs fitting, some methods not relying on Gaussian assumptions had been developed and used such as ANOVA methods [14] [25] [26] [27] [28], Henderson’s Methods I, II and III [29], Rao’s Minimum Norm Quadratic Unbiased Estimation (MINQUE) method [30] [31] [32] [33], iterative MINQUE [34]. But these are tailored only to a very specific and limited class of LMMs, namely ANOVA ones (or variance components LMMs). Moreover, these methods share the common significant drawback of not ensuring nonnegative estimates of variance components. A qualified discussion in Searle et al. [14] seldom recommended them for general usage. Alternatively, a quasi-likelihood approach is presented by Jiang [35] using the REML equations in non-Gaussian ANOVA LMMs, showing consistency and asymptotic normality of the variance components estimators under some additional assumptions. Heyde [36] does the same, but emphasizing the relationship between quasi-likelihood and estimating functions. Iterative weighted least squares and iterative generalized least squares estimation methods for LMMs are presented, respectively, in Jiang et al. [37] and Goldstein [38].

Being based on quite restrictive assumptions, the aforementioned methods developed for non-Gaussian LMMs have a limited range of LMMs for which they are theoretically valid fitting methods. For instance, the overly simplifying assumptions on covariance matrices structure (like the ones in ANOVA and MINQUE methods) even exclude most Gaussian LMMs from that validity range, which helps explain why the ML and REML approaches have superseded them. In contrast, our goal in this work is to design an estimation methodology for LMMs in which the fixed effects vector of parameters and the random effects covariance matrix are estimated based on assumptions as weak as possible. We will first restrict attention to the very popular subclass of 2-level (also called longitudinal) LMMs pioneered by Laird and Ware [22], before extending the approach to fit ANOVA LMMs as well.

To achieve that goal, we take the practical standpoint here that, in most situations, except when some additional information is available about the data (such as serial correlation in errors for some longitudinal data), one has no other choice than to assume that the residual errors in the LMM to fit are uncorrelated and homoscedastic (u.ho.). This is probably the default option for error modeling (with the Gaussian assumptions when using ML or REML) in most statistical software packages tailored for parameters estimation in an LMM. In our modeling of 2-level LMMs here, based on adding only that assumption to the basic ones of zero mean and finite covariances for random effects and errors, we devise a new approach for estimating the fixed effects parameters vector $β$ , the cluster random effects covariance matrix $D$ and the residual errors variance $σ_{ε}^{2}$ . Thus, we do not impose any assumption on the type of clusters random effects distribution. The approach can be viewed as an adaptation to 2-level LMMs of the well known 2-step procedure for fitting the LM to a given data set, under that same assumption, where first $β$ is estimated by Ordinary Least Squares (OLS), then the variance $σ_{ε}^{2}$ of the errors by a carefully designed unbiased estimator. On closer scrutiny, it also turns out to generalize a little popularized estimating procedure for ML computation in Gaussian variance components models credited to Henderson by Harville [16] and detailed in Searle et al. ( [14], pages 278-279).

As for the organization of the material presented in this article, Section 2 briefly reviews LMMs, up to the random effects prediction issue. Section 3 highlights some key preliminary results on LMMs which do not involve any Gaussian distributional assumptions, which will serve as backbone for our new estimation methodology to be designed. In Section 4, we construct that new estimation methodology for 2-level LMMs with u.ho. errors. Section 5 shows that our estimation methodology can be adapted to both 2-level LMMs with u.ho. errors and a diagonal cluster random effects covariance matrix and ANOVA LMMs. In Section 6, our new estimation approach is compared with the traditional Gaussian ML through a simulation study, an application to a classical data set with cluster structure, and a longitudinal data, with implementations done in the R software [39]. Section 7 draws our conclusion about the work presented. The Appendix contains the most lengthy proof of a result presented in the running text. A supplementary material document is also provided, which gathers some known results on LMMs which we have used, but scattered here and there in the literature, and some new results of our own, as well as important implementation details about our presented iterative methods for fitting LMMs. To make a distinction with the article, the numbering of sections, results and equations in that document is prefixed by the letter “S”.

Before we proceed, please note that in this paper, all vectors are columns. Moreover, $A^{T}$ , $tr (A)$ , $‖ u ‖ = \sqrt{u^{T} u}$ and ${‖ u ‖}_{M} = \sqrt{u^{T} M u}$ respectively denote transpose of matrix A, trace of square matrix A, Euclidean norm and norm w.r.t. the SPD matrix M of vector u, where “SPD” stands for symmetric and positive definite while “SPSD” means symmetric and positive semi-definite (matrix). $I_{n}$ is the identity matrix of order n. $M_{n, p} (ℝ)$ and $M_{n} (ℝ)$ are the spaces of matrices with real elements, respectively $n \times p$ and $n \times n$ . $E (X)$ and $M cov (X)$ denote the mean and covariance matrix of random vector (or variable) X. $N_{q} (m, Σ)$ is the q-dimensional Gaussian distribution with mean vector $m$ and covariance matrix $Σ$ , $\overset{D}{\sim}$ is for “has probability distribution”, while estimating equation will be abbreviated EE.

2. Linear Mixed Models: An Overview

2.1. The General Form of an LMM

The general form of an LMM is [11]:

$Y_{} = X β + Z U + ε,$ (1)

where $Y \in ℝ^{n}$ is the observed response, $β \in ℝ^{p}$ is the unknown vector of fixed effects parameters, $U \in ℝ^{q}$ is the vector of unobserved random effects, $ε \in ℝ^{n}$ comprises the unknown residual errors, while $X \in M_{n, p} (ℝ)$ and $Z \in M_{n, q} (ℝ)$ are given design matrices. The usual basic assumptions are [11] [12] [22]:

Assumption $A 1_{g}$ . U and $ε$ are two independent zero mean random vectors with respective positive definite covariance matrices $M cov (U) = G \in M_{q} (ℝ)$ and $M cov (ε) = R \in M_{n} (ℝ)$ .

With these assumptions, the covariance matrix of $Y$ is $V = M cov (Y_{}) = Z G Z^{T} + R$ . We will also use the following assumption:

Assumption $A 2_{g}$ . The $n \times p$ design matrix $X$ in (1) has full column rank.

In (1), the main objective is to estimate $β$ and the covariance matrices $R$ and $G$ . LMMs are classified into two main groups, based on Gaussian distributional assumptions made or not on the random effects and errors in devising estimation procedures. Model (1) is termed Gaussian if, in addition to Assumption $A 1_{g}$ , it holds that:

Assumption $A 3_{g}$ . $U \overset{D}{\sim} N_{q} (0, G)$ and $ε \overset{D}{\sim} N_{n} (0, R)$ .

If the random effects and/or errors are not assumed to be normal, but Assumption $A 1_{g}$ holds, then model (1) is usually said to be (rather abusively) a non-Gaussian LMM. The main goal in this work is to devise estimation procedures for LMMs which do not use Assumption $A 3_{g}$ . The only key added assumption will rather be:

Assumption $A 4_{g}$ . $R = σ_{ε}^{2} I_{n}$ ( $σ_{ε}^{2} > 0$ ), i.e. the residual errors in (1)are uncorrelated and homoscedastic.

2.2. The 2-Level (or Longitudinal) LMM

We will first design our new estimation methodology for one of the types of LMM most used in statistical analyzes. Since Laird and Ware [22], this type of LMM is usually qualified as longitudinal, but we feel it more encompassing to label them as 2-level. In this article, we will use the units-cluster terminology for 2-level LMMs. At the cluster level, such a model can be written by grouping the scalar responses for all units in each cluster j as:

$Y_{ j} = X_{j} β + Z_{j} U_{ j} + ε_{j}, j = 1, \dots, m,$ (2)

where $Y_{ j} = {(Y_{1 j}, \dots, Y_{n_{j}, j})}^{T} \in ℝ^{n_{j}}$ is the response vector; $ε_{j} = {(ε_{1 j}, \dots, ε_{n_{j}, j})}^{T} \in ℝ^{n_{j}}$ is the vector of the errors; $X_{j} \in M_{n_{j}, p} (ℝ)$ is the fixed effects design matrix; $Z_{j} \in M_{n_{j}, r} (ℝ)$ is the random effects design matrix. The unknown parts of the model are $β = {(β_{1}, β_{2}, \dots, β_{p})}^{T} \in ℝ^{p}$ , the vector of fixed effects parameters, and $U_{ j} \in ℝ^{r}$ , the unobserved vector of random effects for cluster j. We will also consider the following usual basic assumptions for such models:

Assumption $A 1$ . The $U_{ j}$ ’s are independent and identically distributed in $ℝ^{r}$ .

Assumption $A 2$ . The $ε_{j}$ ’s are independent.

Assumption $A 3$ . The set of $U_{ j}$ ’s and set of $ε_{j}$ ’sare independent from each other.

Assumption $A 4$ . $E (U_{ j}) = 0 \in ℝ^{r}$ , $E (ε_{ j}) = 0 \in ℝ^{n_{j}}$ , $M cov (U_{ j}) = D \in M_{r} (ℝ)$ , $M cov (ε_{ j}) = R_{j} \in M_{n_{j}} (ℝ)$ . Here, $D$ and $R_{j}$ are,respectively, the covariance matrices of $U_{ j}$ and $ε_{j}$ and are assumed finite and positive definite.

Model (2) can be written in the form (1), but with the covariance matrices having special diagonal block structures: $G = diag (D, \dots, D)$ , $R = diag (R_{1}, \dots, R_{m})$ , $V = diag (V_{1}, \dots, V_{m})$ , with $V_{ j} = Z_{j} D Z_{j}^{T} + R_{j}$ , $j = 1, \dots, m$ .

2.3. Variance Components Models (or ANOVA LMMs)

We will subsequently show that our methodology can also be adapted to the popular ANOVA LMMs, i.e. LMMs (1) in which the term $Z U$ can be decomposed as:

$Z U = Z_{1} U_{1} + \dots + Z_{m} U_{m},$ (3)

where $Z_{1}, \dots, Z_{m}$ are given design matrices, with $Z_{j} \in M_{n, q_{j}} (ℝ)$ , $j = 1, \dots, m$ , and $U_{1}, \dots, U_{m}$ now become the random effects vectors of respective dimensions $q_{1}, \dots, q_{m}$ ( $q_{1} + \dots + q_{m} = q$ ). Taking $Z_{m + 1} = I_{n}$ and $U_{m + 1} = ε$ (so $q_{m + 1} = n$ ), the ANOVA LMM (1) with random effects term (3) is more neatly written:

$Y = X β + Z \tilde{U},$ (4)

with $Z = (Z_{1}, \dots, Z_{m + 1})$ , $\tilde{U} = {(U_{1}^{T}, \dots, U_{m + 1}^{T})}^{T}$ and default assumptions:

M1. $U_{1}, \dots, U_{m + 1}$ are independent;

M2. $\forall j = 1, \dots, m + 1$ , $E (U_{ j}) = 0$ and $M cov (U_{ j}) = σ_{j}^{2} I_{q_{j}}$ , with $σ_{j}^{2} > 0$ finite.

Then (4) implies $G = diag (σ_{1}^{2} I_{q_{1}}, \dots, σ_{m}^{2} I_{q_{m}})$ , and the only unknown parameters in (4), besides $β$ , are the vector of components variances $θ = {(σ_{1}^{2}, \dots, σ_{m + 1}^{2})}^{T}$ . That is why (4) is called, under M1 and M2, a variance components model.

2.4. Prediction of Random Effects: The BLP and the BLUP of U

Besides being sometimes an important target in LMM modeling, it turns out that having an effective and computable (given parameters values) predictor for the random effects vector U in the LMM (1) is a crucial component in our newly designed estimation methodology. In that respect, a first candidate is the best linear predictor (BLP) of U given $Y$ (see [14], Chapter 7):

$BLP (U | Y) = \bar{U} = G Z^{T} V^{- 1} (Y - X β) .$ (5)

But a far more popular predictor is the best linear unbiased predictor (BLUP) of $U | Y$ given, under Assumptions $A 1_{g}$ - $A 2_{g}$ , by Searle [40]:

$BLUP (U | Y) = {\tilde{U}}^{*} = G Z^{T} V^{- 1} (Y - X {\tilde{β}}^{*}),$ (6)

where ${\tilde{β}}^{*}$ is the Best Linear Unbiased Estimator (BLUE) of $β$ ,

${\tilde{β}}^{*} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} Y .$ (7)

But, in our new methodology (as in previous ones), trying to use either the BLP or the BLUP requires to first estimate $G$ and $R$ . Replacing them in (6) by estimators $\hat{G}$ and $\hat{R}$ yields ${\hat{U}}^{*} = EBLUP (U | Y) = \hat{G} Z^{T} {\hat{V}}^{- 1} (Y - X {\hat{β}}^{*})$ , a so called Empirical BLUP (EBLUP) of $U | Y$ , with $\hat{V} = Z \hat{G} Z^{T} + \hat{R}$ , and an Empirical BLUE (EBLUE) ${\hat{β}}^{*} = {(X^{T} {\hat{V}}^{- 1} X)}^{- 1} X^{T} {\hat{V}}^{- 1} Y_{}$ of $β$ . However, importantly for designing our new fitting methodology for LMMs, we stress that the $BLUP (U | Y)$ can also be viewed as a preliminary estimator of $BLP (U | Y)$ . Consequently, any $EBLUP (U | Y)$ can also be taken as a practical estimator of $\bar{U} = BLP (U | Y)$ , thus qualifying also to be an Empirical BLP (EBLP) of $U | Y$ . Indeed, that is how we will estimate $\bar{U}$ , using Henderson’s mixed model equations briefly reviewed next.

2.5. Henderson’s Mixed Model Equations

For given $G$ and $R$ in a Gaussian LMM (1), to simultaneously get an estimator for $β$ and a predictor for U, Henderson [41] [42] maximized, w.r.t. $β$ and U (the latter also viewed as a parameter), the joint density of the random couple $(Y, U)$ . That maximization yields the so called Henderson’s mixed model equations (HMMEs) given in matrix form as:

$(\begin{matrix} X^{T} R^{- 1} X & X^{T} R^{- 1} Z \\ Z^{T} R^{- 1} X & Z^{T} R^{- 1} Z + G^{- 1} \end{matrix}) (\begin{matrix} \tilde{β} \\ \tilde{U} \end{matrix}) = (\begin{matrix} X^{T} R^{- 1} Y \\ Z^{T} R^{- 1} Y \end{matrix}) .$ (8)

From the outset, there has been debate over whether this was a valid way of trying to estimate $β$ and predict U. Henderson himself acknowledged (see [43], page 16) that (even given an observed $Y = y \in ℝ^{n}$ ) the maximized function is not a true likelihood for the couple $(β, U)$ because U is not a fixed unknown parameter but rather an unobserved random variable. So, strictly speaking, this does not qualify as an ML method. Nonetheless, from our standpoint, that debate is mostly peripheral. The only two relevant practical questions are: 1) What are the solutions $\tilde{β}$ and $\tilde{U}$ of the linear system (8)? 2) What properties do they have if taken as respective estimator of $β$ and predictor of U? The answer to both questions is provided by the following result, credited by Harville [16] to Henderson in an unpublished 1963 report:

Theorem 1 ( [16] [44] ). If $X$ is full column rank while $G$ and $R$ are SPD matrices, then the solutions of the linear system (8) are $\tilde{β} = {\tilde{β}}^{*}$ and $\tilde{U} = {\tilde{U}}^{*}$ given by (6)-(7), with $V = Z G Z^{T} + R$ .

So, in the general LMM (1), given $G$ , $R$ and the observed response vector $Y$ , the BLUE ${\tilde{β}}^{*}$ of $β$ and the BLUP ${\tilde{U}}^{*}$ of $U | Y$ are the unique solutions of the HMMEs.

3. More about Henderson’s Mixed Model Equations and LMMs without Gaussian Assumptions

As announced in the introduction, we intend to devise a new estimation methodology, first for 2-level LMMs, which, contrary to usage outside of the Gaussian case, simultaneously estimates unknown parameters and predicts random effects values under Assumptions $A 1$ - $A 4$ and $A 4_{g}$ . In that perspective, we give, in this section, and, especially, in Section 3.2, a series of new results about LMMs which do not use any Gaussian assumption, be it on the random effects or the residual errors. Those results will serve as basis for the design of our new fitting methodology for LMMs in Section 4.

3.1. An Important Preliminary: The BLUE, the BLUP and the HMMEs Do Not Need Gaussian Assumptions

Though Henderson, as reported above, famously discovered his eponymous mixed model Equations (8) coming from the Gaussian route, Theorem 1, which solves them, uses no Gaussian distributional assumptions for that. It actually uses no probabilistic, nor statistical framework or reasoning, whatsoever. It is rather a pure Matrix Algebra result asserting that the linear system (8) has as unique solutions the BLUE ${\tilde{β}}^{*}$ (7) of $β$ and the BLUP ${\tilde{U}}^{*}$ (6) of $U | Y$ . Now, one can derive both ${\tilde{β}}^{*}$ and ${\tilde{U}}^{*}$ by optimizing, for each, a specific minimum variance criterion, but without relying on any Gaussian assumption or any parametric one for that matter [40]. That double observation makes it that the HMMEs are not attached to Gaussian distributional assumptions, contrary to what people always feel compelled to set before they use them.

Remark 1. For the latter reason, we will use the HMMEs as our first two estimating equations when devising the first approach (coded 3S-A1-V1) of our new estimation methodology for LMMs in Section 4.1.

3.2. More about the HMMEs Solutions

Here, we present a series of new Matrix Algebra results and their consequences for LMMs with u.ho. errors, without imposing any Gaussian assumption, much in line with Theorem 1. The discovery of those results triggered the design of our new fitting methodology for that class of LMMs. They derive from the very peculiar structure of the first HMME which we start by stressing.

3.2.1. The First HMME as a Weighted Least Squares Problem

In (8), the first equation can be rewritten: $X^{T} R^{- 1} X = X^{T} R^{- 1} (Y_{} - Z \tilde{U})$ . One then recognises the normal equations of the Weighted Least Squares (WLS) problem:

$\min_{β \in ℝ^{p}} {‖ (Y - Z \tilde{U}) - X β ‖}_{R^{- 1}} .$ (9)

Combining that observation with Theorem 1 yields the remarkable result:

Theorem 2. In the LMM (1) with Assumptions $A 1_{g}$ - $A 2_{g}$ , and given the BLUP ${\tilde{U}}^{*}$ of $U | Y$ , the BLUE ${\tilde{β}}^{*}$ of $β$ is the WLS estimate of $β$ in the LM: $Y - Z {\tilde{U}}^{*} = X β + \tilde{ε}$ , with response $Y - Z {\tilde{U}}^{*}$ , design matrix $X$ , error $\tilde{ε}$ and weighting matrix $R^{- 1}$ .

But Theorem 2 will be more useful to us in designing our new methodology for LMM fitting in Section 4 when one adds the assumption that the LMM has uncorrelated and homoscedastic errors. We detail hereafter why.

3.2.2. More about the BLUE of β and the BLUP of U in an LMM with u.ho. Errors

Here, we focus attention on LMMs with u.ho. residual errors, i.e. Assumption $A 4_{g}$ holds, in addition to $A 1_{g}$ - $A 2_{g}$ . But again, no Gaussian assumption will be relied upon: only the u.ho. errors and their impact on the geometry of the HMMEs solutions play a role here. Furthermore, we emphasize that our results to be presented hereafter are valid for any LMM (1) with u.ho. errors, and not only for 2-level LMMs with such errors. In that respect, first note that in the u.ho. residual errors scenario, the HMMEs (8) simplify significantly. Indeed, if $R = σ_{ε}^{2} I_{n}$ with $σ_{ε}^{2} > 0$ , then (8) reduces to:

$(\begin{matrix} X^{T} X & X^{T} Z \\ Z^{T} X & Z^{T} Z + σ_{ε}^{2} G^{- 1} \end{matrix}) (\begin{matrix} \tilde{β} \\ \tilde{U} \end{matrix}) = (\begin{matrix} X^{T} Y \\ Z^{T} Y \end{matrix}) .$ (10)

The results to be derived hereafter highlight some peculiar properties about the solutions of the linear system (10), thus yielding some remarkable intertwined relationships, under Assumption $A 4_{g}$ , between its unique solutions (Theorem 1), the BLUE $\tilde{β} = {\tilde{β}}^{*}$ of $β$ and $\tilde{U} = {\tilde{U}}^{*}$ , the BLUP of $U | Y$ .

We start with an identity which is a trivial consequence of the structure of $V$ in an LMM (1) under u.ho. errors. Indeed, then $V = Z G Z^{T} + σ_{ε}^{2} I_{n}$ , so:

$I_{n} - Z G Z^{T} V^{- 1} = σ_{ε}^{2} V^{- 1} .$ (11)

This is instrumental first for:

Lemma 3. Consider the LMM (1) with Assumptions $A 1_{g}$ and $A 4_{g}$ . Let $(\tilde{β}, \tilde{U}) \in ℝ^{p} \times ℝ^{q}$ such that $\tilde{U} = G Z^{T} V^{- 1} (Y_{} - X)$ . Then one has:

$Y - X \tilde{β} - Z \tilde{U} = σ_{ε}^{2} V^{- 1} (Y - X \tilde{β}) .$ (12a)

Moreover, if Assumption $A 2_{g}$ also holds,then the following equivalence is true:

$\tilde{β} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} Y \Leftrightarrow \tilde{β} = {(X^{T} X)}^{- 1} X^{T} (Y - Z \tilde{U}) .$ (12b)

Proof. Let Assumptions $A 1_{g}$ and $A 4_{g}$ hold, $R = σ_{ε}^{2} I_{n}$ and $\tilde{U} = G Z^{T} V^{- 1} \tilde{Y}$ , with $\tilde{Y} = Y - X$ , for an arbitrary $\tilde{β} \in ℝ^{p}$ . Then $Y - X \tilde{β} - Z \tilde{U} = \tilde{Y} - Z G Z^{T} V^{- 1} {\tilde{Y}}_{} = (I_{n} - Z G Z^{T} V^{- 1}) \tilde{Y} = σ_{ε}^{2} V^{- 1} \tilde{Y}$ , the last equality thanks to (11), which proves (12a).

Now, if Assumption $A 2_{g}$ is also true, then the square matrix $X^{T} X$ is nonsingular; so, given that $Y - Z \tilde{U} = X \tilde{β} + σ_{ε}^{2} V^{- 1} \tilde{Y}$ from (12a), then ${(X^{T} X)}^{- 1} X^{T} (Y - Z \tilde{U}) = \tilde{β} + σ_{ε}^{2} {(X^{T} X)}^{- 1} X^{T} V^{- 1} (Y - X \tilde{β})$ . Hence, since $σ_{ε}^{2} > 0$ , $\tilde{β} = {(X^{T} X)}^{- 1} X^{T} (Y - Z \tilde{U}) \Leftrightarrow X^{T} V^{- 1} (Y - X \tilde{β}) = 0$ . Then (12b) follows because Assumption $A 2_{g}$ implies that $X^{T} V^{- 1} X$ is an SPD matrix, thus nonsingular. □

Theorem 1 and Lemma 3 imply the following for the solutions of (10):

Theorem 4. In the LMM (1), if $P = V^{- 1} - V^{- 1} X {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1}$ and Assumptions $A 1_{g}$ - $A 2_{g}$ , $A 4_{g}$ hold, then:

$Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*} = σ_{ε}^{2} V^{- 1} (Y - X {\tilde{β}}^{*}) = σ_{ε}^{2} P Y,$ (13a)

${\tilde{β}}^{*} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} Y = {(X^{T} X)}^{- 1} X^{T} (Y - Z {\tilde{U}}^{*}) .$ (13b)

Proof. Thanks to Lemma 3, the respective definitions (7) and (6) of the BLUE ${\tilde{β}}^{*}$ and the BLUP ${\tilde{U}}^{*}$ , and the standard identity $V^{- 1} (Y - X {\tilde{β}}^{*}) = P Y$ . □

Remark 2. The usefulness of the second expression of ${\tilde{β}}^{*}$ in (13b) stems from the fact that one recognizes the OLS, under Assumption $A 2_{g}$ , of $β$ in the LM $\tilde{Y} = X β + \tilde{ε}$ , where $\tilde{Y} = Y - Z {\tilde{U}}^{*}$ . This clearly suggests how,in an iterative algorithm for LMM fitting,one may update the estimate of $β$ given the current estimate of the BLUP of U. That observation is precisely what motivated the construction of the second approach (coded 3S-A1-V2) in our new estimation methodology for LMMs in Section 4.1.

We then have the following corollary:

Corollary 1. If $A = Y^{T} (Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*})$ and $B = {‖ Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*} ‖}^{2}$ in the LMM (1) with Assumptions $A 1_{g}$ - $A 2_{g}$ and $A 4_{g}$ , then

$A - B = σ_{ε}^{2} \cdot {‖ Z^{T} V^{- 1} (Y - X {\tilde{β}}^{*}) ‖}_{G}^{2} = σ_{ε}^{2} \cdot {‖ {\tilde{U}}^{*} ‖}_{G^{- 1}}^{2} .$ (14)

Therefore, the real random variable $Y^{T} (Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*})$ is always nonnegative.

Proof. Let Assumptions $A 1_{g}$ - $A 2_{g}$ and $A 4_{g}$ be true and $\tilde{Y} = Y - X {\tilde{β}}^{*}$ . First,

$A - B = {(X {\tilde{β}}^{*} + Z {\tilde{U}}^{*})}^{T} (\tilde{Y} - Z {\tilde{U}}^{*}) = {\tilde{U}}^{* T} Z^{T} (\tilde{Y} - Z {\tilde{U}}^{*}),$ (15)

where we have used the fact that Theorem 4 and Remark 2 imply: $X {\tilde{β}}^{*} ⊥ \tilde{Y} - Z {\tilde{U}}^{*}$ . Now, thanks to (13a) in Theorem 4, $\tilde{Y} - Z {\tilde{U}}^{*} = σ_{ε}^{2} V^{- 1} \tilde{Y}$ . The latter, inserted in (15) alongside the expression (6) of ${\tilde{U}}^{*}$ , implies:

$A - B = σ_{ε}^{2} {\tilde{Y}}^{T} V^{- 1} Z G Z^{T} V^{- 1} \tilde{Y} = σ_{ε}^{2} (Z^{T} V^{- 1} \tilde{Y})^{T} G (Z^{T} V^{- 1} \tilde{Y}) = σ_{ε}^{2} {‖ Z^{T} V^{- 1} {\tilde{Y}}_{} ‖}_{G}^{2} .$

Or, $A - B = σ_{ε}^{2} {(G Z^{T} V^{- 1} \tilde{Y})}^{T} G^{- 1} (G Z^{T} V^{- 1} \tilde{Y}) = σ_{ε}^{2} {\tilde{U}}^{*}^{T} G^{- 1} {\tilde{U}}^{*} = σ_{ε}^{2} {‖ {\tilde{U}}^{*} ‖}_{G^{- 1}}^{2}$ . □

We end this series of results about the geometry of the HMMEs solutions with one about two remarkable expectations related to the u.ho. residual errors variance $σ_{ε}^{2}$ in the LMM (1):

Theorem 5. In the LMM (1) with Assumptions $A 1_{g}$ - $A 2_{g}$ and $A 4_{g}$ , one has, with P as in Theorem 4:

$E [Y^{T} (Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*})] = σ_{ε}^{2} \cdot (n - p),$ (16a)

$E {‖ Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*} ‖}^{2} = σ_{ε}^{4} \cdot tr (P) .$ (16b)

Proof. First, (13a) in Theorem 4 implies that $Y^{T} (Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*}) = σ_{ε}^{2} Y^{T} P Y$ . Hence, with $m_{Y} = E (Y) = X β$ and $V = M cov (Y_{})$ ,

$E [Y^{T} (Y - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*})] = σ_{ε}^{2} E (Y^{T} P Y) = σ_{ε}^{2} [m_{Y}^{T} P m_{Y} + tr (P V)] .$ (17a)

But $P m_{Y} = P X β = V^{- 1} X β - V^{- 1} X β = 0$ , so $m_{Y}^{T} P m_{Y} = 0$ . Furthermore, since $P V = I_{n} - V^{- 1} X {(X^{T} V^{- 1} X)}^{- 1} X^{T}$ , then $tr (P V) = n - tr [{(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} X] = n - p$ . Substituting these results in (17a) yields (16a).

Similarly, from (13a), we get ${‖ Y_{} - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*} ‖}^{2} = σ_{ε}^{4} Y^{T} P^{2} Y$ , since P is symmetric. Hence

$E {‖ Y_{} - X {\tilde{β}}^{*} - Z {\tilde{U}}^{*} ‖}^{2} = σ_{ε}^{4} E (Y^{T} P^{2} Y) = σ_{ε}^{4} \cdot [m_{Y}^{T} P^{2} m_{Y} + tr (P^{2} V)] .$

Now, $m_{Y}^{T} P^{2} m_{Y} = 0$ as $P m_{Y} = 0$ . Also, since $P V P = P$ ,

$tr (P^{2} V) = tr (P V P) = tr (P)$ , hence (16b). □

Remark 3. Identity (16a) is remarkable and seems familiar. Indeed, it is the analogue, for LMMs with u.ho. errors, of the identity which yields the unbiased estimator of the residual variance in an LM with that same type of errors. And we will use it in much the same way when deriving our new estimation methodology for LMMs with u.ho. errors. Indeed, in conjunction with the last sentence of Corollary 1, (16a) easily suggests how to compute, in an LMM fitting algorithm, a nonnegative update to estimate $σ_{ε}^{2}$ given the current estimates of $β$ and the BLUP of U.

3.3. More about the BLP of $U | Y$ in an LMM

The BLP $\bar{U} = BLP (U | Y)$ , given by (5), will play a key role in our new estimation methodology for LMMs with u.ho. errors. That is because, first of all, its derivation does not require any Gaussian distributional assumption. Secondly, its computation and that of its covariance matrix are cheaper than those of the BLUP, especially under this scenario of u.ho. errors. See the needed formulas in the Supplementary material. In the latter, we also derive new expectations formulas specifically based on the BLP (whereas the ones in the previous section, such as (16a), were based on the BLUP) which can be used to obtain EEs for the residual variance in an LMM.

In our new methodology to be presented from Section 4 onwards, the results in Section 3.2 above and the Supplementary material will provide the tools to derive EEs for the $β$ , $σ_{ε}^{2}$ and the $BLP (U | Y)$ . On the other hand, to seek an EE for $D$ , we will start from the fact that given ${\bar{U}}_{ j} = BLP (U_{ j} | Y_{ j})$ and $V_{ j}^{*} = D - M cov (U_{ j})$ ( $j = 1, \dots, m$ ), then, under Assumptions $A 1_{g}$ - $A 4_{g}$ and $A 2_{g}$ , we get, as an unbiased estimator of $D$ , the $r \times r$ matrix

${\tilde{D}}_{1} = \frac{1}{m} \sum_{j = 1}^{m} (V_{ j}^{*} + {\bar{U}}_{ j} {\bar{U}}_{j}^{T}) .$ (18)

4. Estimation in 2-Level LMMs with u.ho. Errors: A New Approach

We first present our approach for estimating the 3 main parameters of interest in a 2-level LMM (2) under Assumptions $A 1$ - $A 4$ , $A 2_{g}$ and $A 4_{g}$ : $β \in ℝ^{p}$ , $σ_{ε}^{2} > 0$ and $D \in M_{r} (ℝ)$ , alongside obtaining a prediction for $U = {(U_{1}^{T}, \dots, U_{m}^{T})}^{T} \in ℝ^{q}$ . For that latter aspect, let, for $j = 1, \dots, m$ , ${\bar{u}}_{j} = BLP (U_{ j} | Y_{ j} = y_{j}) = σ_{ε}^{- 2} V_{ j}^{*} Z_{j}^{T} (y_{j} - X_{j} β)$ , the BLP of $U_{j}$ given response $Y_{ j} = y_{j} \in ℝ^{n_{j}}$ in cluster j. We will predict U through estimating

$\bar{u} = BLP (U | Y = y) = {({\bar{u}}_{1}^{T}, \dots, {\bar{u}}_{m}^{T})}^{T} = σ_{ε}^{- 2} V^{*} Z^{T} (y - X β),$ (19)

the BLP of U given response $Y = y \in ℝ^{n}$ for the whole sample.

Our approach is based on iteratively solving appropriate EEs for $β$ , $σ_{ε}^{2}$ , $D$ , $\bar{u}$ derived using only sound nonparametric estimation principles. Devising nonparametric EEs from given data to solve a statistical problem is an old idea in Statistics. The method of moments, whenever applicable, is in that category. In the field of LMMs, one may even say that that’s where all started with the ANOVA methods for fitting variance components models. But it has turned especially hard to derive EEs applicable to fit a broad class of LMMs, as unconstrained as possible, but not relying on Gaussian assumptions. Attempts in that direction include the iterative weighted least squares and iterative generalized least squares estimation methods for LMMs presented, respectively, in Jiang et al. [37] and Goldstein [38]. But they are still not yet sufficiently general. What we propose hereafter is a 3-step construction to get closer to such a goal.

4.1. Two 3-Step Sequences for Estimation in 2-Level LMMs with u.ho. Errors

The key point in our approach is the fact that the BLP is computationally cheaper to handle than the BLUP (including at the covariance matrix level) while at the same time, as observed at the end of Section 2.4, an empirical BLUP can also be viewed as an empirical BLP. So, to devise an estimation procedure for $β$ , $σ_{ε}^{2}$ , $D$ and prediction for U from $Y = y$ , and given the preliminary results in Section 0, our starting ideas are described in what follows (we provide 2 different versions). In them, when $\hat{D}$ is an estimate of $D$ , it is understood that $\hat{G} = diag (\hat{D}, \dots, \hat{D})$ (m times) is the corresponding estimate of $G$ .

4.1.1. Starting Ideas: Version 1

• Step 1: Estimating $β$ and $\bar{u}$ , given estimates of $σ_{ε}^{2}$ and $D$

Given preliminary estimates ${\hat{σ}}_{ε}^{2}$ and $\hat{D}$ of $σ_{ε}^{2}$ and $D$ , one can obtain estimates $\hat{β}$ of $β$ , and $\hat{u}$ of $\bar{u}$ by solving the system of HMMEs (10) using $σ_{ε}^{2} = {\hat{σ}}_{ε}^{2}$ and $G = \hat{G}$ .

• Step 2: Improved estimate of $σ_{ε}^{2}$ , given $\hat{β}$ , $\hat{u}$ from Step 1

With $\hat{β}$ and $\hat{u}$ obtained as above in Step 1, (16a) in Theorem 5 suggests to take

${\hat{σ}}_{ε}^{2} = \frac{y^{T} (y - X \hat{β} - Z \hat{u})}{n - p}$ (20)

as a hopefully improved estimate of $σ_{ε}^{2}$ . Using Corollary 1, the way $\hat{β}$ and $\hat{u}$ were obtained in Step 1 implies that ${\hat{σ}}_{ε}^{2}$ is a nonnegative real number.

• Step 3: Improved estimate of $D$ given preliminary estimates ${\hat{σ}}_{ε}^{2}$ , $\tilde{D}$ , $\hat{u}$ of $σ_{ε}^{2}$ , $D$ , $\bar{u}$

Here, we propose an approach to get an improved estimate $\hat{D}$ of $D$ from respective preliminary estimates ${\hat{σ}}_{ε}^{2}$ , $\tilde{D}$ , $\hat{u}$ of $σ_{ε}^{2}$ , $D$ , and $\bar{u}$ in (19). We start with the unbiased estimator ${\tilde{D}}_{1}$ of $D$ given by (18), but is not computable from the available data since it still depends on the unknown parameters, including $D$ itself. Nevertheless, with the preliminary estimates $\tilde{D}$ of $D$ , and ${\hat{σ}}_{ε}^{2}$ of $σ_{ε}^{2}$ , Equation (S:1.7c) in the Supplementary material suggests estimating $V_{ j}^{*}$ by ${\hat{V}}_{ j}^{*} = {\hat{σ}}_{ε}^{2} {(Z_{j}^{T} Z_{j} + {\hat{σ}}_{ε}^{2} {\tilde{D}}^{- 1})}^{- 1}$ . Then, for $j = 1, \dots, m$ , given the available estimate ${\hat{u}}_{j}$ of ${\bar{u}}_{j}$ , (18) suggests as a hopefully improved computable estimate of $D$ :

$\hat{D} = \frac{1}{m} \sum_{j = 1}^{m} ({\hat{V}}_{ j}^{*} + {\hat{u}}_{j} {\hat{u}}_{j}^{T}) .$ (21)

4.1.2. Starting Ideas: Version 2

Here, Step 3 is the same as in Version 1 above, but Steps 1 and 2 rather go as follows:

• Step 1: Estimation of $\bar{u}$ and $β$ , given estimates of $β$ , $D$ , $σ_{ε}^{2}$

Assume that we have respective preliminary estimates $\tilde{β}$ , $\hat{D}$ , ${\hat{σ}}_{ε}^{2}$ of $β$ , $D$ , $σ_{ε}^{2}$ . Given $Y = y$ and (19), we estimate $\bar{u}$ by

$\hat{u} = {\hat{σ}}_{ε}^{- 2} {\hat{V}}^{*} Z^{T} (y - X \tilde{β}),$ (22)

with ${\hat{V}}^{*} = {\hat{σ}}_{ε}^{2} {(Z^{T} Z + {\hat{σ}}_{ε}^{2} {\hat{G}}^{- 1})}^{- 1}$ . From Theorem 4, we hope to get an improved estimate of $β$ through:

$\hat{β} = {(X^{T} X)}^{- 1} X^{T} (y - Z \hat{u}) .$ (23)

Remark 4. This is not the same as Step 1 of Version 1 because $\tilde{β}$ appears on the l.h.s. of (22) instead of $\hat{β}$ . So here, $\hat{u}$ and $\hat{β}$ are not solutions to the HMMEs given $G = \hat{G}$ and $R = {\hat{σ}}_{ε}^{2} I_{n}$ .

Remark 5. In (22), we have $\hat{u} = {({\hat{u}}_{1}^{T}, \dots, {\hat{u}}_{m}^{T})}^{T}$ , where for $j = 1, \dots, m$ , ${\hat{u}}_{j} \in ℝ^{r}$ is given by:

${\hat{u}}_{j} = {\hat{σ}}_{ε}^{- 2} {\hat{V}}_{j}^{*} Z_{j}^{T} (y_{j} - X_{j} \tilde{β}), {\hat{V}}_{j}^{*} = {\hat{σ}}_{ε}^{2} {(Z_{j}^{T} Z_{j} + {\hat{σ}}_{ε}^{2} {\hat{D}}^{- 1})}^{- 1} .$ (24)

Remark 6. The vector $\hat{β}$ given by (23) is the OLS, under Assumption $A 2_{g}$ , of $β$ in the LM: $y - Z \hat{u} = X + \tilde{ε}$ , with $y - Z \hat{u}$ as response and $\tilde{ε}$ as vector of residual errors.

• Step 2: Improved estimate of $σ_{ε}^{2}$ , given $\hat{β}$ , $\hat{u}$ from Step 1, and preliminary estimates of $σ_{ε}^{2}$ , $D$

With Remark 4 above, there is no guarantee that the numerator of (20) is a nonnegative number given $\hat{u}$ and $\hat{β}$ from Step 1 in this Version 2. To ensure a nonnegative estimate of $σ_{ε}^{2}$ , we combine Corollary 1 and Theorem 5 to obtain, from the available estimates ${\tilde{σ}}_{ε}^{2}$ and $\hat{D}$ of $σ_{ε}^{2}$ and $D$ , the hopefully improved estimate of $σ_{ε}^{2}$ :

${\hat{σ}}_{ε}^{2} = \frac{{‖ y - X \hat{β} - Z \hat{u} ‖}^{2} + {\tilde{σ}}_{ε}^{2} {‖ \hat{u} ‖}_{{\hat{G}}^{- 1}}^{2}}{n - p} .$ (25)

Remark 7. From Corollary 1, Theorem 5 and Propositions S1.4-S1.5, S1.8-S1.9, two alternative somewhat more sophisticated computable nonnegative estimates of $σ_{ε}^{2}$ in this context are:

${\hat{σ}}_{ε}^{2} = \frac{{‖ y - X \hat{β} - Z \hat{u} ‖}^{2} + q {\tilde{σ}}_{ε}^{2}}{n - p + tr ({\hat{G}}^{- 1} {\hat{V}}^{*})} = \frac{{‖ y - X \hat{β} - Z \hat{u} ‖}^{2} + q {\tilde{σ}}_{ε}^{2}}{n - p + tr ({\hat{D}}^{- 1} \sum_{j = 1}^{m} {\hat{V}}_{ j}^{*})},$ (26a)

${\hat{σ}}_{ε}^{2} = \frac{tr ({\hat{V}}^{*} Z^{T} Z)}{{‖ \hat{u} ‖}_{{\hat{G}}^{- 1}}^{2}} = \sum_{j = 1}^{m} tr ({\hat{V}}_{ j}^{*} Z_{j}^{T} Z_{j}) / \sum_{j = 1}^{m} {‖ {\hat{u}}_{j} ‖}_{{\hat{D}}^{- 1}}^{2} .$ (26b)

4.2. Targeted Estimates for β, $σ_{ε}^{2}$ , D and Prediction of U

4.2.1. The Estimating Equations

The reasoning in Steps 1-2-3 of Version 1 above leads us to seek, given the data vector $Y = y$ , respective estimates $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ , of parameters $β$ , $σ_{ε}^{2}$ , $D$ , and prediction $\hat{u}$ of U satisfying the system of equations:

$(\begin{matrix} X^{T} X & X^{T} Z \\ Z^{T} X & Z^{T} Z + {\hat{σ}}_{ε}^{2} {\hat{G}}^{- 1} \end{matrix}) (\begin{matrix} \hat{β} \\ \hat{u} \end{matrix}) = (\begin{matrix} X^{T} y \\ Z^{T} y \end{matrix}),$ (27a)

${\hat{σ}}_{ε}^{2} = \frac{y^{T} (y - X \hat{β} - Z \hat{u})}{n - p},$ (27b)

$\hat{D} = \frac{1}{m} \sum_{j = 1}^{m} ({\hat{V}}_{ j}^{*} + {\hat{u}}_{j} {\hat{u}}_{j}^{T}) .$ (27c)

Steps 1-2-3 of Version 2 suggest the following system, with $\hat{D}$ as in (27c):

$\hat{u} = {\hat{σ}}_{ε}^{- 2} {\hat{V}}^{*} Z^{T} (y - X \hat{β}),$ (28a)

$\hat{β} = {(X^{T} X)}^{- 1} X^{T} (y - Z \hat{u}),$ (28b)

${\hat{σ}}_{ε}^{2} = \frac{{‖ y - X \hat{β} - Z \hat{u} ‖}^{2} + {\hat{σ}}_{ε}^{2} {‖ \hat{u} ‖}_{{\hat{G}}^{- 1}}^{2}}{n - p} .$ (28c)

We so obtain two distinct sets of fixed points EEs for $β$ , $σ_{ε}^{2}$ , $D$ and $\bar{u}$ . Our estimation approach for the 2-level LMM (2) under Assumptions $A 1$ - $A 4$ and $A 2_{g}$ , $A 4_{g}$ consists in solving either of them through successive approximations iterative procedure. However, we stress that, from Theorem 4 and Corollary 1, if $\hat{D}$ is SPD, ${\hat{σ}}_{ε}^{2} > 0$ and Assumption $A 2_{g}$ holds, then one has the following logical equivalence:

(27a) − (27b) $\Leftrightarrow$ (28a) − (28b) − (28c).

So the above two systems of equations are actually equivalent, and, thus, have the same solutions if any. However, they do suggest two different successive approximations algorithms to try to calculate these 4 unknowns.

4.2.2. 3S: A Code Name for the LMMs New Estimation Methodology

Our proposed two estimating algorithms are presented in the next section. Since each of them is based on a 3-step sequence construction, we shall use the code name 3S for this new methodology for fitting LMMs. In the future, we will present other variants of this approach. So the current one is coded 3S-A1, and the two algorithms we present hereafter are coded 3S-A1-V1 and 3S-A1-V2.

4.2.3. Two Iterative Estimating Procedures for 2-Level LMMs with u.ho. Errors

Given the response vector $y = {(y_{1}^{T}, \dots, y_{m}^{T})}^{T} \in ℝ^{n}$ , and the EEs (27a)-(27c), our first estimating algorithm for a 2-level LMM with u.ho. errors is:

Algorithm 3S-A1-V1. Estimating $β$ , $σ_{ε}^{2}$ , $D$ and predicting $U_{1}, \dots, U_{m}$ in a 2-level LMM with u.ho.errors: Version 1

1)Initialization: At iteration 0,we estimate $β$ , $σ_{ε}^{2}$ , $D$ and predict $U_{1}, \dots, U_{m}$ ,as follows:

a) ${\hat{β}}^{(0)} = {(X^{T} X)}^{- 1} X^{T} y$ , the OLS estimate of $β$ in the linear model $y = X β + ε$ ;

b) ${\hat{σ}}_{ε}^{2 (0)} = {‖ y - X {\hat{β}}^{(0)} ‖}^{2} / (n - p)$ , the corresponding unbiased estimate of the residual variance;

c) $\forall j = 1, \dots, m$ , ${\hat{u}}_{j}^{(0)} = {(Z_{j}^{T} Z_{j})}^{+} Z_{j}^{T} y_{j}^{* (0)}$ , the OLS estimate of $U_{ j}$ in the linear model $y_{j}^{* (0)} = Z_{j} U_{ j} + ε_{j}$ ,with $U_{ j}$ considered as fixed parameter, $Z_{j}$ the design matrix, $y_{j}^{* (0)} = y_{j} - X_{j} {\hat{β}}^{(0)}$ as the response,and ${(Z_{j}^{T} Z_{j})}^{+}$ being the Moore-Penrose pseudo-inverse of $Z_{j}^{T} Z_{j}$ ;

d) ${\hat{D}}^{(0)} = \frac{1}{m} \sum_{j = 1}^{m} {\hat{u}}_{j}^{(0)} {\hat{u}}_{j}^{(0) T}$ ;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}^{(t)} = {({\hat{u}}_{1}^{(t) T}, \dots, {\hat{u}}_{m}^{(t) T})}^{T}$ ,and ${\hat{D}}^{(t)}$ from iteration t,compute estimates and predictions at iteration $t + 1$ as follows:

a) ${\hat{A}}_{j}^{(t + 1)} = Z_{j}^{T} Z_{j} + {\hat{σ}}_{ε}^{2 (t)} {[{\hat{D}}^{(t)}]}^{- 1}$ , $j = 1, \dots, m$ ;

b) Solve for ${\hat{β}}^{(t + 1)}$ and ${\hat{u}}^{(t + 1)}$ in the linear system,with ${\hat{A}}^{(t + 1)} = diag ({\hat{A}}_{1}^{(t + 1)}, \dots, {\hat{A}}_{m}^{(t + 1)})$ :

$(\begin{matrix} X^{T} X & X^{T} Z \\ Z^{T} X & Z^{T} Z + {\hat{A}}^{(t + 1)} \end{matrix}) (\begin{matrix} {\hat{β}}^{(t + 1)} \\ {\hat{u}}^{(t + 1)} \end{matrix}) = (\begin{matrix} X^{T} y \\ Z^{T} y \end{matrix});$ (29)

c) ${\hat{σ}}_{ε}^{2 (t + 1)} = y^{T} (y - X^{(t + 1)} - Z {\hat{u}}^{(t + 1)}) / (n - p)$ ;

d) ${\hat{V}}_{j}^{* (t + 1)} = {\hat{σ}}_{ε}^{2 (t + 1)} {[{\hat{A}}_{j}^{(t + 1)}]}^{- 1}$ , $j = 1, \dots, m$ ;

e) ${\hat{D}}^{(t + 1)} = \frac{1}{m} \sum_{j = 1}^{m} [{\hat{V}}_{j}^{* (t + 1)} + {\hat{u}}_{j}^{(t + 1)} {\hat{u}}_{j}^{(t + 1) T}]$ ;

3)Stopping criterion:Assume convergence when ${‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)} \leq δ_{1} {‖ {\hat{D}}^{(t + 1)} ‖}^{(M)}$ , $‖ {\hat{β}}^{(t + 1)} - {\hat{β}}^{(t)} ‖ \leq δ_{2} ‖ {\hat{β}}^{(t + 1)} ‖$ ,

$| {\hat{σ}}_{ε}^{2 (t + 1)} - {\hat{σ}}_{ε}^{2 (t)} | \leq δ_{3} {\hat{σ}}_{ε}^{2 (t + 1)}$ are all satisfied,where $δ_{1}, δ_{2}, δ_{3}$ are relative tolerance levels set in $(0,1)$ ,and ${‖ \cdot ‖}^{(M)}$ is a chosen matrix norm.Otherwise, $t \leftarrow t + 1$ and repeat Step 2.

4) Extracting estimates:At convergence,take $\hat{β} = {\hat{β}}^{(t + 1)}$ , ${\hat{σ}}_{ε}^{2} = {\hat{σ}}_{ε}^{2 (t + 1)}$ , $\hat{D} = {\hat{D}}^{(t + 1)}$ as estimates of $β$ , $σ_{ε}^{2}$ , $D$ .Also,take ${\hat{u}}_{1}^{(t + 1)}, \dots, {\hat{u}}_{m}^{(t + 1)}$ as predictions of $U_{1}, \dots, U_{m}$ .

Remark 8. In that algorithm, whereas stabilization of the $D$ and $σ_{ε}^{2}$ iterates is enough to ensure that the values of all the other estimates have stabilized as well,the stopping criterion of the algorithm is based on also monitoring the $β$ iterates for more security.

Given the alternative system of EEs (28a)-(28c), our second algorithm for 2-level LMMs with u.ho. errors goes as follows:

Algorithm 3S-A1-V2. Estimating $β$ , $σ_{ε}^{2}$ , $D$ and predicting $U_{1}, \dots, U_{m}$ in a 2-level LMM with u.ho.errors:Version 2

1)Initialization:The same as in Algorithm 3S-A1-V1;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}_{1}^{(t)}, \dots, {\hat{u}}_{m}^{(t)}$ ,hence ${\hat{u}}^{(t)} = {({\hat{u}}_{1}^{(t) T}, \dots, {\hat{u}}_{m}^{(t) T})}^{T}$ ,and ${\hat{D}}^{(t)}$ from iteration t,compute estimates and predictions at iteration $t + 1$ as follows:

a) ${\hat{B}}_{j}^{(t + 1)} = {(Z_{j}^{T} Z_{j} + {\hat{σ}}_{ε}^{2 (t)} {[{\hat{D}}^{(t)}]}^{- 1})}^{- 1}$ , $j = 1, \dots, m$ ;

b) ${\hat{β}}^{(t + 1)} = {(X^{T} X)}^{- 1} X^{T} (y - Z {\hat{u}}^{(t)})$ ;

c) ${\hat{u}}_{j}^{(t + 1)} = {\hat{B}}_{j}^{(t + 1)} Z_{j}^{T} (y_{j} - X_{j} {\hat{β}}^{(t + 1)})$ , $j = 1, \dots, m$ ;

d) ${\hat{σ}}_{ε}^{2 (t + 1)} = \frac{1}{n - p} ({‖ y - Z {\hat{u}}^{(t + 1)} - X {\hat{β}}^{(t + 1)} ‖}^{2} + {\hat{σ}}_{ε}^{2 (t)} {‖ {\hat{u}}^{(t + 1)} ‖}_{{\hat{G}}^{(t) - 1}}^{2})$ , ${\hat{G}}^{(t)} = diag ({\hat{D}}^{(t)}, \dots, {\hat{D}}^{(t)})$ ;

e) ${\hat{V}}_{j}^{* (t + 1)} = {\hat{σ}}_{ε}^{2 (t + 1)} {\hat{B}}_{j}^{(t + 1)}$ , $j = 1, \dots, m$ ;

f) ${\hat{D}}^{(t + 1)} = \frac{1}{m} \sum_{j = 1}^{m} [{\hat{V}}_{j}^{* (t + 1)} + {\hat{u}}_{j}^{(t + 1)} {\hat{u}}_{j}^{(t + 1) T}]$ ;

3) Stopping criterion and extracting estimates:As in Algorithm 3S-A1-V1.

Remark 9. Important elements about the practical implementation of the above two designed algorithms are detailed further in Section S2 of the accompanying Supplementary material document. In particular, it is explained there how we monitor a possible rank deficiency of $D$ and even whether or not random effects are really there for a given data set, in the first place. For the latter aspect, we introduce and motivate

$ρ = {‖ Z G Z^{T} ‖}^{(M)} / σ_{ε}^{2},$ (30)

a quantity called random effects ratio,through formula (S:2.6b) to try to assess the likelihood of having significant random effects or not in the data.

4.3. First Properties of the Estimates $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ and $\hat{u}$

The convergence of the above two algorithms has not been investigated yet. However, the fact that their convergence can only occur to estimates $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{u}$ , $\hat{D}$ (of $β$ , $σ_{ε}^{2}$ , $D$ , $\bar{u}$ ) satisfying the two equivalent systems of EEs of Section 0.0.5 allows to identify some first properties of those estimates. Indeed, assume that any of the above two algorithms has converged to $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ , $\hat{u}$ , with $\hat{D}$ SPD, ${\hat{σ}}_{ε}^{2} > 0$ and Assumption $A 2_{g}$ true. We highlight two important properties of these estimates.

4.3.1. Relationship with the HMMEs

Firstly, Theorems 1 & 4 and Corollary 1 imply the following about $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ , $\hat{u}$ :

Theorem 6. If one takes $G = \hat{G}$ and $σ_{ε}^{2} = {\hat{σ}}_{ε}^{2}$ in the HMMEs (10),then $\tilde{β} = \hat{β}$ and $\tilde{U} = \hat{u}$ are the solutions to (10).Therefore,the estimates $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ and prediction $\hat{u}$ satisfy:

$\hat{β} = EBLUE (β | y, \hat{G}, {\hat{σ}}_{ε}^{2}) = {(X^{T} {\hat{V}}^{- 1} X)}^{- 1} X^{T} {\hat{V}}^{- 1} y,$ (31)

$\hat{u} = EBLUP (U | y, \hat{G}, {\hat{σ}}_{ε}^{2}) = \hat{G} Z^{T} {\hat{V}}^{- 1} (y - X \hat{β}),$ (32)

where $\hat{V} = Z \hat{G} Z^{T} + {\hat{σ}}_{ε}^{2} I_{n}$ .That is $\hat{β}$ and $\hat{u}$ are,respectively,an empirical BLUE of $β$ and an empirical BLUP of U when $G$ is estimated by $\hat{G}$ and $σ_{ε}^{2}$ by ${\hat{σ}}_{ε}^{2}$ .

Remark 10. The main difference between Algorithms 3S-A1-V1 and 3S-A1-V 2 is that the former enforces satisfaction of the HMMEs for the computed estimates at each iteration $t \geq 1$ , while the latter only guarantees that for the final estimates if convergence is achieved.

4.3.2. Relationship with the 2-Level Gaussian LMM Likelihood Equations under i.i.d. Errors

The EEs (27b) and (28b) indicate that our 2 estimation procedures above are extensions to 2-level LMMs with u.ho. errors of the well known 2-step method to fit an LM with such errors whereby $β$ is first estimated by OLS and, then, that estimator is used to get an unbiased estimator for $σ_{ε}^{2}$ , and this without using any parametric distributional assumption. Now, it is trivial that for the LM with i.i.d. Gaussian errors, that 2-step estimation method is asymptotically equivalent to the ML estimation of $β$ and $σ_{ε}^{2}$ . We are going to show that the same property holds for our 2 estimation procedures when the 2-level LMM is, indeed, a Gaussian one with i.i.d. errors, at least in terms of the solutions $\hat{β}$ , ${\hat{σ}}_{ε}^{2}$ , $\hat{D}$ satisfying the likelihood equations. So, we assume a 2-level LMM satisfying Assumptions $A 1$ - $A 4$ and, for $j = 1, \dots, m$ :

Assumption $A 5$ . $U_{ j} \overset{D}{\sim} N_{r} (0, D)$ and $ε_{j} \overset{D}{\sim} N_{n_{j}} (0, σ_{ε}^{2} I_{n_{j}})$ .

Estimating the parameters $β$ , $σ_{ε}^{2}$ , $D$ of that 2-level Gaussian LMM through ML usually starts by trying to solve the likelihood equations:

$\frac{\partial}{\partial β} l (β, σ_{ε}^{2}, D | y) = 0,$ (33a)

$\frac{\partial}{\partial D} l (β, σ_{ε}^{2}, D | y) = 0,$ (33b)

$\frac{\partial}{\partial σ_{ε}^{2}} l (β, σ_{ε}^{2}, D | y) = 0,$ (33c)

where $l (β, σ_{ε}^{2}, D | y)$ is the log-likelihood of the model with observed response $Y = y$ .

First, it is well known that, given our assumptions:

(33a) $\Leftrightarrow β = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} y .$ (34)

Thus, thanks to (31), the triplet $(β, σ_{ε}^{2}, D) = (\hat{β}, {\hat{σ}}_{ε}^{2}, \hat{D})$ satisfies the first likelihood Equation (33a). For the other two likelihood equations, generalizing computations in Searle et al. ( [14], pages 278-279) for the ANOVA model (4), one has (Proof in the Appendix):

Theorem 7. Let $\tilde{β}$ and $\tilde{U}$ satisfy the HMMEs (10) given $σ_{ε}^{2}$ , $D$ for the 2-level LMM (2),with Assumptions $A 1$ - $A 4$ , $A 2_{g}$ and $A 5$ .

1) If (33a)holds,then:

(33b) $\Leftrightarrow D = \frac{1}{m} \sum_{j = 1}^{m} (V_{ j}^{*} + {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T}) .$

2)If (33a) and (33b) both hold,then:

(33c) $\Leftrightarrow σ_{ε}^{2} = y^{T} (y - X \tilde{β} - Z \tilde{U}) / n .$

Given the EE (27c), the first equivalence in that theorem shows that the second likelihood Equation (33b) is satisfied by the quadruplet $(β, σ_{ε}^{2}, D, U) = (\hat{β}, {\hat{σ}}_{ε}^{2}, \hat{D}, \hat{u})$ . But, given our EE (27b), we see that the triplet $(β, σ_{ε}^{2}, U) = (\hat{β}, {\hat{σ}}_{ε}^{2}, \hat{u})$ falls short to satisfying (33c) only through having a denominator $n - p$ instead of n, exactly like in the LM case. So, when the 2-level LMM is a Gaussian one with i.i.d. errors, our 2 estimation procedures above are asymptotically equivalent to the ML method. However, as for the LM case, we stick here with our denominator $n - p$ , inspired by the unbiasedness property (16a) in Theorem 5.

Remark 11. An important by-product of Theorem 7 is that it implies that by replacing $n - p$ by n in the denominator of the formula updating the residual variance estimate during the iterations in Algorithms 3S-A1-V1 and 3S-A1-V2, we get two new algorithms for computing the Gaussian ML estimates in a 2-level LMM. They may be coded 3S-ML-A1-V1 and 3S-ML-A1-V2 and can be viewed as alternatives to existing algorithms based on Newton-Raphson, Fisher scoring or EM (Expectation-Maximization).

4.4. 3S LMM Fitting: What about the Accuracy of the Estimates?

Obviously, once a 3S algorithm has converged to the targeted estimates of parameters for fitting an LMM to a given data set, the next question is, as usual in statistical inference: what about the accuracy of those estimates? In a strictly parametric modeling with Gaussian ML or REML estimation, the answer is customarily built through estimating the inverse of Fisher’s Information Matrix for the estimated parameters. But we cannot go that route here because we precisely aimed here at a nonparametric modeling with the LMM for the provided data set. Instead, to assess the accuracy of the estimates computed by a 3S algorithm, we use the bootstrap. But bootstrapping mixed models necessitates significantly more care than is usually the case when using routine i.i.d. data. Methods and up-to-date discussions about this can be found for instance in Carpenter et al. [45], Van der Leeden et al. [46], Chambers and Chandra [47], Thai et al. [48], Modugno and Giannerini [49].

5. 3S Fitting of LMMs: Beyond 2-Level LMMs

As stated from the outset, the first goal in this work was to design a methodology for parameters estimation in a 2-level LMM where the only assumption added to the basic ones ( $A 1$ to $A 4$ ) would be to have u.ho. errors (Assumption $A 4_{g}$ ). Nonetheless, analyzing the 3-step sequence upon which each of our devised 3S iterative algorithms was constructed in Section 4, one is struck by the fact that the 2-level element only intervenes in Step 3, i.e. when deriving an EE for $D$ . In the construction of both algorithms, what is done in Steps 1-2 is based on results derived in Section 3.2.2 and classical formulas of the BLP in an LMM and its covariance matrix, thus only uses Assumptions $A 1_{g}$ , $A 2_{g}$ and $A 4_{g}$ of an arbitrary LMM (1). It comes that for any LMM satisfying the latter 3 assumptions, if one can reliably devise an EE for $G$ , then one would immediately get respective adaptations of Algorithms 3S-A1-V1 and 3S-A1-V2 for fitting it. In that order, one can start from the fact that (analogous to (18) for the 2-level case)

${\tilde{G}}_{1} = V^{*} + \bar{U} {\bar{U}}^{T}$ (35)

is an unbiased estimator of $G$ , where $V^{*} = G - M cov (\bar{U})$ .

The most obvious case is when $G$ is a diagonal matrix, but with some diagonal elements being equal by design. We examine two classical such situations hereafter: the 2-level LMM with $D$ diagonal and the ANOVA LMM (4).

5.1. 3S Fitting of 2-Level LMMs with u.ho. Errors and $D$ Diagonal

In a 2-level LMM (2), when the dimension r of the vectors $U_{1}, \dots, U_{m}$ is big while the dimension n of the response vector is small to moderate, the number $p + m r + r (r + 1) / 2$ of scalar parameters to estimate in $β$ , $D$ and the random effects predictors might become too big for any LMM fitting method. In that case, imposing a diagonal structure for $D$ may be a convenient way to achieve a reasonable estimation process for $β$ , $D$ and $σ_{ε}^{2}$ from the observed response vector $Y = y \in ℝ^{n}$ . This motivates assuming then that $D = diag (σ_{1}^{2}, \dots, σ_{r}^{2})$ , with $σ_{k}^{2} > 0$ , $\forall k = 1, \dots, r$ .

With that, the unbiased estimator ${\tilde{D}}_{1}$ of $D$ given by (18) implies that for each $k = 1, \dots, r$ , an unbiased preliminary estimator of the variance component $σ_{k}^{2}$ is

${\tilde{σ}}_{k}^{2} = \frac{1}{m} \sum_{j = 1}^{m} [{(V_{ j}^{*})}_{k k} + {({\bar{U}}_{ j})}_{k}^{2}],$ (36)

where ${(V_{ j}^{*})}_{k k}$ and ${({\bar{U}}_{ j})}_{k}$ are the k^th diagonal element of matrix $V_{ j}^{*}$ and k^th component of vector ${\bar{U}}_{ j}$ . Thus the EE (21) for $D$ is simplified here to:

$\hat{D} = diag ({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{r}^{2}), {\hat{σ}}_{k}^{2} = \frac{1}{m} \sum_{j = 1}^{m} [{({\hat{V}}_{ j}^{*})}_{k k} + {({\hat{u}}_{j})}_{k}^{2}] .$ (37)

Then one can easily adapt Algorithms 3S-A1-V1 and 3S-A1-V2 for the 2-level LMM with u.ho. errors and diagonal $D$ , to get, say, Algorithms 3S-A1-V1-diag and 3S-A1-V2-diag. The latter are presented in Section S4.1 of the Supplementary material document.

5.2. 3S Fitting of an ANOVA LMM

In the ANOVA LMM (4), we can partition $\bar{U} = BLP {(U | Y)}_{}$ as in (S:1.6a), but where each ${\bar{U}}_{ j}$ is rather a $q_{j}$ -vector ( $j = 1, \dots, m$ ). Now, let: $s_{0} = 0$ , and $\forall j = 1, \dots, m$ , $s_{j} = q_{1} + \dots + q_{j} = s_{j - 1} + q_{j}$ . Then each vector ${\bar{U}}_{ j}$ is made up of the ${(s_{j - 1} + 1)}^{th}$ to ${(s_{j})}^{th}$ components of $\bar{U}$ .

With the structure of $G$ in this case, the estimator ${\tilde{G}}_{1}$ in (35) implies that an unbiased preliminary estimator of each variance component $σ_{j}^{2}$ is

${\tilde{σ}}_{j}^{2} = \frac{1}{q_{j}} [\sum_{k = s_{j - 1} + 1}^{s_{j}} {(V^{*})}_{k k} + {‖ {\bar{U}}_{ j} ‖}^{2}]$ (38)

for $j = 1, \dots, m$ . Hence, we get an EE for $G$ here as:

$\hat{G} = diag ({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{m}^{2}), {\hat{σ}}_{j}^{2} = \frac{1}{q_{j}} [\sum_{k = s_{j - 1} + 1}^{s_{j}} {({\hat{V}}^{*})}_{k k} + {‖ {\hat{u}}_{j} ‖}^{2}] .$ (39)

From that, Algorithms 3S-A1-V1 and 3S-A1-V2 can be adapted to fit an ANOVA LMM to get, say, Algorithms 3S-A1-V1-ANOVA and 3S-A1-V2-ANOVA. They are presented in Section S4.2 of the Supplementary material. The first one is nearly the same as one of those derived by Henderson for Gaussian ANOVA LMMs [14] [16]. The only difference is again the $n - p$ in the denominator in the r.h.s. of the EE for $σ_{ε}^{2}$ in (27b).

6. Numerical Examples

In this section, we carry out some numerical experiments on simulated data sets (Section 6.1) and two real world data sets (Section 6.2). For both situations, we compared Algorithm 3S-A1-V1 vs. Gaussian ML. For the latter, and in light of Theorem 7, we could have used Algorithm 3S-ML-A1-V1 in Remark 11. But that would have looked like a self comparison. So we instead chose to use the implementation of Gaussian ML by the lmer function in the reference lme4 package [50] of the R software. Hereafter, we will denote the latter by lmer-ML. The reason why Algorithm 3S-A1-V2 is not included in the study is that, as expected, it tends to give the same results as Algorithm 3S-A1-V1.

6.1. A Simulation Study

Here, to investigate the performance of Algorithm 3S-A1-V1 vs. lmer-ML, we fitted both to 500 simulated data sets, each of size $n = 200$ , under various distributional scenarios. First, we assumed we have a population $Ω$ comprising $m = 10$ clusters in equal proportions of 1/10 and that in each cluster j, both the vector X of fixed effects covariates and Z, that of random effects covariates, follow Gaussian distributions with given parameters, respectively in $ℝ^{4}$ and $ℝ^{3}$ . To get a unit in a simulated data set, we first sample its cluster among the 10 in $Ω$ , then its covariates by sampling from the Gaussian distributions of X and Z in that cluster. The distributions parameters used for that and the simulated distributions described below are given in Section S5 of the Supplementary material document, alongside what we used as vector of fixed effects parameters $β \in ℝ^{4}$ .

To get one data set, after simulating the 200 units in it with their fixed and random effects covariates as just described,

• we first simulate a vector of random effects $U_{ j} \in ℝ^{3}$ for each cluster $j \in {1,2, \dots,10}$ according to a chosen distribution;

• then, for each unit i sampled in cluster j and included in the data set, with simulated fixed effects covariates $X_{i j} \in ℝ^{4}$ and random effects covariates $Z_{i j} \in ℝ^{3}$ , and a residual error $ε_{i j} \in ℝ$ according to a chosen distribution. Then its simulated exact response is calculated as $Y_{i j} = X_{i j}^{T} β + Z_{i j}^{T} U_{ j} + ε_{i j}$ .

For the distribution of the residual errors, we examined three options: a Gaussian, a mixture of 4 Gaussians with given mixing proportions, and a discrete distribution with 4 mass points. On the other hand, the covariance matrix $D$ of each $U_{ j}$ is $3 \times 3$ . For the distribution of the $U_{ j}$ ’s in $ℝ^{3}$ , we examined five options: Gaussian with $D$ SPD, Gaussian with $rank (D) = 1$ , Gaussian with $rank (D) = 2$ , mixture of 3 Gaussians with given mixing proportions, and a discrete distribution with 3 mass points. Thus, we have $3 \times 5 = 15$ possible scenarios. Hence, for each of the algorithms 1 and lmer-ML, in each of the 15 scenarios, we simulated 500 200-sized data sets and fitted the 2-level LMM (expressed in the R software notations):

$Y \sim f e (- 1 + X 1 + X 2 + X 3 + X 4) + r e (- 1 + Z 1 + Z 2 + Z 3) + G r (c l u s t e r)$ (40)

where “−1” signifies no intercept included, be it on the fixed or the random effects.

The fit of the LMM (40) to each simulated data set yields an estimate $\hat{θ} \in {\hat{β}, \hat{D}, {\hat{σ}}^{2}}$ of each parameter $θ \in {β, D, σ^{2}}$ , and a prediction

$\hat{U} = {({\hat{U}}_{j}^{T})}_{j = 1, \dots, 10}^{T}$ of the random effects vector $U = {(U_{j}^{T})}_{j = 1, \dots, 10}^{T} \in ℝ^{30}$ .

Furthermore, to estimate the Mean Squared Prediction Error (MSPE) of the unit response Y by its prediction $\hat{Y}$ from a fitted model, for each simulated data, we independently simulated an additional data set of size 100 with simulated response vector $Y \in ℝ^{100}$ and computed $\hat{Y}$ , the response vector predicted by the fitted LMM. For each $θ \in {β, D, σ^{2}}$ , we then used its simulated replicates ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{500}$ to estimate

$b_{R} (\hat{θ}) = ‖ E (\hat{θ}) - θ ‖ / ‖ θ ‖ and RMSE (\hat{θ}) = E {‖ \hat{θ} - θ ‖}^{2} / {‖ θ ‖}^{2},$ (41)

respectively the Euclidean norm of the Relative Bias and the Relative Mean Squared Error of $\hat{θ}$ . Each expectation was estimated by the corresponding Monte Carlo estimate. In that calculation, for the symmetric $3 \times 3$ matrices $θ = D$ and $\hat{θ} = \hat{D}$ , we identified, each, to the vector of elements in its upper triangular part (comprising the diagonal). Likewise, for each predictor $\hat{P} \in {\hat{U}, \hat{Y}}$ of $P \in {U, Y}$ , we used its simulated replicates to estimate the norm of its bias and its MSPE.

$b (\hat{P}) = ‖ E (\hat{P}) - P ‖ and MSPE (\hat{P}) = E {‖ \hat{P} - P ‖}^{2} .$ (42)

For both algorithms, the results about the estimation of $β$ , $D$ , $σ_{ε}^{2}$ and the prediction of random effects and responses are reported in Table 1. Before we specifically comment on them, note, however, that when running an iterative algorithm on simulated data sets, the algorithm might fail to converge on some

Table 1. Estimates of the bias and MSE of estimates of parameters, predicted random effects and responses in the LMM (40) applied to simulated data sets with different distributions for the random effects and residual errors.

D is the cluster random effects (RE) 3 × 3 covariance matrix; R(D) = rank(D) is the rank of D. In each cell of two numbers in the table, the one on top is the result from the Algorithm 3S-A1-V1, while the one below it, in parentheses, is the result from Gaussian ML.

of them. In our case, the strategy was to discard such data sets, but, in each scenario, run the simulation until 500 data sets have been successfully fitted by the considered algorithm. Nonetheless, while Algorithm 3S-A1-V1 never failed during this set of simulations, the situation appeared particularly tricky when running Algorithm lmer-ML. Indeed, with its default settings, its inner workings make that it signaled a failure of convergence on a massive number of simulated data sets, so that we decided actually not to reject them all, rejecting only the cases where the algorithm signaled a difficulty to converge due to a suspected singular fit, i.e. true model parameters likely on or close to the boundary of their space of values. Even with that, the only scenarios where we did not witness any failure of lmer-ML was under the Gaussian mixture distribution for the cluster random effects in $ℝ^{3}$ , whereas one would have only expected failure when the covariance matrix $D$ is rank deficient.

As might be expected, the results of both algorithms are comparable. Nevertheless, while there is no aspect in both tables in which the lmer-ML algorithm uniformly or almost uniformly dominates 3S-A1-V1, a dominance of the latter over the former can be observed, in almost all scenarios, on a lower bias of the estimate of $σ_{ε}^{2}$ , and not far from so on a lower MSE of that same estimate. Such an almost uniform dominance of the 3S-A1-V1 algorithm can also be seen in the prediction of the response in terms of bias. As for comparing the two algorithms in various individual scenarios, the main striking observation is that 1 almost uniformly dominates lmer-ML when the cluster random effects covariance matrix $D$ is rank deficient, more decisively so the smaller that rank.

6.2. Two Real World Data Sets

In this section, we apply Algorithm 3S-A1-V1 on two classical data sets, cake [51] and Blackmore (from the car R package) data, and compare it to the results from Gaussian ML. The results of estimating $β$ , $σ_{ε}^{2}$ , $D$ and predicting $U_{1}, \dots, U_{m}$ by Algorithm 3S-A1-V1 will be presented. Each of these estimates is given with variability indicators such as estimates of Bias, Mean Squared Error (MSE) and t-statistics, p-values for significance and 95% two-sided confidence intervals (CI), all computed using the nonparametric residuals bootstrap approach outlined in Carpenter et al. [45]. In each LMM fit to a data set, we will also provide an estimate of the random effects ratio $ρ$ in the data, given by (30).

6.3. Application to the Cake Data

The cake data set consists of observations of the breakage angle of chocolate cakes made with 3 different recipes and baked at 6 different temperatures, with 15 replicates for each combination of a recipe and a temperature. Hence, there are $3 \times 6 \times 15 = 270$ sample units (the baked cakes) for each of which there is a recorded value of the variables: angle (numeric), recipe (factor with levels A, B, C), temperature (ordered factor with levels 175 < 185 < 195 < 205 < 215 < 225), replicate (factor with levels 1 to 15), temp (with the same values as temperature, but viewed as numeric). In the R software notations [39], we fitted to these data the two LMMs:

$a n g l e \sim f e (- 1 + t e m p) + r e (1) + G r (r e c i p e : r e p l i c a t e)$ (43a)

$a n g l e \sim f e (- 1 + t e m p e r a t u r e) + r e (1) + G r (r e c i p e : r e p l i c a t e)$ (43b)

In the LMM (43a), the response is angle, the fixed effects variable is temp, we have a random intercept as only random effects variable while the clustering variable is obtained by crossing the categories of the two factors recipe and replicate, thus yielding $3 \times 15 = 45$ clusters. The difference between the LMMs (43a) and (43b) is that the latter uses instead, as lone fixed effects variable, temperature viewed as a 6-level factor. The reason the intercept has been excluded in the fixed effects part of both models is that a preliminary fitting of the LMM (43a) with an intercept among the fixed effects revealed it to be insignificant through the bootstrap procedure. The interest in fitting the two models is that it allows to assess whether the true values of the temperature really matter in the quality of the fit.

The results for models (43a) and (43b) are respectively presented in Table 2 and Table 3. For a given parameter estimate, the t-statistic is the estimate value divided by the square root of its MSE and the p-value is calculated assuming that t-statistic has, approximately, a standard Gaussian distribution when the true

Table 2. Results of algorithm 3S-A1-V1 fitting the LMM (43a) to the cake data.

β_temp: coeff. of fixed effects temp; $σ_{intercept}^{2}$ : random intercept variance; $σ_{ε}^{2}$ : residual variance; ρ: random effects ratio; est.: estimate; [2.5% - 97.5%], bias, std.dev, $\sqrt{MSE}$ , t-stat, p-value: respectively estimated two-sided 95% percentile confidence interval, bias, standard deviation, square root of the MSE, t-statistic and p-value of the t-test for a zero value, all computed by a bootstrap simulation of 1000 replicates of the model fit using the nonparametric residuals bootstrap approach described in Carpenter et al. [45]. These variability estimates are computed from the 1000 bootstrap replicates using the standard formulas introduced by Efron (see [52] ).

Table 3. Results of Algorithm 3S-A1-V1 fitting the LMM (43b) to the cake data.

Meaning of parameters as in Table 2.

value of the parameter is 0. The latter is rarely a bad approximation in this context. Overall, the two model fits are quite similar in terms of random effects variance, residual variance and MSPE estimate. Moreover, the relative difference in terms of fitted response values between the two fits ranges between −0.047 and 0.030. Finally, the p-value of the random effects ratio $ρ$ is roughly 3 × 10⁻⁴, strongly suggesting the presence of random effects on the intercept w.r.t. the recipe:replicate clustering considered for the baked cakes.

For comparison, we also fitted each of the two LMMs (43a) and (43b) by the Gaussian ML method. The results are given in Table 4 and, as expected, the parameters estimates are almost identical to those in the two previous tables.

6.4. Application to the Blackmore Longitudinal Data

To illustrate our methods on longitudinal LMMs, we consider the Blackmore data described in detail in Davis et al. [53]. It is a longitudinal, retrospective, self-reported data from a case-control study on the physical exercise histories of 231 teenage girls, with 138 who are eating disorder (anorexia nervosa) patients recruited from a 4 years inpatient Eating Disorder Program at the Toronto hospital for sick children. The other 93 comparable “control” subjects with no history of a psychiatric disorder (determined by asking relevant clinical questions) were recruited from informational letters through school boards to parents, inviting the teenage daughter (the letter bearer) to take part in the study. Retrospective recall of leisure-time sport and exercise activities at target ages 8, 10, 12, 14 years (if applicable in relation to the girl’s current age), and the 12 months

Table 4. LMM Gaussian maximum likelihood fitting to the cake data.

prior to the study, were obtained during a structured interview. Since the girls were recruited at different ages, the number of observations and the age at the last observation vary, respectively from 2 to 5 observations and from 11.58 to 17.92 years. The Blackmore data are given in the long format for a longitudinal data set, with the following variables: subject (an identification code for each girl); age (the girl’s age in years at the time of observation); exercise (the variable of interest, the amount of physical activity in which the girl engaged, expressed as estimated hours per week) and group (a factor indicating whether the girl is a “patient” or a “control”). For modeling exercise in terms of age and group across the sample of girls, Davis et al. [53] first log₂-transformed exercise to make its distribution for both groups more symmetric and linearised its relationship with age. Because there are some 0 values of exercise, 5 minutes (5/60 of an hour) were added to each value of exercise prior to taking logs.

For these data, we fit an LMM using as fixed effects: an intercept, age minus 8 (denoted age8), group and the interaction of the two latter. As for random effects in this study, they are attached to variations from girl to girl, so the clustering variable is subject. Now, a follow-up plot of log₂ exercise by age for 20 randomly selected patients and 20 randomly selected control girls showed that the log₂ exercise at the start of follow-up varies considerably, suggesting a random intercept in the LMM. Also, the evolution of log₂ exercise with age differs from girl to girl, informing on the inclusion of a random slope for age8. The model to fit is therefore:

$\begin{array}{l} \log_{2} e x e r c i s e \sim f e (1 + a g e 8 + g r o u p + a g e 8 * g r o u p) \\ + r e (1 + a g e 8) + G r (s u b j e c t) \end{array}$ (44)

when fitting that model, and as traditional in this type of study, in the numerical coding of the binary variable group, our reference category is “control”. Moreover, the covariance matrix $D$ is $2 \times 2$ , with diagonal elements $σ_{intercept}^{2}$ and $σ_{age8}^{2}$ , and both off-diagonal ones equal to ${cov}_{ intercept, age8}$ .

The results for the model using Algorithm 3S-A1-V1 versus ML are presented in Table 5 and Table 6, respectively. The interaction of age with group is highly significant in both methods, reflecting a steeper average trend in number of exercises with age in the patient group. The fixed intercept and group effects are not significant at a 5% level. Parameters estimates from Algorithm 3S-A1-V1 are

Table 5. Results of algorithm 3S-A1-V1 fitting the LMM (44) to the Blackmore data.

corr_{intercept;age8}: correlation coefficient between random effects variables intercept and age8.

Table 6. Results of Gaussian ML fitting of the LMM (44) to the Blackmore data.

very close to those from ML. The p-value for the random effects ratio estimate is less than 0.001, confirming the presence, w.r.t. the girls, of random effects on the intercept and/or age8, and probably both since the estimates of their variances $σ_{intercept}^{2}$ and $σ_{age8}^{2}$ are significantly greater than zero. It also appears that there is a significant negative correlation between those two random effects.

In the Supplementary material, we fitted another LMM to this data set by adding a random slope for age8*group in (44). The results using Algorithm 3S-A1-V1 show that, in addition to the parameters already significant in (44), the same is true of the variance of the interaction age8*group, implying a probable random effect on it also. We remark that Gaussian ML fitting of this model failed to converge with the default settings in the R function lmer.

7. Concluding Remarks

Till today, it has been difficult to routinely fit LMMs without assuming both random effects and residual errors to have Gaussian distributions (the default in almost all statistical software packages designed for that purpose). Yet, for many data sets, that assumption may be debatable, especially for the random effects. This is disturbing since modeling of random effects behavior is one of the main goals of LMM fitting in the first place. Therefore, there has been an implicit need, for long now, to develop fitting methods for LMMs not requiring Gaussian assumptions, while being applicable to as wide a range of LMMs as possible. Being restricted to variance components models, the venerable ANOVA methods and Rao’s MINQUE largely fall short in that respect. In the work presented here, we were able to devise a new iterative fitting methodology for 2-level (or longitudinal) LMMs with only added assumption (to the basic ones) that the residual errors were uncorrelated and homoscedastic. Each variant of that estimation methodology iterates through a small set of estimating equations and, when convergent, yields nonnegative estimates of variances and SPSD estimates of covariance matrices. Though no Gaussian assumption is involved in the derivation of these EEs, we, however, showed that if the 2-level LMM is, indeed, Gaussian with i.i.d. errors, then these EEs are equivalent to the likelihood ones, safe for a denominator n-p in lieu of n in the EE for the residual variance. Furthermore, the newly developed estimation methodology for LMMs, nicknamed 3S, is not exclusive to 2-level ones. The same ideas can be used to fit some other classes of LMMs as well, as we showed for variance components (or ANOVA) LMMs, generalizing an old method of Henderson for ML estimation in such models of Gaussian type. An interesting by-product we got is also, actually, an extension of that Henderson method to Gaussian ML fitting of 2-level LMMs with u.ho. errors.

Proof of Theorem 7

In addition to Lemmas S3.1 and S3.2, we shall need the following one:

Lemma 8. In the 2-level LMM (2), if the clusters random effects matrix $D$ satisfies:

$D = \frac{1}{m} \sum_{j = 1}^{m} (V_{ j}^{*} + {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T}),$ (45)

with ${\tilde{U}}_{1}, \dots, {\tilde{U}}_{m} \in ℝ^{m}$ ,then,letting $\tilde{U} = {({\tilde{U}}_{1}^{T}, \dots, {\tilde{U}}_{m}^{T})}^{T} \in ℝ^{q}$ ,

$tr (Z G Z^{T} V^{- 1}) = {‖ \tilde{U} ‖}_{G^{- 1}}^{2} .$ (46)

Proof. From the proof of Proposition S1.4, we already deduce that

$tr (Z G Z^{T} V^{- 1}) = tr [(G - V^{*}) G^{- 1}] .$ (47a)

Now, in the 2-level LMM (2), we know that $G = diag (D, \dots, D)$ , and $V^{*} = diag (V_{ 1}^{*}, \dots, V_{ m}^{*})$ , which are two conformal $q \times q$ block-diagonal matrices. Hence, one also has:

$G - V^{*} = diag (D - V_{ 1}^{*}, \dots, D - V_{ m}^{*}) and G^{- 1} = diag (D^{- 1}, \dots, D^{- 1}),$

which implies that $(G - V^{*}) G^{- 1} = diag ((D - V_{ 1}^{*}) D^{- 1}, \dots, (D - V_{ m}^{*}) D^{- 1})$ . Therefore,

$tr [(G - V^{*}) G^{- 1}] = \sum_{j = 1}^{m} tr [(D - V_{ j}^{*}) D^{- 1}] = tr [\sum_{j = 1}^{m} (D - V_{ j}^{*}) \cdot D^{- 1}] .$ (47b)

Notice then that (45) is equivalent to $\sum_{j = 1}^{m} (D - V_{ j}^{*}) = \sum_{j = 1}^{m} {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T}$ , which, inserted in (47b), yields:

$tr [(G - V^{*}) G^{- 1}] = tr [\sum_{j = 1}^{m} {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T} \cdot D^{- 1}] = \sum_{j = 1}^{m} tr ({\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T} D^{- 1}) .$ (47c)

Now, for $j = 1, \dots, m$ , $tr ({\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T} D^{- 1}) = tr ({\tilde{U}}_{ j}^{T} D^{- 1} {\tilde{U}}_{ j}) = {\tilde{U}}_{ j}^{T} D^{- 1} {\tilde{U}}_{ j} \in ℝ$ . Hence, (47c) implies: $tr [(G - V^{*}) G^{- 1}] = \sum_{j = 1}^{m} {\tilde{U}}_{ j}^{T} D^{- 1} {\tilde{U}}_{ j} = {\tilde{U}}^{T} G^{- 1} \tilde{U} = {‖ \tilde{U} ‖}_{G^{- 1}}^{2}$ , which, with (47a), yields (46). □

• We now proceed properly to prove Theorem 7:

Proof. Let $\tilde{β}$ and $\tilde{U}$ satisfy the HMMEs (10) given $σ_{ε}^{2}$ , $D$ and $Y = y$ in the 2-level LMM (2). Then, thanks to Theorem 1,

$\tilde{β} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} y and \tilde{U} = G Z^{T} V^{- 1} (y - X \tilde{β}),$ (48a)

the latter being equivalent here to:

${\tilde{U}}_{ j} = D Z_{j}^{T} V_{j}^{- 1} (y_{j} - X_{j} \tilde{β}), for j = 1, \dots, m .$ (48b)

Now, let Assumptions $A 1$ - $A 5$ be true. Then, for $j = 1, \dots, m$ , $Y_{ j} \overset{D}{\sim} N_{n_{j}} (X_{j} β, V_{ j})$ , with density:

$f_{Y_{ j}} (y_{j} | β, σ_{ε}^{2}, D) = {(2 π)}^{- n_{j} / 2} {| V_{ j} |}^{- 1 / 2} \exp [- \frac{1}{2} {‖ y_{j} - X_{j} β ‖}_{V_{j}^{- 1}}^{2}],$

The log-likelihood over the data from all the m clusters is thus:

$l (β, σ_{ε}^{2}, D | y) = c - \frac{1}{2} \sum_{j = 1}^{m} [\log | V_{ j} | + {(y_{j} - X_{j} β)}^{T} V_{j}^{- 1} (y_{j} - X_{j} β)] .$ (48c)

1) Now, assume also that (33a) holds, hence $β = \tilde{β}$ , thanks to (34). On the other hand, using (S:3.1a) and (S:3.1b) in Lemma S3.2 to differentiate (48c) w.r.t. $D$ gives:

$- 2 \frac{\partial l}{\partial D} = \sum_{j = 1}^{m} [\frac{\partial \log | V_{ j} |}{\partial D} + \frac{\partial {(y_{j} - X_{j} β)}^{T} V_{j}^{- 1} (y_{j} - X_{j} β)}{\partial D}] = diag (S) - 2 S,$

where $S = \sum_{j = 1}^{m} S_{j}$ , with $S_{j} = W_{j} - B_{j}$ , $B_{j} = Z_{j}^{T} V_{j}^{- 1} Z_{j}$ , $W_{j} = Z_{j}^{T} V_{j}^{- 1} (y_{j} - X_{j}) {(y_{j} - X_{j})}^{T} V_{j}^{- 1} Z_{j}$ . With Lemma S3.1, the nonsingularity of $D$ , the fact that $β = \tilde{β}$ and using (48b), then (S:1.6c),

$\begin{array}{l} \frac{\partial l}{\partial D} = 0 \Leftrightarrow S = 0 \Leftrightarrow \sum_{j = 1}^{m} W_{j} = \sum_{j = 1}^{m} B_{j} \Leftrightarrow D \cdot \sum_{j = 1}^{m} W_{j} \cdot D = D \cdot \sum_{j = 1}^{m} B_{j} \cdot D \\ \Leftrightarrow \sum_{j = 1}^{m} {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T} = \sum_{j = 1}^{m} (D - V_{ j}^{*}) \Leftrightarrow D = \frac{1}{m} \sum_{j = 1}^{m} (V_{ j}^{*} + {\tilde{U}}_{ j} {\tilde{U}}_{ j}^{T}) = \tilde{D} . \end{array}$ (49a)

2) Now, let also (33a) and (33b) both hold. Hence $\tilde{β} = β$ and (49a) are both true. Furthermore, (49a) implies that $G = \tilde{G} = diag (\tilde{D}, \dots, \tilde{D})$ . Using (S:3.1c) and (S:3.1d) in Lemma S3.2 to differentiate (48c) w.r.t. $σ_{ε}^{2}$ ,

$- 2 \frac{\partial l}{\partial σ_{ε}^{2}} = tr (V^{- 1}) - {‖ V^{- 1} (y - X β) ‖}^{2} .$ (49b)

Now, given that $β = \tilde{β}$ and (48a), we get, from (12a) in Lemma 3:

$σ_{ε}^{2} ‖ V^{- 1} (y - X \tilde{β}) ‖ = ‖ y - X \tilde{β} - Z \tilde{U} ‖ .$ (49c)

Also, using (11), then Lemma 8 (given that (49a) is true, by assumption),

$σ_{ε}^{2} tr (V^{- 1}) = tr (I_{n}) - tr (Z G Z^{T} V^{- 1}) = n - {‖ \tilde{U} ‖}_{{\tilde{G}}^{- 1}}^{2} .$ (49d)

Inserting (49c) and (49d) in (49b), we get:

$- 2 \frac{\partial l}{\partial σ_{ε}^{2}} = \frac{n - {‖ \tilde{U} ‖}_{{\tilde{G}}^{- 1}}^{2}}{σ_{ε}^{2}} - \frac{1}{σ_{ε}^{4}} {‖ y - X \tilde{β} - Z \tilde{U} ‖}^{2} .$

Consequently, from Corollary 1,

$\frac{\partial l}{\partial σ_{ε}^{2}} = 0 \Leftrightarrow n σ_{ε}^{2} = {‖ y - X \tilde{β} - Z \tilde{U} ‖}^{2} + σ_{ε}^{2} {‖ \tilde{U} ‖}_{{\tilde{G}}^{- 1}}^{2} = y^{T} (y - X \tilde{β} - Z \tilde{U}) .$ □

Supplementary Material

Supplementary Information for the article: Nonparametric Estimation in Linear Mixed Models with Uncorrelated Homoscedastic Errors

Introduction

In this document, we present supplementary materials useful for the understanding of the article. For that, the first section gathers some known results on LMMs scattered here and there in the literature, and some new results of our own. Section S2 presents some important implementation details about our presented iterative methods for fitting LMMs, while Section S3 presents two lemmas useful for proving Theorem 7 in the article. In Section S4, 3S fitting algorithms for 2-level LMMs with u.ho. errors and diagonal covariance matrix for the cluster random effects, and ANOVA LMMs are presented. More analysis of the Blackmore data is presented in the last section.

S1. LMMs without Gaussian Assumptions: Some Useful Results

S1.1. An Equivalent Formulation of Assumptions $A 1$ - $A 3$

It is useful to note the following equivalent formulation of Assumptions $A 1$ - $A 3$ :

Proposition S1.1. Assumptions $A 1$ - $A 3$ are equivalent to the following set of 3 assumptions:

a1) The $U_{ j}$ ’s are identically distributed in $ℝ^{r}$ .

a2) $\forall j$ , the two random vectors $U_{ j}$ and $ε_{j}$ are independent.

a3) The random couples $(U_{1}, ε_{1}), \dots, (U_{m}, ε_{m})$ are mutually independent.

S1.2. More about the BLP of $U | Y$ in an LMM

The best linear predictor $\bar{U} = BLP (U | Y)$ , given by (5), plays a key role in our new estimation methodology for LMMs with u.ho. errors. That is why we expand on it here. And, again, no Gaussian distributional assumption is involved.

S1.2.1. Covariance Matrix and Another Expression of the BLP of $U | Y$ in an LMM

The relationship between the covariance matrix of $\bar{U} = BLP (U | Y)$ and that of the true random effects vector U will be instrumental in estimating the latter in our methodology, starting with:

$M cov (\bar{U}) = G Z^{T} V^{- 1} M cov {(Y)}_{} V^{- 1} Z G = G Z^{T} V^{- 1} V V^{- 1} Z G = G Z^{T} V^{- 1} Z G,$ (S:1.1a)

thanks to (5). Hence,

$M cov (\bar{U}) = E (\bar{U} {\bar{U}}^{T}) = G - V^{*},$ (S:1.1b)

where $V^{*}$ is the symmetric matrix:

$V^{*} = G - G Z^{T} V^{- 1} Z G .$ (S:1.1c)

Introducing the apparently complicated matrix $V^{*}$ is motivated by the fact that it is only $q \times q$ (and not $n \times n$ like $V$ ) and turns out instrumental in devising an alternative useful way of rewriting the BLP (5) in the LMM (1), more so with u.ho. residual errors:

Lemma S1.2. In the LMM (1) with Assumption $A 1_{g}$ , the following identities hold:

$V^{*} = {(Z^{T} R^{- 1} Z + G^{- 1})}^{- 1},$ (S:1.2a)

$G Z^{T} V^{- 1} = V^{*} Z^{T} R^{- 1} .$ (S:1.2b)

Moreover,if Assumption $A 4_{g}$ also holds,then these identities simplify to:

$V^{*} = {(σ_{ε}^{- 2} Z^{T} Z + G^{- 1})}^{- 1} = σ_{ε}^{2} \cdot {(Z^{T} Z + σ_{e}^{2} G^{- 1})}^{- 1},$ (S:1.2c)

$G Z^{T} V^{- 1} = σ_{ε}^{- 2} V^{*} Z^{T} .$ (S:1.2d)

The advantage of (S:1.2c)-(S:1.2d) is that they give respective expressions of $V^{*}$ and $G Z^{T} V^{- 1}$ using only $q \times q$ and $n \times q$ matrices, and no $n \times n$ one. These imply a much computationally cheaper formula for the BLP of $U | Y$ in an LMM with u.ho. errors, when compared with (5):

Proposition S1.3. In the LMM (1) with Assumption $A 1_{g}$ , one has:

$\bar{U} = BLP (U | Y) = V^{*} Z^{T} R^{- 1} (Y_{} - X β) .$ (S:1.3a)

Moreover,if Assumption $A 4_{g}$ also holds,then this simplifies to:

$\bar{U} = BLP (U | Y) = σ_{ε}^{- 2} V^{*} Z^{T} (Y_{} - X β) .$ (S:1.3b)

It is also worth noting:

Proposition S1.4. In the LMM (1) with Assumption $A 1_{g}$ , one has:

$E ({‖ \bar{U} ‖}_{G^{- 1}}^{2}) = q - tr (G^{- 1} V^{*}) .$ (S:1.4a)

Moreover,if Assumption $A 4_{g}$ also holds,then we also have:

$E ({‖ \bar{U} ‖}_{G^{- 1}}^{2}) = n - σ_{ε}^{2} tr (V^{- 1}) = σ_{ε}^{- 2} tr (V^{*} Z^{T} Z) .$ (S:1.4b)

Proof. Given that ${‖ \bar{U} ‖}_{G^{- 1}}^{2} = {\bar{U}}^{T} G^{- 1} \bar{U}$ , $E (\bar{U}) = 0$ and (S:1.1b),

$\begin{matrix} E ({‖ \bar{U} ‖}_{G^{- 1}}^{2}) = E {(\bar{U})}^{T} G^{- 1} E (\bar{U}) + tr [G^{- 1} M cov (\bar{U})] = tr [G^{- 1} M cov (\bar{U})] \\ = tr [G^{- 1} (G - V^{*})] = tr (I_{q} - G^{- 1} V^{*}) = q - tr (G^{- 1} V^{*}) . \end{matrix}$

Secondly,from (S:1.1a), $tr [G^{- 1} M cov (\bar{U})] = tr (Z^{T} V^{- 1} Z G) = tr (V^{- 1} Z G Z^{T})$ .Now,if Assumption $A 4_{g}$ also holds,then

$tr (V^{- 1} Z G Z^{T}) = tr [V^{- 1} (V - σ_{ε}^{2} I_{n})] = tr (I_{n}) - σ_{ε}^{2} tr (V^{- 1}),$

which entails the first equality in (S:1.4b). For the second one,using (S:1.2d),

$tr (V^{- 1} Z G Z^{T}) = tr (Z G Z^{T} V^{- 1}) = σ_{ε}^{- 2} tr (Z V^{*} Z^{Τ}) = σ_{ε}^{- 2} tr (V^{*} Z^{T} Z) .$

The last proposition implies some additional striking matrix identities in an LMM with u.ho. errors:

Proposition S1.5. In the LMM (1) with Assumptions $A 1_{g}$ and $A 4_{g}$ ,

$σ_{ε}^{2} tr (V^{- 1}) + σ_{ε}^{- 2} tr (V^{*} Z^{T} Z) = n,$ (S:1.5a)

$tr (G^{- 1} V^{*}) + σ_{ε}^{- 2} tr (V^{*} Z^{T} Z) = q .$ (S:1.5b)

S1.2.2. Cluster Partitioning of the $B L P (U | Y)$ in a 2-Level LMM

In the 2-level LMM (2), we can partition $\bar{U} = BLP (U | Y)$ cluster-wise:

$\bar{U} = (\begin{matrix} {\bar{U}}_{1} \\ ⋮ \\ {\bar{U}}_{ m} \end{matrix}), with each {\bar{U}}_{ j} a random r -vector (j = 1, \dots, m) .$ (S:1.6a)

Combining that with (5), the assumptions and the partitioned structures of the various design and covariance matrices $G$ , $R$ and $V$ , we get:

${\bar{U}}_{ j} = BLP (U_{ j} | Y_{ j}) = D Z_{j}^{T} V_{ j}^{- 1} (Y_{ j} - X_{j} β) (j = 1, \dots, m) .$ (S:1.6b)

Similarly, from (S:1.1c), one sees that $V^{*} = diag (V_{ 1}^{*}, \dots, V_{ m}^{*})$ , where

$V_{ j}^{*} = D - D Z_{j}^{T} V_{ j}^{- 1} Z_{j} D \in M_{r} (ℝ) (j = 1, \dots, m) .$ (S:1.6c)

Furthermore, from (S:1.6b), calculations similar to (S:1.1a) give:

$M cov ({\bar{U}}_{ j}) = E ({\bar{U}}_{ j} {\bar{U}}_{j}^{T}) = D Z_{j}^{T} V_{ j}^{- 1} Z_{j} D = D - V_{ j}^{*} (j = 1, \dots, m) .$ (S:1.6d)

So, here, $M cov (\bar{U}) = diag (M cov ({\bar{U}}_{1}), \dots, M cov ({\bar{U}}_{m}))$ , thus yielding

$M cov (\bar{U}) = diag (D - V_{ 1}^{*}, \dots, D - V_{ m}^{*}) = G - V^{*} .$ (S:1.6e)

Finally, the cluster-wise versions of Lemma S1.2 and Propositions S1.3-S1.5 for the 2-level LMM (2) are:

Lemma S1.6. In the 2-level LMM (2), the following identities hold, for $j = 1, \dots, m$ :

$V_{ j}^{*} = {(Z_{j}^{T} R_{j}^{- 1} Z_{j} + D^{- 1})}^{- 1},$ (S:1.7a)

$D Z_{j}^{T} V_{ j}^{- 1} = V_{ j}^{*} Z_{j}^{T} R_{j}^{- 1} .$ (S:1.7b)

Moreover,if Assumption $A 4_{g}$ also holds,then these identities simplify to:

$V_{ j}^{*} = {(σ_{ε}^{- 2} Z_{j}^{T} Z_{j} + D^{- 1})}^{- 1} = σ_{ε}^{2} {(Z_{j}^{T} Z_{j} + σ_{ε}^{2} D^{- 1})}^{- 1},$ (S:1.7c)

$D Z_{j}^{T} V_{ j}^{- 1} = σ_{ε}^{- 2} V_{ j}^{*} Z_{j}^{T} .$ (S:1.7d)

Proposition S1.7. In the 2-level LMM (2), one has, for $j = 1, \dots, m$ :

${\bar{U}}_{ j} = BLP (U_{ j} | Y_{ j}) = V_{ j}^{*} Z_{j}^{T} R_{j}^{- 1} (Y_{ j} - X_{j} β) .$ (S:1.8a)

Moreover,if Assumption $A 4_{g}$ also holds,then this simplifies to:

${\bar{U}}_{ j} = BLP (U_{ j} | Y_{ j}) = σ_{ε}^{- 2} V_{ j}^{*} Z_{j}^{T} (Y_{ j} - X_{j} β) .$ (S:1.8b)

Proposition S1.8. In the 2-level LMM (2), one has, for $j = 1, \dots, m$ :

$E ({‖ {\bar{U}}_{ j} ‖}_{D^{- 1}}^{2}) = r - tr (D^{- 1} V_{ j}^{*}) .$ (S:1.9)

Moreover,if Assumption $A 4_{g}$ also holds,then we also have:

$E ({‖ {\bar{U}}_{ j} ‖}_{D^{- 1}}^{2}) = n_{j} - σ_{ε}^{2} tr (V_{j}^{- 1}) = σ_{ε}^{- 2} tr (V_{ j}^{*} Z_{j}^{T} Z_{j}) .$ (S:1.10)

Proposition S1.9. In the 2-level LMM (2) with Assumptions $A 1_{g}$ and $A 4_{g}$ , for $j = 1, \dots, m$ :

$σ_{ε}^{2} tr (V_{ j}^{- 1}) + σ_{ε}^{- 2} tr (V_{ j}^{*} Z_{j}^{T} Z_{j}) = n_{j},$ (S:1.11a)

$tr (D^{- 1} V_{ j}^{*}) + σ_{ε}^{- 2} t r (V_{ j}^{*} Z_{j}^{T} Z_{j}) = r .$ (S:1.11b)

S2. 3S Iterative Algorithms: Some Important Implementation Details

We give here some useful precisions to effectively program Algorithms 3S-A1-V1 and 3S-A1-V2.

S2.1. Solving the HMMEs in Algorithm 3S-A1-V1

Our way of solving a system of the form (29) is based on the following result which we admit:

Theorem S2.1. (Solving a 2 × 2 block nonsingular system) Let

$A = (\begin{matrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{matrix}) \in M_{n} (ℝ)$ be nonsingular, with $n = n_{1} + n_{2}$ and $n_{1}, n_{2} \in ℕ^{*}$ . If $A_{11} \in M_{n_{1}} (ℝ)$ and is nonsingular, then:

1) $B_{22} = A_{22} - A_{21} A_{11}^{- 1} A_{12} \in M_{n_{2}} (ℝ)$ and is nonsingular.

2) The unique solution of the linear system $A X = Y$ in $ℝ^{n}$ is given by:

$(\begin{array}{l} X_{2} = B_{22}^{- 1} (Y_{2} - A_{21} A_{11}^{- 1} Y_{1}) \in ℝ^{n_{2}} \\ X_{1} = A_{11}^{- 1} (Y_{1} - A_{12} X_{2}) \in ℝ^{n_{1}}, \end{array}$

with $X_{1}, Y_{1} \in ℝ^{n_{1}}$ , $X_{2}, Y_{2} \in ℝ^{n_{2}}$ such that $X = (\begin{matrix} X_{1} \\ X_{2} \end{matrix})$ and $Y = (\begin{matrix} Y_{1} \\ Y_{2} \end{matrix})$ .

Obviously, Theorem S2.1 has a twin version whereby it is rather the lower diagonal block $A_{22} \in M_{n_{2}} (ℝ)$ which is assumed nonsingular. To solve the linear system (29) at the beginning of each new iteration $t + 1$ in Algorithm 3S-A1-V1, we use Theorem S2.1 with:

$A_{11} = X^{T} X \in M_{p} (ℝ), A_{12} = X^{T} Z \in M_{p, q} (ℝ),$

$A_{21} = Z^{T} X = A_{12}^{T} \in M_{q, p} (ℝ), A_{22} = Z^{T} Z + {\hat{σ}}_{ε}^{2 (t)} {\hat{G}}^{(t) - 1} \in M_{q} (ℝ),$

$Y_{1} = X^{T} y \in ℝ^{p}, Y_{2} = Z^{T} y \in ℝ^{q} .$

Remark S2.1. The product matrices $A_{11}$ , $A_{12}$ , $A_{21}$ , $Z^{T} Z$ and vectors $Y_{1}$ , $Y_{2}$ are all independent from the iteration index t. Hence, they can be computed before the iterations start, which is a major computational advantage for the running time of Algorithm 3S-A1-V1.

S2.2. Choice of the Matrix Norm ${‖ \cdot ‖}^{(M)}$ in Instruction 3 of the 3S Algorithms

In the two iterative estimating algorithms for 2-level LMMs presented in Section 4.2.3, all the matrix iterates ${\hat{D}}^{(t)}$ are at least SPSD. Based on that, we choose an appropriate matrix norm ${‖ \cdot ‖}^{(M)}$ to use in Instruction 3 of these algorithms to cheaply compute ${‖ {\hat{D}}^{(t + 1)} ‖}^{(M)}$ at any new iteration $t + 1$ . Indeed, if $A = (a_{i j}) \in M_{r} (ℝ)$ is SPSD, then it is known that:

$\max_{k, l} | a_{k l} | = \max_{k} a_{k k} .$

Consequently, if we take as matrix norm:

${‖ A ‖}^{(M)} = \max_{k, l} | a_{k l} |,$ (S:2.1)

then ${‖ A ‖}^{(M)} = \max [diag (A)]$ , where $diag (A)$ is the vector of diagonal elements in matrix A. So using the matrix norm given by (S:2.1), ${‖ {\hat{D}}^{(t + 1)} ‖}^{(M)}$ is very cheap to compute and the first control test inequality of the stopping criterion in Instruction 3 then simplifies to:

$\max_{k, l} | {({\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)})}_{k l} | \leq δ_{1} \cdot \max [diag ({\hat{D}}^{(t + 1)})] .$

S2.3. Terminating a 3S Iterative Algorithm: The Various Scenarios

An iterative algorithm aimed at equations solving (or optimization), however theoretically well crafted, almost always runs the risk of premature termination in some cases, i.e. the algorithm stalls before its expected convergence has been achieved. Algorithms 3S-A1-V1 and 3S-A1-V2 are not immune to that threat: each of them may fail on a given data set because the latter is far from satisfying one of the LMM assumptions upon which the algorithm is built. Therefore, it is necessary to include, in the inner workings of our 3S algorithms, capabilities to detect, as early as possible, the most plausible scenarios of abnormal termination or behavior.

So, as is traditional in equations solving and optimization iterative algorithms, we use a categorical flag variable code.Stop in each of our algorithms to record the cause of the algorithm termination during a particular run, with success code:

code.Stop = OK:1 convergence, with satisfaction of the intended stopping criterion.

Then we handle the following cases of possible premature stoppage, in which we also include a code. Warning flag variable giving rather a specific warning about the iterations (even if they ended apparently well).

S2.3.1. code.Stop = KO:X: Fixed Effects Design Matrix X Not of Full Column Rank

As developed till today, a key assumption used in our 3S methodology for estimation in LMMs is $A 2_{g}$ , i.e. the fixed effects design matrix $X$ is of full column rank. If not, then some of our derivations no longer hold. Although the methodology might be extended in the future to bypass the need of that assumption, at present we stop any of our algorithms if $A 2_{g}$ is not satisfied at the outset.

1) Checking if the fixed effects design matrix X is full column rank in Algorithm 3S-A1-V1

For Algorithm 3S-A1-V1, the checking is based on the fact that computing and storing the $p \times p$ product matrix $X^{T} X$ is a necessary requirement in that algorithm, to form the coefficient matrix of the linear system of HMMEs (29). Now, $X$ and $X^{T} X$ always have the same rank, so $X$ has full column rank if, and only if, $X^{T} X$ is nonsingular. But the latter holds if, and only if, the symmetric matrix $X^{T} X$ (which is at least SPSD) is SPD, which is equivalent to having a Cholesky factorization,i.e. $X^{T} X = U^{T} U$ , where U is a square upper triangular matrix with positive diagonal elements.

Hence, at the beginning of Algorithm 3S-A1-V1, to check for Assumption $A 2_{g}$ , once $X^{T} X$ has been obtained, we compute its pivoted Cholesky factorization, using an available software program. Modern numerical software such as the LAPACK routines (Anderson et al., 1999), called by the R statistical software (our programming tool) functions handling Numerical Linear Algebra (NLA) calculations, can compute that factorization and check whether an input symmetric matrix is even SPSD in the first place, and if yes, then check if it is likely, or not, SPD, at least up to the numerical instabilities due to the computer rounding errors. They can also output an estimate of the rank k of an SPSD matrix and indicate its k columns most likely linearly uncorrelated. Applying that to $X^{T} X$ , we can decide whether it is clearly SPD, or likely close to a singular SPSD matrix.

In the latter case, we set code.Stop = KO:X, and do not launch Algorithm 3S-A1-V1 at all, informing the user why. But we also output the estimated rank k of $X^{T} X$ and the indices of its k columns most likely uncorrelated. Since these two pieces of information are the same for $X$ , the user can then, if he/she wishes, re-launch the 3S iterative algorithm, but with that fixed effects design matrix changed by suppressing its other $p - k$ columns.

2) Checking if the fixed effects design matrix X is full column rank in Algorithm 3S-A1-V2

Algorithm 3S-A1-V2, instead of the HMMEs, needs to solve, at each iteration, an OLS problem which design matrix is $X$ . To achieve that, rather than first computing the product matrix $X^{T} X$ , numerical analysts recommend to compute the QR factorization (Demmel, 1997) of $X$ . Interestingly, modern numerical software such as the LINPACK routines (Dongarra et al., 1978) called by the R NLA functions can do so while also giving an estimate of the rank of $X$ . So for Algorithm 3S-A1-V2, our checking of Assumption $A 2_{g}$ will be based on the results of the QR factorization with pivoting of $X$ , with the conclusion drawn as for Algorithm 3S-A1-V1 above.

S2.3.2. A Vital Issue for LMM Fitting: Random Effects or Not?

People trying to fit an LMM rather than an LM to their data most often do not worry whether the incurred excess in mathematical modeling and computational time is worth the price in the first place. They do not ask themselves the obvious question:

Are there really random effects in the data with the provided clustering?

But that’s a mistake, noticeably because it turns out that the answer to that question directly impacts the way an LMM fitting iterative procedure really behaves on a given data set. In the programming of a 3S algorithm, we anticipate three possible scenarios related to that issue.

3) code.Stop = KO:1: RE covariance matrix likely (or close to) singular

Even when the cluster random effects are there, they may have been poorly parameterized, so that their scalar components are correlated, resulting in a rank deficient (or singular) covariance matrix $D$ . If so, this will affect adversely any of our 3S algorithms since they all assume $D$ to be SPD.

To deal with that potential difficulty in a 3S algorithm, since $D$ is not known (we are trying to estimate it among other parameters...), our strategy is that after computing each new iterate ${\hat{D}}^{(t)}$ ( $t \geq 1$ ), we test whether it is of full rank or not. Since in our computations, ${\hat{D}}^{(t)}$ is always at least SPSD, we numerically check for its rank by attempting a pivoted Cholesky factorization on it along the same lines as described above for the matrix $X^{T} X$ in the code.Stop = KO:X scenario. If the Cholesky software routine declares the newly computed ${\hat{D}}^{(t)}$ at an iteration t of not being full rank, the iterative algorithm is stopped, informing the user why: the true cluster random effects covariance matrix $D$ is either singular or close to such a matrix.

After that premature termination (and as for the case of singular $X^{T} X$ previously), the algorithm also returns the estimated rank $l$ of ${\hat{D}}^{(t)}$ and the indices of its $l$ most uncorrelated columns. Using these pieces of information as estimates of the corresponding features of the true (but unknown) $D$ , the user can re-launch the 3S iterative algorithm, but with cluster random effects vectors $U_{1}, \dots, U_{m}$ of dimension reduced from r to $l$ , only keeping the $l$ scalar components associated to the $l$ columns most uncorrelated in $D$ . In the data, this amounts to keeping just the corresponding $l$ columns in each of the cluster random effects design matrices $Z_{1}, \dots, Z_{m}$ .

Note, however, that this does not work when $r = 1$ , i.e. each cluster random effect is actually a scalar rather than a vector because, then, the cluster random effects covariance matrix $D$ also reduces to a (nonnegative) scalar and is, therefore, rank deficient if, and only if, that scalar is zero. This will also be true of its ${\hat{D}}^{(t)}$ iterates. But the latter matrices will unlikely be numerically close to zero up to computer rounding errors, so there is little chance that any of them would be detected as rank deficient based on the Choleski Factorization. So we need to handle other abnormal possible termination scenarios including that subcase.

4) code.Stop = KO:2: Slow convergence, likely because of no cluster RE

Recall that a 3S algorithm has converged when the 3 inequalities

${‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)} \leq δ_{1} \cdot {‖ {\hat{D}}^{(t + 1)} ‖}^{(M)},$ (S:2.2a)

$‖ {\hat{β}}^{(t + 1)} - {\hat{β}}^{(t)} ‖ \leq δ_{2} \cdot ‖ {\hat{β}}^{(t + 1)} ‖,$ (S:2.2b)

$| {\hat{σ}}_{ε}^{2 (t + 1)} - {\hat{σ}}_{ε}^{2 (t)} | \leq δ_{3} \cdot {\hat{σ}}_{ε}^{2 (t + 1)},$ (S:2.2c)

are simultaneously satisfied, where $δ_{1}, δ_{2}, δ_{3}$ are relative tolerance levels set in $(0,1)$ , and ${‖ \cdot ‖}^{(M)}$ is a chosen matrix norm. Now, when a 3S algorithm appears to be converging slowly, an inspection of the evolution of the iterations intermediate results reveals that this is, more often than not, entirely due to the slow convergence of the sequence of ${\hat{D}}^{(t)}$ iterates. This manifests itself through the difficulty of having the relative error control inequality (S:2.2a) satisfied, whereas there is really a decrease in the sequence of absolute errors estimates ${‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)}$ . From Numerical Analysis, we know that such a situation typically happens when the sought limit value of the ${\hat{D}}^{(t)}$ iterates (i.e. our targeted estimate of $D$ ) is either zero or close to zero (probably because the true $D$ is so). A traditional way to anticipate such a situation usually consists in replacing (S:2.2a) with a mix of both relative and absolute error controls:

${‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)} \leq δ_{1} \cdot {‖ {\hat{D}}^{(t + 1)} ‖}^{(M)} or {‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)} \leq δ_{11},$ (S:2.3)

where $δ_{11}$ is an absolute error tolerance upper bound.

But because it is generally difficult to set an appropriate value for $δ_{11}$ in a given context, here we instead use the following context dependent alternative to the second inequality in (S:2.3):

${‖ {\hat{D}}^{(t + 1)} - {\hat{D}}^{(t)} ‖}^{(M)} \leq 0.5 \cdot {‖ {\hat{D}}^{(t + 1)} ‖}^{(M)} and {‖ Z {\hat{G}}^{(t + 1)} Z^{T} ‖}^{(M)} \leq 0.01 \cdot {\hat{σ}}_{ε}^{2 (t + 1)},$ (S:2.4)

the latter inequality aiming at detecting a high probability of zero cluster random effects in the given data set. Hence, the final stopping criterion used in our 3S algorithms is rather:

((S:2.2a) or (S:2.4)) and (S:2.2b) and (S:2.2c). (S:2.5)

The rightmost inequality in (S:2.4) tries to test whether the contribution to the response covariance matrix $V$ of the random effects covariance matrix $G$ is (at least) two orders of magnitude smaller than that of the residual errors variance $σ_{ε}^{2}$ . This occurring gives an indication that these random effects either are not really there or are insignificant, which is slowing down the convergence of the ${\hat{D}}^{(t)}$ iterates, and thus the algorithm. So when (S:2.4), (S:2.2b) and (S:2.2c) are satisfied, but not (S:2.2a), we stop with code.Stop = OK:2, inviting the user to view this as a partial success scenario.

5) code.Warning = 3: RE might not be there, anyway...

A 3S algorithm being outright successful (i.e. termination with code.Stop = OK:1) does not imply that there are, indeed, random effects in the provided data. So it is useful to try to detect their presence or absence anyway, convergence of the algorithm or not. Our way of doing so is to always test the rightmost inequality in (S:2.4) after exiting a 3S algorithm. If satisfied, a warning is issued to the user: despite the observed convergence (if any), with the given clustering variable(s), random effects might actually not be there, or seem insignificant. An advantage in that way of trying to detect the zero random effects scenario is that it works whatever the dimension r of these ones in each cluster.

In any event, we always include, in the output from the algorithm, the value of the last ratio:

$\hat{ρ} = \frac{{‖ Z {\hat{G}}^{(t)} Z^{T} ‖}^{(M)}}{{\hat{σ}}_{ε}^{2 (t)}},$ (S:2.6a)

estimating

$ρ = \frac{{‖ Z G Z^{T} ‖}^{(M)}}{σ_{ε}^{2}} .$ (S:2.6b)

The latter can be called the random effects ratio in the considered LMM for the given data set, and can be roughly interpreted as the coefficient of random effects really present in the data:

1) the higher above the number 1 the ratio $ρ$ is, the more effective such a presence is;

2) the smaller below 1 the ratio $ρ$ is, the more doubt can be cast about that presence.

But, for a 2-level LMM (2), note that the block diagonal structure of $Z$ implies:

${‖ Z G Z^{T} ‖}^{(M)} = \max_{1 \leq j \leq m} {‖ Z_{j} D Z_{j}^{T} ‖}^{(M)}, {‖ Z {\hat{G}}^{(t)} Z^{T} ‖}^{(M)} = \max_{1 \leq j \leq m} {‖ Z_{j} D^{(t)} Z_{j}^{T} ‖}^{(M)} .$ (S:2.7)

Remark S2.2. One may wonder: why not simply use a statistical test to decide whether there are cluster random effects or not in the given data? The answer is twofold. First, theoretically valid statistical tests are hard to develop for LMMs, more so in our context where we aim at fitting these models without imposing any parametric distributional assumptions on the data. Secondly, as seen above, the situation of no cluster random effects actually first impacts negatively on the convergence of a 3S algorithm (as well as other iterative algorithms for LMM fitting), often severely slowing it down or even flat out stalling it. Hence in that scenario, it is difficult to even compute reliable estimates on which to perform statistical tests. Here, we wanted to develop practical numerical tests aimed at detecting it during the run of the algorithm. And such a detection is of the utmost practical importance since, if well done, it may allow to decide whether LMM modeling even makes sense or not for a given data set. Nevertheless, upon exiting a 3S algorithm, we can perform a bootstrap test to decide whether the ratio $ρ$ is, or not, significantly greater than zero for the given data set.

S2.3.3. code.Warning = 4: Almost Constant Sequence of ${\hat{β}}^{(t)}$ Iterates

For many data sets, we observed that the sequence of ${\hat{β}}^{(t)}$ iterates produced by any of our 3S algorithms is nearly constant. This has to do with our way of initializing that sequence by the OLS in an LM fit of the data, i.e. an LMM with no RE. With that initialization, if the sequence ( ${\hat{β}}^{(t)}$ ) exhibits a constant trend from the outset, one may suspect that the LM is probably better suited for the data set at hand than a pure LMM, i.e. an LMM with nonzero RE. And this is likely so because either there are no cluster random effects or these are nonsignificant or would require a bigger sample size to be detected. Hence, whatever the stoppage condition (convergence or not) of the iterative algorithm, if for all iterations carried, the ${\hat{β}}^{(t)}$ iterates were all only negligibly different from their OLS starting point ${\hat{β}}^{(0)}$ , we conclude that we might be in that zero RE scenario, meaning that what was supposed to be cluster random effects actually behave, more or less, the same from one cluster to the other.

More specifically, we always perform the following test after termination of any of our 3S algorithms (successfully or not):

$for t = 1 (1) T, ‖ {\hat{β}}^{(t)} - {\hat{β}}^{(0)} ‖ \leq δ_{2} ‖ {\hat{β}}^{(0)} ‖,$ (S:2.8)

where $t = T$ is the iteration at which the algorithm was stopped, and $δ_{2}$ is the same as in (S:2.2b). If (S:2.8) is satisfied, then the user is informed that: may be the LM is better suited for her/his data set. The recommendation then is either to use the LM, or redesign the clustering variable because it might have been poorly chosen in the first place.

S2.3.4. code.Stop = KO:5: Authorized Maximum Number of Iterations Reached without Convergence

As is customary for iterative algorithms expected to stop at satisfaction of a convergence criterion, we must also anticipate the possibility that, for the provided data set, a 3S algorithm either does not actually converge or does so slowly. For that, we set, as is usual, a maximum number max.Its (default = 200) of authorized iterations in the algorithm. If that number is reached without the convergence stopping criterion satisfied, the algorithm is terminated with code.Stop = 5, and we inform the user why.

S2.3.5. Complement: Authorized Minimum Number of Iterations

Somewhat untraditionally, but to limit the risk of an optimistic premature successful termination, we impose that a 3S algorithm always performs a minimum number min.Its (default = 10) before starting to check for a success in convergence.

S3. Two Lemmas Useful for Proving Theorem 7

We admit the following two lemmas (the second one can be proved using multivariable differential tools presented in Graham (1981)):

Lemma S3.1. Let $S \in M_{n} (ℝ)$ . Then, one has: $2 S - diag (S) = 0 \Leftrightarrow S = 0$ .

Lemma S3.2. Let $x \in ℝ^{n}$ and $N \in M_{n} (ℝ)$ , SPD, such that

$N = Z Q Z^{T} + α I_{n}, with Q \in M_{r} (ℝ), symmetric, α \in ℝ and Z \in M_{n, r} (ℝ) .$

If the upper triangular part of Q has functionally independent elements and which are independent from $α$ ,whereas $x$ and Z are independent from Q and $α$ ,then:

$\frac{\partial (x^{T} N^{- 1} x)}{\partial Q} = diag (W) - 2 W,$ (S:3.1a)

$\frac{\partial \log | N |}{\partial Q} = 2 B - diag (B),$ (S:3.1b)

$\frac{\partial (x^{T} N^{- 1} x)}{\partial α} = - {‖ N^{- 1} x ‖}^{2},$ (S:3.1c)

$\frac{\partial \log | N |}{\partial α} = tr (N^{- 1}),$ (S:3.1d)

where $W = Z^{T} N^{- 1} x x^{T} N^{- 1} Z$ , $B = Z^{T} N^{- 1} Z$ ,and if A is a square matrix,then $diag (A)$ is the diagonal matrix with same main diagonal as A.

S4. 3S Fitting Algorithms for 2-Level LMMs with u.ho. Errors and Diagonal Covariance Matrix for the Cluster Random Effects, and ANOVA LMMs

S4.1. 3S Algorithms for 2-Level LMMs with u.ho. Errors Assuming a Diagonal Covariance Matrix for the Cluster Random Effects

We present, here, the equivalence of Algorithms 3S-A1-V1 and 3S-A1-V2 respectively for 2-level LMMs with u.ho. errors, assuming a diagonal covariance matrix for the cluster random effects vector, that is:

$D = diag (σ_{1}^{2}, \dots, σ_{r}^{2}) .$

They carry respectively the code names 3S-A1-V1-d and 3S-A1-V2-d.

Algorithm 3S-A1-V1-d. Estimation of $β$ , $σ_{ε}^{2}$ , $D = diag (σ_{1}^{2}, \dots, σ_{r}^{2})$ and prediction of $U_{1}, \dots, U_{m}$ in a 2-level LMM with u.ho.errors:Version 1

1) Initialization:At iteration 0,we estimate $β$ , $σ_{ε}^{2}$ , $D$ and predict $U_{1}, \dots, U_{m}$ ,as follows:

a) As in Algorithm 3S-A1-V1;

b) ${\hat{σ}}_{k}^{2 (0)} = \frac{1}{m} \sum_{j = 1}^{m} {({\hat{u}}_{j, k}^{(0)})}^{2}$ , $k = 1, \dots, r$ , with ${\hat{u}}_{j, k}^{(0)}$ the k^th element of the vector ${\hat{u}}_{j}^{(0)}$ ;

c) ${\hat{D}}^{(0)} = diag ({\hat{σ}}_{1}^{2 (0)}, \dots, {\hat{σ}}_{r}^{2 (0)})$ ;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}_{1}^{(t)}, \dots, {\hat{u}}_{m}^{(t)}$ ,hence ${\hat{u}}^{(t)} = {({\hat{u}}_{1}^{(t) T}, \dots, {\hat{u}}_{m}^{(t) T})}^{T}$ ,and ${\hat{D}}^{(t)} = diag (σ_{1}^{2 (t)}, \dots, σ_{r}^{2 (t)})$ from iteration t,we obtain estimates and predictions at iteration $t + 1$ as follows:

a) As in Algorithm 3S-A1-V1;

b) ${\hat{σ}}_{k}^{2 (t + 1)} = \frac{1}{m} \sum_{j = 1}^{m} [{({\hat{V}}_{j}^{* (t + 1)})}_{k k} + {({\hat{u}}_{j, k}^{(t + 1)})}^{2}]$ , $k = 1, \dots, r$ ;

c) ${\hat{D}}^{(t + 1)} = diag ({\hat{σ}}_{1}^{2 (t + 1)}, \dots, {\hat{σ}}_{r}^{2 (t + 1)})$ ;

3) Stopping criterion:We assume convergence when all the 3inequalities

$‖ {\hat{ϑ}}^{(t + 1)} - {\hat{ϑ}}^{(t)} ‖ \leq δ_{1} \cdot ‖ {\hat{ϑ}}^{(t + 1)} ‖,$ (S:4.1a)

$‖ {\hat{β}}^{(t + 1)} - {\hat{β}}^{(t)} ‖ \leq δ_{2} \cdot ‖ {\hat{β}}^{(t + 1)} ‖,$ (S:4.1b)

$| {\hat{σ}}_{ε}^{2 (t + 1)} - {\hat{σ}}_{ε}^{2 (t)} | \leq δ_{3} \cdot {\hat{σ}}_{ε}^{2 (t + 1)},$ (S:4.1c)

are satisfied,where $δ_{1}, δ_{2}, δ_{3}$ are relative tolerance levels set in $(0,1)$ ,and $\hat{ϑ} = {({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{r}^{2})}^{T}$ . Otherwise,repeat Step 2with $t \leftarrow t + 1$ .

Algorithm 3S-A1-V2-d. Estimation of $β$ , $σ_{ε}^{2}$ , $D = diag (σ_{1}^{2}, \dots, σ_{r}^{2})$ and prediction of $U_{1}, \dots, U_{m}$ in a 2-level LMM with u.ho.errors:Version 2

1) Initialization:The same as in Algorithm 3S-A1-V1-d;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}_{1}^{(t)}, \dots, {\hat{u}}_{m}^{(t)}$ ,hence ${\hat{u}}^{(t)} = {({\hat{u}}_{1}^{(t) T}, \dots, {\hat{u}}_{m}^{(t) T})}^{T}$ ,and ${\hat{D}}^{(t)} = diag (σ_{1}^{2 (t)}, \dots, σ_{r}^{2 (t)})$ from iteration t,we obtain estimates and predictions at iteration $t + 1$ as follows:

a) As in Algorithm 3S-A1-V2;

b) As in Algorithm 3S-A1-V1-d;

3) Stopping criterion:As in Algorithm 3S-A1-V1-d.Otherwise,repeat Step 2with $t \leftarrow t + 1$ .

4) Extracting estimates:As in Algorithm 3S-A1-V1-d.

S4.2. 3S Algorithms for ANOVA LMMs

We present the adaptation of Algorithms 3S-A1-V1 and 3S-A1-V2 respectively for ANOVA LMMs. That is, the case where

$D = G = diag (σ_{1}^{2} I_{m_{1}}, \dots, σ_{g}^{2} I_{m_{g}}),$

where, for $k = 1, \dots, g$ , each variance $σ_{k}^{2}$ corresponds to the kth categorical variable with $m_{k}$ categories, with g the total number of categorical variables with random effects. They carry respectively the code names 3S-ANOVA-A1-V1 and 3S-ANOVA-A1-V2.

Algorithm 3S-ANOVA-A1-V1. Estimation of $β$ , $σ_{ε}^{2}$ , $D = G$ and prediction of U in a 2-level ANOVA LMM:Version 1

1) Initialization:At iteration 0,we estimate $β$ , $σ_{ε}^{2}$ , $G$ and predict U,as follows:

a) As in Algorithm 3S-A1-V1;

b) ${\hat{u}}^{(0)} = {(Z^{T} Z)}^{- 1} Z^{T} (y - X {\hat{β}}^{(0)})$ , the OLS estimate of U in the linear model $Y^{* (0)} = Z U + ε$ ,with U considered as fixed parameter in the model, $Z$ the design matrix,and $Y^{* (0)} = Y - X {\hat{β}}^{(0)}$ as the response vector;

c) ${\hat{σ}}_{ε}^{2 (0)} = {‖ y - X {\hat{β}}^{(0)} - Z {\hat{u}}^{(0)} ‖}^{2} / (n - p)$ ;

d) ${\hat{σ}}_{k}^{2 (0)} = \frac{1}{m_{k}} {‖ {\hat{u}}_{k}^{(0)} ‖}^{2} = \frac{1}{m_{k}} \sum_{l = 1}^{m_{k}} {({\hat{u}}_{k, l}^{(0)})}^{2}$ ,for $k = 1, \dots, g$ ,with ${\hat{u}}^{(0)} = {({\hat{u}}_{1}^{(0) T}, \dots, {\hat{u}}_{g}^{(0) T})}^{T}$ split according to the g categorical variables that form the boolean matrix $Z$ ,such that ${\hat{u}}_{k}^{(0)} = {({\hat{u}}_{k,1}^{(0)}, \dots, {\hat{u}}_{k, m_{k}}^{(0)})}^{T}$ is the $m_{k}$ -vector of random effects of the k^th categorical variable.

e) ${\hat{G}}^{(0)} = diag ({\hat{σ}}_{1}^{2 (0)} I_{m_{1}}, \dots, {\hat{σ}}_{g}^{2 (0)} I_{m_{g}})$ ;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}^{(t)}$ ,and ${\hat{G}}^{(t)} = diag ({\hat{σ}}_{1}^{2 (t)} I_{m_{1}}, \dots, {\hat{σ}}_{g}^{2 (t)} I_{m_{g}})$ from iteration t,we obtain estimates and predictions at iteration $t + 1$ as follows:

a) As in Algorithm 3S-A1-V1;

b) ${\hat{V}}^{* (t + 1)} = {({\hat{σ}}_{ε}^{- 2 (t + 1)} Z^{T} Z + {\hat{G}}^{(t) - 1})}^{- 1}$ ;

c) ${\hat{σ}}_{k}^{2 (t + 1)} = \frac{1}{m_{k}} \sum_{l = 1}^{m_{k}} [{({\hat{V}}_{k k}^{* (t + 1)})}_{l l} + {({\hat{u}}_{k, l}^{(t + 1)})}^{2}]$ , $k = 1, \dots, g$ , where ${\hat{V}}_{k k}^{* (t + 1)}$ is the k^th diagonal block of ${\hat{V}}^{* (t + 1)}$ corresponding to the $m_{k}$ categories of the k^th categorical random effects variable;

d) ${\hat{G}}^{(t + 1)} = diag ({\hat{σ}}_{1}^{2 (t + 1)} I_{m_{1}}, \dots, {\hat{σ}}_{g}^{2 (t + 1)} I_{m_{g}})$ ;

3) Stopping criterion:We assume convergence when all the 3inequalities

$‖ {\hat{ϑ}}^{(t + 1)} - {\hat{ϑ}}^{(t)} ‖ \leq δ_{1} \cdot ‖ {\hat{ϑ}}^{(t + 1)} ‖,$ (S:4.2a)

$‖ {\hat{β}}^{(t + 1)} - {\hat{β}}^{(t)} ‖ \leq δ_{2} \cdot ‖ {\hat{β}}^{(t + 1)} ‖,$ (S:4.2b)

$| {\hat{σ}}_{ε}^{2 (t + 1)} - {\hat{σ}}_{ε}^{2 (t)} | \leq δ_{3} \cdot {\hat{σ}}_{ε}^{2 (t + 1)},$ (S:4.2c)

are satisfied,where $δ_{1}, δ_{2}, δ_{3}$ are relative tolerance levels set in $(0,1)$ ,and $\hat{ϑ} = {({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{g}^{2})}^{T}$ . Otherwise,repeat Step 2with $t \leftarrow t + 1$ .

4) Extracting estimates:At convergence,take $\hat{β} = {\hat{β}}^{(t + 1)}$ , ${\hat{σ}}_{ε}^{2} = {\hat{σ}}_{ε}^{2 (t + 1)}$ , $\hat{G} = {\hat{G}}^{(t + 1)}$ as estimates of $β$ , $σ_{ε}^{2}$ , $G$ .Also,take ${\hat{u}}^{(t + 1)}$ as predictions of U.

Algorithm 3S-ANOVA-A1-V2.Estimation of $β$ , $σ_{ε}^{2}$ , $D = G$ and prediction of U in a 2-level ANOVA LMM:Version 2

1) Initialization:The same as in Algorithm 3S-ANOVA-A1-V1;

2) The iterative process:Given ${\hat{β}}^{(t)}$ , ${\hat{σ}}_{ε}^{2 (t)}$ , ${\hat{u}}^{(t)}$ ,and ${\hat{G}}^{(t)} = diag ({\hat{σ}}_{1}^{2 (t)} I_{m_{1}}, \dots, {\hat{σ}}_{g}^{2 (t)} I_{m_{g}})$ from iteration t,we obtain estimates and predictions at iteration $t + 1$ as follows:

a)As in Algorithm 3S-A1-V2,but removing the precision about ${\hat{G}}^{(t)}$ ;

b) $A^{(t + 1)} = {(Z^{T} Z + {\hat{σ}}_{ε}^{2 (t + 1)} {\hat{G}}^{(t) - 1})}^{- 1}$ ;

c) ${\hat{u}}^{(t + 1)} = A^{(t + 1)} Z^{T} (y - X {\hat{β}}^{(t + 1)})$ ;

d) ${\hat{V}}^{* (t + 1)} = {\hat{σ}}_{ε}^{2 (t + 1)} A^{(t + 1)}$ ;

e) As in Algorithm 3S-ANOVA-A1-V1;

3)Stopping criterion:As in Algorithm 3S-ANOVA-A1-V1.Otherwise,repeat Step 2with $t \leftarrow t + 1$ .

4) Extracting estimates:As in 3S-ANOVA-A1-V1.

S5. Parameters of the Distributions Used for the Simulations

In this section, we detail the distributions and parameters used for the simulations that yield the results in Section 6.1. As a general rule, the values of most of these parameters were themselves generated randomly before launching the simulations.

S5.1. Simulation of Sample Items, Clusters, Fixed and Random Effects Covariates

We assume that we have a population comprising $m = 10$ clusters in the same proportions of 1/10. To get a sample item, we first sampled its cluster, then its fixed and random effects covariates, respectively in $ℝ^{4}$ and $ℝ^{3}$ , assuming that in each cluster j, the vector X of fixed effects covariates followed $N_{4} (m_{j}^{X}, I_{4})$ . Likewise, in each clusterj, the vector Z of random effects covariates followed $N_{3} (m_{j}^{Z},0.7 I_{3})$ . We used:

● $m_{1}^{X} = {(- 2,1, - 2,2)}^{T}$ , $m_{2}^{X} = {(- 4, - 4, - 4,1)}^{T}$ , $m_{3}^{X} = {(4, - 1,3,2)}^{T}$ , $m_{4}^{X} = {(- 1,3,0, - 2)}^{T}$ , $m_{5}^{X} = {(- 3,3,2, - 2)}^{T}$ , $m_{6}^{X} = {(2, - 1,3, - 3)}^{T}$ , $m_{7}^{X} = {(- 3, - 3,4,4)}^{T}$ , $m_{8}^{X} = {(- 4,1,4, - 3)}^{T}$ , $m_{9}^{X} = {(- 1,2,0, - 3)}^{T}$ , $m_{10}^{X} = {(- 1,0,0,4)}^{T}$ ;

● $m_{1}^{Z} = {(- 2,0, - 3)}^{T}$ , $m_{2}^{Z} = {(3,1,1)}^{T}$ , $m_{3}^{Z} = {(0, - 1, - 1)}^{T}$ , $m_{4}^{Z} = {(2,0, - 2)}^{T}$ , $m_{5}^{Z} = {(- 3,2,3)}^{T}$ , $m_{6}^{Z} = {(3, - 3,0)}^{T}$ , $m_{7}^{Z} = {(- 2,3,2)}^{T}$ , $m_{8}^{Z} = {(2,3, - 3)}^{T}$ , $m_{9}^{Z} = {(0,3,1)}^{T}$ , $m_{10}^{Z} = {(- 2,2,0)}^{T}$ .

In addition, no intercept was included, be it in the fixed part or the random part of the simulated LMM, while we took $β = {(- 1.5,0.7,0,2.3)}^{T} \in ℝ^{4}$ as vector of fixed effects parameters.

S5.2. Parameters of the Distributions for Simulating Random Effects and Residual Errors

To simulate the cluster random effects in $ℝ^{3}$ , we examined 5 distributions, while for the residual errors, we examined 3, hence, testing in total, $5 \times 3 = 15$ scenarios.

S5.2.1. Distributions Tested for the Random Effects

Case 1: Non-degenerate Gaussian distribution in $ℝ^{3}$ with mean vector and covariance matrix:

$m = 0, Σ = (\begin{matrix} 32.784612 & 28.546477 & - 9.726209 \\ 28.546477 & 44.499170 & - 3.915344 \\ - 9.726209 & - 3.915344 & 27.435629 \end{matrix}) .$

Case 2: A degenerate Gaussian distribution in $ℝ^{3}$ with mean vector and a rank one covariance matrix:

$m = 0, Σ = (\begin{matrix} 2.5820307 & - 11.232197 & - 0.46004207 \\ - 11.2321974 & 48.861640 & 2.00124781 \\ - 0.4600421 & 2.001248 & 0.08196599 \end{matrix}) .$

Case 3: As in Case 2, but with mean vector and a rank two covariance matrix:

$m = 0, Σ = (\begin{matrix} 10.19483 & 6.661080 & - 17.54595 \\ 6.66108 & 5.462537 & - 10.91097 \\ - 17.54595 & - 10.910967 & 30.47329 \end{matrix}) .$

Case 4: Mixture of 3 Gaussian distributions in $ℝ^{3}$ with vector of mixing proportions $π = {(1 / 6, 1 / 3, 1 / 2)}^{T}$ , common covariance matrix $Σ = 2 I_{3}$ and respective vector means:

$m_{1} = {(0.2815777,1.8254139, - 0.5799608)}^{T},$

$m_{2} = {(- 0.16064371,0.01403464, - 1.63159078)}^{T},$

$m_{3} = {(0.01323658, - 0.61782773,1.28104746)}^{T}$ .

Case 5: A discrete distribution in $ℝ^{3}$ with vector of respective probabilities $π = {(1 / 6, 1 / 3, 1 / 2)}^{T}$ on the 3 mass points:

$u_{1} = (\begin{matrix} 0.7043073 \\ 0.7533598 \\ - 0.7370090 \end{matrix}), u_{2} = (\begin{matrix} 0.6410311 \\ 0.8449623 \\ - 0.7769852 \end{matrix}), u_{3} = (\begin{matrix} - 3.3851412 \\ 0.3501683 \\ 0.8949348 \end{matrix}) .$

S5.2.2. Distributions Tested for the Residual Errors

Case 1: Univariate Gaussian distribution $N (0,2.25)$ .

Case 2: Mixture of 4 Gaussian distributions in $ℝ$ of variance $σ^{2} = 0.9613748$ and respective means:

$m_{1} = 1.293309, m_{2} = - 1.912166, m_{3} = 0.4601711, m_{4} = 0.690338,$

and vector of mixing proportions $π = {(1 / 8, 1 / 4, 1 / 2, 1 / 8)}^{T}$ .

Case 3: A discrete distribution in $ℝ$ with the 4 mass points:

$u_{1} = 1.532882, u_{2} = 1.332857, u_{3} = - 1.494380, u_{4} = 1.778924$

of respective probabilities $π = {(1 / 8, 1 / 4, 1 / 2, 1 / 8)}^{T}$ .

S6. More Analysis of the Blackmore Data

One can also consider fitting an LMM to the Blackmore data where, in addition to having a random slope for age as in (44), we also have a random slope for the interaction age8*group, thus the following LMM:

$\begin{array}{l} \log_{2} e x e r c i s e \sim f e (1 + a g e 8 + g r o u p + a g e 8 * g r o u p) \\ + r e (1 + a g e 8 + a g e 8 * g r o u p) + G r (s u b j e c t) \end{array}$ (S:6.1)

The results for this model using Algorithm 3S-A1-V1 are presented in TableS1. In addition to the parameters which are already significant in the LMM (44), the estimate of the variance $σ_{ age8 . group}^{2}$ of the random effect age8*group in the LMM (S:6.1) is also significantly greater than zero, confirming that a random effect on age8*group is also probable. This suggests that there is a likely variation of the interaction between age and group across the girls population in this study.

Table S1. Results of algorithm 3S-A1-V1 fitting the LMM (S:6.1) to the Blackmore data.

int, gr, pat: respective shortcuts for intercept, group, patient.

In contrast, we signal that Gaussian ML fitting of the LMM (S:6.1) to the Blackmore data failed to converge with the default settings in the lmer function in the lme4 R package.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Gibbons, R.D. and Hedeker, D. (2000) Applications of Mixed-Effects Models in Biostatistics. Sankhy, 62, 70-103.
[2]	Yarkiner, Z., Hunter, G., O’Neil, R. and de Lusignan, S. (2013) Applications of Mixed Models for Investigating Progression of Chronic Disease in a Longitudinal Dataset of Patient Records from General Practice. Journal of Biometrics & Biostatistics, S9, Article No. 001. https://doi.org/10.4172/2155-6180.S9-001
[3]	Van der Merwe, A.J. and Pretorius, A.L. (2003) An Application of the Mixed Linear Model and the Dirichlet Process Prior in Veterinary Medicine Research. Journal of Agricultural, Biological, and Environmental Statistics, 8, Article No. 328. https://doi.org/10.1198/1085711032192
[4]	Milkevych, V., Madsen, P., Gao, H. and Jensen, J. (2021) The Relative Effect of Genomic Information on Efficiency of Bayesian Analysis of the Mixed Linear Model with Unknown Variance. Journal of Animal Breeding and Genetics, 138, 14-22.
[5]	Torkashvand, E., Jozani, M.J. and Torabi, M. (2017) Clustering in Small Area Estimation with Area Level Linear Mixed Models. Journal of the Royal Statistical Society Series A (Statistics in Society), 180, 1253-1279. https://doi.org/10.1111/rssa.12308
[6]	Maiti, T. (2001) Robust Generalized Linear Mixed Models for Small Area Estimation. Journal of Statistical Planning and Inference, 98, 225-238. https://doi.org/10.1016/S0378-3758(00)00302-5
[7]	Kizilkaya, K., Garrick, D.J., Fernando, R.L., Mestav, B. and Yildiz, M.A. (2010) Use of Linear Mixed Models for Genetic Evaluation of Gestation Length and Birth Weight Allowing for Heavy-Tailed Residual Effects. Genetics Selection Evolution, 42, Article No. 26. https://doi.org/10.1186/1297-9686-42-26
[8]	Dandine-Roulland, C. and Perdry, H. (2015) The Use of the Linear Mixed Model in Human Genetics. Human Heredity, 80, 196-206. https://doi.org/10.1159/000447634
[9]	Lussetti, D., Kuljus, K., Ranneby, B., Ilstedt, U., Falck, J. and Karlsson, A. (2019) Using Linear Mixed Models to Evaluate Stand Level Growth Rates for Dipterocarps and Macaranga Species Following Two Selective Logging Methods in Sabah, Borneo. Forest Ecology and Management, 437, 372-379. https://doi.org/10.1016/j.foreco.2019.01.044
[10]	Zamudio, F., Yanez, M., Guerra, F., Fuentes, D. and Gonzalez, A. (2020) Comparative Analysis of SNP Data and Hybrid Taxa Information by Using a Classificatory Linear Mixed Model to Study the Genetic Variation and Heritability of Initial Height Growth in Selected Poplar Hybrids. Tree Genetics & Genomes 16, Article No. 69. https://doi.org/10.1007/s11295-020-01435-1
[11]	Jiang, J. (2007) Linear and Generalized Linear Mixed Models and Their Applications. Springer, New York.
[12]	Demidenko, E. (2013) Mixed Models: Theory and Applications with R. 2nd Edition, John Wiley & Sons, Inc., Hoboken.
[13]	West, B.T., Welch, K.B. and Galecki, A.T. (2007) Linear Mixed Models: A Practical Guide Using Statistical Software. Chapman & Hall/CRC, London.
[14]	Searle, S.R., Casella, G. and McCulloch, C.E. (2006) Variance Components. Wiley, New York.
[15]	Hartley, H.O. and Rao, J.N.K. (1967) Maximum Likelihood Estimation for the Mixed Analysis of Variance Model. Biometrika, 54, 93-108. https://doi.org/10.1093/biomet/54.1-2.93
[16]	Harville, D.A. (1977) Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. Journal of the American Statistical Association, 72, 320-338. https://doi.org/10.1080/01621459.1977.10480998
[17]	Lindstrom, M.J. and Bates, D.M. (1988) Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated-Measures Data. Journal of the American Statistical Association, 83, 1014-1022.
[18]	Pinheiro, J.C. and Bates, D.M. (2000) Mixed-Effects Models in S and S-PLUS. Springer, NY.
[19]	Gumedze, F.N. and Dunne, T.T. (2011) Parameter Estimation and Inference in the Linear Mixed Model. Linear Algebra and its Applications, 435, 1920-1944. https://doi.org/10.1016/j.laa.2011.04.015
[20]	Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B, 39, 1-38.
[21]	Laird, N.M. (1982) Computation of Variance Components Using the EM Algorithm. Journal of Statistical Computation and Simulation, 14, 295-303. https://doi.org/10.1080/00949658208810550
[22]	Laird, N.M. and Ware, J.M. (1982) Random Effects Models for Longitudinal Data. Biometrics, 38, 963-974. https://doi.org/10.2307/2529876
[23]	Lange, N. and Ryan, L. (1989) Assessing Normality in Random Effects Models. Annals of Statistics, 17, 624-642. https://doi.org/10.1214/aos/1176347130
[24]	Li, S., Cai, T.T. and Li, H. (2021) Inference for High-Dimensional Linear Mixed-Effects Models: A Quasi-Likelihood Approach. Journal of the American Statistical Association. (In Press) https://doi.org/10.1080/01621459.2021.1888740
[25]	Eisenhart, C. (1947) The Assumptions Underlying the Analysis of Variance. Biometrics, 3, 1-21. https://doi.org/10.2307/3001534
[26]	Anderson, R.L. and Bancroft, T.A. (1952) Statistical Theory in Research. McGraw-Hill, New York.
[27]	Montgomery, D.C. (2004) Design and Analysis of Experiments. Wiley, New York.
[28]	Oehlert, G.W. (2010) A First Course in Design and Analysis of Experiments. Library of Congress Cataloging-in-Publication Data, USA.
[29]	Henderson, C.R. (1953) Estimation of Variance and Covariance Components. Biometrics, 9, 226-252. https://doi.org/10.2307/3001853
[30]	Rao, C.R. (1970) Estimation of Heteroscedastic Variances in Linear Models. Journal of the American Statistical Association, 65, 161-172. https://doi.org/10.1080/01621459.1970.10481070
[31]	Rao, C.R. (1971) Estimation of Variance and Covariance Components: MINQUE Theory. Journal of Multivariate Analysis, 1, 257-275. https://doi.org/10.1016/0047-259X(71)90001-7
[32]	Rao, C.R. (1971) Minimum Variance Quadratic Unbiased Estimation of Variance Components. Journal of Multivariate Analysis, 1, 445-456. https://doi.org/10.1016/0047-259X(71)90019-4
[33]	Rao, C.R. (1972) Estimation of Variance and Covariance Components in Linear Models. Journal of the American Statistical Association, 67, 112-115. https://doi.org/10.1080/01621459.1972.10481212
[34]	Rao, C.R. and Kleffe, J. (1988) Estimation of Variance Components and Applications. North-Holland, Amsterdam.
[35]	Jiang, J. (1996) REML Estimation: Asymptotic Behavior and Related Topics. Annals of Statistics, 24, 255-286. https://doi.org/10.1214/aos/1033066209
[36]	Heyde, C.C. (1994) A Quasi-Likelihood Approach to the REML Estimating Equations. Statistics & Probability Letters, 21, 381-384. https://doi.org/10.1016/0167-7152(94)00035-2
[37]	Jiang, J., Luan, Y. and Wang, Y. (2007) Iterative Estimating Equations: Linear Convergence and Asymptotic Properties. The Annals of Statistics, 35, 2233-2260. https://doi.org/10.1214/009053607000000208
[38]	Goldstein, H. (1986) Multilevel Mixed Linear Model Analysis Using Iterative Generalized Least Squares. Biometrika, 73, 43-56. https://doi.org/10.1093/biomet/73.1.43
[39]	R Core Team (2019) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
[40]	Searle, S.R. (1995) The Matrix Handling of BLUE and BLUP in the Mixed Linear Model. BU-1275-MA, Biometrics Unit, Cornell University, Ithaca.
[41]	Henderson, C.R. (1950) Estimation of Genetic Parameters (Abstract). Annals of Mathematical Statistics, 21, 309-310.
[42]	Henderson, C.R., Kempthorne, O., Searle, S.R. and von Krosigk, C.N. (1959) The Estimation of Environmental and Genetic Trends from Records Subject to Culling. Biometrics, 15, 192-218. https://doi.org/10.2307/2527669
[43]	Henderson, C.R. (1973) Sire Evaluation and Genetic Trends. Journal of Animal Science, 10-41. https://doi.org/10.1093/ansci/1973.Symposium.10
[44]	Henderson, C.R. (1963) Selection Index and Expected Genetic Advance. In: Statistical Genetics and Plant Breeding, National Academy of Sciences, No. 982, National Research Council Publication, Washington DC, 141-163.
[45]	Carpenter, J.R., Goldstein, H. and Rasbash, J. (2003) A Novel Bootstrap Procedure for Assessing the Relationship between Class Size and Achievement. Journal of the Royal Statistical Society: Series C, 52, 431-443. https://doi.org/10.1111/1467-9876.00415
[46]	Van der Leeden, R., Meijer, E. and Busing, F. (2008) Chapter 11. Resampling Multilevel Models. In: de Leeuw, J. and Meijer, E., Eds., Handbook of Multilevel Analysis, Springer, New York, 401-433.
[47]	Chambers, R. and Chandra, H. (2013) A Random Effect Block Bootstrap for Clustered Data. Journal of Computational and Graphical Statistics, 22, 452-470. https://doi.org/10.1080/10618600.2012.681216
[48]	Thai, H.-T., Mentré, F., Holford, N., Veyrat-Follet, C. and Comets, E. (2013) A Comparison of Bootstrap Approaches for Estimating Uncertainty of Parameters in Linear Mixed-Effects Models. Pharmaceutical Statistics, 12, 129-140. https://doi.org/10.1002/pst.1561
[49]	Modugno, L. and Giannerini, T. (2015) The Wild Bootstrap for Multilevel Models. Communications in Statistics—Theory and Methods, 44, 4812-4825. https://doi.org/10.1080/03610926.2013.802807
[50]	Bates, D., Maechler, M., Bolker, B. and Walker, S. (2015) Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67, 1-48. https://doi.org/10.18637/jss.v067.i01
[51]	Cook, F.E. (1938) Chocolate Cake, I. Optimum Baking Temperature. Master’s Thesis, Iowa State College, Iowa.
[52]	Efron, B. and Tibsharani, J. (1993) An Introduction to the Bootstrap. Chapman and Hall, London.
[53]	Davis, C., Blackmore, E., Katzman, D.K. and Fox, J. (2005) Female Adolescents with Anorexia Nervosa and Their Parents: A Case-Control Study of Exercise Attitudes and Behaviours. Psychological Medicine, 35, 377-386. https://doi.org/10.1017/S0033291704003447

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies