Weighted Maximum Likelihood Technique for Logistic Regression

Abstract

In this paper, a weighted maximum likelihood technique (WMLT) for the logistic regression model is presented. This method depended on a weight function that is continuously adaptable using Mahalanobis distances for predictor variables. Under the model, the asymptotic consistency of the suggested estimator is demonstrated and properties of finite-sample are also investigated via simulation. In simulation studies and real data sets, it is observed that the newly proposed technique demonstrated the greatest performance among all estimators compared.

Share and Cite:

Idriss, I. , Cheng, W. and Fissuh, Y. (2023) Weighted Maximum Likelihood Technique for Logistic Regression. Open Journal of Statistics, 13, 803-821. doi: 10.4236/ojs.2023.136041.

1. Introduction

Logistic regression is a method used in statistics for modeling the relationship between a binary dependent variable and one or more explanatory variables. It is usually utilized in various fields. Often, the estimation method for parameters of a statistical model uses the maximum likelihood estimator by optimizing the likelihood function. In logistic regression models, MLE is utilized to estimate the coefficients of the predictor variables that most effectively fit the data. Unfortunately, this approach is not robust to unusual data findings. Several robust estimators have been presented as an alternative to MLE to address this issue. The MLE in logistic regression was shown to be very sensitive to outlying data by [1] , who also devised a diagnostic assessment of outlying observations; see also [2] . For binary regression, [3] studied several M-estimators, their estimations depending on leverage down weight, a Mallows-type estimator. [4] created a robust estimate for the logistic regression model based on a modified median estimator, and they also investigated a Wald-type test statistic. [5] created incredibly robust projection estimators for the GLM, but their calculation is quite difficult. [6] proposed a quasi-likelihood estimator by substituting the least absolute deviation estimator (L1 norm) for the least squares estimator (L2 norm) in the definition of quasi-likelihood. [7] suggested resilient estimators and testing techniques for Poisson and binomial models depending on the idea of a quasi-likelihood estimator created by [8] . The breakdown of the MLE in the logistic model was investigated in [9] , whereas [10] offered a method for the logistic regression model. [11] created a new robust technique for Poisson regression. [12] provided reliable estimators for generalized linear models; the fundamental concept is to change the response via a variance stabilizing transformation before using an estimation. [13] indicated that was a very consistent and reliable estimator. [14] provides a reliable and efficient approach for computing the M-estimator described in [13] . [15] presented a quick technique for the generalized linear model based on breakdown points of the trimmed likelihood. Fisher-consistent estimators are another kind of robust estimator, as introduced in [13] . [16] investigated a resistant robust estimator and this estimation relied on the misclassification model. [17] presented a new family of robust methods for logistic regression. [18] compared minimal distance approaches to more robust methods discovered a unique weighted likelihood and used it in Poisson and binary regression. The ideally bounded scoring functions described by [19] for linear models were applied to the logistic model in [20] .

All of these estimators possess significant disparities in their resistance to outliers and effectiveness within the framework of the model. In this paper, we have conducted a comprehensive investigation into the behavior of some of these estimators, both in terms of their asymptotic properties and their behavior in finite samples. Our findings indicate that the Mallows-type estimator proposed by [3] is very robust to outlier contamination but inefficient under the model, while Schweppe-type estimators proposed by [2] are very efficient under the model but show a poor outlier resistance. In this paper, we propose an estimator that can be as robust as Mallows-estimators under contamination but is much more efficient under the model, this is achieved by an adaptive continuous weight. This continuous weighted maximum likelihood estimator depends on the annoyance parameter estimator as a result of Kolmogorov-Smirnov statistics. The maximum likelihood estimators and a logistic regression model are covered in Section 2. Robust Methods for Logistic Regression Model discussed in section 3. We propose a robust technique for logistic regression in section 4. Section 5 displays the findings of the Monte Carlo simulation research and real data analysis. Section 6 contains the conclusions.

2. Logistic Regression Model and ML Estimator

The logistic regression model is a popular technique used to examine the relationship between a categorical variable and one or more predictor variables. It is often used in binary classification issues when there are only two potential outcomes for the dependent variable (e.g., true or false, yes or no, etc.). The logistic function is the foundation of the logistic regression model, which transforms a continuous input variable into a probability value between 0 and 1. Assume a random sample of observations ( x 1 , y 1 ) , , ( x n , y n ) , where X i represents a p predictor variables and y i ( 0,1 ) is a binary variable and suppose the probability of positive response μ i = p ( y i = 1 | x i ) is associated with the covariates through the relation g ( μ ) = X β , such that g 1 ( X β ) is known as the logit link function converts the covariate values in the range ( 0,1 ) .

Using the logit link function The multiple logistic regression model may be created by:

p ( Y = 1 | X = x i ) = F ( x i T β ) = exp ( x i T β ) 1 + exp ( x i T β ) , i = 1 , , n , (1)

where X = ( 1, x 1 , , x p ) are the predictor variables values and β T = ( β 0 , β 1 , , β p ) represents an unknown parameter vector. We may characterize the binary regression model as follows:

η i = x i T β ,

where η i is a linear predictor that is sometimes referred to as a transformation function, where η i = ln ( μ i 1 μ i ) . The MLE is a method for parameter estimation

of a statistical model by maximization of the likelihood function. The MLE assumes that the data are generated by a specific probability distribution (in this case, the logistic distribution) and finds the parameter values that make it most likely for that distribution to have generated those data. The MLE is often used in logistic regression because it provides unbiased estimates of model parameters and has good statistical properties. Assume that the dependent variables y i have a Bernoulli distribution and we can derive the probability distribution for the ith observation as follows:

f ( y i ) = μ i y i ( 1 μ i ) 1 y i , i = 1 , 2 , , n ,

as well as each observation y i takes the value 1 with probability μ i or the value 0 with probability ( 1 μ i ). The probability function is defined as follows

ln ( β ; y 1 , y 2 , , y n ) = i = 1 n f ( y i ) = i = 1 n μ i y i ( 1 μ i ) 1 y i , (2)

Then we compute the log-likelihood of the preceding formula:

ln ( β ; y 1 , y 2 , , y n ) = ln i = 1 n f ( y i ) = i = 1 n [ y i ln ( μ i 1 μ i ) ] + i = 1 n ln ( 1 μ i ) ,

where η i = ln [ μ i / ( 1 μ i ) ] = x i T β and 1 μ i = [ 1 + exp ( x i T β ) ] 1 . So, we can express the log-likelihood as:

ln l ( β ; Y ) = i = 1 n ln y i x i T β i = 1 n ln [ 1 + exp ( x i T β ) ] = β T X Y i = 1 n ln [ 1 + exp ( x i T β ) ] . (3)

In experimental design, we execute repeated observations at each level of independent variables (x). Let η i be the trials number at each predictor level and y i be the number of 1’s observed at the ith observations with n = n 1 , n 2 , , n m . Therefore, we may express the log-likelihood as:

ln l ( β ; Y ) = β T X Y i = 1 m n i ln [ 1 + exp ( x i T β ) ] , (4)

nevertheless, it is possible to maximize the likelihood function by differentiating it with respect to β :

ln l ( β , Y ) β = X T Y i = 1 m [ n i 1 + e x i T β ] e x i T β X i ,

where, e x i T / ( 1 + e x i T β ) = 1 / ( 1 + e x i T β ) = μ i , we have

ln ( β ; Y ) β = X T Y i m n i μ i X i ,

where n i μ i denotes the average of the binomial variable, the preceding formula may be written in matrix form as X T ( Y μ ) , where

μ = [ μ 1 μ 2 μ m ] .

Hence, the MLE is normally computed by resolving the scoring equation:

X T ( Y μ ) = 0. (5)

Since Equation (5) is a nonlinear function of beta, the iteratively weighted least squares (IWLS) technique may be used to find a solution.

3. Robust Methods for Logistic Regression Model

In logistic regression models, robust estimators are statistical methods for estimating parameters that are less sensitive to outliers and influential observations. These techniques are designed to provide reliable estimates of the regression coefficients even when the data contain extreme values or other anomalies that can distort the results. There are several robust estimators that can be used in logistic regression. [13] suggested robust estimators via the management of deviations to get theoretically unbiased estimators; however, an extra bias-correction component must be introduced, which makes the calculation of their estimator very complex and the estimator itself not straightforward. [2] developed Mallows-type estimators by separately manipulating the variables and residuals in the estimation equation. In the instance of logistic regression, they were categorized as solutions of

i = 1 n w ( x i ; η ^ ) π b ( y i π ( x i T β ^ i ) c ( π ( x i T β ^ ) , b ) ) x i = 0 , (6)

where η ^ represents the annoyance parameters (location and scatter estimate of covariates), and Φ b is often considered as Huber’s function π b ( t ) = ( b ) ( t b ) , c ( t , b ) is the bias-correction function expressed as

c ( t , b ) = ( b π / 1 π ( t ) π ( t ) if t < 0 , b < 1 π ( t ) 1 π ( t ) b 1 π ( t ) / π ( t ) if t > 0 , b < π ( t ) 0 otherwise

the weights w ( x i ; η ^ ) often depend only on continuous variables. Suppose we write x i T = ( u i t , z i T ) , where u i R p q are qualitative variables and z i R q are the continuous variables, thereafter, the weights usually take the form w ( x i ; η ^ ) = w ( ( z i μ ^ ) T Σ ^ 1 ( z i μ ^ ) / t ) , with w : R + 1 R + 1 a function that does not increase, μ ^ is the robust estimator of the location of the z i , Σ ^ is the robust estimator of scatter of the z i and t is the threshold value (usually t = χ q ,1 α for some α = ( 0,1 ) ). The initial robust estimator of location and scale for predictor variables μ ^ and Σ ^ can be calculated utilizing the minimum covariance determinant methods. MCD was one of the first multivariate location and scatter estimators that were both affine equivariant and relatively robust. It discovers the h ( > n / 2 ) observations x i whose matrix of traditional covariance

v = 1 h i ( x i t ) ( x i t ) T

has the smallest possible determinant, where t = x the average of those h points. As observed, the residual weight ϕ b and covariate weight w are independent; moreover, it will decrease the effectiveness of the resultant estimators due to the fact that the estimation equation will downweight well-fitted observations with extreme variables. [21] robustly presented a family of resilient adaptive weighted maximum likelihood techniques for logistic regression models. These adaptive weights are dependent on adaptive cut-off thresholds to regulate observations with extreme covariables. They demonstrated that estimators based on adaptive thresholds are more efficient than estimators based on non-adaptive thresholds for clean models and have equivalent resilience for polluted models.

The lack of dependence between the weights assigned to covariates and the weights assigned to deviances in Equation (6) is the underlying cause for the generally lower efficiency of Mallows-type estimators compared to Schweppetype estimators. This occurs due to the downweighting of observations with extreme covariates, even if they exhibit good fit. It is evident that enhancing the efficiency of Mallows-type estimators can be achieved by reducing the thresholding proportions, although this may compromise the robustness of the estimator. In order to simultaneously achieve both high efficiency and high robustness, it becomes necessary to employ adaptive thresholds, as detailed in the next section.

4. Weighted Maximum Likelihood Technique (WMLT)

In this section, we build a novel class of continuous weighted maximum likelihood estimators based on the annoyance parameter estimator as a function of the Kolmogorov-Smirnov statistics. We’ll refer to these estimators as WMLT (weighted maximum likelihood technique). First build two estimators μ ^ ( 0 ) and Σ ^ ( 0 ) that are the initial estimates of location and scatter of the predictor variables z i s , thereafter, calculate the Mahalanobis distances squared of z i s that

is characterized by m 2 = ( z i μ ^ ( 0 ) ) T ( Σ ^ ( 0 ) ) 1 ( z i μ ^ ( 0 ) ) . Furthermore, the empirical distribution function of m i 2 may be expressed as:

F n ( t ) = 1 n i = 1 n I ( | m i 2 | t ) ,

when z i s have a normal distribution, F n converge to F χ q 2 ( χ q 2 distribution function). Then we can estimate the outliers proportion in covariates by [21]

α n = sup t F χ q 2 1 ( 1 δ ) { F χ q 2 ( t ) F n ( t ) } + = max i i 0 { F χ q 2 ( m ( i ) 2 ) i 1 n } + ,

where { . } + represents the positive part, δ measures the length of the tail ( δ = 0.25 ) is an acceptable option, and i o = min { i : m i 2 F χ q 2 1 ( 1 δ ) } . When | F χ q 2 ( t ) F n ( t ) | is large for a large t, it indicates that the sample contains outliers. Hence, an adaptive threshold may be described as

t n = F n 1 ( 1 α n ) = m ( n [ n α n ] ) 2 .

[21] proposed the adaptive threshold estimators. Specifically, they developed an estimator of the Mallows-type estimator with weights w ( x i ; η ^ ) = w ( m i 2 / t n ) which are basically weighted maximum likelihood estimators. The proposed weight function may be defined as

w ( x i ; α n ) = m ( α n m i 2 ) ,

we assume m is a completely non-increasing continuous mapping from + to ( 0,1 ) such that m ( 0 ) = 1 , sup x > 0 { x m ( x ) } < , and the first derivative is bounded with m ( 1 ) ( 0 ) = 0 . Determine an objective function

ψ β , η ( x , y ) = w η ( x ) ϕ β ( x , y ) , x , β p , y ( 0 , 1 ) , (7)

where η = ( α , μ , Σ ) is the location, scale, and goodness of fit of the explanatory variables respectively, an adaptive weight function w η ( x ) = m ( α ( z μ ) T Σ 1 ( z μ ) ) and ϕ β ( x , y ) = ( y ( x T β ) ) x , with μ a , Σ is q × q real matrix and x T = ( u T , z T ) . Finally, we define our adaptive estimator β ^ n of β as the solution to the estimating equation

i = 1 n ψ β , η ^ ( x i , y i ) = 0 ,

where η ^ is a consistent estimator of η = ( α , μ , Σ ) .

If z i are normally distributed, then t n = and the reweighted estimators are asymptotically equivalent to the sample mean and the sample covariance matrix, and therefore fully efficient. This efficient carries over to the adaptive Mallows-type estimators, as shown in the next subsection.

4.1. Asymptotic of Proposed Method

The estimating equation (6) can be written as i = 1 n Ψ ( x i , y i ; β ^ ; η ^ ) = 0 , with Ψ ( x , y ; β ; η ) = w ( x ; η ) ψ b ( y F ( x T β ) c ( x T β , b ) ) x . Under appropriate regularity conditions, the classical asymptotic of M-estimators hold, see [22] . Let β 0 be the model parameter and E 0 denote expectation under the model; define M 0 ( β ) = E 0 { Ψ ( x , y ; β ; η 0 ) } , with η 0 = p lim η ^ . If C 0 = D M 0 ( β ) | β = β 0 , where D denote the differential and A 0 = E 0 { Ψ ( x , y ; β 0 ; η 0 ) Ψ ( x , y ; β 0 ; η 0 ) T } , then

n ( β ^ β 0 ) N p ( 0, C 0 1 A 0 C 0 1 ) .

This result is valid for non-adaptive and adaptive weights alike, as long as η ^ converges to η 0 in probability. For the weighted maximum likelihood estimator given by ψ b ( u ) = u , A 0 and C 0 have simple expressions

C 0 = E { w ( x ; η 0 ) F ( x T β 0 ) x x T } , (8)

A 0 = E { w 2 ( x ; η 0 ) F ( x T β 0 ) ( 1 F ( x T β 0 ) ) x x T } . (9)

Using the asymptotic normality of β ^ it is possible to construct confidence ellipsoids for β 0 . First, we estimate the matrices (9) and (6) with

C ^ = 1 n i = 1 n w ( x ; η ^ ) F ( x i T β ^ ) x i x i T ,

A ^ = 1 n i = 1 n w ( x i ; β ^ ) F ( x i T β ^ ) ( 1 F ( x i T β ^ ) ) x i x i T .

Then the estimated asymptotic variance of n ( β ^ β 0 ) is V ^ = C ^ 1 A ^ C ^ 1 . The asymptotic confidence ellipsoid of level ( 1 α ) for β 0 is given ξ ( β 0 ) = { β R p : n ( β ^ β ) T V ^ 1 ( β ^ β ) χ p ,1 α 2 } . This can be generalized to linear transformations of β 0 .

4.2. Asymptotic Properties of WMLT

This subsection focuses on the asymptotic features of the suggested estimator β ^ n described in the preceding section. We will show that the estimator is asymptotically consistent based on some general assumptions about the moments of predictor variables. Suppose that β 0 , μ 0 and Σ 0 are the actual values of β , μ and Σ respectively, and the independent sample ( x 1 , y 1 ) , , ( x n , y n ) follows the logistic model P ( y i = 1 ) = μ i , i = 1 , , n . We define functions

n ψ β = Ψ n ( β ) = 1 n i = 1 n ψ β , η ^ ( x i , y i )

and

Ψ ( β ) = P ψ β , 0 = P ϕ β ,

where ψ β ,0 is the calculated by Equation (7) with η replaced with η 0 = ( 0, μ 0 , Σ 0 ) and p indicates the joint probability distribution of the (x,y)’s. Theorem 4.2 establishes that β ^ n is consistent, which makes use of the conclusions of Lemma 1, 2, and 3 mentioned below. To prove the Lemmas and the Theorems, the following assumptions must be made:

B1: μ ^ p μ 0 and Σ ^ p Σ 0 .

B2: E ( x x ) T is nonsingular.

B3: E G 0 ( x 4 ) < .

B4: The m ( x ) is a continuous weight function and has a first derivative that is bounded, m ( 0 ) = 1 and m 1 ( 0 ) = 0 .

B1 is met for the vast majority of well-known initial robust estimators, such as the Minimum Covariance Determinant used in the simulation experiments. In the following lemmas and theorem, the asymptotic characteristics are assumed to be as n . The proof for Lemma 1 is given in [21] .

Lemma 1. In case B1 is holds, then α n = 0 p ( 1 ) .

Lemma 2. In case B2 is holds, then Ψ ( β n ) 0 implies β n β 0 0 for any sequence { β n } Θ .

Proof of Lemma 2. Consider the following

Ψ ( β n ) = P ψ β n , 0 = P ϕ θ n = P ( y ( x T β n ) ) x = P ( y ( x T β 0 ) + ( x T β 0 ) ( x T β n ) ) x = P ( ( x T β 0 ) ) x + P ( ( x T β 0 ) ( x T β n ) ) x

If Ψ ( β n ) 0 , then Ψ ( β n ) 0 . Since P ( y ( x T β n ) ) x = 0 , we may see from the equality above that P ( ( x T β 0 ) ( x T β n ) ) x 0 . Note

P ( ( x T β 0 ) ( x T β n ) ) x = P ( ( 1 ) ( c ) ( x T β n x T β 0 ) x ) 1 4 P ( x T ( β n β 0 ) x ) ,

where c ( x T β 0 , x T β n ) and h ( μ ) is the initial derivation for logistic link, h ( μ ) = e x ( e x = 1 ) 2 ( 0 , 1 / 4 ) . We also have X = ( x 1 , x 2 , , x p ) T , β 0 β n = β d = ( β d 1 , β d 2 , , β d p ) T .

P ( x T ( β 0 β n ) x ) = P ( i = 1 p β d i [ x 1 x i x 2 x i x p x i ] ) = i = 1 p β d i P ( [ x 1 x i x 2 x i x p x i ] ) 0.

Since P ( x x T ) = E ( x x T ) is nonsingular, we have β 0 β 1 0 , is shown by the fact that β n β 0 0 .

Lemma 3. Suppose that B3 and B4 are true. Then the class { ψ η , β : β Θ , η η 0 < δ } is P-Glivenko-Cantelli for some δ > 0 , where η 0 = ( 0, μ 0 , Σ 0 ) .

Proof of Lemma 3. To demonstrate that a class Ϝ of vector-valued functions ψ : ( x , y ) p to be Glivenko-Cantelli, we must display each of the coordinate classes φ i : ( x , y ) with ψ = ( ψ 1 , , ψ p ) T ranging over Ϝ ( i = 1,2, , p ) is Glivenko-Cantelli.

The class Ϝ = { ψ γ : γ = ( β , η ) = ( β , α , μ , Σ ) , α [ 0,1 ] , μ q , Σ S + q , β Θ , η η 0 < δ } is a set of measurable functions that are indexed by a bounded subset in Γ Θ × q × × q × q , and S + q represents a collection of positive semidefinite matrices described in q × q . This is because Σ is basically a variance-covariance matrix of continuous predictor variables, hence it is positive semidefinite and symmetric. For the norm, we use

η η n = ( α 2 + μ μ 0 2 + Σ Σ 0 + β β 0 2 ) 1 2 , with . indicates the Euclidean norm of vectors μ , β and for matrix Σ , . represents the induced norm in general. For two values γ i = ( β i , η i ) , ( i = 1 , 2 ) of γ , we have

| ψ γ 1 i ( x , y ) ψ γ 2 i ( x , y ) | = | w η 1 ( x ) ϕ β 1 i ( x , y ) w η 2 ( x ) ϕ β 2 i ( x , y ) | = | w η 1 ( x ) ϕ β 1 i ( x , y ) w η 1 ( x ) ϕ β 1 i ( x , y ) + w η 1 ( x , y ) ϕ β 2 i ( x , y ) w η 2 ( x ) ϕ β 2 i ( x , y ) | | w η 1 ( x ) ϕ β 1 i ( x , y ) w η 1 ( x ) ϕ β 2 i ( x , y ) | + | w η 1 ( x ) ϕ β 2 i ( x , y ) w η 2 ( x ) ϕ β 2 i ( x , y ) | . (10)

But

| w η 1 ( x ) ϕ β 1 i ( x , y ) w η 1 ( x ) ϕ β 2 i ( x , y ) | = w η 1 ( x ) | ( y ( x T β 1 ) ) x i ( y ( x T β 2 ) ) x i | | ( x T β 1 ) ( x T β 2 ) | | x i | K 0 | x i | | x T β 2 x T β 1 | K 0 | x i | c β 2 β 1 , (11)

where K 0 represents the upper limit of the first derivation of the link function h ( μ ) .

Making use of the Mean Value Theorem, for each x 1 and x 2 , there exists c ( x 1 , x 2 ) such that

μ ( x 1 ) μ ( x 2 ) = h ( c ) | x 1 x 2 | < K 0 | x 1 x 2 | .

In addition, we have

| w η 1 ( x ) ϕ β 2 i ( x , y ) w η 2 ( x ) ϕ β 2 ( x , y ) | = | w η 1 ( x ) w η 2 ( x ) | | y ( x T β 2 ) | | x i | = | w α 1 , μ 1 , Σ 1 ( x ) w α 1 , μ 2 , Σ 2 ( x ) + w α 1 , μ 2 , Σ 2 ( x ) w α 2 , μ 2 , Σ 2 ( x ) | | y ( x T β 2 ) | | x i | ( | w α 1 , μ 1 , Σ 1 ( x ) w α 1 , μ 2 , Σ 2 ( x ) | + | w α 1 , μ 2 , Σ 2 ( x ) w α 2 , μ 2 , Σ 2 ( x ) | ) | x i | , (12)

then, we get

| w α 1 , μ 2 , Σ 2 ( x ) w α 2 , μ 2 , Σ 2 ( x ) | = | m ( α 1 ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) ) m ( α 2 ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) ) | K 1 | α 2 α 1 | ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) , (13)

and

w α 1 , μ 1 , Σ 1 ( x ) w α 1 , μ 2 , Σ 2 ( x ) = | m ( α 1 ( Z μ 1 ) T Σ 1 1 ( Z μ 1 ) ) m ( α 1 ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) ) | K 1 α 1 | ( Z μ 1 ) T Σ 1 1 ( Z μ 1 ) ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) | K 1 | ( Z μ 1 ) T Σ 1 1 ( Z μ 1 ) ( Z μ 1 ) T Σ 2 1 ( Z μ 1 ) | + K 1 | ( Z μ 1 ) T Σ 2 1 ( Z μ 1 ) ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) | . (14)

Given that Σ is a positive semidefinite matrix, as well Σ 1 . Indicate what the eigenvalues of Σ 2 1 are as λ 1 λ 2 λ p 0 . Consequently, Σ 2 1 has a set of orthogonal eigenvectors, say p 1 , , p q , s.t. Σ 2 1 p i = λ i p i . There exists an orthogonal matrix in matrix format Q s.t.

Q 1 Σ 2 1 Q = Q T Σ 2 1 Q = Λ .

Then we get

| ( Z μ 1 ) T Σ 2 1 ( Z μ 1 ) ( Z μ 2 ) T Σ 2 1 ( μ 2 ) | = | i = 1 q λ i ( p i T ( μ 1 ) ) 2 i = 1 q λ i ( p i T ( Z μ 2 ) ) 2 | = | i = 1 q λ i p i T ( 2 Z μ 1 μ 2 ) p i T ( μ 2 μ 1 ) | i = 1 q λ i p i 2 2 Z μ 1 μ 2 μ 2 μ 1

λ 1 μ 2 μ 1 2 Z μ 1 μ 2 i = 1 q p i 2 = ρ ( Σ 2 1 ) μ 2 μ 1 2 Z μ 1 μ 2 i = 1 q p i 2 Σ 2 1 μ 2 μ 1 2 Z μ 1 μ 2 , (15)

where ρ ( Σ 2 1 ) is the spectral radius of Σ 2 1 and ρ ( A ) A applicable to any induced norm. Similarly Σ 2 1 Σ 1 1 is also symmetric, thus indicate the eigenvalues of Σ 2 1 Σ 1 1 as p 1 * , , p q * s.t. ( Σ 2 1 Σ 1 1 ) p i * = λ i * p i * . There exists an orthogonal matrix in matrix format Q * s.t.

( Q * ) 1 ( Σ 2 1 Σ 1 1 ) Q * = ( Q * ) T ( Σ 2 1 Σ 1 1 ) Q * = Λ * .

Next, we have

| ( Z μ 1 ) T Σ 2 1 ( Z μ 1 ) ( Z μ 1 ) T Σ 1 1 ( Z μ 1 ) | = | ( Z μ 1 ) T ( Σ 2 1 Σ 1 1 ) ( Z μ 1 ) | = | ( Z μ 1 ) T Q * Λ * ( Q * ) T ( Z μ 1 ) | = | [ ( Q * ) T ( Z μ 1 ) ] T Λ * [ ( Q * ) T ( Z μ 1 ) ] |

= | i = 1 q λ i ( ( p i * ) T ( Z μ 1 ) ) 2 | max i = 1 , , q | λ i * | i = 1 q ( ( p i * ) T ( Z μ 1 ) ) 2 ρ ( Σ 2 1 Σ 1 1 ) Z μ 1 2 i = 1 q ( p i * ) 2 Z μ 1 2 Σ 2 1 Σ 1 1 . (16)

[23] presented a simple and popular M-estimator that minimizes a “bounded” version of the sum of residuals. The estimating equation is

θ ^ M = i = 1 n ψ ( r i ( θ ^ ) σ ) x i = 0 (17)

Then, using (12), (13), (14), (15) and (17) we get

w η 1 ( x ) ϕ β 2 i ( x , y ) w η 2 ( x ) ϕ β 2 ( x , y ) < K 1 | α 2 α 1 | ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) | x i | + K 1 Σ 2 1 μ 2 μ 1 2 Z μ 1 μ 2 x i + K 1 Z μ 1 2 Σ 2 1 Σ 1 1 | x i | . (18)

Since | α 2 α 1 | , μ 2 μ 1 , β 2 β 1 < γ 2 γ 1 , from (11), (10) and (16) now, we can provide a bound for ψ :

| ψ γ 1 i ( x , y ) ψ γ 2 ( x , y ) | < K 0 β 2 β 1 | x i | x + K 1 | α 2 α 1 | ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) | x i | + K 1 Σ 2 1 μ 2 μ + 1 2 Z μ 1 μ 2 | x i | + K 1 Z μ 1 2 Σ 2 1 Σ 1 1 | x i | ( K 0 | x i | x + K 1 ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) | x i | + K 1 Σ 2 1 2 Z μ 1 μ 2 | x i | + K 1 Z μ 1 2 | x i | ) γ 2 γ 1 = L i ( x ) γ 2 γ 1 ,

for every γ 1 , γ 2 . We have constructed a Lipschitz criterion for each ψ i , i = 1 , , p , so for ψ we have | ψ γ 1 ( x , y ) ψ γ 2 ( x , y ) | < L ( x ) γ 2 γ 1 , every γ 1 , γ 2 ,

where

L ( x ) = K 0 x x + K 1 ( Z μ 2 ) T Σ 2 1 ( Z μ 2 ) x + K 1 Σ 2 1 2 Z μ 1 μ 2 | x i | + K 1 Z μ 1 2 x .

Now investigate the bracketing entropy in relative to the L r ( P ) -norm

ψ i P r = ( P | ψ | r ) 1 r .

Use brackets of the type [ ψ γ ε L , ψ γ + ε L ] for γ spanning over a sufficient chosen subset of Γ and these brackets have L r ( P ) s i z e 2 ε L p r . If γ ranges over a grid of mesh width ε over Γ then the brackets [ ψ γ ε L , ψ γ + ε L ] range over Ϝ . Based on the Lipschitz condition we get

ψ γ 1 ε L ψ γ 2 ψ γ 1 + ε L , if γ 2 γ 1 ε ,

Hence, many brackets are required as in radial balls ε 2 to cover Γ , alternatively we require less than ( d i a m Γ / ε ) 2 p + 2 cubes with size ε to cover parameter space Γ . If P | L i | r < , then a constant J exists, depends only on Γ and P, in such a way that the bracketing numbers fulfill

N [ ] ( ε , Ϝ , L R ( P ) ) J ( d i a m Γ ε ) p , every 0 < ε < d i a m Γ .

Given that all ψ Ϝ are continuous functions, they can be measured. If B3 is fulfilled, then P | L | < , and as a result, class Ϝ is P-Glivenko-Cantelli from the Theorem (19.4) (Glivenko-Cantelli) in [24] .¨

If B1, B2, B3 and B4 are hold, then estimators β ^ n are used as the solving to the equation Ψ ( β ^ n ) = 0 converges to β 0 in probability.

Proof of Theorem 1. denote

Ψ ( β ) = P ψ β , 0 = P ϕ β Ψ ( β , η ) = P ψ β , η = P ϕ β W η

Ψ ( β ) = Ψ n ( β , η ) = 1 n i = 1 n ψ β , η Ψ n ( β , η ^ ) = 1 n i = 1 n ψ β , η ^ .

Note that Ψ ( β ^ n ) = ( Ψ ( β ^ n ) Ψ n ( β ^ n ) ) + Ψ n ( β ^ n ) . Then we can establish that

sup β Θ Ψ n ( β ) Ψ ( β ) = 0 p ( 1 ) , then Ψ ( β ^ n ) = 0 p ( 1 ) since Ψ n ( β ^ n ) = 0 p ( 1 ) . Thus, based upon Lemma 2, we have β ^ n β 0 = 0 p ( 1 ) . To show sup β Θ Ψ n ( β ) Ψ ( β ) = 0 p ( 1 ) , we consider it as follows:

sup β Θ Ψ n ( β , η ^ ) Ψ ( β ) = sup β Θ Ψ n ( β , η ^ ) + Ψ ( β , η ) + Ψ n ( β , η ) Ψ ( β , η ) + Ψ ( β , η ) Ψ ( β ,0 ) J 1 + J 2 + J 3 ,

where

J 1 = sup β Θ Ψ n ( β , η ^ ) Ψ n ( β , η )

J 2 = sup β Θ Ψ n ( β , η ) Ψ ( β , η )

J 3 = sup β Θ Ψ ( β , η ) Ψ ( β , 0 ) .

Lemma 3 informs us that Ϝ is P-Glivenko-Cantelli class, so J 2 0 . For J 1 ,

J 1 = sup β Θ 1 n i = 1 n ϕ β ( x i , y i ) ( w η ^ ( x i ) w η ( x i ) ) 1 n i = 1 n sup β Θ ϕ β ( x i , y i ) ( w η ^ ( x i ) w η ( x i ) ) = 1 n i = 1 n | w η ^ ( x i ) w η ( x i ) | sup β Θ ϕ β ( x i , y i ) .

Using (16), we get

| w η ^ ( x i ) w η ( x i ) | < K 1 | α ^ α | ( Z μ ) T Σ 1 ( Z μ ) + K 1 Σ 1 μ ^ μ 2 Z μ ^ μ + K 1 Z μ ^ 2 Σ ^ 1 Σ 1 . (19)

Since η ^ η = ( α ^ α , μ ^ μ , Σ ^ Σ ) = 0 p ( 1 ) , we have α ^ α = 0 p ( 1 ) , μ ^ μ = 0 p ( 1 ) and Σ ^ Σ = 0 p ( 1 ) . Also, Σ ^ 1 Σ = Σ ^ 1 ( Σ Σ ^ ) Σ 1 , as a result Σ ^ 1 Σ 1 = 0 p ( 1 ) . Therefore, (19) follows that | w η ^ ( X i ) | = 0 p ( 1 ) . Because B2 and B3 are fulfilled, we also have sup β Θ ϕ β ( x , y ) 2 < , then we get J 1 p 0 . For J 3

J 3 = sup β Θ P ϕ β ( x , y ) ( w η ( x ) w η 0 ( x ) ) sup β Θ P ϕ β ( x , y ) ( w η ( x ) w η 0 ( x ) ) sup β Θ P 1 2 ϕ β ( x , y ) 2 P 1 2 | w η ( x ) w η 0 ( x ) | 2 = P 1 2 | w η ( x ) w η 0 ( x ) | 2 sup β Θ P 1 2 ϕ β ( x , y ) 2 ,

and the Cauchy-Schwarz inequality was employed in the final inequality. We already know

w η ( x ) w η 0 ( x ) = 1 m ( η ( Z μ ) T Σ 1 ( Z μ ) ) 0.

Then η ^ = 0 p ( 1 ) m ( 0 ) = 1 and | w η ^ ( x ) w η 0 ( x ) | 2 is bounded by 1, using the dominated convergence theorem we get

l i m n P 1 2 | w η ( x ) w η 0 ( x ) | 2 0.

We also have

sup β Θ P ϕ β ( x , y ) 2 = sup β Θ P ( y x T β ) x 2 P x 2 .

and from assumption B3, we obtain sup β Θ P ϕ β ( x , y ) 2 < . It then follows that J 3 0

5. Assess Robustness of the Estimators

To evaluate the robustness of the methods, two methods have been employed. The first includes simulated models for contrasting the novel technique with the traditional MLE, Mallows-type estimator (Mallows) and (CUBI). In the second, we used the actual leukemia data set and The Erythrocyte Sedimentation Rate Data.

5.1. Simulation study

A Monte Carlo simulation analysis was performed for this subsection to assess the efficiency and robustness of the suggested estimator β ^ n . For the initial robust estimators of the scatter and location Σ ^ and μ ^ utilized the minimum covariance determinant (MCD). We calculated the following estimators for comparison: MLE, conditionally unbiased bounded influence (CUBI) of [2] , and the Mallows-type estimator (Mallows) of [3] . In the simulation, the following weight function m ( x ) was applied:

m ( x ) = a ( 1 ( x 1 ) 6 ) + b ,

where a = 0.8 and b = 0.2 .

Three models are involved in the simulation study: a clean logistic regression model, a contaminated model with a 10% contamination rate, and a contaminated model with a 20% contamination rate. First, clean model, the standard normal distribution was used to generate two predictor variables, x 1 ~ N ( 0,1 ) and x 2 ~ N ( 0,1 ) . Three sample sizes were used: ( n = 100 , 300 , 500 ) and p = 2 . We generate the response variable according to the Bernoulli distribution with a parameter equal to π i = exp ( β 0 + β 1 x 1 + β 2 x 2 ) / ( 1 + exp ( β 0 + β 1 x 1 + β 2 x 2 ) ) . The values of the true parameters β are taken ( 0,1.6,1.2 ) for three models. Second, the percentage of contamination in data equal 10%, and their predicted variables are generated from a normal distribution with ( μ = 0 ) and ( σ = 1 ). Third, the percentage of contamination in data equals 20% in a similar manner to the above.

The performance of these estimators is evaluated using Bias and mean squared error for various models. Nevertheless, the estimator with the smallest Bias and MSE is the best. Each scenario simulation included over 1000 repetitions. Consequently, for each parameter, the following are the calculations for bias and mean squared error:

Bias = | 1 1000 i = 1 1000 β i β | ,

and

MSE = 1 1000 i = 1 n | β ^ i β | 2 .

5.2. Numerical Results

The numerical results, displayed in this paper, are based on simulation studies, and two real data applications. This numerical result is expected to evaluate, the performance of a proposed model. Table 1 shows the bias and mean squared errors of the four estimation techniques for the clean model. The findings show that the bias and MSE of the MLE, Mallows, and CUBIF estimators are relatively similar, while the WMLT estimator performs worse than the other estimators. When the sample size increases, the bias and mean squared errors are observed to decrease. As shown in Table 2, under 10% of the data were contaminated, so the new robust approach WMLT has the greatest overall performance among all comparable estimators for varied sample sizes. Table 3 demonstrates that when 20% of the data are contaminated, our proposed technique (WMLT) outperforms other estimators in terms of bias and mean squared errors. Due to the sensitivity of anomalies, conventional maximum likelihood estimates perform inadequately in the contaminated model. In conclusion, the proposed method outperforms all other methods compared with contaminated data. Furthermore, the new estimator performs reasonably well with clear data.

Table 1. Bias and mean squared errors of estimators for clean model.

Table 2. Bias and (MSE) of estimators for second model (10% of data are contaminated).

Table 3. Bias and (MSE) of estimators for second model (20% of data are contaminated).

5.3. Leukemia Data

This study uses data from [25] which includes information from 33 people who died from acute myeloid leukemia. Each patient was measured for three variables: AG, WBC, and time. The response variable represents the survival time in weeks of the patient; it was converted into a binary variable with (Y = 1) indicating patients whose survival time exceeded 52 weeks and (Y = 0) indicating those who did not. WBC represents the white blood cell concentration of the patient. Whereas AG (present = 1, absent = 0) measured the presence or absence of a morphologic characteristic of white blood cells. The observation number 17 appears to be atypical. Using AG and WBC as predictor variables and binary survival time y as the response variable, a logistic regression model was constructed. The estimators analyzed here are weighted maximum likelihood technique (WMLT), MLE, MLE17 (MLE17 is the maximum likelihood estimator for clean data when observation 17 is excluded.), Mallows (estimator of the Mallows type), and CUBIF (conditionally unbiased bounded-influence function estimator).

Table 4 demonstrates that the MLE is extremely sensitive to influential observations. Furthermore, eliminating observation 17 lowered the impact of WBC to near nil. For the leukemia data, the new estimator (WMLT) demonstrated the greatest performance among all other estimators. However, Mallow’s estimators are reasonably close to the MLE17.

5.4. The Erythrocyte Sedimentation Rate Data

The Erythrocyte Sedimentation Rate (ESR) data. In this data, the primary objective was to determine if the levels of two plasma proteins (Fibrinogen and γ.Globulin) were responsible for the increase in ESR in healthy individuals. The research was conducted by the Institute of Medical Research in Kuala Lumpur, Malaysia, on 32 patients, and the original data was collected by [26] . The response of zero indicates a healthy person, whereas the response of one indicates an unwell person. Here, the continuous variables (FIB and γ.GLO) are compared to the binary response (ESR). In the original ESR data, [27] identified two outliers (cases 13 and 29) in X-space. Cases 14 and 15 are influential observations. Therefore, removing instances 14 and 15 would result in cases with no overlap.

Table 4. The estimated parameters and standard errors for the leukemia data.

Case 13 was modified so that ( y , x 1 , x 2 ) = ( 3 , 6 , 37 ) in order to execute uncontaminated data analysis. From the uncontaminated data, the ESR data were contaminated where the occurrences (Y = 1) and non-occurrences (Y = 0) were replaced with each other for cases 14 and 15, and this might only leave one out of the three overlapping cases for the ESR data.

Under contaminated ESR data, β0 and se(β0) of all estimators are impacted by outliers as compared to the other parameters (see Table 5). The results shown in Table 5 also indicate that the MLE is primarily influenced by outliers. Following the modification of the tainted data, only one overlapping observation, case 13, remains. This is the reason why the coefficients and standard errors of the WMLT that downweights this observation are so large. Even though the WMLT has the smallest χ2 value, the WMLT estimator should also be taken into consideration. The results shown in Table 5 indicate that the WMLT is a suitable estimator for the ESR data because its estimates are relatively closer to the MLE for uncontaminated data.

6. Conclusions

In this paper, we develop a new robust technique for logistic regression, also known as the weighted maximum likelihood technique (WMLT). The asymptotic consistency of the proposed estimator was demonstrated.

In order to evaluate the robustness of a new technique, we conducted simulation experiments using a variety of scenarios and data sets. Classical maximum likelihood estimates show the lack of robustness in the presence of outliers. Our simulation study for the clean model illustrated that the MLE, Mallows, and CUBIF estimators perform similarly, while the new weighted technique performs less effectively than the other estimators. The simulation study also shows that the WMLT technique outperforms other estimators when dealing with contaminated data and demonstrates the greatest performance among all estimators compared to various scenarios and real data sets. The proposed method (WMLT) can be applied to other generalized linear models (GLMs) and is expected to be superior to existing methods in practical applications. The findings

Table 5. Estimated coefficients, standard errors, and χ2 for ESR.

of this chapter provide researchers and practitioners with a new approach to developing robust estimators for logistic regression and potentially other generalized linear models (GLMs).

Acknowledgments

We thank the Editor and the referee for their comments. Research of M. Baron is funded by the National Science Foundation grant DMS 1322353. This support is greatly appreciated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 4, 705-724.
https://doi.org/10.1214/aos/1176345513
[2] Kunsch, H.R., Stefonski, L.A. and Carroll, R.J. (1989) Conditionally Unbiased Bounded-Influence Estimation in General Regression Models, with Applications to Generalized Linear Models. Journal of the American Statistical Association, 84, 460-466.
https://doi.org/10.1080/01621459.1989.10478791
[3] Carroll, R.J. and Pederson, S. (1993) On Robustness in the Logistic Regression Model. Journal of the Royal Statistical Society, 55, 693-706.
https://doi.org/10.1111/j.2517-6161.1993.tb01934.x
[4] Hobza, T., Mart, N. and Pardo, L. (2017) A Wald-Type Test Statistic Based on Robust Modified Median Estimator in Logistic Regression Models. Journal of Statistical Computation and Simulation, 87, 2309-2333.
https://doi.org/10.1080/00949655.2017.1330414
[5] Bergesio, A. and Yohai, V.J. (2011) Projection Estimators for Generalized Linear Models. Journal of the American Statistical Association, 106, 661-671.
https://doi.org/10.1198/jasa.2011.tm09774
[6] Myers, M.D.C.R.H. and Vining, G.G. (1992) Least-Absolute-Deviations Fits for Generalized Linear Models. Biometrika, 79, 747-754.
https://doi.org/10.1093/biomet/79.4.747
[7] Cantoni, E. and Ronchetti, E. (2001) Robust Inference for Generalized Linear Models. Journal of the American Statistical Association, 96, 1022-1030.
https://doi.org/10.1198/016214501753209004
[8] Wedderburn, R.W.M. (1974) Quasi-Likelihood Functions, Generalized Linear Models, and the Gaussian Newton Method. Biometrika, 61, 439-447.
https://doi.org/10.1093/biomet/61.3.439
[9] Croux, C., Flandre, C. and Haesbroeck, G. (2002) The Breakdown Behavior of the Maximum Likelihood Estimator in the Logistic Regression Model. Statistics and Probability Letters, 60, 377-386.
https://doi.org/10.1016/S0167-7152(02)00292-4
[10] Rousseeuw, P.J. and Christmann, A. (2005) Robustness against Separation and Outliers in Logistic Regression. Quality Control and Applied Statistics, 50, 451-452.
[11] Idriss, I.A. and Cheng, W. (2023) Robust Estimators for Poisson Regression. Open Journal of Statistics, 13, 112-118.
https://doi.org/10.4236/ojs.2023.131007
[12] Valdora, M. and Yohai, V.J. (2014) Robust Estimators for Generalized Linear Models. Journal of Statistical Planning and Inference, 146, 31-48.
https://doi.org/10.1016/j.jspi.2013.09.016
[13] Bianco, A.M. and Yohai, V.J. (1996) Robust Estimation in the Logistic Regression Model. Springer, New York, 17-34.
https://doi.org/10.1007/978-1-4612-2380-1_2
[14] Croux, C. and Haesbroeck, G. (2014) Implementing the Bianco and Yohai Estimator for Logistic Regression. Computational Statistics and Data Analysis Journal, 146, 31-48.
[15] Muller, C.H. and Neykov, N. (2003) Breakdown Points of Trimmed Likelihood Estimators and Related Estimators in Generalized Linear Models. Journal of Statistical Planning and Inference, 44, 273-295.
https://doi.org/10.1007/978-3-642-57338-5_24
[16] Copas, J.B. (1988) Binary Regression Models for Contaminated Data. Journal of the Royal Statistical Society, 50, 225-265.
https://doi.org/10.1111/j.2517-6161.1988.tb01723.x
[17] Ahmed, I.A.I. and Cheng, W.H. (2020) The Performance of Robust Methods in Logistic Regression Model. Open Journal of Statistics, 10, 127-138.
https://doi.org/10.4236/ojs.2020.101010
[18] Ruckstuhl, A.F. and Welsh, A.H. (2001) Robust Fitting of the Binomial Model. The Annals of Statistics, 29, 1117-1136.
https://doi.org/10.1214/aos/1013699996
[19] Krasker, W.S. and Welsch, R.E. (1982) Efficient Bounded-Influence Regression Estimation. Journal of the American Statistical Association, 77, 595-604.
https://doi.org/10.1080/01621459.1982.10477855
[20] Stefanski, L.A., Carroll, R.J. and Ruppert, D. (1986) Optimally Bounded Score Functions for Generalized Linear Models with Applications to Logistic Regression. Biometrika, 73, 413-424.
https://doi.org/10.1093/biomet/73.2.413
[21] Gervini, D. (2005) Robust Adaptive Estimators for Binary Regression Models. Journal of Statistical Planning and Inference, 131, 297-311.
https://doi.org/10.1016/j.jspi.2004.02.006
[22] Vandev, D.L. and Neykov, N.M. (1998) About Regression Estimators with High Breakdown Point. Statistics, 32, 111-129.
https://doi.org/10.1080/02331889808802657
[23] Huber, P.J. (1964) Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35, 73-101.
https://doi.org/10.1214/aoms/1177703732
[24] Vaart, A.W. (2000) Asymptotic Statistics. Cambridge University Press, Cambridge.
[25] Cox, D. and Oakes, D. (1984) Analysis of Survival Data. Chapman and Hall, New York.
[26] Collett, D. and Jemain, A. (1985) Residuals, Outliers and Influential Observations in Regression Analysis. Sains Malaysiana, 14, 493-511.
[27] Syaiba, B. and Habshah, M. (2010) Robust Logistic Diagnostic for the Identification of High Leverage Points in Logistic Regression Model. Journal of Applied Sciences, 10, 3042-3050.
https://doi.org/10.3923/jas.2010.3042.3050

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.