A Study of EM Algorithm as an Imputation Method: A Model-Based Simulation Study with Application to a Synthetic Compositional Data

Abstract

Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.

Share and Cite:

Abolade, Y. and Zhao, Y. (2024) A Study of EM Algorithm as an Imputation Method: A Model-Based Simulation Study with Application to a Synthetic Compositional Data. Open Journal of Modelling and Simulation, 12, 33-42. doi: 10.4236/ojmsi.2024.122002.

1. Introduction

Compositional data exclusively consists of relative information. These entities are part of a broader entity. Typically, closed data or data that aggregates to a constant value, such as 100%, is commonly documented. An illustrative instance within the field of medicine involves the examination of the constituent elements present in bodily fluids such as blood and urine. The statistical linear model is frequently employed to uncover latent associations among relevant random variables due to its user-friendly nature and interpretability. In the domain of machine learning and its associated disciplines, ensuring the quality of data is a significant difficulty. The quality of the underlying data plays a crucial role in determining the quality of information obtained through Machine Learning algorithms, as these algorithms rely solely on data for their functioning. One significant concern pertaining to data quality involves the presence of missing data, particularly in compositional datasets. The linear regression model is a widely employed statistical modeling technique that is utilized across various applications to ascertain correlations between variables of interest. The method of maximum likelihood estimation (MLE) is commonly employed to estimate the parameters of linear regression by determining the values that maximize the likelihood function given the observed data. The obtained model can be utilized for doing partial effects analysis on the independent variables as well as for making predictions about future outcomes.

However, it is important to note that many datasets often exhibit missing observations. In the context of research, it is possible for participants to opt out of providing a response to a survey query, for files to finally undergo destruction, or for data to be inadequately preserved. The process of recommencing data collecting and recovery in this case will incur financial expenses and necessitate a significant amount of time. The matter pertaining to the evaluation of inadequate data necessitates attention. The expectation-maximization (EM) technique has been proposed as a potential solution for scenarios with missing data due to its robust convergence properties. The estimation of parameters in statistical models that involve unobserved variables or unobserved data is commonly achieved by the repetitive use of the Expectation-Maximization (EM) technique, which seeks to find the maximum likelihood or maximum a posteriori (MAP) estimates. The M phase of the EM algorithm involves the estimation of parameters that maximize the expected log-likelihood obtained during the E stage. The E-step, in contrast, constructs a function that calculates the expected value of the log-likelihood, using the current estimate as the parameters. In the subsequent E phase, these estimates are subsequently employed to determine the distribution of the latent variables or missing data.

Despite the wide investigation of the EM algorithm as an imputation tool, there is a lack of knowledge regarding its effectiveness when used for compositional data. This study investigates the performance of the EM method on a synthetic compositional dataset with missing observations. Two imputation techniques, namely the robust least square version and least square, are utilized and evaluated. The EM technique is applied to simulated studies by making iterative assumptions about a compositional dataset with random missing data and outliers, assuming a normal distribution. The effectiveness of the EM method was evaluated by comparing its results with two commonly employed imputation techniques, namely k-Nearest Neighbor (k-NN) and mean imputation ( x ¯ ), in terms of Aitchison distance and covariance [1] . Based on the conducted trials, it was shown that the robust variant of the EM algorithm exhibited superior performance compared to alternative imputation strategies.

2. Methodolgy

2.1. Linear Regression Model

We take into consideration a one-dimensional estimator and response linear regression model. Let’s say we have n observations in our dataset. We define the predictor X = ( x 1 , x 2 , , x n ) , and the response Y = ( y 1 , y 2 , , y n ) . For the ith observation, we assume that y i and x i are related by the linear regression model and in Equation (1):

y i = β 0 + β 1 x i + ϵ , ϵ ~ N I D ( 0 , σ 2 ) (1)

We assume that x i ~ N ( α , δ 2 ) , i.i.d. Under such assumptions, the conditional distribution of Y given X is [ Y | X ] ~ N ( β 0 + β 1 X , σ 2 ) . Then we can write down the joint probability density of X and Y given by

f ( y i , x i ) = f ( y i | x i ) f ( x i ) = 1 σ 2 π e 1 2 σ 2 ( y i β 0 β 1 x i ) 2 × 1 δ 2 π e 1 2 δ 2 ( x i α ) 2 (2)

2.2. Missing Values

The data is not completely observed in a lot of real-world scenarios. We expand our model to have response values Y fully observed and only m of the predictor values observed (i.e., n m predictor values missing), and the response values Y are fully observed. We can arrange the dataset so that the first m observation is fully observed.

X c o m p = ( x 1 , , x m , x m + 1 , , x n ) = ( X o b s , X m i s s ) (3)

Equation (4) thus allows for the decomposition of the entire data log-likelihood for the model into the observable and missing parts.

L ( θ ; X , Y ) = s i = 1 n L ( θ ; x i , y i ) = 2 n log σ 2 2 π 2 n log δ 2 2 π 1 2 σ 2 i = 1 m ( y i β 0 β 1 x i ) 2 1 2 σ 2 i = m + 1 n ( y i β 0 β 1 x i ) 2 1 2 δ 2 i = 1 m ( x i α ) 2 1 2 δ 2 i = m + 1 n ( x i α ) 2 (4)

where θ = ( β 0 , β 1 , σ 2 , α , δ 2 ) 5

2.3. The EM Algorithm Formulation

The issue with the above calculation is that Xmis with not observed and needs to be estimated. One reasonable approach is that we simply require each x m + 1 , , x n to be replaced by its conditional expectation given the observed data, Xobs and Y.

1) E-step:

E [ i = m + 1 n ( y i β 0 β 1 x i ) 2 ] = i = m + 1 n ( ( y i β 0 ) 2 + β 1 2 E ( X i 2 | y i , θ * ) 2 β 1 ( y i β 0 ) E ( X i | y i , θ * ) )

E [ i = m + 1 n ( x i α ) 2 ] = i = m + 1 n ( E ( X i 2 | y i , θ * ) 2 α E ( X i | y i , θ * ) + α 2 ) (5)

where E [ x i | y i , θ * ] and E [ x i 2 | y i , θ * ] are the first and second conditional moments, respectively. Since X and Y have a bivariate normal distribution, we can derive the conditional of X given Y and θ *

E [ X i | y i , θ * ] ~ N ( α + β 1 δ 2 σ 2 + β 1 2 δ 2 ( y i β 0 β 1 α ) , σ 2 δ 2 σ 2 + β 1 2 δ 2 ) (6)

Then we can easily find the conditional first and second moment of Xmiss given Y and θ * , denoted as M 1 and M 2 respectively.

M i 1 = α + β 1 δ 2 σ 2 + β 1 2 δ 2 ( y i β 0 β 1 α ) (7)

M i 2 = ( α + β 1 δ 2 σ 2 + β 1 2 δ 2 ( y i β 0 β 1 α ) ) 2 + σ 2 δ 2 σ 2 + β 1 2 δ 2 (8)

With these terms computed above, the E-step formulation is shown in Equation (9) below.

Q ( θ , θ * ) = L ( θ ; X , Y ) = 2 n log 2 π n log δ n log σ 1 2 σ 2 i = 1 m ( ( y i β 0 ) 2 + β 1 2 M 2 2 β 1 ( y i β 0 ) M 1 ) 1 2 σ 2 i = m + 1 n ( y i β 0 β 1 x i ) 2 1 2 δ 2 i = m + 1 n ( x i α ) 2 1 2 δ 2 i = 1 m ( M 2 2 α M 1 α 2 ) (9)

where M 1 , M 2 are given in Equations (7) and (8).

2) M-step:

The M-step maximizes Q ( θ , θ * ) calculated in the E-step. Solving Q ( θ , θ * ) θ = 0 , we get the following results. The updated estimates of β is just the OLS solution to the model, i.e., β = ( X T X ) 1 ( X T Y )

[ β 0 β 1 ] = [ n i = 1 n x i * ˜ i = 1 n x i * ˜ i = 1 n x i 2 * ˜ ] 1 [ i = 1 n y i i = 1 n x i * ˜ y i ]

where X * ˜ = ( X o b s , M 1 * ) n and X 2 * ˜ = ( X o b s 2 , M 2 * ) n are estimated completed predictor under current estimated parameter θ * . Similarly, the other updated parameters will be:

σ 2 = 1 n i = 1 n ( ( y i β 0 ) 2 β 1 y i β 0 ) x i * ˜ + ( β 1 ) 2 x i 2 * ˜

α = 1 n i = 1 n x i * ˜

δ 2 = 1 n i = 1 n ( x i 2 * ˜ 2 α x i * ˜ + α 2 )

2.4. Convergence of EM Algorithm

We will discuss the convergence of EM algorithm in a more general setting. Suppose we have dataset of m independent examples and want to fit a parametric model p ( x , z ) to the dataset, then the log likelihood function is:

l ( θ ) = i = 1 m log p ( x ( i ) ; θ ) = i = 1 m log p ( x ( i ) , z ; θ )

where z are the latent random variables. Explicitly finding the maximum likelihood estimate of the parameter θ is quite hard, but if z ( i ) is observed, the estimation would be easy. Let Q i ( z ) 0 , z Q i ( z ) = 1 . Since f ( x ) = log ( x ) is a concave function and by Jensen’s Inequality, we get the lower-bound of l ( θ ) :

i log p ( x ( i ) ; θ ) = i z ( i ) p ( x ( i ) , z ( i ) ; θ ) (10)

= i log z ( i ) Q i ( z ( i ) ) p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) (11)

i z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) (12)

Note that this inequality holds for any distribution Q i , it gives the lower bound on l ( θ ) . Later we will show that the l ( θ ) increases monotonically with successive iterations of EM if the lower-bound is tight at θ. We know that the Jensen’s Inequality holds with equality if the random variables are constant. So, it suffices to satisfy:

p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) = c

i . e . Q i ( z ( i ) ) p ( x ( i ) , z ( i ) ; θ )

where c is constant and does not depend on z ( i ) . Under this assumption, since z Q i ( z ) = 1 , we get:

Q i ( z ( i ) ) = p ( x ( i ) , z ( i ) ; θ ) z p ( x ( i ) , z ( i ) ; θ ) = p ( x ( i ) , z ( i ) ; θ ) p ( x ( i ) , z ; θ ) = p ( z ( i ) | x ( i ) ; θ )

So Qi’s is just the posterior distribution of the z(i)’s given x(i) and the setting of θ. Now we introduce the iteration of EM algorithm:

While θ ( t ) θ ( t 1 ) > ϵ do

E-step:

Compute Q i ( t ) ( z ( i ) ) : = p ( z ( i ) | x ( i ) ; θ ( i ) )

M-step

Compute

θ ( t + 1 ) : = arg max θ i z ( i ) Q i ( t ) ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( t ) ( z ( i ) )

End

Consider that:

l ( θ ( t + 1 ) ) i z ( i ) Q i ( t ) ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ( t + 1 ) ) Q i ( t ) ( z ( i ) ) (13)

i z ( i ) Q i ( t ) ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ( t ) ) Q i ( t ) ( z ( i ) ) (14)

= l ( θ ( t ) ) (15)

The first inequality holds because of (13). By the definition of θ ( t + 1 ) , (15) is obvious. The last equality holds because lower-bound in (13) is tight at θ = θ ( t ) under our previous assumption. So, the sequence { l ( θ ( t ) ) } t is both upper bounded (by 0) and increasing. Hence, in EM algorithm the log likelihood converges monotonically.

3. Application

This section covers the model-based simulation research application of the EM algorithm with least square and resilient least square regression on compositional data. We also analyze the resilience and efficiency of the EM approach and compare its output to two other commonly used imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation ( x ¯ ), to address missing data.

3.1. Data Description

Compositional data is a unique kind of non-negative data that contains the pertinent information not in the actual data values but rather in the ratios between the variables. An observation x = ( x 1 , , x D ) is a D-part composition if, and only if, x i > 0 , i = 1 , , D , and according to Aitchison [2] , the ratios between the components include all the important information.

S D = { [ x 1 , , x D ] : x i > 0 ( i = 1 , , D ) , x 1 + + x D = 1 } and ( x 1 , , x D ) = ( w 1 , , w D ) w 1 + + w D where (w) denote = total weight and ( w 1 , , w D )

are the component weights. According to Aitchison [2] , compositional data is not directly represented in Euclidean space. The Aitchison distance d A is a suitable way to measure the D-part composition known as (simplex) [3] . According to Egozcue et al. [4] , the isometric log-ratio (ilr) is used to convert the D-dimensional simplex into the real space D 1 . With this transformation, the Aitchison distance can be expressed as d A ( x , y ) = d E ( i l r ( x ) , i l r ( y ) ) , where d E denotes the Euclidean distance. The data in this simulation is generated by a normal distribution on the simplex, denoted by s D ( μ , Σ ) (Mateu-Figueras, Pawlowsky-Glahn, and Egozcue) [5] . We generated 10,000 realizations of a random variable [6] X ~ s 4 ( μ , Σ ) with μ = ( 0 , 2 , 3 ) T and Σ = ( ( 1 , 0.5 , 1.4 ) T , ( 0.5 , 1 , 0.6 ) T , ( 1.4 , 0.6 , 2 ) T ) [7] .

3.2. Experimental Design

· To assess the effectives of the EM algorithm, we look at the results using least squares (LS) and its robust version (RLS) across a range of missing data rates (contamination levels between 5% and 10%) and outlier rates (1%, 3%, 5%, and 10%) expressed in terms of their Aitchison distance (dA) [8] .

· We also look at the EM algorithm’s output in terms of the covariances of various rates of outliers (1%, 3%, 5%, and 10%) and missing data (contamination levels ranging from 5% to 10%).

3.3. Results and Analysis

The detailed results for all experiments are discussed in this section (Table 1, Table 2, Figure 1, Figure 2).

Table 1. Performance metrics for different imputation methods in terms of distance.

1Epsilon denote rate of outlier at (1%, 3%, 5% and 10%); 2NArate denote missing rate (contamination level at 5% and 10%); 3xMean denote arithmetic mean imputation method; 4kNN denote k-Nearest Neighbor imputation method; 5LS denote least square regression version of EM algorithm; 6RLS denote robust least square regression version of EM algorithm; *Bold numbers indicate the best performing imputation method for a given epsilon and missing rate.

Figure 1. Performance for different imputation methods in terms of distance.

Table 2. Performance metrics for different imputation methods in terms of Covariance.

1Epsilon denote rate of outlier at (1%, 3%, 5% and 10%); 2NArate denote missing rate (contamination level at 5% and 10%); 3xMean denote arithmetic mean imputation method; 4kNN denote k-Nearest Neighbor imputation method; 5LS denote least square regression version of EM algorithm; 6RLS denote robust least square regression version of EM algorithm; *Bold numbers indicate the best performing imputation method for a given epsilon and missing rate.

Figure 2. Performance for different imputation methods in terms of Covariance.

4. Conclusions

In this study, it is postulated that the missing rate follows a random pattern. Consequently, both the least square and robust least square EM algorithm approaches are employed to address the issue of missing compositional data. The simulated compositional synthetic data was utilized as the practical dataset in our study. It has been shown that when the occurrence of missing and outlier data decreases, the Aitchison distance approaches zero. Furthermore, the robust least square variant of the EM method yields the smallest values in this regard making it the best performing algorithm. Distance values close to zero represent the most efficacious imputation strategies.

We also compare the Expectation-Maximization (EM) method to arithmetic mean imputation and k-Nearest Neighbor and find that the robust EM version is the best of the bunch when it comes to co-variances. In this study, we provided evidence to support the effectiveness of all strategies when applied under conditions of low missing rates, specifically those below 5%. The EM algorithm demonstrates improved performance as the rate of missing data increases. However, when the incidence of missing data exceeds 10%, most imputation methods become inadequate in generating a dependable approximation, hence exacerbating the difficulty in recovering the data.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Aitchison, J. (2002) Simplicial Inference. In: Viana, M.A.G. and Richards, D.S.P., Eds., Contemporary Mathematics Series, Vol. 287: Algebraic Methods in Statistics and Probability, American Mathematical Society, Providence, 1-22.
[2] Aitchison, J. (1986) The Statistical Analysis of Compositional Data. Chapman & Hall, London.
[3] Aitchison, J. (1989) Measures of Location of Compositional Data Sets. Mathematical Geology, 21, 787-790.
https://doi.org/10.1007/BF00893322
[4] Martín-Fernández, J.A., Egozcue, J.J., Olea, R.A. et al. (2021) Units Recovery Methodsin Compositional Data Analysis. Natural Resources Research, 30, 3045-3058.
https://doi.org/10.1007/s11053-020-09659-7
[5] Pawlowsky, V., Olea, R.A. and Davis, J.C. (1995) Estimation of Regionalized Compositions: A Comparison of Three Methods. Mathematical Geosciences, 27, 105-127.
https://doi.org/10.1007/BF02083570
[6] Weltje, G.J. (1997) End-Member Modeling of Compositional Data: Numerical-Statistical Algorithms for Solving the Explicit Mixing Problem. Mathematical Geosciences, 29, 503-549.
https://doi.org/10.1007/BF02775085
[7] Rehder, U. and Zier, S. (2001) Comment on “Logratio Analysis and Compositional distance by Aitchison et al. (2000)”. Journal of Mathematical Geology, 32, 741-763.
[8] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1-22.
https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.