Superiority of Bayesian Imputation to Mice in Logit Panel Data Models ()
1. Introduction
Complications in statistical analyses due to missing values in data sets have been of major concern to researchers in virtually all fields of study. If the response variable is binary, these complications are compounded by the nonlinear nature of model specification adopted to relate the continuous covariates to the categorical response. Numerous studies show that frequentist approaches to missing data, like complete case analysis, often lead to biased estimates and substantial loss of power [1] [2] [3] [4] . The logit panel data model may therefore not be an exception to this problem.
Model based techniques of imputation have proved to be more reliable than single value imputation. One customary model-based approach is to perform multiple imputations with chained equations (MICE) [4] [5] which has been widely accepted due to its good performance and ease of use. Available statistical tools designed for MICE operate in three key stages: 1) a small number (usually five or ten) of data sets are created by imputing each missing value multiple times; 2) analyzing each of the completed data sets using known complete data methods; 3) obtaining the overall result by pooling the derived estimates to take additional uncertainty created by the missing values into account. The separation of the imputation and analysis stages of MICE calls for the need to have imputation models which contain not only other covariates in the predictor function but also the outcome [6] . When the outcome is univariate and the data is complete, this can be easily performed because the outcome is just one of the variables in the data set. However, when the outcome has a multivariate nature, such as a binary outcome study, the outcome variable inclusion function becomes subjective and may take simple or complex summaries of the long trajectories, such as including only a single observed value, average value or the area under the trajectory. Evidently, it is not easy to settle for the most adequate representation for a specific analysis model, and except in very simple situations, some of the summaries adopted discard relevant information. Comparative assessment done herein shows that the inclusion of inadequate summary measures of subjects’ trajectories can lead to increased bias.
A full Bayesian imputation technique has been observed to combine the analysis model with the imputation models without having to specify an appropriate summary measure for the outcome variable. The milestone difference to MICE is that by combining the imputation and analysis in one estimation procedure, the Bayesian approach obtains inferences on the posterior distribution of the parameters and missing covariates directly and jointly. Thereby, the whole trajectory of the longitudinal outcome is implicitly taken into account in the imputation of missing covariate values, and no summary representation is necessary. This study embraces the works of Ibrahim et al. [7] , who propose a decomposition of the likelihood into a series of univariate conditional distributions which produces the sequential full Bayesian (SFB) approach that is flexible and easy to implement as an alternative to MICE. Besides, the uncertainty about the missing values is automatically taken into account and no pooling is necessary since missing values are continuously imputed in each iteration of the estimation procedure. Stubbendick and Ibrahim [8] , in their study used the likelihood based approach that factorized the joint likelihood into a sequence of conditional distributions. This approach is akin to SFB except for the model used. In addition, other authors have shown how to apply weighted estimating equations for inference in settings with incomplete data [9] [10] . Generally, studies agree that model based imputation techniques are superior to single value imputations. However, not much literature is found on conditional logit models as it were.
In this study, we describe a few strategies to include a binary outcome in the MICE algorithm and compare them with the Bayesian approach. Each of the methods was evaluated using simulation from which biases and root mean square errors (RMSE) are compared to discern the better of the two methods.
We structure the rest of the paper as follows: Section 2 briefly describes the logit panel data model. In Section 3, we introduce the problem of missing data and describe and compare the two methods of interest, MICE, and the SFB approach. Both methods are applied to simulated data in Section 4. An evaluation of the bias and RMSE for both approaches is described in Section 5 where the findings are summarized and discussed.
2. Model Specification
2.1. The Logit Panel Data Model
A general panel data model is a special case of the generalized longitudinal model where a population unit i is observed across different time periods and takes the form
(1)
where
is the tth observation of individual i, measured at time t,
denotes the vector of k regression coefficients of the design matrix of the fixed effects
, where
is a column vector that contains the tth row of that matrix.
is an individual-specific time-invariant parameter while
is a time-specific individual-invariant parameter.
is an error term that is normally distributed with mean zero and variance
. In many studies, individuals or subjects are assumed to change over time, and that not any two of them have the same characteristics at a time t. As such, we only have the individual specific effects, letting
, Equation (1) takes the form
(2)
The association between
and
establishes whether the relation (2) is treated as fixed or random effects model. This is to say that if
is correlated with
then the model has only
as the stochastic part and
is treated as fixed (non-random) and consequentially, we have a fixed effect (FE) panel model. On the other hand, (2) is a random effect (RE) model, if
becomes part of the stochastic part
. Cumulative Equation (2) has a total of
parameters to be estimated as
and
. For FE, one does not estimate the effects of the variables that are individual specific and time-invariant but rather controls for them or “partials them out” to reduce the bias from other omitted variables.
For a continuous dependent variable
, the parameters in the panel data model can be estimated unbiasedly and efficiently. Unfortunately, however, the dependent variable may be categorical which calls for specific nonlinear functions that preserve the structure of the dependent variable to be considered. The logistic function is one such nonlinear relation that yields the logit model. Conditional maximum likelihood estimation is the most preferred method for logistic regressions. In this method, the conditional maximum likelihood “conditions” the (fixed effects) out of the likelihood function [16] . This is done by conditioning the likelihood function on the total number of events observed for each subject as was first explained by Chamberlain [11] .
Most economic studies have the dependent variable as categorical with two (or more) options indicating a success or a failure of an event. Such dependent variable is normally represented by a binary choice variable
for individual i at time t. Then
follows a binomial distribution with probability of success for individual i at time t as
and
(3)
This expected value of the binary outcome is modeled as a function of some explanatory variables or covariates as
(4)
where
is a link function that relates the binary outcome to the various types of covariates in. Here, we assume strict exogeneity holds i.e. the residual
is uncorrelated with all x-variables over the entire time period spanned by the panel so that
has a zero-mean.
If F is the unit function for all
then under random sampling, the unconditional probability that
is equal to the unconditional expected value of
, i.e.
So if the binary response model above is correctly specified, we have
(5)
Model (5) is referred to as the linear probability model (LPM) since the probability of success is expressed as a linear function of the explanatory variables in the vector
and the parameters can be estimated by OLS or within estimator. A major shortfall of LPM when used to estimate the parameters for any discrete choice response variable is that we can get predicted “probabilities” either less than zero or greater than one which of course absurdly contravenes the definition of probability.
The problems of LPM can be addressed by choosing a monotonically increasing function F such that
and
. (7)
Thus F is a nonlinear function, and hence we cannot use a linear regression model estimation techniques. Various non-linear functions for F have been suggested in the literature and the most common ones are the logistic distribution, yielding the logit model, and the standard normal distribution, yielding the probit model.
In the logit model, F takes the form,
(8)
which is between zero and one for all values of
. This is the cumulative distribution function (CDF) for a logistic variable whose parameters can be estimated.
2.2. Implications of the Probability Function Pr(.) for Binary Outcomes
When the outcome variable is binary by specification, its probability function Pr(.) only permits two plausible values for either a success or a failure irrespective of the functional forms of the covariates in the model. Overlooking this property of the binary outcome yields a linear probability model (LPM) (5) which has quite a number of limitations. Among these limitations are: 1) we can get predicted “probabilities” either less than zero or greater than one from the LPM which is absurd since predictions outside this range are meaningless and somewhat embarrassing; 2) conceptually, it does not make sense to say that a probability is linearly related to a continuous independent variable for all possible values. If it were, then continually increasing this explanatory variable would eventually drive
above one or below zero; 3) the residual of the LPM is heteroskedastic and the only best way of solving this problem is to obtain estimates of the standard errors that are robust to heteroskedasticity; 4) the residual is not normally distributed hence small sample inferences cannot be based on the usual suite of normality-based distributions such as the t test.
Equation (8) now represents what is known as the (cumulative) logistic distribution function which cushions against the limitations of LPM. It is easy to verify that as
ranges from −∞ to +∞, F ranges between 0 and 1 and that F is nonlinearly related to
(i.e.,
), thus satisfying the requirements for a probability function. However, in satisfying these requirements, we have created an estimation problem because F is nonlinear not only in X but also in the β’s as can be seen clearly from (8). This implies that we cannot use the familiar OLS procedure to estimate the parameters. But through linearization, this problem becomes more apparent than real. This is done by obtaining the odds ratio
in favor of a success i.e. the ratio of the probability that
to the probability that
. It is realized that the logarithm of the odds ratio, is not only linear in X, but also (from the estimation viewpoint) linear in the parameters.
. This log of the odds ratio is called the logit, and hence the name logit model.
3. Dealing with Missing Data—Model-Based Approach for Logit Panel Data Models
The relationship between a binary outcome y and predictor variables X is a linear mixed model expressed in a standard modeling framework as Equation (1).
If the data setting is complete and balanced, the probability density function of interest is
where
denotes the vector of all parameters of the model. Conversely, when missingness results in some of the covariates being incomplete,
is partitioned into two parts,
: the completely observed variables
and those variables containing missing value
. We then have a measurement model that depends on unobserved data expressed as
, which cannot be handled by the standard complete data methods efficiently.
Two key assumptions now need to be made so that we may not require any digression from standard approaches of handling mixed effects models. These assumptions are (a) that we only have cross-sectional covariates to impute and (b) that the missing data mechanism of the outcome is ignorable, that is, missing at random (MAR) or missing completely at random (MCAR) [12] .
As it were, despite having various frequentist techniques of imputing on missing data, we narrow our imputation to model-based techniques and give in-depth discussion on multiple imputation using chained equations (MICE) and sequential full Bayesian approach (SFB). This concept of replacing a missing datum with multiple values was hatched by Donald B. Rubin [13] . He stated that
“[…] of course (1) imputing one value for missing datum can’t be correct in general, and (2) in order to insert sensible values for a missing datum we must rely more or less on some model relating unobserved values to observed values.”
1) Multiple imputations using chained equations (MICE)
The underlying principle of MI is to divide the analysis of incomplete data into three steps: MICE has become a popular imputation technique by the fact that it allows for multivariate missing data and does not require a specific missingness pattern. Since the core principle of multiple imputations is divided into three stages (imputation, analysis, and pooling), MICE becomes very vital in addressing the initial stage of imputation.
Performing MICE under certain regularity conditions, the multivariate distribution
(9)
with
and
can be uniquely determined by its full-conditional distributions. Van Buuren et al. [14] assert that through this unique determination, Gibbs sampling of the conditional distributions can be used to produce a sample from (9). Though the MICE procedure does not actually start from a specification of (9), it directly defines a series of conditional, predictive models of the form
(10)
which links each incomplete predictor variable
,
, with other incomplete and complete predictors,
and
, respectively, without forgetting the response
. The predictive distributions (1.2) are drawn from the extended exponential family with linear predictor expressed as
where
is the one-to-one monotonic link function for the 𝓁th covariate and
,
and
are vectors of parameters relating the complete and missing covariates and the outcome to
.
The function
specifies how the outcome enters the linear predictor. In the univariate case, the default choice for
is simply the identity function. However, when we have a multivariate
, such as a binary outcome, we cannot always simply specify
because
may take on two different values for an individual i at time t. Hence, it is not meaningful to use the same regression coefficient
to connect outcomes of different individuals with
, and a representation needs to be found that summarizes
and that has the same number of elements that also have the same interpretation for all individuals.
We choose to proposed some relations for
in this study as below
(11)
(12)
(13)
(14)
From the above relations, it can be deduced that
from:
i) (11) omits the response completely from the relation
.
ii) (12) uses only one observation chosen from the outcomes for ith individual.
ii) (13) uses the average of the observed outcomes.
iv) (14) uses the trapezoidal area under the curve of
around the observation times t and t + 1 as exemplified in Figure 1 below.
Linear combinations of the above relations may still suffice and so there can be infinitely many options of specifying the predictor function g(·). It is not therefore a simple task to determine an appropriate
to use unless the setting is too simple as well. Only in very simple settings is it possible to determine which function of
is appropriate. Under a random intercept model for
, (13) is the appropriate summary function in the imputation model for a normal
Figure 1. Shows the area under the trajectory used as a summary of the response variable y over the study periods.
cross-sectional covariate [15] whereas for more complex analysis models or discrete covariates, it is not straightforward to derive the appropriate summary functions.
The splendid of all inclusions of the outcome variable into the predictor function would be to one in which all available outcome variables are captured in the function
. This however can only be possible in settings where the outcome is balanced or close to balanced and does not have a large number of repeated measurements. Alternatively, we could impute missing outcome values so that all individuals have the same number of measurements at approximately the same time points.
In order to achieve high validity of imputations using MICE, we need to ensure that (a) the imputation models are correctly specified and (b) the missing data mechanism is ignorable—MAR or MCAR [12] [16] .
2) Full Bayesian imputation
Since MICE may be faced by the challenge of choosing the best summary representation of a multivariate outcome, we can settle for a full Bayesian approach. In this approach, the complete data posterior is obtained by combining the complete data likelihood with prior information to yield
where
is a vector containing parameters that are associated with the likelihood of the partially observed covariates Xmis, and
and
are prior distributions. Ibrahim et al., [7] explain that a convenient way to specify the joint likelihood of the missing covariates
is to use a sequence of conditional univariate distributions
(15)
with
. It is similarly assumed that each of these distributions is again chosen from the extended exponential family and in accordance with the type of the respective variable. An advantage of this sequential way of writing the joint distribution of the covariates is that it provides a straightforward way to specify the joint distribution even when the covariates are of mixed type.
After specifying the prior distributions
and
, Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling, can be used to draw samples from the joint posterior distribution of all parameters and missing values. Because all missing values are imputed in each iteration of the Gibbs sampler, the additional uncertainty created by the missing values is automatically taken into account, and no pooling is necessary.
We opt to adopt the full likelihood instead of the series of predictive models (10) due to one key advantage: that we can choose how to factorize this full likelihood. Precisely, by factorizing the joint distribution
in to the conditional distribution
and the marginal distribution
, the joint posterior distribution can be specified without having to include the outcome into any predictor, and no summary representation
is needed. This becomes clear when writing out the full conditional distribution of the incomplete covariates, used by the Gibbs sampler:
where densities
,
and
are members of the extended exponential family with linear predictors expressed as
(16)
(17)
(18)
with T denoting the number of repeated measurements of individual i,
,
,
,
and
.
Equation (16) represents the predictor of the linear mixed model, Equation (17) the predictor of the imputation model of
with link function
from the extended exponential family, and Equation (18) represents the predictors of the covariates that have
as a predictive variable, with
being the corresponding link function. Clearly, looking at Equations (16)-(18) it is noted that none of them is dependent on the outcome variable as was the case in MICE. This puts SFB as a better option over MICE at the specification stage. Synonymously, just like MICE, the SFB approach assumes ignorable missing data mechanisms and correctly specified conditional distributions of the covariates.
It has been mentioned before that it is not obvious how the imputation models in the sequence should be ordered [17] and, from a theoretical point of view, different orderings may result in different joint distributions, leading to different results. Chen and Ibrahim [18] suggest to condition the categorical imputation models on the continuous covariates. In the context of MI, it has been suggested to impute variables in a sequence so that the missing pattern is close to monotone. Our convention is to order the imputation models in (15) according to the number of missing values, starting with the variable with the least missing values. It has been shown, however, that sequential specifications, as used in the Bayesian approach, are quite robust against changes in the ordering [18] and results may be unbiased even when the order of the sequence is misspecified as long as the imputation models fit the data well enough.
4. Monte Carlo Simulation Study
4.1. Relative Theory of Monte Carlo Study
A Monte Carlo simulation (also known as multiple probability simulation) is used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. Monte Carlo studies, or Monte Carlo experiments, encompass a loop of computational algorithms that base on repeated and randomized sampling procedures to obtain numerical results. The major goal is therefore to use this randomness to solve modeling problems. Use of Monte Carlo studies allows us to understand the impact of risk and uncertainty. Universally, Monte Carlo simulation studies follow five key steps:
1) Setting up a probability distribution for important variables.
2) Building a cumulative probability distribution for each variable.
3) Establishing an interval of random numbers for each variable.
4) Generating random numbers.
5) Actually simulating a series of trials.
4.2. Design
In this section, we subject the theoretical analysis of MICE and SFB to hypothetical binary outcome panel data to support the hypothesis that SFB may be superior to MICE. To this end we focus on the logit estimator. This simulations aims at comparing bias and RMSE of the parameter estimates obtained by the unconditional logit estimator and the conditional logit estimator in the presence of imputed missing covariates.
To take care of the different possible features of the data, this comparison will be made for two sets of data, one complete (balanced) and the other incomplete (unbalanced) due to intermittent nonresponses. The latter data set is then balanced by imputing the missing observations using the two model based approaches described in Section 3. We considered a total of four data sets: one complete set, two imputed sets by MICE using the specifications (11) and (13) of how the response variable relates to missingness and one imputed set by SFB.
We specify all the panel sets fitted to the estimation of the following model:
where
is a vector of five explanatory variables drawn from uniform, binomial and normal distributions and the error term
is drawn from a normal distribution. The variables’ descriptions are in Table 1. All the specified predictor variables have parameters, β1 to β5, of the model which are used to define the dependent variable y. These parameters were fixed as β1 = 1, β2 = −1, β3 = 1, β4 = 1 and β5 = 1. The dependent variable, y is then calculated from the relation
where
is a logistic variable given by
with
being a standard normal random variable. The fixed effects
are obtained as functions of
and T by the relation
with
being a standard normal random variable as well. By simulation, the probability densities of the five variables are as shown in Figure 2 where the blue density represents the complete data and the magenta densities are from the few imputed data sets by MICE and SFB.
For in-depth comparisons, we also assess the impact of sample size on the parameter estimates obtained by these approaches herein. In order to factor in small, medium and large sample size possibilities, we mainly settled for three different values of N that were used for all sets of data fitted into the models (N = 50, N = 100 and N = 250). This was established by a subjective imposition of the expected probability of success as
and plausible coefficients of variation (CoV) as 0.2, 0.14 and 0.09 respectively in the relation
. Further, we evaluate the impact of the proportion of
Figure 2. Covariate densities of complete data (blue) and imputed data (magenta).
missingness from 10% to 30% by randomly deleting the desired proportion of observations from the data set and imputing them back for each sample size.
Since we estimate a fixed effect model, the coefficients are truncated in order to ensure convergence. After 1000 iterations, the summarized results are given in tabulated as in Tables 2-4 where for both estimators (unconditional logit and conditional logit) considered, we report the mean bias, and the root mean squared error (RMSE) for all the parameter estimates. R-Studio has several inbuilt packages that make use of the Bayesian framework for imputation [14] [19] [20] [21] . However for longitudinal or panel data sets, the use of WinBUGS [21] or JAGS [22] provides a much more straightforward platform to easily handle SFB.
For each of the variables x(1) through x(5) five different samples were generated in MICE with the densities as shown. The simulated densities tend to follow the trend of the observed (complete) density in blue for each of the study variables.
4.3. Simulation Results
Table 2. Sample size n = 50, percentage of missingness = 10%; 30%.
Table 3. Sample size n = 100, percentage of missingness = 10%; 30%.
Table 4. Sample size n = 250, percentage of missingness = 10%; 30%.
5. Discussion, Conclusions and Recommendation
Through the variation of the simulated sample sizes, the study confirms the asymptomatic properties of parameter bias. Undeniably, all the reported measures (mean bias and the root mean square errors) are observed to have an inverse relationship with increasing sample size across all the models considered. We also have established as much as MICE performs well in most standard situations, SFB yields relatively stable estimates that are robust to the proportion of missingness in a binary response panel data frame. In addition, MICE does not
Figure 3. Average bias for parameter estimates.
respond consistently with categorical or discrete data but yields consistent parameter estimates for the continuous regressor. This is also observed for SFB but improves asymptomatically. Uniquely, the magnitude of the median bias produces estimates with increased biases for the conditional logit estimator compared to the unconditional logit estimator when all three imputation techniques are performed more so when the sample size is large [23] . Figure 3 shows the superiority of SFB over MICE for both scenarios considered when handling the logistic panel data model.
Since the SFB approach uses a sequence of univariate conditional distributions to specify the joint pdf, it is advantageous because the analysis model is part of the conditional distributions and the parameters of interest are estimated simultaneously with the imputation of the missing values hence no further pooling is done as is the case in MICE. This means that specification of the univariate conditional imputation model permits a much straightforward and flexible imputation.
In summary, this paper has discussed brief estimation methods and procedures for estimating nonlinear (binary choice logit) panel data regression models with missing covariates.
The key objective of this study was singled out to be an assessment of the performances of MICE and Bayesian imputation to effect of non-responses (missingness) in the parameter estimates for logit panel data models. Although studies show that MICE performs relatively well in many standard situations, this study has demonstrated the advantage of working with full likelihood in more complex specifications of the response variable. In addition, situations where imputation with MICE does not take into account the actual structure of the measurement model may yield poorly imputed values which affect the consistency of the analysis.
A key importance of deriving the estimators is to increase the theoretical understanding of the estimators and also reduce the computational complexity while estimating logit panel models. As observed from the Monte Carlo results, unbalancedness in a data set biases the parameter estimates and the different imputation techniques employed in this study respond differently to the bias and efficiency of the estimates.
As a recommendation, further developments can be done on this study by considering other imputation techniques in MICE where the response variable enters the imputation model in other different specifications.
Appendix
R CODES FOR MONTE CARLO SIMULATION
n = 50 with 10% missingness
set.seed(12345) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter = 1000
j = 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=50 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation
ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .1)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply (XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH
n= 50 with 30% missingness
set.seed(123456) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter=1000
j= 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=50 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation
ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .3){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .3)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH
n= 100 with 10% missingness
set.seed(1234567) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter=1000
j= 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=100 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation
ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .1)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH
n= 100 with 30% missingness
set.seed(12345678) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter=1000
j= 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=100 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation
ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .3){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .3)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH
n= 250 with 10% missingness
set.seed(123456789) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter=1000
j= 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=250 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation
ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .1)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH
n= 250 with 30% missingness
set.seed(1234567899) # Use this to make the randomly generated data the same each time you run the simulation#
N.iter=1000
j= 1:5
A <- array(0, dim=c(N.iter,5))
B <- array(0, dim=c(N.iter,5))
C <- array(0, dim=c(N.iter,5))
D <- array(0, dim=c(N.iter,5))
E <- array(0, dim=c(N.iter,5))
F <- array(0, dim=c(N.iter,5))
G <- array(0, dim=c(N.iter,5))
H <- array(0, dim=c(N.iter,5))
for(i in 1:N.iter){
n=250 # vary n
t=2 # vary t and change time in pData
nt=n*t
b1=1
b2=-1
b3=1
b4=1
b5=1
ai=rnorm(n,0,1)
x1=rnorm(nt,0,1)
x2 <- runif(nt,0,1)
x3 <- rnorm(nt,0.5,0.5)
x4 <- rbinom(nt,2,0.65) # nt different such numbers
hit=rnorm(nt,0,1)
uit=rnorm(nt,0,1)
vit=log(abs(uit/(1+uit)))
x5 <- ifelse(x1+hit>0, 1, 0)
alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai
wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5
zit= wit+vit
y <- ifelse(zit>0, 1, 0)
pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)
# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #
pData2<- cbind(x1,x2,x3,x4,x5)
##################################
NAins <- NAinsert <- function(df, prop = .3){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
XX<-NAins(pData2, .3)
XX
pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)
pData3
#Scenario 1 S1 when f(yi)=0
yS1<-y*0
XXX<-data.frame(yS1,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX,2,pMiss)
apply(XXX,1,pMiss)
#Performing MICE in XXX
library(mice)
md.pattern(XXX)
library(VIM)
aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
#to see the new imputed covariate values, say X1
tempData$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData)
#getting back the completed data and binding with original y
completedData <- complete(tempData,1)
completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))
XXXS1<-data.frame(y,completedDataS1)
XXXS1
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#Scenario 3 S3 when f(yi)=E(yit)=p
S3=rep(mean(y),times=nt)
yS3<-data.frame(S3)
XXX3<-data.frame(yS3,XX)
#check missingness proportions by covariate
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(XXX3,2,pMiss)
apply(XXX3,1,pMiss)
#Performing MICE in XXX3
library(mice)
md.pattern(XXX3)
library(VIM)
aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)
summary(tempData3)
#to see the new imputed covariate values, say X1
tempData3$imp$x1
#showing covariate densities of observed data(blue) and imputed data( magenta)
densityplot(tempData3)
#getting back the completed data and binding with original y
completedData3 <- complete(tempData3,1)
completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))
XXXS3<-data.frame(y,completedDataS3)
XXXS3
# Impute missing values using Bayesian imputation
imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables
# Extract the completed dataset from the imputed_data object
XXX4 <- complete(imputed_data)
BayesData <- data.frame(y,XXX4)
#parameter estimation, unconditional logit and conditional logit for Complete Original Data
glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)
clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)
#parameter estimation, unconditional logit and conditional logit for Scenario S1
glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)
clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)
#parameter estimation, unconditional logit and conditional logit for Scenario S3
glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)
clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)
#parameter estimation, unconditional logit and conditional logit for BayesData
glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)
clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)
#Calling out coefficients
A[i,] <- glm.out1$coef[j+1]
B[i,] <- clogit1$coef[j]
C[i,] <- glm.out2$coef[j+1]
D[i,] <- clogit2$coef[j]
E[i,] <- glm.out3$coef[j+1]
F[i,] <- clogit3$coef[j]
G[i,] <- glm.outBayes$coef[j+1]
H[i,] <- clogitBayes$coef[j] }
#mean bias#
mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))
mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))
mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))
mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))
mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))
mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))
mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))
mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))
#residual errors#
rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))
rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))
rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))
rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))
rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))
rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))
rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))
rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))
mean.biasA
rmseA
mean.biasB
rmseB
mean.biasC
rmseC
mean.biasD
rmseD
mean.biasE
rmseE
mean.biasF
rmseF
mean.biasG
rmseG
mean.biasH
rmseH