Likelihood Methods for Basic Stratified Sampling, with Application to Von Bertalanffy Growth Model Estimation ()
1. Introduction
In stratified random sampling (SRS), the population or a random sample of the population is partitioned into relatively homogeneous subgroups, or strata, and then random samples are taken independently in each stratum for full observation. Such sampling design may also be regarded as a kind of two-phase sampling, with the population or the large sample before partitioning being the first phase sample, and the smaller and more extensive subsamples after partitioning being the second phase samples.
Practical implementations of SRS frequently fall into two categories as classified by [1]: 1) basic stratified sampling (BSS) where the maximum second phase subsample size (BSS1) or subsampling fraction (BSS2) in each stratum is prefixed, and 2) variable probability sampling (VPS) in which sequential units are independently generated from a model and then classified into strata where they are selected for full observation with pre-specified probabilities. [2] classified BSS2 as VPS, and hence all the inference methods for VPS are also suitable for BSS2.
Assume that there are a total of N sampling units on which the stratified sampling is conducted. Let
and
,
, denote respectively the vectors of responses and covariates of the ith individual generated from the joint distribution
, with
being a vector of all the parameters describing the conditional distribution of
given
. In SRS
are fully observed only for a subset of size n of the N units, which are called complete data in this paper, and only a subset
of
is observed for the other
units, which are called incomplete data.
In SRS the unobserved elements of
and/or
are missing data, and missingness can be fully accounted for by variable
which is observed for all the N units; that is, the unsampled variables are missing at random (MAR) in the terminology of [3]. In addition, for BSS and VPS, given the observed data, the missing probability for all the missing data is a constant involving no parameters
. As a result, the likelihood, which is called full information likelihood, is given by (see e.g. [1])
(1)
where
is the density function of
,
enumerates the second phase complete data, and
enumerates the first phase incomplete data.
If the response
is not involved in the stratification, namely, vector
contains no elements of vector
,
is independent of parameters
, and the full likelihood
reduces to
, which is trivial since neither the sampling scheme nor the covariate distribution
needs to be taken into account. In this paper we consider only the SRS where the response
is involved in stratification, which is often referred to as response-selective stratified sampling (RSSS).
In fisheries surveys, length-stratified age sampling (LSAS) is one of the most popular strategies for sampling the age distribution of a fish population. In the first phase of LSAS a large amount of caught fish of a certain species is measured for length, and classified into length strata (e.g. two centimeters, five centimeters). In the second phase a pre-specified small number of fish are randomly selected from each stratum for age measurement. LSAS is BSS, and since growth models generally describe how length increases as a function of age (i.e. length is the response and age is the covariate), it is response-selective. LSAS has been conducted world-wide for several decades. For example, the Canadian Department of Fisheries Oceans (DFO) conducts annual surveys since the 1970’s and uses LSAS for age sampling for many species such as cod and American plaice. Millions of length-at-age data have been accumulated for each species, which are invaluable for fisheries stock assessment and ocean ecosystem studies. In this paper we focus on BSS, with some of the methods and conclusions also applicable to VPS.
[4] suggested to model the age distribution of fish in a survey using the Gamma distribution. [5] also assumed a Gamma age distribution in their hierarchical model of growth for many fish populations, and they showed that parameter estimates did not change much when a more flexible parametric age distribution was used. [6] showed that a flexible Normal mixture distribution for age distribution is more robust to misspecification of the age distribution. Following these studies, in this paper we focus on the case where a valid parametric covariate distribution model is available. For the examples in the simulation studies and real data analysis, we simply use a Gamma distribution for age so that our comparison among various inference approaches is less influenced by numerical issues related to integrating over a complicated covariate distribution.
This is the motivation of this paper. [1] and [7] gave the complete-data likelihood for VPS, which is based solely on the second phase complete data and can be used when the information is not retained for units not selected for full observation. In this study we would like to derive a likelihood function for BSS requiring only the second phase complete data and the total sample size N, which can be used when the first phase BSS data is not available. Some authors ([e.g. [8]) applied a pseudoconditional likelihood approach [1] to LSAS data. We improved this approach to accommodate the first phase incomplete data and the complexities in fisheries LSAS. We conducted simulation studies to compare the new and existing likelihood and pseudolikelihood approaches that have been used or are conveniently applicable to fisheries LSAS. Our purpose is to identify the approaches with the best performance.
The outline of this paper is as follows. In Section 2 we define notations and review the likelihood and pseudolikelihood approaches relevant to this study. In Section 3 we derive the complete-data density function, complete-data likelihood and full-data likelihood for BSS. Application of an empirical proportion approach, which is an improved version of the pseudoconditional likelihood approach, to BSS is explored in Section 4. Results from simulation studies based on a linear model with between-individual (BI) variation and a Von Bertalanffy growth model with BI variation are presented in Section 5 to compare the performance of all these new and existing estimators discussed in this paper. The most promising estimators are then further illustrated in Section 6 by fitting the VonB model with BI variation to growth data for American plaice (Hippoglossoides platessoides) collected by DFO. Some further discussions are provided in Section 7.
2. Notation, Likelihoods and Pseudolikelihoods
Suppose that N units
, are generated from the joint distribution
. As mentioned previously we always assume that an appropriate parametric covariate distribution is available, then
here includes not only the parameters describing conditional distribution of response
given covariate
, but also the parameters defining the covariate distribution. The range of
is divided into H exhaustive and mutually exclusive strata
. Denote the probability for
to fall into the hth stratum as
, namely,
(2)
Define the indicator variable
(3)
Because BSS2 can be classified as VPS [2], in the following we use BSS specially for BSS1. For BSS we assume that in each stratum
there are
units from which
units are randomly selected for full observation of
, with
and
. For the remaining
units the values of
are only partially observed for a subset
. Here
is the maximum sample size for full observation and
(4)
Although the likelihood for BSS (4) is given by (1), several published studies use other likelihoods, and some of these are described as follows.
[9] studied maximizing the likelihood function
for fitting regression models, and called this approach the conditional maximum likelihood. Under the assumption that a valid parametric covariate distribution is available, and the randomness in
can be neglected for all the strata so that the
in (4) are always equal to
in all strata, in the Appendix we show that
(5)
and the constant of proportionality does not depend on
. The conditional likelihood then becomes
(6)
which is adopted in [10] and [11].
Weighted pseudo-likelihood estimators have been studied extensively since the 1980’s for problems involving response-selective sampling. For this topic we refer to [12] - [18]. In the most basic and popular version of this approach, the log-likelihood function if all the N units were fully observed,
with ln denoting natural logarithm, is estimated by the Horvitz-Thompson (HT) method based on the n units that are actually observed in full,
(7)
Although this weighted log-pseudo-likelihood (7) may provide an unbiased parameter estimating equation, the HT approach is known to be inefficient, and can be seriously so in some situations such as when the sample unit values are not approximately proportional to the inclusion probabilities ([19]; [20], page 103-104).
An approach for addressing this inefficiency issue is to adjust the standard HT weights by using the whole set of incomplete data, namely, those with only a subset of
measured but available for all the N sample units (see e.g. [17] [18] [21]). We call this method the calibrated weighted likelihood approach. As an implementation of the calibrated weighted likelihood approach, in this study we modified the traditional Horvitz-Thompson weights by minimizing the chi-squared distance (see Equation (1.1) in [21]) between the original and modified weights subject to the constraint
(8)
where
are the modified weights. Similarly one can also calibrate up to higher order moments or calibrate the empirical distributions by imposing the constraints
(9)
where i enumerates all the subjects selected for full observation, and
if
and 0 otherwise. Nevertheless, these calibration strategies may not produce better estimates than (8) does, according to our simulation studies. Hence, in this paper we only report results with constraint (8). The calibrated weighted likelihood approach under all these constraints can be conveniently implemented with Equation (9) in [17].
In some applications (e.g. [8]) researchers use an approximate density based on variable probability sampling (VPS). In VPS, units are randomly selected for full observation from the
partially observed units, with subsampling probabilities
that vary for each stratum h. The density approximation is based on the empirical subsampling probability
(see Equation (2) in [2]),
(10)
Parameters are estimated based on the likelihood function defined from (10). Note that with the availability of a valid covariate distribution, a density function similar to (10) can also be constructed for the
incomplete observations (i.e. those partially observed units). In LSAS there are always some empty strata with
but non-negligible occupation probability
, which are missing in the denominator of (10). We will address these issues in Section 4 and call the improved likelihood the “empirical proportion (EP) likelihood”.
3. Complete-Data Likelihood for BSS
As mentioned previously, the methods for VPS are applicable to BSS2, and the complete-data likelihood for VPS is given in [1] and [7]. Therefore in this section we only consider BSS1 and refer to BSS1 as BSS for convenience.
We denote
and
respectively as the binomial probability mass function and cumulative probability, with number of successes x, total number of events N and success probability p. The density function for a unit selected for full observation in BSS is denoted as
with “BC” indicating “BSS complete data”.
Theorem 1. In BSS the density function of a unit
selected for full observation is given by
(11)
if
.
The proof of Theorem 1 is given in the Appendix. As suggested in [9], [10], [22] and [23], the BSS complete-data (BC) likelihood can be constructed as
(12)
With the same arguments for deriving (11), the density function for the partially observed units is
(13)
where the subscript “BI” denotes “BSS incomplete data”. The summations in (13) may be calculated more efficiently using
(14)
Densities (11) and (13) incorporate respectively the information of complete data and incomplete data. We anticipate that they together can lead to better inference than using only complete data. The BSS full-data (BF) likelihood is
(15)
Here and in the remainder of this paper, we enumerate the fully observed units in the hth stratum as
, and the partially observed units in the same stratum as
.
In some cases only the number of incomplete measurements,
, in each stratum are known, instead of the measured values of all
’s. In this situation we need to integrate out
in (13) and rewrite the likelihood function (15) as
(16)
In real data analysis it is important to examine residuals for the fitted model to assess the validity of assumptions. Equation (11) gives the density function for BSS complete data, and can be used to calculate residuals. For simplicity we assume response
to be univariate y. Define the density function of
conditional on
as
.
The standardized residual for the ith observation
is
(17)
The measured data such as length and age are usually discrete, and the above integrations become summations, which are easier to evaluate.
4. Application of Empirical Proportion Approach to BSS
In this section we expand density (10) for application in BSS and especially in LSAS.
Empty strata (
) always happen with LSAS. For the empty strata in (10), the empirical selection proportions
are not defined. We need to assign selection probabilities for full and incomplete observations to those unobserved strata. In VPS these selection probabilities may be determined by the maximum likelihood method [10]. For sampling model (4), when
, all the individuals in the hth stratum are selected for full measurement; hence, logically the empirical selection probability is 1 when
. We assume that in unobserved strata the probability for full observation is 1, and the probability for incomplete observation is 0. Hence, the empirical proportion (EP) density of the complete data with
fully measured is given by
(18)
Here
enumerate the strata with data observed, and
enumerate the strata without data.
is the total number of strata with nonnegligible occupation probabilities
(see Equation (2)).
Similarly, we can include information from the incomplete observations using their EP density,
(19)
Here, without loss of generality, we assume that
falls in the hth stratum. For an unobserved stratum h, since we have defined its proportion for full observation
, its proportion for partial observation
. The EP likelihood function then has the form
(20)
If only the number of incomplete observations in each stratum is reported without knowing the
values,
in (19) needs to be integrated out and the likelihood (20) becomes
(21)
5. Simulation Study
In this section we examine the performance of the inference approaches for BSS described in the previous sections. We use two simple examples: a linear model with between individual (BI) variation, and a nonlinear Von Bertalanffy (VonB) growth model with BI variation. The simulation setup is as follows.
5.1. Linear Model with BI Variation
The linear model with BI variation is
(22)
where
,
and
. Capital letter B denotes the random effect of BI variation. We randomly generated
pairs,
, from model (22). The parameters of the model were chosen as
,
and 1.0,
,
,
, and
. Here we selected a small intercept
so that the issues with the relative performance in its estimation as defined by (25) can be clearly seen. Slope is an important parameter in linear model. Hence we selected small, moderate and large values for its mean
and a relatively large standard deviation (SD)
to test different approaches in identifying the slope under various situations. The mean
and SD
for covariate X are chosen so that the spread of the covariate allows reasonable estimates of the model parameters. We adopted a moderate error SD (
) relative to the other parameters. We stratified the data by length (Y) bins of size 2 and randomly selected a maximum of 15 units per length stratum to keep their X values, and dropped the X values of the other units not selected. This sampling design is close to the LSAS of fishery surveys that we would like to address in this study.
5.2. VonB Growth Model with BI Variation
The VonB model is a commonly used growth model in fisheries science (e.g. [24]). The basic VonB model is given by
(23)
where
denotes length at age
,
is the maximum possible size (as
), k is the growth rate parameter, and
is the theoretical age at which the fish would have had zero length. Variation in growth is also important for population and community dynamics (e.g. [25]). Not accounting for individual variation in growth may lead to bias in estimating the population mean growth parameters and length at age, as noted by [26] and [27]. The VonB model with BI variation follows [11],
(24)
where Y is the measured length,
,
and
. The error
here in fact includes both BI variation and Y observation error.
We randomly generated
ages from a gamma distribution with Case 1:
, and Case 2:
.
and
are determined by matching the
and
with those of the age data for American plaice that we have been investigating. Case 1 represents a younger population with mean age = 4.46 and variance = 5.47, while case 2 represents an older population with mean = 7.20 and variance = 4.61. Case 1 has a broad age distribution close to the origin, and case 2 has a narrower distribution of ages. Lengths were then generated from model (24) with
,
,
and
. We stratified the data by length classes of size 2 and randomly sampled a maximum of 15 units per length stratum to keep their ages and dropped all the other ages not selected.
5.3. Estimation Performance
Relative biases (RBias), relative standard errors (RSE), and relative square root mean squared errors (RRMSE) are defined as
(25)
We derived these values using 500 simulations for the full information likelihood (1), conditional likelihood (6), weighted likelihood (7), calibrated weighted likelihood, complete-data likelihood (12), full-data likelihood (15), and EP likelihood (20) (see Tables 1-4). We also include the “random approach” based on maximizing the likelihood
Table 1. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22).
.
Table 2. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22).
.
Table 3. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22).
.
(26)
as a reference point to see the difference between considering BSS and totally ignoring BSS.
For the linear model with BI variation (22), Tables 1-3 indicate that the full information, full-data and EP likelihood approaches have quite close performance, and in general they perform substantially better than all the other approaches in terms of RBias, RSE and RRMSE for all estimated parameters. The weighted likelihood (WL) and calibrated WL approaches have close performance, and there is no evidence that calibration improves the estimation; that is,
Table 4. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the VonB model with BI variation (24). Case 1:
. Case 2:
.
in some cases the calibrated WL has a little smaller RRMSEs than WL, and in the other cases the reverse happens, but the differences have no clear pattern, and are too small to draw reliable conclusions. Similarly, even though there is some difference in performance between the complete-data likelihood approach and the two WL approaches, it is not clear which method performs better. The two WL approaches have smaller RRMSEs for
and
estimation, while the complete-data likelihood approach has smaller RRMSEs for other parameter estimation. The conditional likelihood approach based on (6) performs the worst among all the approaches in this study except the random approach. Especially for
,
,
and
estimation, its RRMSEs are more than twice of those from the complete-data likelihood approach. Nevertheless, the conditional likelihood approach performs substantially better than the random approach.
Simulation results presented in Table 4 provide a comparison of the various estimation approaches for a nonlinear VonB model with BI variation. The outcomes for this nonlinear case are similar to the linear case just described. The full information, full-data and EP likelihood approaches have only tiny differences in performance, and in general perform better than the other approaches. The WL and calibrated WL approaches have almost identical performance. The complete-data likelihood approach has close performance as the two WL approaches and it is not clear which method is better. In Case 1 the conditional likelihood approach performs better than the random approach, but worse than all the other approaches including the complete-data approach. In Case 2 its performance is much worse than all the approaches including the random approach. Actually, the conditional likelihood approach failed for this case because it did not converge in 107 of the 500 simulations. All the methods in this study cannot estimate
well, with large RBias, large RSEs, and hence large RRMSEs. In practice we suggest to borrow information from other studies such as larvae studies to fix
, or equivalently to fix length at age 0, for the VonB model.
6. Real Data Analysis
The simulation study indicates that the full information likelihood (1), full-data likelihood (15) and EP likelihood (20) approaches perform better than the other estimation methods. In this section we apply these three approaches to fit the VonB model (24) using a dataset collected by DFO in NAFO Division 3N during the spring of 2011. Here we consider only female American plaice because males and females follow different growth models.
The LSAS within each Division involved measuring the length of all fish caught in research trawl tows, classifying them into 2 cm length strata, and subsampling a few or no otoliths from each length stratum. The sampling goal in each Division was to obtain about 25 age measurements per 2 cm length stratum by sex if
, and about 15 age measurements per stratum without sex distinguishment if
.
Parameter estimates (ESTs) and the corresponding standard errors (SEs) are provided in Table 5. The three estimation approaches give similar values for all the parameters and SEs, which agrees with their close performance in the simulation study. For comparison, we also included estimates from the random approach (26), which result in a substantially larger value for
and a smaller value for k. The standard errors of the estimates from the random approach are also larger, especially for
.
Applying (17), we obtained the standardized residuals of the second phase complete data for all approaches, whose box-and-whisker plots by age are shown in Figure 1. The standardized residuals from the full information likelihood, EP
Figure 1. Box-and-whisker plots of standardized residuals vs. age from fitting the VonB model with BI variation (24) to the American plaice data from DFO 2011 Spring survey in NAFO Division 3N by the four likelihood approaches: full information likelihood (Full information), empirical proportion likelihood (EP), full-data likelihood (Full-data) and random sample assumption based likelihood (Random). The black dots are the medians. The boxes indicate the lower and upper quartiles. The ends of the whiskers represent the lowest datum still within 1.5 IQR (interquartile range) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile.
Table 5. Parameter estimates (EST) and standard errors (SE) for the VonB model with between-individual variation (24).
likelihood and full-data likelihood approaches do not indicate bias in fitted mean length at age from the data mean along the full range of age. The standardized residuals from the random approach (26) exhibit clear bias to negative values at ages larger than about 12, indicating over-estimation of
. In Figure 1 the interquartile range (IQR, the box) of the residuals is much larger at younger ages (≤4) compared to older ages (≥9). The standard deviation (SD) of the standardized residuals at each age is supposed to be 1. However, the calculated SDs (results not shown) transfers from being greater than 1 at younger ages (≤4) to being mainly smaller than about 0.6 at older ages (≥9). These suggest two problems with the model: 1) the BI variation model in (24) under-estimates the variation at shorter lengths and vice-versa at longer lengths for this data, and 2) due to reproduction, the juvenile female American plaice follows a different growth model from the adult female American plaice, which is neglected by the current model.
7. Discussion
We derived the density function (11) for BSS (basic stratified sampling) complete data, and constructed the complete-data likelihood (12), which allows statistical inference when the incomplete data are not well retained. The complete-data density can also be used for standardized residual calculation as discussed in Section 3. Residuals are important for validation of fitted models.
Both the complete-data likelihood approach and the random approach make use of only the complete data. The complete approach performs substantially better than the random approach in the simulation studies, indicating the importance of correctly incorporating the sampling scheme in the inference methods. The conditional likelihood (6) accounts for the sampling scheme approximately by ignoring the randomness in
in all the strata. Therefore its performance lies between the random and the complete-data likelihood approaches in almost all the cases in the simulation study. However in some BSS sampling projects where the number of strata is small and the maximum subsample size
for each stratum can usually be obtained, then the conditional likelihood (6) is appropriate.
Another method to incorporate the sampling scheme is to use the count information of the incomplete data in each stratum, as in the weighted likelihood (WL) and calibrated WL approaches. Even though in the simulation study the two methods of accounting for the sampling scheme, namely the complete-data likelihood and the (calibrated) WL approaches, have comparable performance, the complete-data likelihood requires an appropriate distribution model for covariates, which can limit its application. The WL and calibrated WL approaches are not subject to this restriction, and hence can be more practical.
A full utilization of the information in incomplete data is to incorporate the density function of the incomplete data in the likelihood. In this regard, we proposed two new likelihoods for BSS, namely, the full-data likelihood and the empirical proportion (EP) likelihood. If the covariate distribution can be properly modeled, the two new approaches perform as well as the standard full information likelihood approach, and they all perform substantially better than the other methods covered in this study. This result suggests the significance of the information in the incomplete data.
On the whole this study indicates that the complete data, the incomplete data, and the sampling scheme are all important for a consistent and efficient statistical inference from BSS data.
In this work we found that the EP likelihood approach, which was originally proposed for the variable probability sampling (VPS), works well (or the best together with the full-data and full information likelihood approaches) for BSS data. Its merits will further show up when covariates cannot be modeled effectively. This work is under the condition that a valid covariate distribution model is available, which may be a strong assumption in practice. We will explore the case when no appropriate covariate distribution model is available in another paper.
Acknowledgements
Research funding to Nan Zheng was provided by the Ocean Frontier Institute, through an award from the Canada First Research Excellence Fund. Funding was also provided by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant to NC and NC’s Ocean Choice International Industry Research Chair program at the Marine Institute of Memorial University of Newfoundland.
Appendix: Proof of Theorem 1
Without loss of generality, we assume that
, then
Since the selection for full observation is random given
,
and we have
(27)
where
is the sample size in the hth stratum as defined by (4).
, that is, the probability for a selected unit to be in a stratum h is proportional to the number of vacancies in the stratum h. Also,
, namely, the event {a unit is selected without any further information about its
} is independent of the event {there are
units that are selected in the stratum h}.
Hence, when
,
which can be normalized into (11).
Note that in the case
for all the strata
,
, which is a constant independent of
. Then (27) leads to
, which proved (5).