On the Index of Repeatability: Estimation and Sample Size Requirements ()
1. Introduction
Repeatability and reproducibility are ways of measuring precision, particularly in the fields of biochemistry, radiology, and medical diagnoses. In general, scientists perform the same experiment several times in order to confirm their findings. These findings may show variations. In the context of an experiment, repeatability measures the variation in measurements taken by a single instrument or person under the same conditions, while reproducibility measures whether an entire study or an experiment can be reproduced. There has been confusion in the literature about the way that repeatability and reproducibility are quantified. Both concepts were often reported as either standard deviations or coefficient of variations.
The main focus of this paper is on the concept of repeatability, which was first introduced by Bland and Altman [1] . For repeatability to be established, the following conditions must be in place: the measurements should be taken in the same location; the same measurement procedure; the same observer; the same measuring instrument, used under the same conditions; and repetition over a short period of time.
What’s known as “the repeatability” is in fact a measurement of precision, which denotes the absolute difference between a pair of repeated test results. We note that when we have more than two readings per subject the idea of pairing produces several repeatability coefficients and the concept becomes unclear.
Repeatability is also known as test-retest reliability indicating the closeness of the agreement between the results of successive measurements of the same measurand carried out under the same conditions of measurement. A less-than-perfect test-retest reliability causes test-retest variability. Such variability can be caused by, for example, intra-individual variability and intra-observer variability. A measurement may be said to be repeatable when this variation is smaller than a pre-determined acceptance criterion. A complete account on the reliability literature can be found in Shoukri [2] [3] .
One of the most important applications of the concept of repeatability is in the construction of the normal range or reference range in clinical medicine, which relies on the availability of a large sample of healthy individuals. Research has shown that the distribution of these measurements is affected by two main sources of variations: the between subjects and the within subject-components of variations.
This paper has three-fold objectives: Firstly; we define a proposed index of repeatability, as the ratio of the within-subjects’ variations to the between subject variation. The within subject variation is expected to be quite small relative to the between subjects-variations. To formalize the presentation, we assume that a single measurement
from subject
is written as:
(1)
Hence
represents the sources of between-subjects biological variation, and
represents sources of within subject variations, while
denotes the population mean. Note that the assumption of additivity of components is made to simplify the presentation. However, a multiplicative model may be made additive under the logarithmic transformation. Following Harries and De Mets [4] it is further assumed that
, and
and that
for all i. We define the “Repeatability Index Parameter” (RIP) as
.
The salient point is that
cannot be estimated unless we have at least two repeated measurements on any subject in the study.
In Section 2 we specify the model generating the observations and discuss a general method of estimating RIP from a sample of k subjects when there is an opportunity to have n repeated samples per subject. In Section 3, we provide two alternatives for the sampling strategies. The first, we assume that the investigator has decided to acquire on total number of measurements
, and the question becomes; what is the best split between
, that maximizes the accuracy of estimating RIP?
One of the biggest obstacles in clinical studies is the cost constraints. Therefore, the second strategy is to find the optimal split of
, so that IRP is estimated with maximum precision under cost restrictions (constrained optimization). The third objective of the study is to address the issue of estimating RIP when the assumption of the Gaussian distribution of observation is not tenable.
2. Model Specifications and Parameter Estimation
We assume that for subject i, n replicates of the same variable of interest
are taken by the same instrument at the same time, so that
(2)
k is the number of subjects
n is the number of replications per subject
We further assume that the components of the model described by (2) are such that:
, are independently distributed random variables measuring the subjects effect, are independently distributed of the within subjects variation denoted by
.
Under the additivity assumption of the model components, we have:
The parameter
is the target parameter of interest, named “Repeatability Index Parameter” (RIP). The components of variation of the model set-up can be estimated using the well-known one-way Analysis of Variance (ANOVA) with random effects (Table 1).
S.O.V = Source of variation, DF = Degrees of freedom associated with the corresponding sum of squares, S.O.S = Corrected sums of squares, MS = Mean square error = S.O.S/DF, EMS = Expected mean square. The sample statistics needed for the ANOVA computations based on the available observations are given as:
The moment estimator of the parameter
, and hence the maximum likelihood estimator (under balanced design) is given by:
(3)
The parameter estimator
given in Equation (3) is a nonlinear function of the sample statistics, and therefore an exact expression for its variance is not available. We use the delta method (Kendall and Stuart, 1989) [5] , to obtain the first approximation of the variance of
given by:
(4)
Substituting the required quantities in (4) and simplifying we get the first order approximation of the variance of
as:
(5)
An
approximate Wald’s confidence interval on
may be constructed as:
(6)
where z in Equation (6) is the
cut-off point from the standard normal table.
3. How Many Repeats Do We Need?
Our first approach to estimate the optimal number of replications is to assume that
is fixed a priori, and one needs to determine the number of replicates that minimizes the variance v, as given in (5).
Minimizing v with respect to the number of replications per subjects and solving for n we get:
(7)
This means that
is minimized (i.e. maximizing precision) when at least 2 repeats are attained from each subject as shown in Equation (7). When
(no within subject-variation) then
precisely.
We may also estimate the number of repeats for fixed width confidence interval as follows:
Suppose that we have decided on the number of subjects k. The question now is how many repeats per-subject are needed to estimate
with 95% confidence such that the width of the confidence interval has a maximum given length w.
Since the length of the Wald’s confidence interval is given as:
Let
Solving for n we have:
(8)
This closed form expression is quite simple, and the computation of n from Equation (8), is straight forward. Substituting
,
, and
in (8), then
.
4. Estimating the Number of Repeats under Cost Constraints
It is an extremely expensive, and in some circumstances, it is a difficult task to obtain repeated samples from each subject. Some of these difficulties are related to cost and time (which may be translated into cost). Clearly too small a sample may lead to a study that produces many false negatives, too large a sample may result in many false positives and additional cost. Thus, a critical decision in constructing accurate estimate of normal range is to balance the cost of recruiting healthy normal with the need to obtain accurate estimate of RIP. In this section we shall address the issue of obtaining the combination
that minimizes the variance of
subject to cost constraints. The sampling cost depends primarily on the size of the sample, and includes the data collection costs, subjects recruiting costs, management and technicians’ costs. On the other hand, overhead costs remain fixed regardless of the sample size. The total cost is assumed this additive formula
(9)
In Equation (9)
is the fixed cost,
is the cost of recruiting a healthy subject, and
is the cost of taking a single measurement. Denoting the variance of
by V, the main objective is to determine the number of repeated measurements that minimize the variance of
subject to cost constraints T. In terms of language of optimization, we construct the objective function
(10)
The parameter
in Equation (10) is the Lagrange-multiplier. The necessary conditions for minimization of Q are:
Differentiating Q with respect to n, k, and
and equating to zero we get:
(11)
Note that from Equation (9) we have:
, where
The cubic Equation (11) has an explicit solution given by:
(12)
where
and
Equation (12) is the optimum number of replicates per subject that is needed to minimize the variance of the estimated RIP when the total cost of the investigation is held fixed.
Note that, when
and
(i.e.
), then
, as given in Equation (7).
This means that a special cost structure is implied by the optimal allocation procedure discussed in the previous section. Note also, when
,
, implying that the ratio
is as important factor in determining the optimal allocation
.
Examples:
,
, then
,
, then
,
, then
,
, then
Remarks:
We set as a bench mark to the value of the estimator of RIP a maximum of 1%. That is if the within subject variation relative to the between-subjects variation is above 1%, then repeatability is low, and visa-versa.
Note also, that the estimator of θ is a non-linear function of the sample data, and hence is potentially biased estimator. Moreover, the derived variance is just a first order approximation of the actual variance. Finally, if the measurements are not normally distributed, then construction of confidence interval on the population parameter using the normal quantiles will not be acceptable unless the sample size is quite large. One way to assess the properties of the proposed estimator is to use the nonparametric-bootstrap sampling techniques. We shall address this issue in the data analysis section.
5. Effect of Non-Normality of Components of Variations on the Estimated Variance of RIP
Not all biological markers that are measured on continuous scale have Gaussian distributions. In this section we drop the assumptions of normality regarding the distributions of
and
, and evaluate the effect of non-normality on the estimation of the RIP. The immediate consequences of dropping the assumption of non-normality of the measurements are:
1) The one-way ANOVA mean squares MSB and MSW will not have chi-square distributions.
2) The mean squares MSB and MSW are no longer independent, and hence the ratio
of the mean squares will not have the usual F-distribution.
Relaxing the assumption of normality both the measures of Kurtosis of
and
are needed in the calculation of the asymptotic variance of
[6] .
Let
and
denote respectively the coefficients of kurtosis of
and
. These quantities are defined as:
Using results for the balanced one-way ANOVA [6] we have:
,
, and
, where
Using the delta method [5] , and substituting in (4) we get, the first order approximation, variance of
is:
Simplifying we get:
(13)
Comments:
The first question that needs to be answered is: which component of variation has the largest effect on the variance of the RIP estimate, and hence on the number of repeats. We answer this question in a heuristic manner. We note from Equation (13) that
is divided by the factor {kn} in
,
, and
. The implication is that, as the number of subjects increase, the kurtosis of the error term has negligible effect on the variance of the estimated RIP.
We may also demonstrate the effect of non-normality using tools of probability and power calculations. This can be illustrated through testing of statistical hypotheses on the RIP. Suppose that we need to determine the number of subjects to detect the departure from the null hypothesis
in the direction of the one-sided alternative
, with type-one error rate
and power
. For fixed n, we can show that:
(14)
If we set the Type I error rate at 5% and power at 80%, for given values of
and
, the estimated values of k can be easily calculated.
Specifically, for an effect size
,
, and
, and
, then from Equation (14) we need to recruit 6 subjects, while for the same range of values of the RIP we need to recruit 21 subjects if
, and
. The worst situation is when the two components of variation are far from being normal. For example, for the same values under the null and alternative hypotheses, with
,
, and
, then
. However, when
, we need to recruit
only. These computations illustrate the impact of the departure from normality of the distribution of between and within subject-variations on the sample size requirements.
6. Data Analysis and Bootstrap
In this section we apply the methodology presented in this paper on Serum Alanine-aminotransferase (
ALT
). The ALT is a critical parameter for both the assessment and follow-up of patients with liver disease. Therefore, establishing the repeatability and the precision of ALT measurements as a diagnostic marker are of paramount importance. Regardless of gender or body mass index (
BMI
) [7] , the normal range was most often estimated from a population that included patients with subclinical liver disease, including non-alcoholic fatty liver disease (NAFLD), which is now documented as the greatest prevalent cause of chronic liver disease worldwide [8] . Recent studies have recommended establishing normal ranges for ALT separately in males and females [9] .
Furthermore, lately published HBV guidelines suggested that treatment decisions should be based on these new
ALT
levels [10] , with the exception of one recently published Korean study, no earlier reports have established normal liver histology when evaluating reference ALT ranges [11] .
From a large tertiary hospital-based registry, the available data were grouped into female group with 20 subjects and another male group of 30 subjects. In both groups, each subject’s ALT was evaluated three times according to the rules set in [1] . The data are summarized in Table 2, for females and in Table 3 for males.
Bootstrap results
We used R to bootstrap the data. We set the number of bootstrap samples at 1000 for both data sets.
Bootstrap Statistics for females’ data:
original bias std. error
0.001 0.00117 0.0004
As can be seen from Figure 1, both the histogram and the Q-Q plot show that the large sample distribution of the estimator is skewed to the right. Therefore, one should be careful when constructing Wal’s confidence limits of the population RIP
Bootstrap Statistics for males’ data:
original bias std. error
0.002 - 0.0001 0.0006
In contrast to females’ data, the histogram of the sample statistics as shown in Figure 2 is skewed to the left, but the Q-Q plot exhibit closer to normality. This may be due to the fact that the males’ data is larger than the females’ data.
Table 2. Descriptive statistics of the female ALT data.
Table 3. Descriptive statistics of the male ALT data.
Figure 1. Histogram and the Q-Q plots of the 1000 bootstrap samples of the estimated RIP (females ALT data).
Figure 2. Males’ data histogram and the Q-Q plots of the 1000 bootstrap samples of the estimated RIP.
7. Comments and Summary
As can be seen from the histograms and the Q-Q plots, the distribution of the estimated RIP = t1* is far from being normally distributed. But we expect that the distributional properties may be closer to normality when the number of subjects is much larger than the number of have here. When one attempts to establish the population-based reference range of health populations, the number of subjects is typically in the hundreds, and the issue of normality may be irrelevant.
Further investigations for the case of categorical measurements and when the number of replications per subject is not fixed, are needed.