Frequentist and Bayesian Sample Size Determination for Single-Arm Clinical Trials Based on a Binary Response Variable: A Shiny App to Implement Exact Methods ()
1. Introduction
Sample Size Determination (SSD) is an essential step in the design of a research study, especially in clinical trials. Let us denote by
the parameter of interest, which measures the efficacy of a novel treatment, and assume that we are interested in testing
vs
, where
and
form a partition of the parameter space
.
A well-established strategy for SSD exploits the concept of power function: the study is sized to guarantee a large probability of rejecting the null hypothesis H0, when it is actually false. The decision about the rejection of H0 can be made under a frequentist framework or by performing a Bayesian analysis. In this latter case, a prior distribution, called the analysis prior, is introduced to incorporate in the procedure pre-trial knowledge the researcher wants to take into account, together with pre-experimental evidence if available. Moreover, the conjecture that the alternative hypothesis is true represents an essential element of the methodology. It can be realized by assuming that the true
is equal to a fixed design value
, suitably selected under H1 and, in this case, the probability of rejecting H0 is evaluated by exploiting the sampling distribution of the test statistic conditional on
(conditional power). Alternatively, we can introduce uncertainty on the guessed design value by incorporating another prior distribution, called the design prior, which assigns negligible probability to values of
under H0. In this latter case, the probability of rejecting H0 is computed by exploiting the prior predictive distribution of the test statistic under the assumption that
is distributed according to the design prior (predictive power). By combining frequentist and Bayesian procedures of analysis, with both the conditional and the predictive approach, we can obtain four power functions that we can use for sample size determination (see [1] ). The general idea is to select the minimum sample size necessary to achieve a desired level of power. This methodology, based on the introduction of two distinct prior distributions and thus based on the so-called two-priors approach, has been initially formalized by Wang and Gelfand [2] . It is now well known with many implementations presented in the literature (see, among others, [3] - [10] ).
In this paper, we consider the problem of SSD based on power analysis when the focus is on single-arm studies based on a single binomial proportion. This design is typically used in Phase II of clinical trials, where the parameter of interest is the probability of response to a novel therapy. Sambucini [1] derived the four power functions described above by using frequentist and Bayesian exact methods at the analysis stage, which are particularly attractive because Phase II sample sizes are usually small. It is interesting to remark that, since we are dealing with discrete data, the power functions show a basically increasing, but not-monotonic, behaviour as a function of the sample size. This “saw-tooth behaviour” requires a modification of the standard criterion to select the optimal sample size, if we are interested in having the condition regarding the power functions fulfilled also for all the sample size values greater than the optimal one ( [1] [11] ). This modification of the SSD criteria has been also introduced in Gentile and Sambucini [12] , where the four power functions have been derived for single-arm trials based on count data. The aim of this paper is to present an R Shiny web application (app) developed to implement the SSD criteria provided in [1] . Some R functions, contained in already existing packages, are available to compute the optimal sample size for a single binomial proportion, but they are based only on the frequentist conditional power and rely on asymptotic approximations. For instance, the functions pwr.p.test and prop1, implemented in the R packages pwr [13] and pwr2ppl [14] , exploit the arcsine transformation of the proportion [15] , whereas the function power.prop1.test of the package MKpower uses the normal approximation [16] . Furthermore, some functions allow for exact sample size computation, but do not account for the saw-tooth behaviour of the power function. Examples are the power_binom_test function in the package MESS [17] and the propTestPower function in the package EnvStats [18] . The function power.diagnostic.test of the package MKpower accounts for the saw-tooth behaviour, but it aims to size diagnostic tests for an expected sensitivity or specificity [16] . Functions implementing the exact frequentist conditional power are also available in software specific for sample size determinations. Examples of freeware are G*Power [19] and Lenth’s applet [20] . For a more exhaustive list the reader is referred to the textbook by Ryan [15] .
In practice, when the interest is focused on a single binomial proportion, many software tools have been developed to implement the standard procedures for SSD, based on power analysis conducted using the frequentist conditional approach. Instead, to our best knowledge, up until now, no software tool has been available to implement exact criteria based on the other three power functions. Thus, we developed an R Shiny App [21] [22] that provides a user-friendly and interactive environment to obtain the optimal sample size according to the criteria based on the two-priors approach and derived in [1] . The app allows the visualization of the behaviour of the four power functions as the sample size increases and lets the user decide whether to take into account or not the saw-tooth behaviour of the power when selecting the sample size. It also contains specific tools to suitably select the analysis and the design prior distributions.
The rest of the paper is organized as follows. In Section 2, we revise the exact procedures, based on the two-priors approach, to select the optimal sample size for a single binomial proportion. Section 3 discusses some strategies to elicit the prior distributions. In Section 4, we present the Shiny App and illustrate its features through an example. Finally, Section 5 contains some concluding remarks.
2. Exact SSD Methods for a Single Binomial Proportion
In this Section, we revise the exact SSD procedures based on four possible power functions, assuming that interest is on one-sample testing problems with a binary response [1] .
Specifically, let us suppose that we are interested in testing the proportion of responders to a novel therapy. We consider n patients, each of whom receives the same treatment dosage, and classified as responders or not to the therapy by using a binary variable Y. We assume that we are interested in testing
vs
, where
denotes the parameter of interest, i.e. the true response rate, while
is a fixed target value that should represent the efficacy rate of the standard of care and is usually estimated through historical data.
2.1. Frequentist Power Functions
Initially we assume that, at the analysis stage, the decision of rejecting the null hypothesis is made under a frequentist framework. The test statistic to use is the number of responders out of the n patients, Yn, whose sampling distribution of is
where
denotes the probability mass function of a binomial with parameters n and p. Therefore, the frequentist rejection region at level
is
, where the critical value r is given by
(1)
Note that, since the binomial distribution is discrete, the actual Type I error rate does not hit
exactly, but it is always less than or equal to it. In order to exploit the power function for SSD purposes, at the design stage, we need to consider a scenario under which the alternative hypothesis is true. A first possibility is to specify a design value
for
that belongs to H1. In this case, we obtain the frequentist conditional power
where
denotes the probability measure associated with the sampling distribution of Yn for
. Note that
represents the probability of correctly rejecting H0 using a frequentist procedure, when
is equal to
. However, we can add flexibility to the procedure by avoiding the use of a fixed design value. In fact, it is possible to introduce uncertainty on the suitable design value to specify by eliciting a design prior distribution,
. This latter is an instrumental tool that allows to model design expectations on
, under the assumption that the treatment is effective. Consequently, it should be chosen as an informative distribution, as we will discuss further in Section 3. In our specific case, given a beta design prior,
, the prior predictive distribution of Yn is
(2)
where
is the probability mass function of a beta-binomial with parameters p, q and n. Therefore, the frequentist predictive power is given by
where
denotes the probability measure associated with the prior predictive distribution of Yn in (2) and r is the critical value defined in (1). Note that
provides the probability of correctly rejecting H0 using a frequentist procedure, when
is guessed to belong to the alternative hypothesis, where it is distributed according to
.
2.2. Bayesian Power Functions
Alternatively, it is possible to perform the analysis under a Bayesian framework. This choice allows us to take into account pre-experimental information available on the treatment, for instance, based on historical data or on the subjective opinions of experts. The information is incorporated through the elicitation of another prior distribution on the parameter, the analysis prior distribution,
. By exploiting conjugate analysis results, we consider a beta density,
, so that the corresponding posterior distribution is
Under this setup, to build a Bayesian equivalent of a power function, we need to determine the set of values of
that, if observed, would lead to rejecting the null hypothesis. In line with Spiegelhalter et al. [23] , we name this condition on the random result as the “Bayesian significance” and establish that it consists in rejecting H0 if the posterior probability of the alternative hypothesis is sufficiently high. Thus, in our specific case, Yn can be considered “significant” in a Bayesian perspective if
(3)
where
denotes the probability measures associated with the posterior distribution of
and
is a probability threshold typically selected as a small value. The condition in (3) is a random object at the design phase because it depends on the future result Yn. Nevertheless, for a fixed value of n, the probability that it is fulfilled increases with Yn. Therefore, the Bayesian rule consists in rejecting H0 if
, where
(4)
Then, we need to compute the probability that the Bayesian significance condition is fulfilled under the optimistic assumption that the treatment is effective. Once again, we may realize this assumption by using either a conditional or a predictive approach. In the first case, we fix a suitable design value
under H1 and define the Bayesian conditional power as
In the second case, we elicit a design prior distribution for
, as described above, and obtain the Bayesian predictive power, that is
Clearly, both the power functions
and
provide the probability of correctly rejecting H0 using a Bayesian procedure, under the assumption that the alternative hypothesis is true. Moreover, it is worth pointing out that the Bayesian predictive power is the one that allows to model both prior knowledge and uncertainty on the design value: it includes the other power functions as special cases. In fact, if we consider a point-mass design distribution on
, then no design uncertainty is involved, and the predictive approach coincides with the conditional one. On the other hand, If no pre-experimental information is available, a non-informative analysis prior can be elicited and the Bayesian powers coincide with the frequentist one.
2.3. Sample Size Determination Criteria
Whatever the power function chosen, the standard SSD criterion selects the optimal n as the minimum value such that the power exceeds a threshold of interest
. Hence, the optimal sample sizes are obtained as
(5)
where the superscript refers to the approach used at the design stage, while the subscript refers to the approach used at the analysis stage. However, given the saw-tooth shape of the power curves as a function of n, a slightly different and more conservative SSD criterion can be adopted. The idea is to select the smallest sample size such that the condition on the power is fulfilled also for all the sample size values greater than it, that is
(6)
This latter criterion prevents the condition of interest from being satisfied for the selected sample size, but no longer satisfied for some larger values of n.
3. Prior Distributions Selection
This section discusses some strategies to elicit the design and analysis prior distributions, accounting for their different aims. We start focusing on the design prior distribution
. The idea is to express the hyperparameters in terms of the prior mode
and the prior sample size
by using [1] [24] :
(7)
We can center
on the design value we would consider in the conditional approach and regulate the concentration through the choice of the prior sample size. It is crucial to emphasize that
should be an informative distribution. First, it serves to realize the assumption that
belongs to the alternative hypothesis. Furthermore, as n approaches infinity,
and
tend to the probability assigned to the alternative hypothesis by
, denoted by
[25] . Thus,
should be close to one to ensure that the power tends to 1 as n goes to infinity. We suggest the use of two possible strategies to ensure that the design prior distribution satisfies these features. Once
has been specified, we determine
numerically so that:
1)
assigns a probability close to one to the alternative hypothesis;
2)
assigns a probability close to one to a symmetric interval
, where
is a non-negative real number such that
.
Both these procedures are implemented in the Shiny App described in the next Section. Finally, as
tends to infinity,
tends to assign all the probability mass to the prior mode
, resulting in no variability introduced around it. Consequently, the predictive and conditional approaches align.
The elicitation of analysis prior distribution can be based on historical data or on subjective opinions of experts. However, one of the most common ways of proceeding is to choose a non-informative density or a density based on very weak information, to avoid the possibility of underweighting the results of the current experiment in determining the analysis outcome. Thus, for instance, we may rely on the Uniform distribution on the interval
by specifying
or on the non-informative Jeffreys prior by setting
. As an alternative and similarly to the choice of
, we can express the hyperparameters of the analysis prior distribution in terms of prior mode,
and prior sample size
, where
is typically fixed equal to one, or equal to a very low value, in order to obtain a weakly informative prior distribution. This way of proceeding allows to express skepticism, neutrality or optimism about large treatment effects through the choice of the prior mode
. Finally, let us notice that if we introduce no prior information, i.e. if
is set equal to 0, the Bayesian and the frequentist setup coincide.
4. Shiny App
This Section presents a Shiny App that implements the sample size criteria described in Section 2. It is available at the following link: https://susanna-gentile.shinyapps.io/SSD_singlearm.
The Shiny package in R enables the creation of interactive web apps directly from R ( [21] [22] ). Our Shiny App aims to provide an intuitive and user-friendly tool for applying the methodologies discussed in this paper. The main functionalities of the app are:
To allow computing the optimal sample size according to the four power functions;
To implement both the standard and the conservative criterion to select the optimal sample size;
To display, if requested, the power function behaviour as a function of n;
To enable storing the design parameters and results into a table and to download it as a CSV file;
To help to select the analysis and the design prior distributions and to visualize them.
Thus, the users can select either the conservative criterion in (6), accounting for the saw-tooth behaviour, or the standard criterion. We suggest using the first criterion. However, we let this choice be at the user’s discretion as there is no unanimous agreement on the appropriateness of this methodology [15] . The tools to select the design and analysis prior are organized into two separate panels. As previously stressed, the two distributions have different aims and should be distinguished.
4.1. User Interface Structure
We start by describing the User Interface (UI). The UI changes accordingly to the methodologies chosen to conduct the design and analysis phases, as depicted in Figures 1-4. In the upper part of the UI, users can input the design parameters, split into three groups.
General setting: The inputs include the historical control
, the power level
, and the maximum sample size. Users can choose whether to use the conservative criterion (the default), the standard one, or both.
Analysis stage: The inputs depend on the planned final analysis. For a frequentist analysis, the app requires the Type-I error probability
(Figure 1). For a Bayesian analysis, the app requires the Bayesian significance level
and the analysis prior’s hyperparameters. Users can exploit the “Analysis prior” panel to select them (Figure 3, panels (a) and (b)).
Figure 1. User interface when the aim is to compute
, when
,
,
and
. We consider both the conservative and the standard criterion.
(a)(b)
Figure 2. User interface and design priors’ selection panel when the aim is to compute
, when
,
,
and
.
is selected so that
. (a) User interface and results for the frequentist predictive power; (b) Design prior selection panel.
(a)(b)
Figure 3. User interface and analysis priors’ selection panel when the aim is to compute
, when
,
,
and
. For the analysis prior, we set
and
is selected so that
. (a) User interface and results for the Bayesian conditional power; (b) Analysis prior selection panel.
Figure 4. User interface when the aim is to compute
, when
,
and
. The design and the analysis priors are the ones selected previously.
Design stage: The inputs depend on the approach used to realize the optimistic assumption that the experimental treatment is effective. The app requires the design value
for the conditional approach (Figure 1) and the hyperparameters of the design prior distribution for the predictive approach. We recommend using the “Design prior” panel to select the design prior (Figure 2, Panel (b)).
Once all the required inputs have been provided, the “Results” Panel prints on the left a summary of the design parameters, the optimal sample size, and the critical value. On the right, if requested, the power as a function of n is displayed. Users can save the results in a table by clicking “Save results”. The table can then be downloaded as a CSV file (Figure 1). The info icons provide some suggestions on the choice of the parameters.
Note that if the user changes the target value parameter
, he will have to insert the other design parameters again. This mechanism prevents the app from crashing if the old input parameters are inconsistent with the new target value and the corresponding hypothesis system. Once the user provides
, the app will check if all the inputs satisfy the boundaries before computing the results and inform the user otherwise.
The maximum sample size input allows the user to specify the maximum sample size available. If the optimal sample size exceeds the maximum, the app prints a message and the power corresponding to the maximum sample size in the “Result” panel. By default, the app still computes the optimal sample size according to the selected criterion, as shown in the plot on the right. However, suppose the optimal sample size is greater than 1000. In that case, the app stops, and the user can decide whether to increase the maximum sample size by modifying the input parameter, as suggested by a printed message. This boundary also prevents the app from crashing if the optimal sample size at the desired level does not exist, which may happen if the prior distributions are not well specified.
4.2. Tools for Selecting the Prior Distributions
The “Design prior” panel appears if the user opts for the predictive approach at the design stage (Figure 2, Panel (b)). The panel requires the specification of the prior mode
. Then, the “Select the prior sample size
” drop-down list allows to select
according to three possible strategies:
By assigning a fixed probability to the alternative hypothesis:
the resulting
assigns the selected probability to the alternative hypothesis.
By assigning a fixed probability to an interval:
the resulting
assigns the selected probability to a symmetric interval
.
Manually: the user can select the prior sample size.
As emphasized in the previous section, the design prior should be highly informative and assign a negligible probability to the null hypothesis. The first two methods implement the strategies described in the previous Section and ensure this condition by selecting
numerically. However, users can also choose
at their discretion. In this case, we encourage users to verify if the probability assigned to the alternative hypothesis is greater than the desired power level
. To help the user in the choice, this probability corresponding to the inserted
is printed under the hyperparameters. Regardless of the selected method, the design distribution is displayed on the right. Finally, the “Update prior parameters” button allows for updating the corresponding inputs in the user interface with the hyperparameters of the chosen distribution.
The “Analysis prior” panel appears when the user decides to conduct the analysis stage under a Bayesian framework (see Figure 3, Panel (b)). Firstly, the user needs to specify the prior mode
. Then, the drop-down list “Select the prior sample size
” allows selecting the prior sample size
“Manually”, i.e., at the user’s discretion, or “Automatically”. In the latter case, we select
by fixing the probability of the alternative hypothesis
. The range of possible probabilities is determined numerically by ensuring that:
1) The hyperparameters
and
are both greater than 1, so that
admits a mode in
;
2) The prior sample size
is less than 100.
As outputs, the panel returns the hyperparameters and a graphical representation of the analysis prior. If the user opts for selecting
manually, the probability assigned to the alternative hypothesis is also shown for a check. If the user opts for the automatic selection, the app prints the prior sample size
instead. Finally,
and
can be stored in the corresponding UI values by clicking on “Update prior parameters”.
4.3. Illustrative Example
We now illustrate an example of the app utilization. When inserting the inputs, we suggest starting from the target value
so that the app can automatically check if the other parameters are well selected.
Let us start by considering the frequentist conditional power; the corresponding UI is shown in Figure 1. We assume that the aim is to test the null hypothesis that the actual response rate is less than or equal to
at level
. We set
as we consider clinically relevant an increase of 0.2. The desired power level is
. We set the maximum sample size to 200 and consider both the standard and the conservative criterion. The two optimal sample sizes are respectively 35 and 38, due to the saw-tooth behaviour.
Then, we switch to a predictive approach by selecting “Predictive” in the Design Stage window. This choice leads to the User Interface in Figure 2. We rely on the “Design prior” panel to select the design prior hyperparameters. More specifically, we consider the same design value
and select
using the “By assigning a fixed probability to the alternative hypothesis” method. Since we require that the design prior assigns the
, the resulting beta density
. The optimal sample size is 40 for the standard criterion and 46 for the conservative one. As expected,
is greater than
because we are accounting for the uncertainty around the design value
.
Let us suppose now that there is an optimistic prior opinion toward the treatment efficacy, and the most plausible value for the parameter, according to experts, is
. We switch to a Bayesian analysis framework to incorporate this information. Figure 3 and Figure 4 show the Shiny App screenshots for the Conditional and Predictive approaches. We set
to ensure comparability with the previous results and use the “Analysis Prior” panel to select
and
. More specifically, we set
, while
is selected so that
. The resulting analysis prior is
, corresponding to a prior sample size
. Considering the conditional approach, as in Figure 3, the optimal sample size according to the conservative criterion is
. Specifically,
because the selected analysis prior distribution expresses a modest enthusiasm towards treatment efficacy. Similarly, if we adopt a predictive approach to account for the uncertainty around the design value, as in Figure 4, the optimal sample size is
. As expected, the latter is greater than
due to the predictive approach but smaller than the corresponding frequentist sample size
.
5. Conclusions
The methodologies based on the two-priors approach allow to exploit four different power functions to determine the optimal sample size. We revise these procedures when the focus is on single-arm studies based on a single binomial proportion. Although there are several R packages and software tools to implement the classical procedures for SSD, based on the frequentist conditional approach, easy-to-use computational tools to implement criteria based on the other three power functions are not yet available. To fill this gap, we developed an interactive and used-friendly Shiny application, whose main functionalities are presented in this paper.
In addition to allowing the calculation of the optimal sample size, the Shiny app allows us to check the behaviour of the four power functions as the sample size varies and let the users choose between the standard and the conservative criterion for SSD, which takes into account the saw-tooth behaviour of the power functions. Moreover, since the distinction between analysis and design priors is an essential element of the implemented procedures, the app provides two separate panels to help the selection of both the prior distributions: different strategies to elicit these priors can be used and the app allows us to visualize the corresponding plot.
Finally, let us notice that in this paper we refer to single-arm designs in phase II of clinical trials. These designs are frequently used to determine whether a new treatment is likely to meet a basic level of efficacy, before comparing it with the standard therapy in larger and randomized phase III trials, and the efficacy is commonly measured as a response rate. However, the SSD procedures implemented in the Shiny App we present can be used to size experiments conducted in fields other than the clinical one, as long as based on a single binomial proportion.