Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data ()
1. Introduction
A random variable X is said to have “Quasi Negative Binomial Distribution”, QNBD if the probability function is given by:
(1)
The distribution whose probability function is given in (1) was first derived by Takács [1] as a queuing model. He assumed that we have a single server queue with independent customers arriving according to a Poisson process in batches of size
with traffic intensity π and exponential service time with mean 1/α. It is also assumed that the service time is independent of the interarrival time. Under these conditions, the probabilities of arrival
and departure = 1 −θ. Takács [1] and later Consul and Gupta [2] showed that the probability that a buy period has
is for fixed β given by (1). The distribution is a member of the Lagrange class of distributions [3] [4] [5].
The shape of the histogram of X depends on the combination (β, θ). In Figure 1 & Figure 2 we can see that the distribution has a much longer tail for large values of β.
The paper is structured as follows: In Section 2 we demonstrate the connection between the QNBD and the regular exponential family of distributions [6] and derive the higher order central moments of the distribution. A limiting form of the distribution will be investigated as well. In Section 3, we derive the first order approximation of the variances and biases of the moment estimators of (β,θ). In Section 4, we derive the maximum likelihood estimators and their asymptotic variances and biases. In Section 5, we develop the regression model and establish discuss the maximum likelihood estimation of the regression parameters. In Section 6, we apply the models to real-life data arising from genomic studies. We provide general discussion in Section 7.
2. Moments of the Distribution
The simplest approach to derive the higher order central moments of the distribution is to first write (1) in the general form of the linear exponential family.
For fixed β, the QNBD belongs to the regular exponential family of discrete random variables:
(2)
with
![]()
Figure 1. Histogram of QNBD for β = 2, and θ = 0.10.
![]()
Figure 2. Histogram of X when for β = 10, and θ = 0.09.
The mean
and variance
of X are given respectively by:
(3)
(4)
Writing
, and
, one can establish a recurrence relationship among the central moments so that:
Here
(5)
Therefore, the third and fourth central moments are:
(6)
(7)
where,
.
Moreover, the fifth central moment is given by:
where
3. Moment Estimators
Suppose that we have a random sample
with sample mean
and sample variance
Equating the sample statistics to their corresponding population parameters (3) and (4) and solving for θ and β we get
(8)
(9)
We use the delta method to evaluate the variances and biases of a moment estimator.
From Kendall and Ord [7] we have:
With similar expressions for
and
.
One can show that:
Note that, the information matrix is the determinant of the variance covariance matrix of the moment estimators and is given by:
Example: Modeling the number of brain lesions to predict Multiple Sclerosis.
The use of gadolinium (Gd) withT1 weighted imaging can identify areas of breakdown in the blood-brain barrier and increases the reliability and in detecting active Multiple Sclerosis (MS) lesions [8]. The number of new Gd enhancing lesions is a widely used end point for monitoring disease activity and for evaluating the effect of treatments in phase II clinical trials. In these studies, the results of the Magnetic Resonance Imaging (MRI) end point are in the form of counts [9]. To deal with the problem of overdispersion, the negative binomial distribution is used to model this type of data.
As application of the QNBD we simulated lesions count data like the situation described in [8] (Table 1).
The sample size = 116 subjects.
The histogram of the data s is given in Figure 3.
The y-axis we have the frequency of each x.
mean(x) = 3.37, and var(x) = 69.63.
The moment estimators are
and
.
Bootstrapping the distribution of the moment estimators
, and
(Figure 4).
![]()
Table 1. Distribution of the number of brain lesions.
![]()
Figure 3. Histogram of the data shown in Table 1. The x-axis we have the x values.
![]()
Figure 4. Histogram of distribution of
based on 1000 bootstrap samples.
The distribution is negatively skewed. The empirical bias in the moment estimator of
is
.
Similarly
.
From Figure 5 we may infer that the distribution of
seems to be a mixture of two distribution or is bimodal. From these results, we may conclude that the moment estimators are not reliable unless we have extremely large sample. In the next section, we discuss the maximum likelihood estimation.
4. Maximum Likelihood Estimators (MLE)
It is well-known that the estimators obtained from application of the method of MLE possess optimal properties such asymptotic normality and efficiency. Based on a simple random sample the log-likelihood (l) function is given by:
(10)
(11)
Similarly, setting
equal to zero and solving for β we get:
(12)
where
![]()
Figure 5. Histogram of distribution of
based on 1000 bootstrap samples.
The MLEs of θ and β are thus obtained by solving (11) and (12) iteratively, noting that (12) is in the form of
or a fixed-point equation.
Elements of the variance-covariance matrix of the
are obtained by inverting the Fisher’s information matrix. We can show that
(13)
(14)
(15)
and
where
We note that on using the digamma approximation we can write
The R-Code for fitting the QNBD is given in Appendix 1.
5. Orthogonal Polynomial Approximation for
The evaluation of the asymptotic variance covariance matrix is difficult because
does not have a tractable form. To overcome this difficulty, following [5] we employ an asymptotic expansion for
as a linear combination of orthogonal polynomials. From Morgan et al. [9], if
is a distribution function with feint moments
of all orders, then the point
is a point of increase for
, if
for every
. If the distribution function P has atleast Y points of increase, Cramér [10] has proved that there exists a sequence of polynomials
, uniquly determined under the following conditions:
1)
is of degree n, and the coefficient of
in
is positive
2)
satisfy the orthogonality conditions
If
Szegő [11] derived the formal Fourier expansion of a continuous function
in terms of a set of orthogonal polynomials such that:
where
are selected so that:
is minimum. He showed that
Direct calculations give:
and,
, are the orthogonal polynomials associated with the probability function
, where
. Moreover, we write
Hence
Now, since
, and
then
Since
Then
The asymptotic relative efficiency of the moment estimators is therefore given by:
For the lesion data Eff = 16.6%. We interpret this number as follows: for the moment estimators to be as efficient as the maximum likelihood estimators, we need a sample size that is 16.6% larger compared to the sample size used for the maximum likelihood estimation.
3.4 Asymptotic biases of the MLE
Unlike the moment estimators, the
do not have closed form expressions, and the applications of the delta method cannot be used to obtain their asymptotic biases. Sherton and Wallington [12] used an approach that depends on the asymptotic expansion of the log-likelihood functions. We denote the biases of
, and
by
and
, and these are the solutions of the system of equations
In the above system of equation we have the following notations:
And
Since
We can show that
where
and
where
where
Finally, using the above information we can show that
.
Solving the system of equations, we obtain the asymptotic biases so that
For the lesion data, the biases of the maximum likelihood estimators are given by:
, and
6. Quasi Negative Binomial Regression
Our aim in this section is develop regression model based on the GNBD. The approach is facilitated by the fact that the QNBD is a member of the regular exponential family shown in [13]. We employ the transformation:
(16)
Here we assume to
be monotone, differentiable, and positive function of θ [13]. In (16) z is a vector of
exploratory variables and γ is a vector of regression parameters. To estimate
, and β, we assure that
are independent random variables and
(17)
In this section, we derive the maximum likelihood estimators of the regression parameters, the parameter β and their asymptotic properties. The log-likelihood function is given by:
where,
can be approximated using the results:
where δ is Euler’s number. Therefore
Simplifying we get:
can be approximated by:
The variance covariance matrix of the estimated parameters, and β based on the regression model is given by the inverse of Fisher’s information matrix:
where M is and q × q symmetric matrix whose elements are
so that
, and O is a 1 × q matrix whose elements are
and C is a 1 × 1 element with
.
The simplest approach to obtain the maximum likelihood estimators of γ and β is by solving the equations;
and
iteratively using a numeric technique such as Newton-Raphson. Following Cox and Hinkley [14], we have as
and under certain regularity conditions, the maximum likelihood estimators of
are asymptotically normal and consistent.
That is
in law
4-Limiting form of the QNBD: The Quasi-Poisson Distribution
As
, so that
, the distribution (1) takes the following form:
Therefore,
. Expressing the distribution in terms of the mean parameter μ, the limiting distribution can be written as:
(18)
In a paper that follows, we shall discuss the issues of maximum likelihood estimation for the parameter μ of the probability function (18) and the regression model associated with it.
7. Data Analysis: RNA_SEQ Data: Modeling the Distribution of Read Counts
Over the past decade, various statistical analysis tools have been developed to analyze expression profiling data generated by microarrays (Reviewed in [15] [16] [17] ). Before these tools can be applied to RNA-Seq data, it is worth noting that microarray data and RNA-Seq data are inherently different [16]. Microarray data is “analog” since expression levels are represented as continuous hybridization signal intensities. In contrast, RNA-Seq data is “digital”, representing expression levels as discrete counts. This inherent difference leads to the difference in the parametric statistical methods that are used since they often depend on the assumptions of the random mechanism that generates the data. The Poisson, Binomial and Negative binomial distributions are more suitable for modeling discrete data in an RNA-Seq experiment. Therefore, a statistical method developed for microarray data analysis cannot be directly applied to RNA-Seq data analysis without first examining the underlying distributions. Recently several statistical methods have been developed to deal specifically with RNA-Seq count data [17]. In an RNA-Seq dataset, the expression levels of a specific gene were modeled using the Poisson distribution. This Poisson model is verified in the case where there are only technical replicates using a single source of RNA [15]. In the Poisson model, over-dispersion occurs if the sample variance is greater than the sample mean. There could be several sources that cause over-dispersion in RNA-Seq data, including the variability in biological replicates due to heterogeneity within a population of cells, possible correlation between gene expressions due to regulation, and other uncontrolled variations [18]. The existence of over-dispersion in real data was observed in several previous studies [18]. Popular models to safeguard against over-dispersion include the negative binomial distribution, or two-stage Poisson distribution [19], as discussed below.
When over-dispersion is observed across the samples, the gene counts cannot be estimated accurately by a simple Poisson model [20]. One way to handle this problem is to allow the Poisson mean to be a random variable and then model the gene counts by the marginal distribution of the mean count. Specifically, assume that the Poisson mean follows a Gamma distribution then the marginal distribution of the gene count has a Negative Binomial distribution with mean
and variance =
, where ε is the dispersion parameter [20].
Yoon and Nam [21] [22] showed that the gene dispersion value as estimated under the negative binomial modelling of read counts is the key determinant of the read count bias.
Whenever multiple samples are available and instead of modeling the raw expression, we model the gene counts as a function of the experimental sample and gene dispersion as covariates. For highly expressed genes we used the QNB regression model for published data that we downloaded from http://woldlab.caltech.edu/rnaseq/.
The published data were downloaded from http://www.ncbi.nlm.nih.gov/sra/ as the fastq files: SRA010153 for the MAQC data, SRP000727 for the human data (the two low-coverage MAQC samples were excluded), SRX000559-SRX000564 for the yeast data.
We analyzed the read count of the Mice-Brain tissue data under four experimental conditions:
Z1 = Chrom_ chr11, Z2 = Chrom chr9_ra, Z3 = Chrom chrUn_ra, and Z4 = Chrom chr13_ra, and d = the gene dispersion levels. Zj are modeled as categorical variables with categorical with Z4 being the reference category, and d is measured on the continuous scale. Figure 6 shows the histogram of the read counts for the 4 groups (Tables 2(a)-(d)).
We now analyze the data using three count regression models; the Poisson, the Negative binomial, and the QNB (Tables 3-5).
![]()
Figure 6. Histogram of the read counts data.
![]()
Table 3. Fitting the data to Poisson regression model.
AIC: 314241
![]()
Table 4. Results of fitting data to the Negative Binomial regression model.
(Dispersion parameter for Negative Binomial (2.0488, with SE = 0.0185). AIC: 214866
![]()
Table 5. Results of fitting data to the QNBD.
AIC = 213104
1) Modeling read count as a Poisson regression model glm(formula = y ~ Z1 + Z2 + Z3 + d, family = poisson, data = ratdata);
2) Modeling read count using Negative Binomial to account for overdispersion;
3) Quasi negative binomial regression model.
8. Comments on the Data Fitting
We used three count regression models to fit the RNA-SEQ data. All models were fitted using the R package [23]. The first is a Poisson regression model, the second is the well-known negative binomial, and the third is the proposed QNB regression model. The Poisson model is fitted in R by applying the “GLM” while the negative binomial is fitted by using the “MASS” package in R. We provided the R-code for fitting the QNB in Appendix 2 in Appendix 2. We based the comparisons among these models on the AIC values (the smaller the better). Clearly, the Poisson model with the largest AIC = 314241, is the worst as it fails to properly account for the overdispersion in the data. Remarkable improvement is attained when the negative binomial regression model is used as its AIC = 214866. Although the QNB regression model has the smallest AIC = 213104, the improvement over the negative binomial is not tangible. We still believe that our proposed model should be a close competitor to the negative binomial model.
9. Discussion
There has been a growing interest among bioinformaticians and statisticians in constructing flexible distributions for counts that exhibit overdispersion to improve the modeling of count data. As a result, significant progress has been made towards generalizing some well-known discrete models, which have been successfully applied to problems arising in several areas of research. The proposed distribution was utilized to model two data sets; it was shown to provide a better fit than several other related models, including some with the same number of parameters. In the future paper, we shall demonstrate the applicability of the limiting form of our proposed distribution to genomics data together with inference procedures using multiple samples. Finally, we believe that the inferential results developed in this article should find numerous applications in bioinformatics, genomics, medicine, data engineering, and other areas of physical sciences.
Acknowledgements
The authors acknowledge the positive comments made by anonymous reviewers of this work.
Appendices
Appendix 1: R-CODE for Fitting the Univariate Version of the QNBD Using the Maximum Likelihood Method Applied to the “Brain Lesion” Data
QNBD<- function(x,theta,beta,log = FALSE){
loglik <- log((((beta-1)/(beta-1+beta*x))*(factorial(beta-1+beta*x))/
(factorial(x)*factorial(beta-1+beta*x-x))*
((theta^x)*(1-theta)^(beta*x+beta-1-x))))
if(log = = FALSE)
density <- exp(loglik)
else density<-loglik
return(density)
}
parameter <- maxlogL(x = x,dist = "QNBD",start = c(.01,2),optimizer = 'optim')
summary(parameter)
The fitting results by the method maximum likelihood are:
Appendix 2: R-CODE: QNB Regression Fitting by the Method of Maximum Likelihood Applied to the RNA_SEQ Read Count Data
llik=function(y,par){
b0=par [1]
b1=par [2]
b2=par [3]
b3=par [4]
b4=par [5]
beta=par [6]
n=length(y)
eta=b0+b1*x1+b2*x2+b3*x3+b4*x4
mu=exp(eta)/(1+exp(eta))
ll=sum(log(beta-1)-log(beta-1+beta*y)
+lgamma(beta+beta*y)-lgamma(1+y)-lgamma(beta+beta*y-y)
+y*log(mu)+(beta+beta*y-1-y)*log(1-mu))
return(-ll)
}
res=optim(par=c(2,.6,-.02,.42,-.12,2.1),llik,y=y,method="BFGS",hessian=T)
theta=res$par
theta
#CALCULATING THE STANDARD ERRORS OF MLE
out3=nlm(llik,theta,y=y,hessian=TRUE)
fish=out3$hessian
solve(fish)
element=diag((solve(fish)))
se=sqrt(element)
qqnorm(y,resid(out3))
z=theta/se
p_value=1-pnorm(abs(z))
result.GNBD=data.frame(theta,se,z,p_value)
result.GNBD=round(result.GNBD,4)