Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data

Abstract

Background: The Poisson and the Negative Binomial distributions are commonly used to model count data. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial has a variance larger than the mean and therefore both models are appropriate to model over-dispersed count data. Objectives: A new two-parameter probability distribution called the Quasi-Negative Binomial Distribution (QNBD) is being studied in this paper, generalizing the well-known negative binomial distribution. This model turns out to be quite flexible for analyzing count data. Our main objectives are to estimate the parameters of the proposed distribution and to discuss its applicability to genetics data. As an application, we demonstrate that the QNBD regression representation is utilized to model genomics data sets. Results: The new distribution is shown to provide a good fit with respect to the “Akaike Information Criterion”, AIC, considered a measure of model goodness of fit. The proposed distribution may serve as a viable alternative to other distributions available in the literature for modeling count data exhibiting overdispersion, arising in various fields of scientific investigation such as genomics and biomedicine.

Share and Cite:

Shoukri, M. and Aleid, M. (2022) Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data. Open Journal of Statistics, 12, 216-237. doi: 10.4236/ojs.2022.122016.

1. Introduction

A random variable X is said to have “Quasi Negative Binomial Distribution”, QNBD if the probability function is given by:

P x = P ( X = x ) = β 1 ( β 1 ) + β x Γ ( β + β x ) θ x ( 1 θ ) β + β x x 1 x ! Γ ( β + β x x ) x = 0 , 1 , 2 , 0 < θ < 1 0 < β θ < 1 (1)

The distribution whose probability function is given in (1) was first derived by Takács [1] as a queuing model. He assumed that we have a single server queue with independent customers arriving according to a Poisson process in batches of size ( β 1 ) with traffic intensity π and exponential service time with mean 1/α. It is also assumed that the service time is independent of the interarrival time. Under these conditions, the probabilities of arrival θ = π / ( π + α ) and departure = 1 −θ. Takács [1] and later Consul and Gupta [2] showed that the probability that a buy period has ( β 1 ) x is for fixed β given by (1). The distribution is a member of the Lagrange class of distributions [3] [4] [5].

The shape of the histogram of X depends on the combination (β, θ). In Figure 1 & Figure 2 we can see that the distribution has a much longer tail for large values of β.

The paper is structured as follows: In Section 2 we demonstrate the connection between the QNBD and the regular exponential family of distributions [6] and derive the higher order central moments of the distribution. A limiting form of the distribution will be investigated as well. In Section 3, we derive the first order approximation of the variances and biases of the moment estimators of (β,θ). In Section 4, we derive the maximum likelihood estimators and their asymptotic variances and biases. In Section 5, we develop the regression model and establish discuss the maximum likelihood estimation of the regression parameters. In Section 6, we apply the models to real-life data arising from genomic studies. We provide general discussion in Section 7.

2. Moments of the Distribution

The simplest approach to derive the higher order central moments of the distribution is to first write (1) in the general form of the linear exponential family.

For fixed β, the QNBD belongs to the regular exponential family of discrete random variables:

P x = h ( x ) exp [ η ( θ ) S ( x ) ψ ( θ ) ] (2)

with

h ( x ) β 1 ( β 1 ) + β x Γ ( β + β x ) x ! Γ ( β + β x x )

S ( x ) x

Figure 1. Histogram of QNBD for β = 2, and θ = 0.10.

Figure 2. Histogram of X when for β = 10, and θ = 0.09.

η ( θ ) = log [ θ ( 1 θ ) 1 β ]

ψ ( θ ) = log ( 1 θ ) β 1

The mean μ 1 and variance μ 2 of X are given respectively by:

μ 1 = ( β 1 ) θ 1 β θ (3)

μ 2 = ( β 1 ) θ ( 1 θ ) ( 1 β θ ) 3 (4)

Writing

g ( θ ) = exp [ η ( θ ) ] , and f ( θ ) = exp [ ψ ( θ ) ] , one can establish a recurrence relationship among the central moments so that:

μ r + 1 = E [ ( x μ 1 ) r ] = g ( θ ) g ( θ ) [ μ r θ + μ 1 θ μ r 1 ]

Here

μ 1 = g ( θ ) g ( θ ) ln f θ r = 1 , 2 , (5)

Therefore, the third and fourth central moments are:

μ 3 = ( β 1 ) θ ( 1 θ ) ( 1 2 θ + 2 β θ β θ 2 ) ( 1 β θ ) 5 (6)

μ 4 = 3 μ 2 2 + ( β 1 ) θ ( 1 θ ) ( 1 β θ ) 7 M (7)

where,

M = 1 6 θ + 6 θ 2 + 2 β θ ( 4 9 θ + 4 θ 2 ) + β 2 θ 2 ( 6 6 θ + θ 2 ) .

Moreover, the fifth central moment is given by:

μ 5 = 10 μ 2 μ 3 + ( β 1 ) θ ( 1 θ ) ( 1 β θ ) 9 B

where

B = 1 14 θ + 36 θ 2 + 24 θ 3 + 2 θ β ( 11 42 θ + 28 θ 2 ) θ 2 β ( 29 96 θ + 58 θ 2 ) + θ 2 β 2 ( 58 96 θ + 29 θ 2 ) 2 θ 3 β 2 ( 28 42 θ + 11 θ 2 ) + 2 θ 3 β 3 ( 12 9 θ + θ 2 ) θ 4 β 3 ( 18 12 θ + θ 2 )

3. Moment Estimators

Suppose that we have a random sample x 1 , x 2 , , x n with sample mean x ¯ and sample variance s 2

x ¯ = 1 n ( x 1 + x 2 + + x n )

s 2 = 1 n i = 1 n ( x i x ¯ ) 2

Equating the sample statistics to their corresponding population parameters (3) and (4) and solving for θ and β we get

θ ^ = 1 x ¯ ( 1 + x ¯ ) 2 s 2 (8)

β ^ = 1 + x ¯ ( 1 + x ¯ ) s 2 (9)

We use the delta method to evaluate the variances and biases of a moment estimator.

From Kendall and Ord [7] we have:

var ( θ ^ ) = var ( x ¯ ) ( ˙ θ ^ x ¯ ) 2 + var ( s 2 ) ( ˙ θ ^ s 2 ) 2 + 2 cov ( x ¯ , s 2 ) ( ˙ θ ^ ¯ x ¯ ) ( ˙ θ ^ ¯ s 2 )

Bias ( θ ^ ) = 1 2 ! [ var ( x ¯ ) ˙ θ ^ x ¯ 2 + var ( s 2 ) ˙ 2 θ ^ 2 s 2 + 2 cov ( x ¯ , s 2 ) ˙ 2 θ ^ ¯ x ¯ s 2 ]

With similar expressions for var ( β ^ ) and Bias ( β ^ ) .

One can show that:

V 1 = var ( θ ^ ) = ( 1 θ ) n θ ( β 1 ) ( 1 β θ ) [ 1 + 2 β θ 3 θ ] 2 + ( μ 4 μ 2 2 ) n ( 1 β θ ) 6 θ 2 ( β 1 ) 2 2 μ 3 n ( 1 β θ ) 4 θ 2 ( β 1 ) 2

V 2 = var ( β ^ ) = ( 1 θ ) [ 1 + β θ 2 θ ] 2 n θ ( 1 θ ) ( β 1 ) + ( μ 4 μ 2 2 ) n ( 1 β θ ) 8 θ 2 ( 1 θ ) 2 ( β 1 ) 2 2 μ 3 n ( 1 β θ ) 6 [ 1 + β θ 2 θ ] θ 2 ( 1 θ ) 2 ( β 1 ) 2

Bias ( θ ^ ) = 4 ( β 1 ) n ( 1 β θ ) [ 2 + β θ 3 θ ] μ 4 μ 2 2 n [ ( 1 β θ ) 6 θ 2 ( 1 θ ) ( β 1 ) 2 ] + μ 3 n [ ( 1 β θ ) 4 ( 1 + 2 β θ 3 θ ) ( β 1 ) 2 θ 2 ( 1 θ ) ]

Bias ( β ^ ) = 1 n [ 1 + ( μ 4 μ 2 2 ) ( 1 β θ ) 7 θ 2 ( 1 θ ) 2 ( β 1 ) 2 μ 3 ( 1 β θ ) 3 θ 2 ( 1 θ ) 2 ( 1 + β θ 2 θ ) ]

cov ( β ^ , θ ^ ) = μ 2 n ( θ x ¯ ) ( β x ¯ ) + μ 4 μ 2 2 n ( θ s 2 ) ( β s 2 ) + μ 3 n [ ( θ x ¯ ) ( β s 2 ) + ( θ s 2 ) ( β x ¯ ) ]

Note that, the information matrix is the determinant of the variance covariance matrix of the moment estimators and is given by:

D = var ( θ ^ ) var ( β ^ ) cov 2 ( θ ^ , β ^ )

Example: Modeling the number of brain lesions to predict Multiple Sclerosis.

The use of gadolinium (Gd) withT1 weighted imaging can identify areas of breakdown in the blood-brain barrier and increases the reliability and in detecting active Multiple Sclerosis (MS) lesions [8]. The number of new Gd enhancing lesions is a widely used end point for monitoring disease activity and for evaluating the effect of treatments in phase II clinical trials. In these studies, the results of the Magnetic Resonance Imaging (MRI) end point are in the form of counts [9]. To deal with the problem of overdispersion, the negative binomial distribution is used to model this type of data.

As application of the QNBD we simulated lesions count data like the situation described in [8] (Table 1).

The sample size = 116 subjects.

The histogram of the data s is given in Figure 3.

The y-axis we have the frequency of each x.

mean(x) = 3.37, and var(x) = 69.63.

The moment estimators are θ ^ = 0.077 and β ^ = 10.227 .

Bootstrapping the distribution of the moment estimators

SE ( θ ^ ) = 0.344 , and SE ( β ^ ) = 0.106 (Figure 4).

Table 1. Distribution of the number of brain lesions.

Figure 3. Histogram of the data shown in Table 1. The x-axis we have the x values.

Figure 4. Histogram of distribution of θ ^ based on 1000 bootstrap samples.

The distribution is negatively skewed. The empirical bias in the moment estimator of θ ^ is bias ( θ ^ ) = 0.1659 .

Similarly bias ( β ^ ) = 9.77 .

From Figure 5 we may infer that the distribution of β ^ seems to be a mixture of two distribution or is bimodal. From these results, we may conclude that the moment estimators are not reliable unless we have extremely large sample. In the next section, we discuss the maximum likelihood estimation.

4. Maximum Likelihood Estimators (MLE)

It is well-known that the estimators obtained from application of the method of MLE possess optimal properties such asymptotic normality and efficiency. Based on a simple random sample the log-likelihood (l) function is given by:

l = n log ( β 1 ) i = 1 n log ( β 1 + β x i ) + i = 1 n j = 1 x i [ β ( 1 + x i ) j ] + n x ¯ log θ + n ( β 1 ) ( 1 + x ¯ ) log ( 1 θ ) (10)

β θ ˜ = x ¯ β ( 1 + x ¯ ) 1 (11)

Similarly, setting l β equal to zero and solving for β we get:

β ˜ = ( 1 + x ¯ ) 1 + x ¯ ( 1 + x ¯ ) 1 [ 1 exp [ Ω x ( ( β ˜ 1 ) + ( 1 x ¯ ) ) 1 ] ] 1 (12)

where

Ω x = ( n ( 1 + x ¯ ) ) 1 [ i = 1 n ( 1 + x i ) β ˜ ( 1 + x i ) 1 i = 1 n j = 1 x i ( 1 + x i ) β ˜ ( 1 + x i ) j ]

Figure 5. Histogram of distribution of β ^ based on 1000 bootstrap samples.

The MLEs of θ and β are thus obtained by solving (11) and (12) iteratively, noting that (12) is in the form of β ˜ = f ( β ˜ ) or a fixed-point equation.

Elements of the variance-covariance matrix of the ( θ ˜ , β ˜ ) are obtained by inverting the Fisher’s information matrix. We can show that

i θ θ = E [ 2 l θ 2 ] = n ( β 1 ) θ ( 1 θ ) ( 1 β θ ) (13)

i θ β = E [ 2 l θ β ] = n 1 β θ (14)

i β β = E [ 2 l β 2 ] = n ( β 1 ) 2 E [ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) 1 ] 2 + i = 1 n j = 1 x i ( 1 + x i ) 2 [ β ( 1 + x i ) j ] 2 ] (15)

var ( θ ˜ ) = i β β / Δ , var ( β ˜ ) = i θ θ / Δ ,

and

cov ( θ ˜ , β ˜ ) = i θ β / Δ

where Δ = i θ θ i β β i θ β 2

We note that on using the digamma approximation we can write

i β β = n ( β 1 ) 2 E [ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) 1 ] 2 + i = 1 n x i ( 1 + x i ) 2 [ 1 + β ( 1 + x i ) ] [ 1 + β ( 1 + x i ) x i ] ]

The R-Code for fitting the QNBD is given in Appendix 1.

5. Orthogonal Polynomial Approximation for i β β

The evaluation of the asymptotic variance covariance matrix is difficult because P 22 = E [ 2 log l β 2 ] does not have a tractable form. To overcome this difficulty, following [5] we employ an asymptotic expansion for 2 log P x β as a linear combination of orthogonal polynomials. From Morgan et al. [9], if P x is a distribution function with feint moments μ r of all orders, then the point x 0 is a point of increase for P x , if P x 0 + h > P x 0 h for every h > 0 . If the distribution function P has atleast Y points of increase, Cramér [10] has proved that there exists a sequence of polynomials G 0 ( x ) , G 1 ( x ) , uniquly determined under the following conditions:

1) G n ( x ) is of degree n, and the coefficient of x n in G n ( x ) is positive

2) G n ( x ) satisfy the orthogonality conditions

x = 0 G r ( x ) G s ( x ) = E ( G r 2 ( x ) )

If r = s

= 0 r s ( r , s = 0 , 1 , 2 , )

Szegő [11] derived the formal Fourier expansion of a continuous function h ( x ) in terms of a set of orthogonal polynomials such that:

h ( x ) = r = 0 a r G r ( x )

where a r are selected so that:

r = 0 [ log P x β r = 0 a r G r ( x ) ] 2 P x

is minimum. He showed that

a 0 0 , a 1 = μ 1 β / E ( G 1 2 ( x ) ) ,

a 2 = ( μ 2 β μ 3 μ 2 μ 1 β ) / E ( G 2 2 ( x ) )

Direct calculations give:

G 0 1 , G 1 ( x ) = x μ 1 and, G 2 ( x ) = ( x μ 1 ) 2 μ 3 μ 2 ( x μ 1 ) μ 2 , are the orthogonal polynomials associated with the probability function P x , where E ( G 1 2 ( x ) ) = μ 2 . Moreover, we write

Ω = E ( G 2 2 ( x ) ) = E [ ( x μ 1 ) 4 + μ 3 2 μ 2 2 ( x μ 1 ) 2 + μ 2 2 2 μ 3 μ 2 ( x μ 1 ) 3 2 μ 2 ( x μ 1 ) 2 + 2 μ 3 μ 2 ( x μ 1 ) μ 2 ]

Hence

Ω = μ 4 + μ 3 2 μ 2 + μ 2 2 2 μ 3 2 μ 2 2 μ 2 2 = μ 4 μ 3 2 μ 2 μ 2 2

Now, since μ 1 β = β [ ( β 1 ) θ 1 β θ ] = θ ( 1 θ ) ( 1 β θ ) 2 , and

μ 2 β = β [ ( β 1 ) θ ( 1 θ ) ( 1 β θ ) 3 ] = θ ( 1 θ ) ( 1 + 2 β θ 3 θ ) ( 1 β θ ) 4

then

log P x β = μ 1 β ( x μ 1 ) E ( G 1 2 ( x ) ) + ( μ 2 β μ 3 μ 2 μ 1 β ) E ( G 2 2 ( x ) ) [ ( x μ 1 ) 2 μ 3 μ 2 ( x μ 1 ) μ 2 ]

Since

n E [ log P β ] 2 = E [ log l β 2 ]

Then

i β β n [ θ ( 1 θ ) ( β 1 ) ( 1 β θ ) + θ 4 ( 1 θ ) 2 ( 1 β θ ) 6 [ μ 4 μ 3 2 μ 2 μ 2 2 ] ]

The asymptotic relative efficiency of the moment estimators is therefore given by:

Eff = 1 Δ D

For the lesion data Eff = 16.6%. We interpret this number as follows: for the moment estimators to be as efficient as the maximum likelihood estimators, we need a sample size that is 16.6% larger compared to the sample size used for the maximum likelihood estimation.

3.4 Asymptotic biases of the MLE

Unlike the moment estimators, the ( θ ˜ , β ˜ ) do not have closed form expressions, and the applications of the delta method cannot be used to obtain their asymptotic biases. Sherton and Wallington [12] used an approach that depends on the asymptotic expansion of the log-likelihood functions. We denote the biases of θ ˜ , and β ˜ by b 1 ( θ ˜ ) and b 2 ( β ˜ ) , and these are the solutions of the system of equations

( P 11 P 12 P 21 P 22 ) [ b 1 ( θ ˜ ) b 2 ( β ˜ ) ] = [ A 1 2 D n A 2 2 D n ]

In the above system of equation we have the following notations:

D = P 11 P 22 P 12 2

A 1 = P 22 P 1 , 11 2 P 12 P 1 , 12 + P 11 P 1 , 22

A 2 = P 22 P 2 , 11 2 P 12 P 2 , 12 + P 11 P 2 , 22

And

P 1 , 11 = E [ P 2 P θ 2 P θ 2 ]

P 1 , 12 = E [ P 2 P θ 2 P θ β ]

P 1 , 22 = E [ P 2 P θ 2 P β 2 ]

P 2 , 11 = E [ P 2 P β 2 P θ 2 ]

P 2 , 12 = E [ P 2 P β 2 P θ β ]

P 2 , 22 = E [ P 2 P β 2 P β 2 ]

Since

P θ = ( x μ 1 ) ( 1 β θ ) θ ( 1 θ ) P

We can show that

P 1 , 11 = μ 2 w 1 + μ 3 w 2

where

w 1 = ( 1 β θ ) ( 2 θ β θ 2 1 ) θ 3 ( 1 θ ) 3

and

w 2 = ( 1 β θ ) 3 θ 3 ( 1 θ ) 3

P 1 , 22 = [ θ ( 1 θ ) ( 1 β θ ) ] 1

P 2 , 11 = ( 1 θ ) 2 , P 1 , 22 = P 2 , 22 = 0

P 2 , 12 = D 1 + D 2 + D 3

where

D 1 = 1 2 θ + β θ ( 2 0 ) ( β 1 ) ( 1 θ ) 2

D 2 = [ μ 2 μ 2 β μ 3 μ 1 β μ 2 μ 4 μ 3 2 μ 2 3 ] 2 [ μ 5 + μ 3 3 μ 2 2 2 μ 3 μ 4 μ 2 ]

D 3 = 2 ( 1 β θ ) 5 ( β 1 ) 2 θ ( 1 θ ) 2 [ μ 2 μ 2 β μ 3 μ 1 β ]

where

μ 1 β = θ ( 1 θ ) ( 1 β θ ) 2 and μ 2 β = θ ( 1 θ ) ( 1 + 2 β θ 3 θ ) ( 1 β θ ) 4

Finally, using the above information we can show that P 2 , 22 = 0 .

Solving the system of equations, we obtain the asymptotic biases so that

bias ( θ ˜ ) = θ 2 ( 1 θ ) 3 ( 1 β θ ) 2 2 n [ ( β 1 ) ( 1 θ ) P 22 θ ( 1 β θ ) ] 2 × [ P 22 ( 1 θ ) 2 2 P 1 , 12 ( 1 θ ) + 2 P 22 θ ( 1 θ ) ( 1 β θ ) 2 β ( β 1 ) P 22 2 θ ( 1 β θ ) 2 ]

bias ( β ˜ ) = θ 2 ( 1 θ ) 2 ( 1 β θ ) 2 n [ ( β 1 ) ( 1 θ ) P 22 ( 1 β θ ) ] 2 × [ 2 β ( β 1 ) β P 22 θ ( 1 β θ ) 2 θ ( 1 θ ) + 2 ( β 1 ) P 22 θ ( β 1 ) P 22 1 θ ]

For the lesion data, the biases of the maximum likelihood estimators are given by:

bias ( θ ˜ ) = 0.003 , and bias ( β ˜ ) = 0.002

6. Quasi Negative Binomial Regression

Our aim in this section is develop regression model based on the GNBD. The approach is facilitated by the fact that the QNBD is a member of the regular exponential family shown in [13]. We employ the transformation:

τ ( θ i ) = z i T γ _ , i = 1 , 2 , , k (16)

Here we assume to τ ( θ i ) be monotone, differentiable, and positive function of θ [13]. In (16) z is a vector of υ × 1 ( ν < n ) exploratory variables and γ is a vector of regression parameters. To estimate γ 1 , γ 2 , , γ q , and β, we assure that

x 1 ~ QNBD ( θ i , β ) , i = 1 , 2 , , n

are independent random variables and

logit [ θ i ( z ) ] = [ z i T γ ] (17)

In this section, we derive the maximum likelihood estimators of the regression parameters, the parameter β and their asymptotic properties. The log-likelihood function is given by:

l = n log ( β 1 ) i = 1 n log ( β 1 + β x i ) + i = 1 n log Γ ( β + β x i ) i = 1 n log Γ ( β + β x i x i ) + i = 1 n x i log θ i + i = 1 n ( β 1 ) ( 1 x i ) log ( 1 θ i ) = l 1 ( β , x i ) + l 2 ( β , x i , θ i )

l 2 γ = i = 1 n x i γ r [ log θ i ] + i = 1 n ( β 1 ) ( 1 + x i ) γ r [ log ( 1 θ i ) ] = i = 1 n x i z i r [ 1 + e z i T γ ] 1 i = 1 n ( β 1 ) ( 1 + x i ) z i r e z i T γ [ 1 + e z i T γ ] 1

σ r β 2 l 2 β γ r = i = 1 n ( 1 + x i ) z i r e z i T γ [ 1 + e z i T γ ] 1

I β r = E [ 2 l β γ r ] = i = 1 n z i r e z i T 1 + e z i T γ E [ 1 + x i ] = i = 1 n z i r e z i T γ [ ( 1 + e z i T γ ) ( 1 ( β 1 ) e z i T γ ) ] 1

2 l 2 γ r γ s = i = 1 n x i z i r z i s e z i T γ ( 1 + e z i T γ ) 2 ( β 1 ) i = 1 n ( 1 + x i ) z i r z i s [ e z i T γ 1 + e z i T γ [ e z i T γ 1 + e z i T γ ] 2 ]

E [ 2 l 2 γ r γ s ] σ r s = i = 1 n z i r z i s θ i ( 1 θ i ) ( β 1 ) θ i 1 β θ i + ( β 1 ) i = 1 n z i r z i s ( 1 θ i ) 1 β θ i [ θ i θ i 2 ]

E [ 2 l γ r γ s ] = ( β 1 ) i = 1 n z i r z i s θ i ( 1 θ i ) 1 β θ i

where,

θ i = e z i T γ / ( 1 + e z i T γ )

l β r can be approximated using the results:

Γ ( y ) Γ ( y ) = δ 1 y + j = 1 y j ( y + j )

where δ is Euler’s number. Therefore

l β n β 1 i = 1 n ( 1 + x i ) β ( 1 + x i ) 1 + i = 1 n ( 1 + x i ) [ log ( 1 + β ( 1 + x i ) ) log ( β ( 1 + x i ) + 1 x i ) ] + i = 1 n ( 1 + x i ) log ( 1 θ i )

l β = n β 1 i = 1 n ( 1 + x i ) ( β 1 + β x i ) + i = 1 n Γ ( β + β x i ) Γ ( β + β x i ) ( 1 + x i ) i = 1 n Γ ( β + β x i x i ) Γ ( β + β x i x i ) ( 1 + x i ) + i = 1 n x i log [ ( 1 θ i ) θ i ] + i = 1 n log ( 1 θ i ) = n β 1 + i = 1 n ( 1 + x i ) β 1 + β x i + i = 1 n Γ ( β + β x i ) Γ ( β + β x i ) ( 1 + x i ) i = 1 n Γ ( β + β x i x i ) Γ ( β + β x i x i ) ( 1 + x i ) + i = 1 n ( 1 + x i ) log ( 1 θ i )

2 l β 2 n ( β 1 ) 2 i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) 1 ] 2 + i = 1 n ( 1 + x i ) { ( 1 + x i ) 1 + β ( 1 + x i ) ( 1 + x i ) 1 + β ( 1 + x i ) x i ]

Simplifying we get:

2 l β 2 = n ( β 1 ) 2 i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) 1 ] 2 + i = 1 n x i ( 1 + x i ) 2 [ 1 + β ( 1 + x i ) ] 2 x i [ 1 + β ( 1 + x i ) ]

σ β β = E [ 2 l β 2 ] can be approximated by:

σ β β = 1 ( β 1 ) 2 i = 1 n θ i ( 2 θ i ) + ( β 1 ) i = 1 n θ i ( 1 θ i ) 2 ( 1 β θ i ) ( 1 2 β θ i + β ) ( 1 3 β θ i + β + θ i )

The variance covariance matrix of the estimated parameters, and β based on the regression model is given by the inverse of Fisher’s information matrix:

Σ = [ σ γ γ σ γ β σ β β ] 1 = [ M O C ]

where M is and q × q symmetric matrix whose elements are m i j so that m i j = cov ( γ ^ i , γ ^ j ) , and O is a 1 × q matrix whose elements are O j = cov ( γ ^ i , β ^ ) and C is a 1 × 1 element with C = var ( β ^ ) .

The simplest approach to obtain the maximum likelihood estimators of γ and β is by solving the equations;

l β = 0 and l γ r = 0 , r = 1 , 2 , , r iteratively using a numeric technique such as Newton-Raphson. Following Cox and Hinkley [14], we have as n and under certain regularity conditions, the maximum likelihood estimators of ϕ ^ = ( γ ^ , β ^ ) are asymptotically normal and consistent.

That is

V n ¯ ( ϕ ^ ϕ ) N q + 1 ( 0 , Σ ) in law

4-Limiting form of the QNBD: The Quasi-Poisson Distribution

As β , θ 0 , so that β θ = α , the distribution (1) takes the following form:

P x = α x x ! ( 1 + x ) x 1 e α ( 1 + x ) , 0 < α < 1

μ = E ( x ) = α 1 α , var ( x ) = α ( 1 α ) 3

Therefore, var ( x ) = μ ( 1 + μ ) 2 . Expressing the distribution in terms of the mean parameter μ, the limiting distribution can be written as:

P x = ( 1 + x ) x 1 x ! ( μ 1 + μ ) x e μ 1 + μ ( 1 + x ) (18)

In a paper that follows, we shall discuss the issues of maximum likelihood estimation for the parameter μ of the probability function (18) and the regression model associated with it.

7. Data Analysis: RNA_SEQ Data: Modeling the Distribution of Read Counts

Over the past decade, various statistical analysis tools have been developed to analyze expression profiling data generated by microarrays (Reviewed in [15] [16] [17] ). Before these tools can be applied to RNA-Seq data, it is worth noting that microarray data and RNA-Seq data are inherently different [16]. Microarray data is “analog” since expression levels are represented as continuous hybridization signal intensities. In contrast, RNA-Seq data is “digital”, representing expression levels as discrete counts. This inherent difference leads to the difference in the parametric statistical methods that are used since they often depend on the assumptions of the random mechanism that generates the data. The Poisson, Binomial and Negative binomial distributions are more suitable for modeling discrete data in an RNA-Seq experiment. Therefore, a statistical method developed for microarray data analysis cannot be directly applied to RNA-Seq data analysis without first examining the underlying distributions. Recently several statistical methods have been developed to deal specifically with RNA-Seq count data [17]. In an RNA-Seq dataset, the expression levels of a specific gene were modeled using the Poisson distribution. This Poisson model is verified in the case where there are only technical replicates using a single source of RNA [15]. In the Poisson model, over-dispersion occurs if the sample variance is greater than the sample mean. There could be several sources that cause over-dispersion in RNA-Seq data, including the variability in biological replicates due to heterogeneity within a population of cells, possible correlation between gene expressions due to regulation, and other uncontrolled variations [18]. The existence of over-dispersion in real data was observed in several previous studies [18]. Popular models to safeguard against over-dispersion include the negative binomial distribution, or two-stage Poisson distribution [19], as discussed below.

When over-dispersion is observed across the samples, the gene counts cannot be estimated accurately by a simple Poisson model [20]. One way to handle this problem is to allow the Poisson mean to be a random variable and then model the gene counts by the marginal distribution of the mean count. Specifically, assume that the Poisson mean follows a Gamma distribution then the marginal distribution of the gene count has a Negative Binomial distribution with mean μ i and variance = μ i ( 1 + ε μ i ) , where ε is the dispersion parameter [20].

Yoon and Nam [21] [22] showed that the gene dispersion value as estimated under the negative binomial modelling of read counts is the key determinant of the read count bias.

Whenever multiple samples are available and instead of modeling the raw expression, we model the gene counts as a function of the experimental sample and gene dispersion as covariates. For highly expressed genes we used the QNB regression model for published data that we downloaded from http://woldlab.caltech.edu/rnaseq/.

The published data were downloaded from http://www.ncbi.nlm.nih.gov/sra/ as the fastq files: SRA010153 for the MAQC data, SRP000727 for the human data (the two low-coverage MAQC samples were excluded), SRX000559-SRX000564 for the yeast data.

We analyzed the read count of the Mice-Brain tissue data under four experimental conditions:

Z1 = Chrom_ chr11, Z2 = Chrom chr9_ra, Z3 = Chrom chrUn_ra, and Z4 = Chrom chr13_ra, and d = the gene dispersion levels. Zj are modeled as categorical variables with categorical with Z4 being the reference category, and d is measured on the continuous scale. Figure 6 shows the histogram of the read counts for the 4 groups (Tables 2(a)-(d)).

We now analyze the data using three count regression models; the Poisson, the Negative binomial, and the QNB (Tables 3-5).

Figure 6. Histogram of the read counts data.

(a) Chrom-chr1 (b) Chrom-chr13_ra (c) Chrom-chr9_ran (d) Chrom-chrUn_ran

Table 2. (a) Summary statistics of the read count data for Chrom-Chr1 sample; (b) Summary statistics of the read count data for Chrom-Chr13 sample; (c) Summary statistics of the read count data for Chrom-Chr9_ran sample; (d) Summary statistics of the read count data for Chrom-ChrUn_ran sample.

Table 3. Fitting the data to Poisson regression model.

AIC: 314241

Table 4. Results of fitting data to the Negative Binomial regression model.

(Dispersion parameter for Negative Binomial (2.0488, with SE = 0.0185). AIC: 214866

Table 5. Results of fitting data to the QNBD.

AIC = 213104

1) Modeling read count as a Poisson regression model glm(formula = y ~ Z1 + Z2 + Z3 + d, family = poisson, data = ratdata);

2) Modeling read count using Negative Binomial to account for overdispersion;

3) Quasi negative binomial regression model.

8. Comments on the Data Fitting

We used three count regression models to fit the RNA-SEQ data. All models were fitted using the R package [23]. The first is a Poisson regression model, the second is the well-known negative binomial, and the third is the proposed QNB regression model. The Poisson model is fitted in R by applying the “GLM” while the negative binomial is fitted by using the “MASS” package in R. We provided the R-code for fitting the QNB in Appendix 2 in Appendix 2. We based the comparisons among these models on the AIC values (the smaller the better). Clearly, the Poisson model with the largest AIC = 314241, is the worst as it fails to properly account for the overdispersion in the data. Remarkable improvement is attained when the negative binomial regression model is used as its AIC = 214866. Although the QNB regression model has the smallest AIC = 213104, the improvement over the negative binomial is not tangible. We still believe that our proposed model should be a close competitor to the negative binomial model.

9. Discussion

There has been a growing interest among bioinformaticians and statisticians in constructing flexible distributions for counts that exhibit overdispersion to improve the modeling of count data. As a result, significant progress has been made towards generalizing some well-known discrete models, which have been successfully applied to problems arising in several areas of research. The proposed distribution was utilized to model two data sets; it was shown to provide a better fit than several other related models, including some with the same number of parameters. In the future paper, we shall demonstrate the applicability of the limiting form of our proposed distribution to genomics data together with inference procedures using multiple samples. Finally, we believe that the inferential results developed in this article should find numerous applications in bioinformatics, genomics, medicine, data engineering, and other areas of physical sciences.

Acknowledgements

The authors acknowledge the positive comments made by anonymous reviewers of this work.

Appendices

Appendix 1: R-CODE for Fitting the Univariate Version of the QNBD Using the Maximum Likelihood Method Applied to the “Brain Lesion” Data

QNBD<- function(x,theta,beta,log = FALSE){

loglik <- log((((beta-1)/(beta-1+beta*x))*(factorial(beta-1+beta*x))/

(factorial(x)*factorial(beta-1+beta*x-x))*

((theta^x)*(1-theta)^(beta*x+beta-1-x))))

if(log = = FALSE)

density <- exp(loglik)

else density<-loglik

return(density)

}

parameter <- maxlogL(x = x,dist = "QNBD",start = c(.01,2),optimizer = 'optim')

summary(parameter)

The fitting results by the method maximum likelihood are:

AIC = 426.76 , θ ˜ = 0.298 ± 0.015 , β ˜ = 2.81 ± 0.057

Appendix 2: R-CODE: QNB Regression Fitting by the Method of Maximum Likelihood Applied to the RNA_SEQ Read Count Data

llik=function(y,par){

b0=par [1]

b1=par [2]

b2=par [3]

b3=par [4]

b4=par [5]

beta=par [6]

n=length(y)

eta=b0+b1*x1+b2*x2+b3*x3+b4*x4

mu=exp(eta)/(1+exp(eta))

ll=sum(log(beta-1)-log(beta-1+beta*y)

+lgamma(beta+beta*y)-lgamma(1+y)-lgamma(beta+beta*y-y)

+y*log(mu)+(beta+beta*y-1-y)*log(1-mu))

return(-ll)

}

res=optim(par=c(2,.6,-.02,.42,-.12,2.1),llik,y=y,method="BFGS",hessian=T)

theta=res$par

theta

#CALCULATING THE STANDARD ERRORS OF MLE

out3=nlm(llik,theta,y=y,hessian=TRUE)

fish=out3$hessian

solve(fish)

element=diag((solve(fish)))

se=sqrt(element)

qqnorm(y,resid(out3))

z=theta/se

p_value=1-pnorm(abs(z))

result.GNBD=data.frame(theta,se,z,p_value)

result.GNBD=round(result.GNBD,4)

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Takács, L. (1962) A Generalization of the Ballot Problem and Its Application in the Theory of Queues. Journal of the American Statistical Association, 57, 327-337.
https://doi.org/10.1080/01621459.1962.10480662
[2] Consul, P.C. and Gupta, H.C. (1980) The Generalized Negative Binomial Distribution and Its Characterization by Zero Regression. SIAM Journal of Applied Mathematics, 39, 231-237.
https://doi.org/10.1137/0139020
[3] Consul, P.C. and Shenton, L.R. (1972) Use of Lagrange Expansion for Generating Generalized Probability Distributions. SIAM Journal of Applied Mathematics, 23, 239-248.
https://doi.org/10.1137/0123026
[4] Consul, P.C. and Famoye, F. (2006) Lagrangian Probability Distributions. Birkhäuser, Boston.
[5] Shoukri, M.M. (1980) Estimation of Generalized Discrete Distributions. Unpublished PhD Thesis, The University of Calgary, Calgary.
[6] Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 135, 370-384.
https://doi.org/10.2307/2344614
[7] Kendall, M. and Ord, K. (2009) The Advanced Theory of Statistics. Vol. 1, 6th Edition, Griffin, London.
[8] Rudick, R., Antel, J., Confavreux, C., Confavreux, C., Cutter, G., Ellison, G., et al. (1996) Clinical Outcomes Assessment in Multiple Sclerosis. Annals of Neurology, 40, 469-479.
https://doi.org/10.1002/ana.410400321
[9] Morgan, C.J., Aban, I.B., Katholi, C.R. and Cutter, G.R. (2010) Modeling Lesion Counts in Multiple Sclerosis When Patients Have Been Selected for Baseline Activity. Multiple Sclerosis, 16, 926-934.
https://doi.org/10.1177/1352458510373110
[10] Cramér, H. (1946) Mathematical Methods of Statistics. Princeton University Press, Princeton.
[11] Szegő, G. (1939) Orthogonal Polynomials. Vol. 23, Colloquium Publications, American Mathematical Society, New York.
[12] Shenton, L.R. and Wallington, P.A. (1962) The Bias of the Moment Estimators with an Application to the Negative Binomial Distribution. Biometrika, 49, 193-204.
https://doi.org/10.1093/biomet/49.1-2.193
[13] McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. Chapman Hall, London.
[14] Cox, D.R. and Hinkley, D. (1974) Theoretical Statistics. Chapman and Hall, London.
[15] McCarthy, D.J., Chen, Y. and Smyth, G.K. (2021) Differential Expression Analysis of RNA-Seq Experiments with Respect to Biological Variation. Nucleic Acids Research, 40, 4288-4297.
https://doi.org/10.1093/nar/gks042
[16] Pan, W. (2002) A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments. Bioinformatics, 18, 546-554.
https://doi.org/10.1093/bioinformatics/18.4.546
[17] Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. and Gilad, Y. (2008) RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays. Genome Research, 18, 15-1517.
https://doi.org/10.1101/gr.079558.108
[18] Koch, C.M., Chiu, S.F., Akbarpour, M., Bahart, A., Ridge, K.M., Bartom, E.T. and Winter, D.R. (2018) A Beginner’s Guide to Analysis of RNA Sequencing Data. American Journal of Respiratory Cell and Molecular Biology, 59, 145-157.
https://doi.org/10.1101/gr.079558.108
[19] Yoon, S., Kim, S.Y. and Nam, D. (2016) Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates. PLoS ONE, 11, e0165919.
https://doi.org/10.1371/journal.pone.0165919
[20] Auer, P.L. and Doerge, R.W. (2011) A Two-Stage Poisson Model for Testing RNA-Seq Data. Statistical Applications in Genetics and Molecular Biology, 10, 26.
https://doi.org/10.2202/1544-6115.1627
[21] Yoon, S. and Nam, D. (2017) Gene Dispersion Is the Key Determinant of the Read Count Bias in Differential Expression Analysis of RNA-Seq Data. BMC Genomics, 18, Article No. 408.
https://doi.org/10.1186/s12864-017-3809-0
[22] Robinson, M.D. and Smyth, G.K. (2008) Small-Sample Estimation of Negative Binomial Dispersion, with Applications to SAGE Data. Biostatistics, 9, 321-332.
https://doi.org/10.1093/biostatistics/kxm030
[23] https://cran.r-project.org/bin/windows/base/

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.