Using Pearson’s System of Curves to Approximate the Distributions of the Difference between Two Correlated Estimates of Signal-to-Noise Ratios: The Cases of Bivariate Normal and Bivariate Lognormal Distributions ()
1. Introduction
Signal-to-noise is an important measure of quality measurements and it has applications in many fields including medicine, engineering and genomics. Generally speaking, a signal is what the investigator tries to measure, and the noise is the amount of uncertainty that surrounds the value of the signal making it hard to identify the actual value of the signal. For example, in a trial that uses weight reduction drugs, the Body Mass Index (BMI) may be a signal of interest while the noise may be attributed to failure to account for participants’ baseline measurements.
Radiologists use the signal-to-noise ratio (SNR), defined as the ratio of the mean signal to its standard deviation, as a measure of image quality (Cunningham and Shaw [1] ). For example, in the case of Magnetic Resonance Images (MRI), the noise may be distributed uniformly throughout the image, and one way of measuring the effect of noise is to calculate the SNR. For MRI data, radiologists evaluate the SNR by computing the mean signal intensity over a certain region of interest (ROI) and dividing this by the standard deviation of the signal from outside the image. In cancer diagnosis, Jung et al. [2] applied SNR to investigate the accuracy and inter-observer variability of a standardized evaluation system for endorectal three-dimensional MR spectroscopic imaging of the prostate. They noted that most of current image processing applied to MRI image data can be formulated as a parameter estimation problem, and in the case of noise filtering, the SNR is the target parameter. Sim and Kamel [3] proposed an autoregressive (AR) model for SNR estimation. In more general settings, Taguchi, and Wu [4] used the SNR as an indicator of “closeness to target” and [5] provided references for indices of measurement quality and reliability.
Applications of SNR in Genomics
One of the interesting applications of SNR is in genomics. The unique conditions in gene expression analysis [6] are: 1) the level of a transcriptor depends roughly on the concentration of the related factors that control the rate of production of the transcript; 2) the random variations for any particular transcript are normally distributed. In several applications, it is assumed that the variation of any transcript is constant relative to most of the other transcripts in the genome which means the coefficient of variation σ/μ is considered constant across the genome. Alternatively, we may take μ/σ (signal-to-raise ratio) which we use in this paper as a measure of relative variation.
For example, suppose that we have a micro assay with the R genes with red and green expression values,
, and
. Let
and
denote the mean and standard deviation of
. It is of interest to test the null hypothesis.
Our paper has three-fold objectives: The first is developing nonparametric methods for testing the equality of two correlated SNR parameters. The second is to use Pearson’s system of curves to construct a probability distribution for the difference between two estimates of SNR in the case of bivariate normal and the bivariate lognormal distributions. This is quite important because providing the best approximate distribution allows us to study the characteristics of the distribution of the statistic of interest beyond the point and interval estimation of the corresponding population parameter. The last objective is to illustrate the methodologies on real life data.
2. Testing the Equality of Two Correlated SNRs
Let
be a random sample drawn from
. We define the signal-to-noise ration as:
The maximum likelihood estimator (MLE) of θ is
where,
and
and respectively the maximum likelihood estimators of μ, and σ2. Instead of S2, we shall use
which is unbiased estimator of σ2 instead of
.
Since one of the characteristics of the normal distribution is the stochastic independence of
from S2, we have:
, with
Now, since
, where
denotes a chi-square random variable with
degrees of freedom, we can evaluate the exact moments of
. The rth non-central moment of
are:
(1)
(2)
(3)
Hence, from (1) and (2) we have
(4)
This estimator is biased and has an expected value:
(5)
From (5) the relative bias of
is given by:
(6)
For selected values of the sample size n, the corresponding values RB given in (6) are provided in Table 1.
From Table 1, we can see that the RB of the MLE of
is almost negligible when the sample size n gets larger. Now, if two independent samples are available, the difference between estimated two independent SNRs defined as D:
We can show that under the assumption of independence that:
(7)
where
We may therefore construct a
CI on
as:
(8)
When neither the assumptions of independence nor the normality are satisfied, we propose several approaches to test the hypothesis on the equality of two correlated SNRs. The first approach is nonparametric and is suitable when the distributional assumptions of the parents populations cannot be verified.
Table 1. This relative bias is negligible even for small sample size.
3. Nonparametric Test of the Null Hypothesis H0: SNR1 = SNR2
Let
denote a random sample (
) available from any bivariate distribution such that,
denote a random sample of size n, with parameter vector given by
. The samples summary statistics of the data are:
,
,
,
, and
is the correlation between X1 and X2.
We define the signal-to-noise ratio (SNR) of the ith sample by:
, (i = 1, 2) whose moment estimators are:
It is evident that only if
and
, then
. Therefore, if the available data support the null hypothesis:
(9)
then we conclude that
.
Testing the hypothesis (9) is equivalent to using the test developed in [7] . Earlier, [7] suggested tests of equality of correlated means and correlated variances using the statistic (10):
(10)
where
,
, and
is the correlation between the two sets of observations.
Then
.
Example 1:
The Petrifilm test, a quantitative microbiological test for Escherichia coli O157:H7, was evaluated for its performance as a beef-carcass monitoring test compared with an E. coli O157:H7 detection method using a hydrophobic grid membrane filter (HGMF) [8] . The Petrifilm test showed excellent agreement with the HGMF method when test samples were obtained from pure cultures and experimentally contaminated meat. Here we shall compare the SNR of the two methods using the nonparametric indirect approach suggested in (10), using a sample of 23 pairs of observations.
We provide the R code in which we use the bootstrap methodology to establish the large sample properties of the proposed statistic in Appendix 2.
RESULTS
SNR_Pertifilm = 0.979
SNR_hgmf = 1.05
Correlation = 0.996
quantile (−2 * log(WILK), probs = c (0.05, 0.95))
5% 95%
0.006 0.863
quantile (xx, probs = c (0.05, 0.95))
5% 95%
0.101 5.983
As can be seen the (5%, 95%) bootstrap quantiles of the test on the equality of the two SNRs (0.006, 0.863) are completely contained within the theoretical quantiles of a chi-square distribution xx, with 2 degrees of freedom. Thus, we established the equality of the two SNRs. The approximate distribution of the statistic (10) is depicted graphically in Figure 1, Figure 2.
Figure 1 shows that the histogram is skewed to the right, while Figure 2 shows that the distribution of −2 * logWILK is in close agreement with xx (chi square random variable with n − 2 degrees of freedom).
Figure 1. Histogram of 1000 bootstrap samples of the statistic (10).
Figure 2. The q-q plot of the statistic (10) plotted against 1000 samples xx, from the chi-square distribution with 2 degrees of freedom.
4. The Pearson’s System of Curves
Several years ago, Karl Pearson introduced a system of distributions that have been found useful in many applications. Pearson’s family of distributions generates different densities (f) that are solutions of the differential equation:
(11)
The system of curves was discussed in detail in [9] , and in this section we present a summary of their results.
The moments of a PSC (Pearson’s System of Curves) are determined by the values of the constants in (1) in addition to the constant if integration; conversely if we have the four moments of a PC we can solve for the constants in Equation (11). Pearson classified the solutions into types numbered 1 to 6. In constructing confidence intervals on the population parameter, the quartiles of x are more important than the form of the probability density “f”.
To clarify this point, we assume that x has a Pearson curve density, with mean μ and central moments
and
. The percentage points of the standardized x given as
can be obtained as functions of skewness
and kurtosis
, and are tabulated in double entry tables against
and kurtosis The percentage points of x are then easily obtained from those of x.
In summary, the steps of fitting a Pearson curve to theoretical distribution as summarized in [9] are:
1) We calculate skewness
and kurtosis
2) Based on s and k we obtain
, the percentage point at upper or lower level α
3) Calculate
if
is positive or
if
is negative
There are six types of Pearson’s distributions:
Type I = These distributions are (location scale transformation of) Beta distributions. The probability density function with parameters, a, b, scale = s and location = m is given by:
(12)
For
,
,
,
Type II = Pearson Type II is the symmetric Beta is a special case of Type I and is obtained when
. Therefore, the probability density function is given by:
(13)
Type III = This is the Gamma distribution. The probability density function with shape = a, scale = ss, and location = m is given by:
(14)
For
,
and
.
Type IV = The Pearson Type IV with location parameter = m, scale parameter = ss and shape parameters
has pdf given by:
(15)
where
where
,
(
corresponds to the Pearson Type VII distribution family).
Type V = This is the Inverse Gamma distributions. The pdf is given by:
(16)
where ss = scale parameter, α = shape parameter, and m = location parameter. The scale parameter is permitted to have negative values to allow for left skewness.
Type VI = Known as the Beta Prime distribution and in fact are scaled F-distributions. The pdf is given by:
(17)
ss = scale parameter
m = location parameter
Clearly, to determine the approximate Pearson curve for any random variable, we should have the numerical values of the skewness and kurtosis, that is we have to determine its first four central moments. To find the first 4 central moments of
, we shall use the delta method by employing the Taylor’s expansion on
. For ready calculations of the moments, we shall use some of the results from [10] together with the expectations of products of correlated chi-square variables and the higher order moments of bivariate normal distribution. These expectations are given in Appendix 1.
5. The Case of the Bivariate Normal Distribution
Let
be a random sample from bivariate normal distribution.
Our main objective are:
1) Construct a large sample confidence interval on
2) Use the first four central moments of
to fit a Pearson family and find the appropriate quantiles.
We shall use the delta method by employing the Taylor’s expansion on the statistic
.
Now we derive the first 4 central moments of the targeted statistic
.
On using the Taylor’s expansion, we define the variance of
as:
where
Substituting these values in
, using the appropriate expectations in the Appendix 1, and after simplifications, to the first order of approximation we get:
(18)
The third central moment is defined as:
Again, using the expectations in Appendix 1, and after simplifications we get:
(19)
The fourth central moment is defined as:
Again, using the expectations in Appendix 1 and after simplifications we get:
(20)
Example 2: Vitamin B12 and Folic Acid as predictors of Osteoporosis
Osteoporosis and bone health remain a major health problem often associated with significant morbidity, mortality, and healthcare costs [11] . An epidemiological analysis reported that 34% of healthy Saudi women and 30.7% of healthy Saudi men aged between 50 and 79 years are osteoporotic [12] . Furthermore, approximately 40% - 50% of women and 25% - 33% of men sustain osteoporotic fractures in their lifetimes. Hence, identifying the risk factors for osteoporosis is crucial in reducing the incidence of fractures. One study investigated the relationship between vitamin B12 and folate levels and BMD in Saudi Arabia [13] [14] . A recent study established the association between B12 and folate levels and BMD [15] . We used part of the data to investigate which factor has a higher SNR so it can be used reliably in the prediction of osteoporotic fractures.
Results of data analysis are shown in Table 2.
On using the normal Q-Q plot, we noted that the log transformed B12, and Folate are approximately normally distributed with correlation = 0.290 (p-value < 0.001).
Figure 3 and Figure 4 show respectively that the log-transformed variables B12 and Folate are approximately normally distributed. Preliminary data analysis gave:
Table 2. Summary statistics of the B12 and Folate data.
Figure 3. Q-Q normal plot of the log-transformed B12.
Figure 4. Q-Q normal plot of the log-transformed Folate.
SNR (X1) = 5.6546/.51782 = 10.92
SNR (X2) = 7.618/.30944 = 24.62
Difference = SNR (X2) – SNR (X1) = 13.7
SE (Difference) = 0.41
Lower 95% confidence limit = Difference − 1.96 * SE (Difference) = 12.9 (21)
Upper 95% confidence limit = Difference + 1.96 * SE (Difference) = 14.5 (22)
Apparently, logFolate has a significantly higher SNR than the logB12.
Again, we use the bootstrap of the data to find the best fitting Pearson curve for the distribution of:
. The R code is given in Appendix 3.
Results
Variance = 1.106637, mean = 13.78
quantile (PIVOT, probs = c (0.05, 0.95))
5% 95%
12.155 15.613
Since the 95% empirical confidence limits (12.155, 15.613) do not include zero, we conclude that the two SNRs are significantly different. Note that the empirical limits are almost identical to the 95% confidence limits (21) and (22) (Figure 5 and Figure 6).
PEARSON’s CURVE:
Parameters estimates of the Pearson curve
type a b location scale
II 15.75 30.53 8.58 15.26
Pearson moments
mean variance skewness kurtosis
13.779 1.107 0.192 2.932
Figure 5. The histogram of 1000 bootstrap values of the statistic PIVOT =
.
Figure 6. Region of definition (yellow) of the Pearson’s curve showing that the Type II is the best curve to fit the distribution of the difference between the estimated SNRs.
6. The Bivariate Lognormal Case
Let
and
be two log normally distributed random variables defined on the intervals
,
, and
denote the transformed normal variables; that is
where the joint distribution of
is a bivariate normal. The joint pdf of
is given by:
(23)
where
are the means of
,
are their standard deviations and
is the correlation between
.
The Joint probability density function given in (23) is that of a bivariate lognormal distribution which was studied extensively in [16] . The means and the variance of the lognormal distributions are given respectively as follows
The SNR is thus given by
.
We are interested in deriving the distribution of the maximum likelihood estimator of
(24)
This estimator is given by:
(25)
Note that the case of two independent lognormal populations was investigated in [17] .
To find the first 4 central moments and similar to the case of bivariate normal distribution discussed in the previous section, we employ the Taylor’s series expansion on
, as shown in (26).
In general for the pair of estimators
assumed unbiased for
we have:
(26)
where
,
means the partial derivative is evaluated at the true value of the parameter. This partial derivative is given in (27).
(27)
Let
denote the rth central moment of
, that is:
Therefore, using (27) we get:
(28)
(29)
(30)
Figure 7. The histogram of the 1000 bootstrap samples for the distribution of
.
Figure 8. Pearson’s diagram shows that in the case of the correlated lognormal distribution the Type I curve is the best distribution of
for the bivariate log-normally distributed.
The moments of
given in (28), (29), and (30) are used to derive the skewness and kurtosis that are needed to calculate the parameters of the Pearson’s curve.
Example 3:
We have shown that the B12 and Folate data in the previous example, after applying the log transformation, the transformed variables were quite close to the normal distribution. Therefore, we may conclude that the untransformed variables may have lognormal distributions. Similar Bootstrap approach may then be used to find the empirical Pearson’s distribution, when the difference between the estimated SNRs is
. The R code in this case is the same as in the case of bivariate normal distribution. The results are summarized below.
Variance (
) = 0.023, Mean (
) = 1.363, Skewness = 0.070, and Kurtosis = 2.916.
The empirical 5%, 95% quantiles are (1.118, 1.612). Since the quantiles do not include zero, we conclude that with 95% confidence, that there is a significant difference between the two SNRs under the bivariate log-normal set up. The parameters of the Pearson’s curve are:
type a b location scale
1 27.14 36.028 0.32 2.428
Note that, the notation PIVOT_L is used to denote the random variable
.
As can be seen from the histogram in Figure 7, the distribution of
is slightly skewed to the right.
From the diagram in Figure 8, it seems that the Pearson’s Type 1 is the best approximating distribution.
7. Discussion and Conclusions
In this paper, we considered two approaches to estimate and test the difference between two SNR ratios. Through bootstrap sampling, the nonparametric approach confirmed the large sample distribution of the test statistic on the hypothesis of simultaneously testing the equality of means and variances, and hence equality of two correlated SNRs.
When the bivariate distribution is either normal or lognormal we derived the first four central moments of the target statistics using delta the method. Thereafter, we used the bootstrap to find the Pearson’s family that best fits the distribution of the estimated difference between SNRs. Pearson’s Type I&II gave the best fit based on the available data. Several R [18] packages were used in the computational steps to achieve the objectives of the paper.
Acknowledgements
The author acknowledges the anonymous reviewer’s comments that have greatly improved the presentation of the paper.
Appendix 1. Cross Product Expectations of Higher Order Functions of Bivariate Normal and Bivariate Chi-Square Random Variables
,
where
By symmetry
Appendix 2. R-CODE for Example 3.1
library(boot)
B=1000
n=nrow(data)
n
WILK=numeric(B)
b<-data$petrifilm
f<-data$hgmf
mb<-mean(b)
sb<-sd(b)
mf<-mean(f)
sf<-sd(f)
for (bb in 1:B){
i<-sample(1:n,size=n,replace=TRUE)
b<-data$petrifilm[i]
f<-data$hgmf[i]
mb=mean(b)
mf=mean(f)
sb=sd(b)
sf=sd(f)
rr=cor(b,f)
up=sb^2*sf^2*(1-rr^2)
down1=((sb^2+sf^2)/2)*(1+rr)
down2=(((sb^2+sf^2)/2)*(1-rr))+((mb-mf)^2/2)
WILK[bb]=up/(down1*down2)
}
print(vv<-var(-2*log(WILK)))
print(mm<-mean(-2*log(WILK)))
hist(-2*log(WILK),prob=TRUE)
xx<-rchisq(1000,2)
qqplot(xx,-2*log(WILK))
quantile(-2*log(WILK), probs = c(.05,.95))
quantile(xx,probs=c(.05,.95))
Appendix 3. R Code for Fitting Pearson’s Curve for the B12 and Folate Data of Example 3
R-Code
Bootstrapping the distribution of the PIVOT
library(ggplot2)
library(boot)
B=1000
n=nrow(data)
n
PIVOT=numeric(B)
b<-data$logB12
f<-data$logFolate
mb<-mean(b)
sb<-sd(b)
s1<-var(b)
mf<-mean(f)
sf<-sd(f)
s2<-var(f)
for (bb in 1:B){
i<-sample(1:n,size=n,replace=TRUE)
b<-data$logB12[i]
f<-data$logFolate[i]
mb=mean(b)
mf=mean(f)
sb=sd(b)
sf=sd(f)
r=cor(b,f)
r
t1=mb/sb
t2=mf/sf
MU2<-(2*(1-r)/(n-1))+((t1^2+t2^2-2*t1*t2*r^2)/(2*(n-1)))
MU2
MU31<- (t1^3+t2^3)/(n-1)^2
MU32<- (3*n*r^2*t1*t2*(t1+t2))/(2*(n-1))
MU3<- MU31-MU32
A1<- (12*(1-r)^2/n)
A2<- (6*r^2*t1*t2/(n*(n-1)))
A3<- ((3*(n+3)*r^2*t1*t2*(t1^2+t2^2))/(n-1)^3)
A4<- ((3*(n+1)*(t1^4+t2^4))/(4*(n-1)^3))
A51<-(3*t1^2*t2^2/8)
A52<-(n+1)/(n-1)^2
A53<- ((8*(n+1)*r^4)/(n-1)^3)+(4*r^4/(n-1)^3)
MU4<- A1+A2-A3+A4+A51*(A52+A53)
PIVOT[bb]=(t2-t1)
}
print(vv<-var(PIVOT))
print(mm<-mean(PIVOT))
hist(PIVOT,prob=TRUE)
xx<-rnorm(1000)
qqplot(xx,PIVOT)
quantile(PIVOT, probs = c(.05,.95))
den <- density(PIVOT)
plot(den, frame = FALSE, col = "blue",main = "Density plot")
#######MOMENTS AND PEARSON CURVE
library(moments)
all.moments(PIVOT,central=TRUE,order.max=4)
s=skewness(PIVOT)
k=kurtosis(PIVOT)
m=mean(PIVOT)
v=var(PIVOT)
library(PearsonDS)
###### PEARSON CODE#######
library(PearsonDS)
moments<-c(mean=m,variance=v,skewness=s,kurtosis=k)
ppar<-pearsonFitM(moments=moments)
print(unlist(ppar))
pearsonMoments(params=ppar)
pearsonDiagram(max.skewness=sqrt(.25),max.kurtosis=4,squared.skewness=TRUE,lwd=2,legend=TRUE,n=999)
pearsonMoments(moments=moments)