TITLE:
Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes
AUTHORS:
Behrouz Madahian, Lih Y. Deng, Ramin Homayouni
KEYWORDS:
LASSO, Robustness, Sparsity, MCMC, Gibbs Sampling
JOURNAL NAME:
Open Journal of Statistics,
Vol.4 No.7,
August
28,
2014
ABSTRACT:
A major limitation
of expression profiling is caused by the large number of variables assessed
compared to relatively small sample sizes. In this study, we developed a
multinomial Probit Bayesian model which utilizes the double exponential prior
to induce shrinkage and reduce the number of covariates in the model [1]. A hierarchical Sparse Bayesian Generalized
Linear Model (SBGLM) was developed in order to facilitate Gibbs sampling which
takes into account the progressive nature of the response variable. The method
was evaluated using a published dataset (GSE6099) which contained 99 prostate
cancer cell types in four different progressive stages [2]. Initially, 398 genes were selected using
ordinal logistic regression with a cutoff value of 0.05 after Benjamini and
Hochberg FDR correction. The dataset was randomly divided into training (N = 50)
and test (N = 49) groups such that each group contained equal number of each
cancer subtype. In order to obtain more robust results we performed 50
re-samplings of the training and test groups. Using the top ten genes obtained
from SBGLM, we were able to achieve an average classification accuracy of 85% and
80% in training and test groups, respectively. To functionally evaluate the
model performance, we used a literature mining approach called Geneset Cohesion
Analysis Tool [3]. Examination of the top 100 genes produced
an average functional cohesion p-value of 0.007 compared to 0.047 and 0.131
produced by classical multi-category logistic regression and Random Forest
approaches, respectively. In addition, 96 percent of the SBGLM runs resulted in
a GCAT literature cohesion p-value smaller than 0.047. Taken together, these
results suggest that sparse Bayesian Multinomial Probit model applied to cancer
progression data allows for better subclass prediction and produces more
functionally relevant gene sets.