A novel over-sampling method and its application to miRNA prediction


MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro- RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental- SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no over- sampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.

Share and Cite:

Dang, X. , Hirose, O. , Saethang, T. , Tran, V. , Nguyen, L. , Le, T. , Kubo, M. , Yamada, Y. and Satou, K. (2013) A novel over-sampling method and its application to miRNA prediction. Journal of Biomedical Science and Engineering, 6, 236-248. doi: 10.4236/jbise.2013.62A029.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Kim, V.N. and Nam, J.-W. (2006) Genomics of microRNA. Trends in Genetic, 22, 165-173. doi:10.1016/j.tig.2006.01.003
[2] Lee, R.C., Feinbaum, R.L. and Ambros, V. (1993) The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75, 843 854. doi:10.1016/0092-8674(93)90529-Y
[3] Harfe, B.D., Mcmanus, M.T., Mansfield, J.H., Hornstein, E. and Tabin, C.J. (2005) The RNaseIII enzyme Dicer is required for morphogenesis but not patterning of the vertebrate limb. Proceedings of the National Academy of Sciences, 102, 10898-10903. doi:10.1073/pnas.0504834102
[4] Wilfred, B.R., Wang, W. and Nelson, P.T. (2007) Energizing miRNA research: A review of the role of miRNAs in lipid metabolism, with a prediction that miR-103/107 regulates human metabolic pathways. Molecular Genetics and Metabolism, 91:209-217. doi:10.1016/j.ymgme.2007.03.011
[5] Lodish, H.F., Chen, C. and Bartel, D.P. (2004) Micro RNAs modulate hematopoietic lineage differentiation. Science, 303, 83-86. doi:10.1126/science.1091903
[6] Lim, L.P., Lau, N.C., Garrett-engele, P. and Grimson, A. (2005) Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature, 292, 288-292.
[7] Kozomara, A. and Griffiths-Jones, S. (2011) miRBase: Integrating microRNA annotation and deep-sequencing data. Nucleic Acids Research, 39, D152-D157. doi:10.1093/nar/gkq1027
[8] Hertel, J. and Stadler, P.F. (2006) Hairpins in a Haystack: Recognizing microRNA precursors in comparative genomics data. Bioinformatics, 22, e197-e202. doi:10.1093/bioinformatics/btl257
[9] Lim, L.P., Glasner, M.E., Yekta, S., Burge, C.B. and Bartel, D.P. (2003) Vertebrate microRNA genes. Science, 299, 1540. doi:10.1126/science.1080372
[10] Lai, E.C., Tomancak, P., Williams, R.W. and Rubin, G.M. (2003) Computational identification of Drosophila microRNA genes. Genome Biology, 4, R42. doi:10.1186/gb-2003-4-7-r42
[11] Jones-Rhoades, M.W. and Bartel, D.P. (2004) Computa tional identification of plant microRNAs and their targets, including a stress-induced miRNA. Molecular Cell, 14, 787-799. doi:10.1016/j.molcel.2004.05.027
[12] Bonnet, E., Wuyts, J., Rouzé, P. and Van de Peer, Y. (2004) Detection of 91 potential conserved plant mi croRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. Proceedings of the National Academy of Sciences, 101, 11511-11516. doi:10.1073/pnas.0404025101
[13] Ng, K.L.S. and Mishra, S.K. (2007) De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bio informatics, 23, 1321-1330. doi:10.1093/bioinformatics/btm026
[14] Berezikov, E., Guryev, V., Van de Belt, J., Wienholds, E., Plasterk, R.H.A. and Cuppen, E. (2005) Phylogenetic shadowing and computational identification of human microRNA genes. Cell, 120, 21-24. doi:10.1016/j.cell.2004.12.031
[15] Sewer, A., Paul, N., Landgraf, P., Aravin, A., Pfeffer, S., Brownstein, M.J., Tuschl, T., Van Nimwegen, E. and Zavolan, M. (2005) Identification of clustered micro RNAs using an ab initio prediction method. BMC Bioin formatics, 6, 267. doi:10.1186/1471-2105-6-267
[16] Xue, C., Li, F., He, T., Liu, G-.P., Li, Y. and Zhang, X. (2005) Classification of real and pseudo microRNA pre cursors using local structure-sequence features and sup port vector machine. BMC Bioinformatics, 6, 310. doi:10.1186/1471-2105-6-310
[17] Clote, P., Ferré, F., Kranakis, E. and Krizanc, D. (2005) Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA, 11, 578 591. doi:10.1261/rna.7220505
[18] Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X. and Lu, Z. (2007) MiPred: Classification of real and pseudo micro RNA precursors using random forest prediction model with combined features. Nucleic Acids Research, 35, 339 344. doi:10.1093/nar/gkm368
[19] Tang, X., Xiao, J., Li, Y., Wen, Z., Fang, Z. and Li, M. (2012) Systematic analysis revealed better performance of random forest algorithm coupled with complex net work features in predicting microRNA precursors. Che mometrics and Intelligent Laboratory Systems, 118, 317 323. doi:10.1016/j.chemolab.2012.05.001
[20] Wang, Y., Chen, X., Jiang, W., Li, L., Li, W., Yang, L., Liao, M., Lian, B., Lv, Y., Wang, S., Wang, S. and Li, X. (2011) Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM. Genomics, 98, 73-78. doi:10.1016/j.ygeno.2011.04.011
[21] Zhang, Y., Yang, Y., Zhang, H., Jiang, X., Xu, B., Xue, Y., Cao, Y., Zhai, Q., Zhai, Y., Xu, M., Cooke, H.J. and Shi, Q. (2011) Prediction of novel pre-microRNAs with high accuracy through boosting and SVM. Bioinformatics, 27, 1436-1437. doi:10.1093/bioinformatics/btr148
[22] Kadri, S., Hinman, V. and Benos, P.V. (2009) HHMMiR: Efficient de novo prediction of microRNAs using hierarchical hidden Markov models. BMC Bioinformatics, 10, S35. doi:10.1186/1471-2105-10-S1-S35
[23] Batuwita, R. and Palade, V. (2009) microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics, 25, 989-995. doi:10.1093/bioinformatics/btp107
[24] Xiao, J., Tang, X., Li, Y., Fang, Z., Ma, D., He, Y. and Li, M. (2011) Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinformatics, 12, 165. doi:10.1186/1471-2105-12-165
[25] Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique. Lecture Notes in Computer Science, 5476, 475-482. doi:10.1007/978-3-642-01307-2_43
[26] Han, H., Wang, W. and Mao, B. (2005) Borderline SMOTE: A new over-sampling method in imbalanced data sets learning. Lecture Notes in Computer Science, 3644, 878-887. doi:10.1007/11538059_91
[27] Chawla, N.V., Bowyer, K.W. and Hall, L.O. (2002) SMOTE?: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
[28] Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121-167. doi:10.1023/A:1009715923555
[29] Vapnik, V.N. (1999) An overview of statistical learning theory. IEEE Transactions on Neural networks, 10, 988 999. doi:10.1109/72.788640
[30] Karatzoglou, A. and Smola, A. (2004) kernlab—An S4 package for kernel methods in R. Journal of Statistical Software, 11, 1-20.
[31] Venables, W.N. and Ripley, B.D. (2002) Modern applied statistics with S. 4th Edition. Springer, New York. doi:10.1007/978-0-387-21706-2
[32] Liaw, A. and Wiener, M. (2002) Classification and regression by random forest. R News, 2, 18-22.
[33] Akbani, R., Kwek, S. and Japkowicz, N. (2004) Applying support vector machines to imbalanced datasets. Lecture Notes in Computer Science, 3201, 39-50. doi:10.1007/978-3-540-30115-8_7
[34] Anand, A., Pugalenthi, G., Fogel, G.B. and Suganthan, P.N. (2010) An approach for classification of highly im balanced data using weighting and undersampling. Amino Acids, 39, 1385-1391. doi:10.1007/s00726-010-0595-2
[35] Kubat, M. and Matwin, S. (8-12 July 1997) Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the 14th International Conference in Ma chine Learning, Nashville, 179-186.
[36] Han, K. (2011) Effective sample selection for classification of pre-miRNAs. Genetics and Molecular Research, 10, 506-518. doi:10.4238/vol10-1gmr1054
[37] Xuan, P., Guo, M., Liu, X., Huang, Y., Li, W. and Huang, Y. (2011) PlantMiRNAPred: Efficient classification of real and pseudo plant pre-miRNAs. Bioinformatics, 27, 1368-1376. doi:10.1093/bioinformatics/btr153
[38] Frank, A. and Asuncion, A. (2010) UCI machine learning repository. University of California, Irvine.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.