Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic


As the incidence of this disease has increased significantly in the recent years, expert systems and machine learning techniques to this problem have also taken a great attention from many scholars. This study aims at diagnosing and prognosticating breast cancer with a machine learning method based on random forest classifier and feature selection technique. By weighting, keeping useful features and removing redundant features in datasets, the method was obtained to solve diagnosis problems via classifying Wisconsin Breast Cancer Diagnosis Dataset and to solve prognosis problem via classifying Wisconsin Breast Cancer Prognostic Dataset. On these datasets we obtained classification accuracy of 100% in the best case and of around 99.8% on average. This is very promising compared to the previously reported results. This result is for Wisconsin Breast Cancer Dataset but it states that this method can be used confidently for other breast cancer diagnosis problems, too.

Share and Cite:

Nguyen, C. , Wang, Y. and Nguyen, H. (2013) Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. Journal of Biomedical Science and Engineering, 6, 551-560. doi: 10.4236/jbise.2013.65070.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] The Women’s Health Resource (2013) What is breast cancer.http://www.imaginis.com/general-information-on-breast-cancer/what-is-breast-cancer-2
[2] WHO (2012) Breast cancer: Prevention and control. http://www.who.int/cancer/detection/breastcancer/en/index1.html
[3] UCI Machine Learning Repository (2012) Wisconsin breast cancer dataset.http://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table
[4] Chen, H.-L., Yang, B., Liu, J. and Liu, D.-Y. (2011) A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Systems with Applications, 38, 9014-9022. doi:10.1016/j.eswa.2011.01.120
[5] Keles, A., Keles, A. and Yavuz, U. (2011) Expert system based on neuro-fuzzy rules for diagnosis breast cancer. Expert Systems with Applications, 38, 5719-5726. doi:10.1016/j.eswa.2010.10.061
[6] Marcano-Cedeño, A., Quintanilla-Dominguez, J. and Andina, D. (2011) WBCD breast cancer database classification applying artificial metaplasticity neural network. Expert Systems with Applications, 38, 9573-9579. doi:10.1016/j.eswa.2011.01.167
[7] Ince, M.C. and Karabatak, M. (2009) An expert system for detection of breast cancer based on association rules and neural network. Expert Systems with Applications, 36, 3465-3469. doi:10.1016/j.eswa.2008.02.064
[8] Polat, K. and Günes, S. (2007) Breast cancer diagnosis using least square support vector machine. Digital Signal Processing, 11, 694-701. doi:10.1016/j.dsp.2006.10.008
[9] Ubeyli, E.D. (2007) Implementing automated diagnostic systems for breast cancer detection. Expert Systems with Applications, 33, 1054-1062. doi:10.1016/j.eswa.2006.08.005
[10] Sahana, S., Polat, K., Kodaz, H. and Günes, S. (2007) A new hybridmethod based on fuzzy-artificial immune system and k-nn algorithmfor breast cancer diagnosis. Computers in Biology and Medicine, 377, 415-423. doi:10.1016/j.compbiomed.2006.05.003
[11] Abonyi, J. and Szeifert, F. (2003) Supervised fuzzy clustering for the identification of fuzzy classifiers. Pattern Recognition Letters, 24, 2195-2207.doi:10.1016/S0167-8655(03)00047-3
[12] Quinlan, J.R. (1996) Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90.
[13] Krishnan, M.M.R., Banerjee, S., Chakraborty, C., Chakraborty, C. and Ray, A.K. (2010) Statistical analysis of mammographic features and its classification using support vector machine. Expert Systems with Applications, 37, 470-478. doi:10.1016/j.eswa.2009.05.045
[14] Stoean, R. and Stoean, C. (2013) Modeling medical decision making by support vector machines, explaining by rules of evolutionary algorithms with feature selection. Expert Systems with Applications, 40, 2677-2686.doi:10.1016/j.eswa.2012.11.007
[15] Mu, T.T. and Nandi, A.K. (2007) Breast cancer detection from FNA using SVM with different parameter tuning systems and SOM-RBF classifier. Journal of the Franklin Institute, 344, 285-311. doi:10.1016/j.jfranklin.2006.09.005
[16] Zhang, Z.W., Shi, Y. and Gao, G.X. (2009) A rough setbased multiple criteria linear programming approach for the medical diagnosis and prognosis. Expert Systems with Applications, 36, 8932-8937. doi:10.1016/j.eswa.2008.11.007
[17] Li, D.-C., Liu, C.-W. and Hu, S.C. (2011) A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artificial Intelligence in Medicine, 52, 45-52. doi:10.1016/j.artmed.2011.02.001
[18] Ghazavi, S.N. and Liao, T.W. (2008) Medical data mining by fuzzy modeling with selected features. Artificial Intelligence in Medicine, 43, 195-206.
[19] Breiman, L. (2001) Random forests. Machine Learning Journal Paper, 45, 5-32.
[20] Wu, X.D. and Kumar, V. (2009) The top ten algorithm in data mining. Chapman & Hall/CRC, London.
[21] Biau, G., Devroye, L. and Lugosi, G. (2008) Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9, 2015-2033.
[22] Verikas, A., Gelzinis, A. and Bacauskiene, M. (2011) Mining data with random forests: A survey and results of new tests. Pattern Recognition, 44, 330-349. doi:10.1016/j.patcog.2010.08.011
[23] Liaw, A. and Watthew, M. (2002) Classification and regression by random forest. R News, 3, 18-22.
[24] Breiman, L. (2004). RFtools—For predicting and understanding data. Technical Report.http://oz.berkeley.edu/users/breiman/RandomForests
[25] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1993) Classification and regression tree. Chapman & Hall, London.
[26] Su, X. (2007) Bagging and random forests.http://pegasus.cc.ucf.edu/~xsu/CLASS/STA5703/notes11.pdf
[27] Efron, B. (1994) The jackknife, the bootstrap and other resampling plans. 6th Edition, Capital City Press, Baton Rouge, 1994.
[28] Dupret, G. and Koda, M. (2001) Theory and methodology: Boostrap resampling for unbalanced data in supervised learning. Eropean Journal of Operational Research, 134, 141-156. doi:10.1016/S0377-2217(00)00244-7
[29] Good, P.I. (2006) Resampling methods: A practical guide to data analysis. 3rd Edition, Birkhauser.
[30] Hsu, C.-C., Wang, K.-S. and Chang, S.-H. (2011) Bayesian decision theory for support vector machines: Imbalance measurement and feature optimization. Expert Systems with Applications, 38, 4698-4704. doi:10.1016/j.eswa.2010.08.150
[31] Koch, K.-R. (2007) Introduction to Bayesian statistics. Springer, New York, 2007.
[32] Brase, C.H. and Brase, C.P. (2012) Understanable statistics. 10th Edition, Cengage Learning, Stamford.
[33] Hodges, J.J.L. (2005) Basic concepts of probability and statistics. 2nd Edition, Society for Industrial and Applied Mathemtacis, Philadelphia.
[34] Frank, A. and Asuncion, A. (2010) UCI machine learning repository.

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.