D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering

Abstract

In this study, we propose a data preprocessing algorithm called D-IMPACT inspired by the IMPACT clustering algorithm. D-IMPACT iteratively moves data points based on attraction and density to detect and remove noise and outliers, and separate clusters. Our experimental results on two-dimensional datasets and practical datasets show that this algorithm can produce new datasets such that the performance of the clustering algorithm is improved.

Share and Cite:

Tran, V. , Hirose, O. , Saethang, T. , Nguyen, L. , Dang, X. , Le, T. , Ngo, D. , Sergey, G. , Kubo, M. , Yamada, Y. and Satou, K. (2014) D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering. Journal of Software Engineering and Applications, 7, 639-654. doi: 10.4236/jsea.2014.78059.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Berkhin, P. (2002) Survey of Clustering Data Mining Techniques. Technical Report, Accrue Software, San Jose.
[2] Murty, M.N., Jain, A.K. and Flynn, P.J. (1999) Data Clustering: A Review. ACM Computing Surveys, 31, 264-323. http://dx.doi.org/10.1145/331499.331504
[3] Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques. Journal of Intelligent Information Systems, 17, 107-145. http://dx.doi.org/10.1023/A:1012801612483
[4] Golub, T.R., et al. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531-537. http://dx.doi.org/10.1126/science.286.5439.531
[5] Quinn, A. and Tesar, L. (2000) A Survey of Techniques for Preprocessing in High Dimensional Data Clustering. Proceedings of the Cybernetic and Informatics Eurodays.
[6] Abdi, H. and Williams, L.J. (2010) Principal Component Analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 433-459. http://dx.doi.org/10.1002/wics.101
[7] Yeung, K.Y. and Ruzzo, W.L. (2001) Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics, 17, 763-774. http://dx.doi.org/10.1093/bioinformatics/17.9.763
[8] Shi, Y., Song, Y. and Zhang, A. (2005) A Shrinking-Based Clustering Approach for Multidimensional Data. IEEE Transaction on Knowledge Data Engineering, 17, 1389-1403.
http://dx.doi.org/10.1109/TKDE.2005.157
[9] Chang, F., Qiu, W. and Zamar, R.H. (2007) CLUES: A Non-Parametric Clustering Method Based on Local Shrinking. Computational Statistics & Data Analysis, 52, 286-298.
http://dx.doi.org/10.1016/j.csda.2006.12.016
[10] Jain, A.K. and Dubes, R.C. (1988) Algorithms for Clustering Data. Prentice Hall, Upper Saddle River.
[11] Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, 226-231.
[12] Ankerst, M., Breunig, M.M., Kriegel, H.P. and Sander, J. (1999) OPTICS: Ordering Points to Identify Clustering Structure. Proceedings of the ACM SIGMOD Conference, 49-60.
[13] Hinneburg, A. and Keim, D. (1998) An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th International Conference on Knowledge Discovery & Data Mining, 58-65.
[14] Tran, V.A., et al. (2012) IMPACT: A Novel Clustering Algorithm Based on Attraction. Journal of Computers, 7, 653-665. http://dx.doi.org/10.4304/jcp.7.3.653-665
[15] The UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets
[16] Karypis Lab Datasets. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/chameleon-data.tar.gz
[17] Karypis, G., Han, E.H. and Kumar, V. (1999) CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 32, 68-75. http://dx.doi.org/10.1109/2.781637
[18] Radioresistant and Radiosensitive Tumors and Cell Lines.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9712
[19] Chang, F., Qiu, W., Zamar, R.H., Lazarus, R. and Wang, X. (2010) Clues: An R Package for Nonparametric Clustering Based on Local Shrinking. Journal of Statistical Software, 33, 1-16.
[20] Hubert, L. and Arabie, P. (1985) Comparing Partitions. Journal of Classification, 2, 193-218.
[21] Visakh, R. and Lakshmipathi, B. (2012) Constraint Based Cluster Ensemble to Detect Outliers in Medical Datasets. International Journal of Computer Applications, 45, 9-15.
[22] D-IMPACT Preprocessing Algorithm. https://sourceforge.net/projects/dimpactpreproce/

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.