Automatic Classification of Unstructured Blog Text

Abstract

Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming; statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the na?ve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the na?ve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted.

Share and Cite:

M. Dalal and M. Zaveri, "Automatic Classification of Unstructured Blog Text," Journal of Intelligent Learning Systems and Applications, Vol. 5 No. 2, 2013, pp. 108-114. doi: 10.4236/jilsa.2013.52012.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] M. K. Dalal and M. A. Zaveri, “Automatic Text Classification of Sports Blog Data,” Proceedings of the IEEE International Conference on Computing, Communications and Applications (ComComAp 2012), Hong Kong, 11-13 January 2012, pp. 219-222.
[2] S. Kim, K. Han, H. Rim and S. H. Myaeng, “Some Effective Techniques for Naive Bayes Text Classification,” IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 11, 2006, pp. 1457-1466. doi:10.1109/TKDE.2006.180
[3] M. J. Meena and K. R. Chandran, “Naive Bayes Text Classification with Positive Features Selected by Statistical Method,” Proceedings of the IEEE International Conference on Advanced Computing, Chennai, 13-15 December 2009, pp. 28-33. doi:10.1109/ICADVC.2009.5378273
[4] Z. Wang, Y. He and M. Jiang, “A Comparison among Three Neural Networks for Text Classification,” Proceedings of the IEEE 8th International Conference on Signal Processing, Beijing, 16-20 November 2006, pp. 1883-1886. doi:10.1109/ICOSP.2006.345923
[5] Z. Wang, X. Sun, D. Zhang and X. Li, “An Optimal SVM-Based Text Classification Algorithm,” Proceedings of the IEEE 5th International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006, pp. 1378-1381. doi:10.1109/ICMLC.2006.258708
[6] M. Zhang and D. Zhang, “Trained SVMs Based Rules Extraction Method for Text Classification,” Proceedings of the IEEE International Symposium on IT in Medicine and Education, Xiamen, 12-14 December 2008, pp. 16-19. doi:10.1109/ITME.2008.4743814
[7] R. D. Goyal, “Knowledge Based Neural Network for Text Classification,” Proceedings of the IEEE International Conference on Granular Computing, Fremont, 2-4 December 2007, pp. 542-547. doi:10.1109/GrC.2007.108
[8] D. Isa, L. H. Lee, V. P. Kallimani and R. RajKumar, “Text Document Preprocessing with the Bayes Formula for Classification Using Support Vector Machine,” Proceedings of the IEEE Transactions on Knowledge and Data Engineering, Vol. 20, No. 9, 2008, pp. 1264-1272.
[9] J. Polpinij and A. K. Ghose, “An Ontology-Based Sentiment Classification Methodology for Online Consumer Reviews,” Proceedings of the IEEE International Conference on Web Intelligence and Intelligent Agent Technology, Sydney, 9-12 December 2008, pp. 518-524. doi:10.1109/WIIAT.2008.68
[10] K. T. Durant and M. D. Smith, “Predicting the Political Sentiment of Web Log Posts Using Supervised Machine Learning Techniques Coupled with Feature Selection,” Lecture Notes in Computer Science, Vol. 4811, 2007, pp. 187-206. doi:10.1007/978-3-540-77485-3_11
[11] L. Zhao and C. Li, “Ontology Based Opinion Mining for Movie Reviews,” Lecture Notes in Computer Science, Vol. 5914, 2009, pp. 204-214. doi:10.1007/978-3-642-10488-6_22
[12] M. F. Porter, “An Algorithm for Suffix Stripping,” Program, Vol. 14, No. 3, 1980, pp. 130-137.
[13] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, Vol. 60, No. 5, 2004, pp. 493-502. doi:10.1108/00220410410560573
[14] K. S. Jones, “IDF Term Weighting and IR Research Lessons,” Journal of Documentation, Vol. 60, No. 5, 2004, pp. 521-523. doi:10.1108/00220410410560591
[15] W. Zhang, T. Yoshida and X. Tang, “TF-IDF, LSI and Multi-Word in Information Retrieval and Text Categorization,” Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Singapore City, 12-15 October 2008, pp. 108-113. doi:10.1109/ICSMC.2008.4811259
[16] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of American Society of Information Science, Vol. 41, No. 6, 1990, pp. 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[17] T. Liu, Z. Chen, B. Zhang, W. Ma and G. Wu, “Improving Text Classification Using Local Latent Semantic Indexing,” Proceedings of the 4th IEEE International Conference on Data Mining, Brighton, 1-4 November 2004, pp. 162-169. doi:10.1109/ICDM.2004.10096

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.