Proceedings of 2010 Cross-Strait Conference on Information Science and Technology (CSCIST 2010 E-BOOK)

Qinhuangdao,China,7.9-7.13,2010

ISBN: 978-1-935068-15-0 Scientific Research Publishing, USA

E-Book 840pp Pub. Date: July 2010

Category: Computer Science & Communications

Price: $120

Title: Two Improved TAN Models for Text Categorization
Source: Proceedings of 2010 Cross-Strait Conference on Information Science and Technology (CSCIST 2010 E-BOOK) (pp 340-344)
Author(s): Caiyan Jia, School of Computer and Information Technology, Beijing Jiaotong University, Beijing,100044
Jia Liu, School of Computer and Information Technology, Beijing Jiaotong University, Beijing,100044
Abstract: TAN (Tree Augmented Na?ve Bayes) combines the simplicity of Na?ve Bayes with the ability of Bayesian network for expressing the relationships among attributes. It outperforms Na?ve Bayes, yet at the same time maintains the computational simplicity of Na?ve Bayes. But the existing TAN model, BL-TAN, for text categorization has the following two problems: 1) it does not take the features that do not appear in texts into account; 2) it ignores the information of word frequency. Based on the two problems, we propose two improved TAN models, BNL-TAN and MUL-TAN, where BNL-TAN considers the unappeared features and MUL-TAN uses the information of word frequency. Then, we compare the original BL-TAN with these two improved versions on Chinese and English unbalanced text datasets, respectively. Experimental results show BNL-TAN is better than BL-TAN and MUL-TAN is much better than BL-TAN and BNL-TAN. So, we conclude that word frequency of texts is very important to text categorization.
Free SCIRP Newsletters
Copyright © 2006-2024 Scientific Research Publishing Inc. All Rights Reserved.
Top