Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features


The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%.

Share and Cite:

Z. Miller, B. Dickinson and W. Hu, "Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features," International Journal of Intelligence Science, Vol. 2 No. 4A, 2012, pp. 143-148. doi: 10.4236/ijis.2012.224019.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] “Social Networking Sites in the US: Marketing Report,” 2009.
[2] J. D. Burger, J. Henderson, G. Kim and G. Zarella, “Discriminating Gender on Twitter,” Proceedings of EMNLP, 2011, pp.
[3] H. Craig, “Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything about Them?” Literary and Linguistic Computing, 1999.
[4] J. Walther, K. P. D’Addario, “The Impacts of Emoticons on Message Interpretation in Computer-Mediated-Communication,” Social Science Computer Review, 2001, pp.
[5] M. W. Corney, “Analyzing E-Mail Text Authorship for Forensic Purposes,” Master’s Thesis, Queensland University of Technology, Queensland, 2003.
[6] “Study: Males vs. Females in Social Networks,” 2009.
[7] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques,” 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
[8] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review, Vol. 65, No. , 1958, pp. 368-408.
[9] A. Bifet, G. Holmes, R. Kirkby and B. Pfahringer, “MOA: Massive Online Analysis,” Journal of Machine Learning and Research, Vol. 11, No. , 2010, pp. 1601-1604.
[10] M. E. Maron and J. L. Kuhns, “On Relevance, Probabilistic Indexing, and Information Retrieval,” Journal of the Association for Computing Machinery, Vol. 7, No. 3, 1960, pp. 216-244.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.