A Hybrid Algorithm for Stemming of Nepali Text

Abstract

In this paper, a new context free stemmer is proposed which consists of the combination of traditional rule based system with string similarity approach. This algorithm can be called as hybrid algorithm. It is language dependent algorithm. Context free stemmer means that stemmer which stems the word that is not based on the context i.e., for every context such rule is applied. After stripping the words using traditional context free rule based approach, it may over stem or under stem the inflected words which are overcome by applying string similarity function of dynamic programming. For measuring the string similarity function, edit distance is used. The stripped inflected word is compared with the words stored in a text database available. That word having minimum distance is taken as the substitution of the stripped inflected word which leads to the stem of it. The concept of traditional rule based system and corpus based approach is heavily used in this approach. This algorithm is tested for Nepali Language which is based on Devanagari Script. The approach has given better result in comparison to traditional rule based system particularly for Nepali Language only. The total accuracy of this hybrid algorithm is 70.10% whereas the total accuracy of traditional rule based system is 68.43%.

Share and Cite:

Sitaula, C. (2013) A Hybrid Algorithm for Stemming of Nepali Text. Intelligent Information Management, 5, 136-139. doi: 10.4236/iim.2013.54014.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Y. AI-Nashashibi, D. D. Neagu and Y. Ali, “Stemming Techniques for Arabic Words: A Comparative Study,” 2nd International Conference on Computer Technology and Development (ICCTD), 2010, pp. 270-276.
[2] H. Mohammad, B. Zuhair, C. Keely and M. David, “An Arabic Stemming Approach Using Machine Learning with Arabic Dialogue System,” ICGST AIML-11 Conference, Dubai, April 2011, pp. 9-16.
[3] L. S. Leah, B. Lisa and C. E. Margaret, “Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurance Analysis,” SIGIR, ACM, 11-15 August 2002.
[4] L. S. Leah, B. Lisa and C. E. Margaret, “Conservatice Stemming for Search and Indexing,” ACM, August 2005, pp. 15-19.
[5] S. Jikitsha and P. C. Bankim, “Stemming Techniques and Naive Approach for Gujarati Stemmer,” International Conference in Recent Trends in Information Technology and Computer Science, IJCA, 2012, pp. 9-11.
[6] A. F. Alajmi, E. M. Saad and M. H. Awadalla, “Hidden Makov Model Based Arabic Morphological Analyzer,” International Journal of Computer Engineering Research, IJCER, Vol. 2, No. 2, 2011, pp. 28-33.
[7] M. Upendra and P. Chandra, “MAULIK: An Effective Stemmer for Hindi Lanuage,” International Journal of Computer Science and Engineering, IJCSE, Vol. 4, No. 5, 2012, pp. 711-717.
[8] R. Ananthakrishnana and R. D. Durgesh, “A Light Stemmer for Hindi.”
[9] K. Dinesh and R. Kumar, “Design and Development of Stemmer for Pujabi,” International Journal of Computer Applications, IJCA, Vol. 11, No. 12, 2010, pp. 18-23. doi:10.5120/1634-2196
[10] S. Llia, “Overview of Stemming Algorithms,” Depaul University.
[11] F. B. William and F. J. Christopher, “Strength and Similarity of Affix Removal Stemming Algorithms,” James Madison University and Virginia Tech.
[12] O. H. M. Ali and L. Ma Shi, “Stemming Algorithm to Classify Arabaic Documents,” Symposium on Progress in Information & Communication Technology, 2009, pp. 111-115.
[13] A. James and K. Giridhar, “Stemming in the Language Modeling Framework,” SIGIR, ACM, Toronto, 28 July-1 August 2003.
[14] A. Farag and N. Andreas, “N-Gram Conflation Approach for Arabic Text,” SIGIR, ACM, Amsterdam, 7 July 2007.
[15] K. Dinesh and R. Prince, “Stemming of Punjabi Words by Using Brute Force Technique,” International Journal of Engineering Science and Technology, IJEST, Vol. 3, No. 2, 2011.
[16] D. Sajib and N. Vincent, “Unsupervised Morphological Parsing of Bengali,” Lang Resource & Evaluation, Springer, 2007.
[17] R. Monica, M. Scott and Y. Yiming, “Unsuperised Learning of Arabic Stemming Using a Parallel Corpus,” Proceeding of the 41st Annual Meeting of the Association for Computation Linguistics, July 2003, pp. 301-398.
[18] N. S. Giridhar, K. V. Prema and N. V. Subba Reddy, “A Prospective Study of Stemming Algorithms for Web Text Mining,” Ganapt University Journal of Engineering & Technology, Vol. 1, 2011, pp. 28-34.
[19] K. Chouvalit and B. Veera, “Inverted Lists String Matching Algorithms,” International Journal of Computer Theory and Engineering, Vol. 2, No. 3, 2010, pp. 352-357.
[20] K. Koudas, S. Sunita and S. Divesh, “Record Linkage: Similarity Measures and Algorithms.”
[21] J. Ms. Anjali, “A Comparative Study of Stemming Algorithms,” IJCTA, Vol. 2, No. 6, 2011, pp. 1930-1938.
[22] B. Bal Krishna and S. Prajol, “A Morphological Analyzer and a Stemmer for Nepali,” Madan Puraskar Pustakalaya, Working Papers 2004-2007.
[23] F. Cuna Ekmekcioglu, L. F. Michael and W. Peter, “Stemming and N-Gram Matching for Term Conflation in Turkish Texts,” Information Research, Vol. 2, 1996.
[24] C. Sitaula, “Semantic Text Clustering Using Enhanced Vector Space Model Using Nepali Language,” GESJ, Vol. 36, No. 4, 2012, pp. 41-46.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.