Probabilistic, Statistical and Algorithmic Aspects of the Similarity of Texts and Application to Gospels Comparison

DOI: 10.4236/jdaip.2015.34012   PDF   HTML   XML   3,192 Downloads   3,542 Views   Citations

Abstract

The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k-shinglings, a k-shingling being defined as a sequence of k consecutive characters that are extracted from a text (k ≥ 1). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman et al. ([1]), denoted here as (RUM). The Jaccard index is the one being used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.

Share and Cite:

Dembele, S. and Lo, G. (2015) Probabilistic, Statistical and Algorithmic Aspects of the Similarity of Texts and Application to Gospels Comparison. Journal of Data Analysis and Information Processing, 3, 112-127. doi: 10.4236/jdaip.2015.34012.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Rajaraman, A. and Ullman, J.D. (2011) Mining of Massive Datasets. Cambridge University Press, Cambridge. http://dx.doi.org/10.1017/cbo9781139058452
[2] Stein, B. and Eissen, S.M. (2006) Near Similarity Search and Plagiarism Analysis. In: Spiliopoulou, et al., Eds., From Data and Information Analysis to Knowledge Engineering Selected Papers from the 29th Annual Conference of the German Classification Society (GfKl) Magdeburg, Springer, Berlin Heidelberg, 430-437. http://dx.doi.org/10.1007/3-540-31314-1_52
[3] Gionis, A., Indyk, P. and Motwani, R. (1999) Similarity Search in High Dimensions via Hashing. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999, 518-529.
[4] Chen, D.-Y., Tian, X.-P., Shen, Y.-T. and Ouhyoung, M. (2003) On Visual Similarity Based 3D Model Retrieval. Eurographics 2003, 22, 223-232.
[5] Cha, S.H. (2007) Comprehensive Survey on Distance Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences, 4, 300-307.
[6] Gower, J.C. and Legendre, P. (1986) Metric and Euclidean Properties of Dissimilarity Coefficients. Journal of Classification, 3, 5-48. http://dx.doi.org/10.1007/BF01896809
[7] Zezula, P., Dohnal, V. and Amato, G. (2006) Similarity Search the Metric Space Approach. Advances in Database Systems, Springer Series, 32, 220 p.
[8] Strehl, A., Ghosh, J. and Mooney, R. (2000) Impact of Similarity Measures on Web-Page Clustering. American Association for Artificial Intelligence, 78712-1084.
[9] Formica A. (2005) Ontology-Based Concept Similarity in Formal Concept Analysis. Information Sciences, 176, 2624-2641.
[10] Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Washington DC, 2003, 39-48, http://dx.doi.org/10.1145/956750.956759
[11] Theobold, M., Siddharth, J. and Paepcke, A. SpotSigs: robust and efficient near duplicate detection in large web collections. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, 2008, 563-570.
[12] http://fr.wikipedia.org/wiki/évangiles
[13] http://www.info-bible.org/lsg/INDEX.html

  
comments powered by Disqus

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.