A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms

P. Vinod-Prasad

doi:10.4236/jcc.2020.89008

Journal of Computer and Communications > Vol.8 No.9, September 2020

A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms

P. Vinod-Prasad
Department of Technology, Ministry of Higher Education, Sur, Oman.
DOI: 10.4236/jcc.2020.89008 PDF HTML XML 355 Downloads 1,101 Views Citations

Abstract

Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many other applications highly depend on pattern matching or similarity searches. In this paper, we discuss some of the string matching solutions developed in the past. Then, we present a novel mathematical model to search for a given pattern and it’s near approximates in the text.

Keywords

String Matching, Pattern Matching, Similarity Search, Substring Search

Share and Cite:

Vinod-Prasad, P. (2020) A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms. Journal of Computer and Communications, 8, 94-99. doi: 10.4236/jcc.2020.89008.

1. Introduction

Similarity searches commonly known as approximate string matching allow for some mismatches between the text and the pattern. Similarity searches are widely used in computational biology, search engines, data mining, signal processing, digital dictionaries, and many other applications where exact and similar patterns are to be searched over a given text. Most of the algorithms developed prior to the 1990’s such as Morris and Pratt [1], Aho and Corasick [2], Knuth et al. [3], Boyer and Moore [4], Horspool [5], and Karp and Rabin [6], were designed to search for exact pattern matches in the text. However, with some modifications, they can also be used to find approximate matches.

The term “distance” is often used when comparing two strings for similarity. A lesser distance is expected for a greater similarity. The Hamming distance [7] approximates the similarity between two strings of equal length by measuring the number of character mismatches at corresponding locations. Let R and S be two non-empty equal-length strings of size M such that $R = r_{0} r_{1} \dots r_{M - 1}$ and $S = s_{0} s_{1} \dots s_{M - 1}$ . Then, the Hamming distance between R and S is given by ham(R, S) = number of locations where $r_{i} \neq s_{i}$ , $0 \leq i \leq M - 1$ . Clearly, if R and S are identical, then ham(R, S) = 0. For a given pattern P modern applications may require database to be searched for exact or approximate matches of P. In the literature, this problem is sometimes referred as “string matching with k mismatches” which can be stated as follows: Given the text $T = t_{0} t_{1} \dots t_{N - 1}$ of size N, and the pattern $P = p_{0} p_{1} \dots p_{M - 1}$ of size M defined over a finite set of text character called alphabet λ. Let hd_i be the Hamming distance such that $h d_{i} = h a m (P, t_{i} t_{i + 1} \dots t_{i + M - 1})$ , where, $0 \leq i \leq (N - M)$ . Then, for a given integer k such that $0 \leq k \leq M$ , report all locations iin T where $h d_{i} \leq k$ .

To solve the “k-mismatch” problem, Landau and Vishkin [8] proposed a suffix-tree-based algorithm. They used a suffix tree to preprocess the text and the pattern in O(N + M) time, and then report k mismatches in O(kN) time. However, the algorithm requires O(k(M + N)) space, which is a concern when N becomes large. Galil and Giancarlo [9] presented an algorithm that uses O(kN) time and O(M) space to solve the same problem. Amir et al. [10] developed an algorithm to identify all such locations in $O (N \sqrt{k} \log_{2} k)$ time. As “k” increases the performance of these algorithms deteriorates, and approaches O(MN) as “k” approaches M. For k=M, the problem becomes independent of k and reduces to “string matching with mismatches”, which can be stated as follows: Given the text $T = t_{0} t_{1} \dots t_{N - 1}$ , and the pattern $P = p_{0} p_{1} \dots p_{M - 1}$ , for every i such that $0 \leq i \leq (N - M)$ , output the Hamming distance hd_i such that $h d_{i} = h a m (P, t_{i} t_{i + 1} \dots t_{i + M - 1})$ . Abrahamson [11] applied a technique known as the Boolean convolution of the pattern and the text to solve the problem in $O (N \sqrt{M \log M})$ time and O(N) space. Using a linked list, Yates and Perleberg [12] presented O(N + Nf_max) time and O(2M + σ) space algorithm, where f_max is the frequency of the most commonly occurring character in the pattern. Many of these algorithms are covered in Crochemore et al. [13]. For a detailed survey on approximate string matching refer to Navarro [14], Boytsov [15]. Most of the algorithms developed in the past use data-structures and methods outlined above to create indexes over the text or the pattern to accelerate the search process. However, their costly maintenance has always been a cause of concern.

2. Assumptions and Notations

We use the following assumptions and notations. “λ” represents a finite non-empty ordered set of characters called an alphabet, such that $| λ | = σ$ is the size of the alphabet. “λ_i” is the i^th character in λ such that $1 \leq i \leq σ$ . String S is a finite sequence of characters defined over alphabet λ. S[i] or S_i represents the i^th character of string S, where, “i” is refers to as the shift, location, or index in S. Both T and P represent non-empty text and pattern strings defined over the alphabet λ. N and M represent sizes of the text and the pattern respectively such that M ≤ N. We use the phrase “Number of matches of P in T at shift t”, which refers to the number of character matches when pattern P is aligned with shift t in T.

3. The Model

Traditional, pattern matching algorithms attempts to align P from the first character of T which may result in losing some valuable information regarding the character matches. Let’s consider an example: suppose pattern P = ABCDEF, and text T = CDEFABCD. Assuming 0 being the initial index of the strings, a four character match is found when P is aligned at the index −2 in T. All traditional algorithms would lose this information as attempts are made to align P from index 0 in T. In other words, traditional algorithms consider the indexes in the range $0 \leq i \leq N - M$ , where i represent the text index. However, as we have seen, considering i’s in the range $(1 - M) \leq i \leq (N - 1)$ may provide additional information, particularly when the pattern is considerably large, and the character matches exist at opposite ends of the pattern or text. The lemma and proof given below are already discussed in our previous work [16]. However, a brief discussion is provided below for a prompt reference.

The Lemma

Let T and P be non-empty text and pattern strings defined over an alphabet λ such that: $T = t_{0} t_{1} \dots t_{N - 1}$ and $P = p_{0} p_{1} \dots p_{M - 1}$ . Corresponding to every index j in P, we define a set R_j such that $R_{j} = {i - j | t_{i} = p_{j}, \forall 0 \leq i \leq N - 1}$ . Further, let S represent a set such that $S = R_{0} \cap R_{1} \cap R_{2} \cap \dots \cap R_{M - 1}$ .Then:

1) Every integer $s \in S$ represents an index in T where an exact match of P is found when P is aligned at s in T.

2) The cardinality |S| represents the number of occurrences of P in T. If |S| =0 then P is not present in T.

Proof: Suppose P is found in T when P is aligned at index s in T. Then, we have to show that $s \in S$ . If P appears in T at shift (index)s that means all M characters of pattern $P = p_{0} p_{1} \dots p_{M - 1}$ can be successfully matched with $T = t_{s} t_{s + 1} t_{s + 2} \dots t_{s + M - 1}$ . Hence, $\forall j$ in P such that $0 \leq j \leq M - 1$ , we have $T_{s + j} = p_{j}$ . Now, from the definition of R_j, we get $R_{j} = {(s + j) - j = s}$ ⇒ $\forall 0 \leq j \leq M - 1$ , $s \in R_{j}$ ⇒ $s \in S$ . Further, since all $s \in S$ represent an exact match of P when P is aligned at index s in T ⇒ |S| = Number of occurrences of P in T.

Example 1: Let T = GCABABABCBA be a text array and P = ABAB be a pattern array of size 4. For the given text we have i such that $0 \leq i \leq 10$ . For each shift j ( $0 \leq j \leq 3$ ) in P we create a set R_j such that $R_{j} = {i - j | T [i] = p_{j}, 0 \leq i \leq 10}$ . Which gives: $R_{0} = {2, 4, 6, 10}$ , $R_{1} = {2, 4, 6, 8}$ , $R_{2} = {0, 2, 4, 8}$ , and $R_{3} = {0, 2, 4, 6}$ . Hence, $R_{0} \cap R_{1} \cap R_{2} \cap R_{3} = S = {2, 4}$ ⇒ $| S | = 2$ , which shows P occurs in T twice at locations 2 and 4.

Corollary 1: Given the M sets R_j defined as in the lemma given above. Let f_s be the frequency of the occurrence of an integer “s” in all sets. Then, f_s represents the number of character matches when P is aligned with shift s in T.

Proof: We have already proved that any $s \in S = R_{0} \cap R_{1} \cap R_{2} \cap \dots \cap R_{M - 1}$ represents exactly M character matches of $P = p_{0}, p_{1}, p_{2}, \dots, p_{M - 1}$ at shift s in T ⇒ each $s \in R_{j}$ represents a single character match p_j of P ⇒ the frequency of integer “s” = f_s = Number of character matches of P at shift s in T. Please note: for an exact match of P at shift s in T, “s” must be present in all M sets, in that case: $s \in S$ , i.e. f_s= M. Also, M - f_s represents the Hamming distance, i.e. the number of character mismatches when pattern P is aligned with shift s in T.

Example 2: Consider example 1, the integer 6 appears in three sets: R₀, R₁ and R₃. Hence, the frequency of the integer 6 is equal to f₆ = 3. Which shows three character matches of P when P is aligned with shift 6 in T. Similarly, $f_{0} = f_{8} = 2$ reveal two character matches of P when P is aligned with locations 0 and 80 in T.

4. Model Implementation

We present another example to see how the model described above can be implemented successfully. Consider the pattern array P = FCTHZCTZCF, and the text array T = SKRFCTHZCTZCFTYCTZGHTTCTHZTHZFCTHZCTZCFT. The first column of Table 1 below summarizes all unique characters of the pattern. The second and third columns of the table present the shifts of the corresponding pattern characters in the text and in the pattern respectively. Each row of the last column represents set R_j as defined in the lemma.

As noted above, the frequency of occurrence f_s of an integer “s” in all set represents the number of character matches of P at shift sin T. Therefore, we simply need a mechanism for counting the number of occurrences of individual “s” in all sets in Table 1. This can be done using an array of integers “hit[]” of size N with all cells initialized to 0. Then, for each $s \in R_{j}$ the count at hit[s] is incremented by one. In other words, each $s \in R_{j}$ induces a hit at “hit[s]”. Figure 1 shows the resulting array. The first row of Figure 1 represents array indexes, and middle row indicates the hit count at the corresponding index. Note that locations 3 and 29 have been hit 10 times, which indicate 10 character matches of P and T at alignment locations 3 and 29 in T. Moreover, location 21

Table 1. Hit index.

Figure 1. The hit array.

Figure 2. The re-indexed hit array.

has been hit 5 times, location 14 has been hit 4 times, representing 5 and 4 character matches of P at alignment locations 21 and 14 respectively.

The method described above has two shortcomings. First, we need an array of size N, which is undesirable for large values of N. To resolve this, the hit[] can be assumed as cyclic that allow us to reuse the previously used array cells. Second, as we have seen in previous examples that we might get negative values of s for which the array cell does not exist. For example, s = −1 in the two sets R₅ and R₆ in Table 1, suggesting that we can obtain two character matches if P is aligned at location −1 in T. Such hits are ignored in Figure 1 because as we cannot record hits at negative array locations. This issue can be resolved by assuming the initial index of the text file to be ≥M − 1 rather than 0. This will ensure that s ≥ 0 for all $s \in R_{j}$ . Figure 2 shows the re-indexed version of the array, where, each text character location is hyped by M. Therefore, index M = 10 in the re-indexed array corresponds to location 0 of the actual text array, index 9 corresponds to location −1, and so on. The benefit is straightforward; with the re-indexed array, we can say that 2 hits are found if the pattern is aligned at location 9, which corresponds to location -1 in the actual text array.

5. Conclusion

Highly practical solutions can be drawn based on the model we have presented in this paper. The novel approach we have followed may become an alternative to existing solutions.

Acknowledgements

We thank anonymous reviewers for their comments and suggestions.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1]	Morris, J. and Pratt, V. (1970) A Linear Pattern Matching Algorithm. Technical Report 40, Computing Center, University of California, Berkeley.
[2]	Aho, A. and Corasick, M. (1975) Efficient String Searching: An Aid to Bibliographic Search. Communications of the ACM, 18, 333-340. https://doi.org/10.1145/360825.360855
[3]	Knuth, D., Morris, J. and Pratt, V. (1977) Fast Pattern Matching in Strings. SIAM Journal on Computing, 6, 323-350. https://doi.org/10.1137/0206024
[4]	Boyer, R. and Moore, S. (1977) A Fast String Searching Algorithm. Communications of the ACM, 20, 762-772. https://doi.org/10.1145/359842.359859
[5]	Horspool, N. (1980) Practical Fast Searching in Strings. Software Practice and Experience, 10, 501-506. https://doi.org/10.1002/spe.4380100608
[6]	Karp, R. and Rabin, M. (1987) Efficient Randomized Pattern-Matching Algorithms. IBM Journal Research and Development, 31, 249-260. https://doi.org/10.1147/rd.312.0249
[7]	Hamming, R. (1950) Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29, 147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
[8]	Landau, G. and Vishkin, U. (1986) Efficient String with k Mismatches. Theoretical Computer Science, 43, 239-249. https://doi.org/10.1016/0304-3975(86)90178-7
[9]	Galil, Z. and Giancarlo, R. (1986) Improved String Matching with k Mismatches. SIGACT News, 17, 52-54. https://doi.org/10.1145/8307.8309
[10]	Amir, A., Lewenstein, M. and Porat, E. (2004) A Faster Algorithms for String Matching with k Mismatches. Journal of Algorithms, 50, 257-275. https://doi.org/10.1016/S0196-6774(03)00097-X
[11]	Abrahamson, K. (1987) Generalized String Matching. SIAM Journal of Computing, 16, 1039-1051. https://doi.org/10.1137/0216067
[12]	Baeza-Yates, R. and Perleberg, C.H. (1996) Fast and Practical String Matching. Information Processing Letters, 59, 21-27. https://doi.org/10.1016/0020-0190(96)00083-X
[13]	Crochemore, M., Hancart, C. and Lecroq, T. (2007) Algorithms on Strings. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511546853
[14]	Navarro, G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Surveys, 33, 31-88. https://doi.org/10.1145/375360.375365
[15]	Boytsov, L. (2011) Indexing Methods for Approximate Dictionary Searching: Comparative Analysis. ACM Journal of Experimental Algorithmics, 16, Article No. 1.1. https://doi.org/10.1145/1963190.1963191
[16]	Vinod-Prasad, P. (2016) A Novel Algorithm for String Matching with Mismatches. 5th International Conference of Pattern Recognitions Applications and Methods, Rome, February 2016, 638-644. https://doi.org/10.5220/0005752006380644

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies