On Some Properties of Graph of Prefix Code

Abstract

We investigate decomposition of codes and finite languages. A prime decomposition is a decomposition of a code or languages into a concatenation of nontrivial prime codes or languages. A code is prime if it cannot be decomposed into at least two nontrivial codes as the same for the languages. In the paper, a linear time algorithm is designed, which finds the prime decomposition. If codes or finite languages are presented as given by its minimal deterministic automaton, then from the point of view of abstract algebra and graph theory, this automaton has special properties. The study was conducted using system for computational Discrete Algebra GAP.

Share and Cite:

Krainiukov, N. , Abramyan, M. and Melnikov, B. (2024) On Some Properties of Graph of Prefix Code. Journal of Applied Mathematics and Physics, 12, 1571-1581. doi: 10.4236/jamp.2024.124096.

1. Introduction

A formal language is set of the words over some alphabet [1]. The standard set operations, like union and interception, can be performed under formal languages [2]. Another simple basic operation of formal languages is their concatenation. However, the complexity of the inverse operation of decomposing a formal language into a nontrivial concatenation of factor languages is more sophisticated and has a long history of studying [3]. A concatenation is trivial if one of the languages consists exactly of the empty string {ε}. A non-empty language is said to be prime if it cannot be written as a catenation of two languages neither one of which is the singleton language consisting of the empty word.

We investigate decomposition problems for the class of finite languages. The finite language is finite set of words. This class contends the codes, the codes are languages recognized by special kind of automata, so-called flower automata. The representation of the finite language in special graph of automata makes it possible to decompose into a product of prime languages.

Section 2 contains the basic definition and notation of formal languages, codes, graphs, finite automata and formal serials over the semiring.

Section 3 discusses the algorithm of decomposition prefix codes and finite language and a practical example of application of this algorithm.

Section 4 applies computer discrete algebra system GAP to decompose prefix codes and finite language.

2. Definitions and Notation

In this section, we remember the definition and notation about formal languages, codes, free monoid, free algebra, automata and graphs. The following definitions taken from [4] [5] [6] will be used.

An alphabet Σ is finite set letters Σ = { a , b , c , } . A word or string w is finite length sequence of letters over alphabet Σ. We denote as Σ * the set of all finite words. The set of Σ * with respect to the concatenation operation forms a free monoid. Language L is subset of monoid Σ * . Free monoid Σ * contains the empty word ε. Semigroup Σ + = Σ * \ ε is monoid Σ * without empty word ε.

A basic operation of free monoid Σ * is concatenation of two words w = u v . The concatenation can be expanded to the formal languages L 1 , L 2 . The result of concatenation is the language = L 1 L 2 = { w | w = v u , v L 1 , u L 2 } .

This operation is like integer multiplication. The inverse operation that is factorization is more complicated. There is problem to find factors/divisors for big integer numbers and prime integer numbers.

The complexity of the inverse operation of decomposing a language L into a nontrivial concatenation L 1 , L 2 is also complicated.

A concatenation is obvious if one of languages L 1 , L 2 consists exactly of the empty string. A language L is prime language if it cannot be decomposed to a non-trivial concatenation of two languages L1∙and L2. The prime factorization of language is the decomposition into prime factors, prime languages.

The problem of prime factorization is undecidable for context-free languages [6]. A set X is a code if any word in X + can be written uniquely as a product of words in X. To say other words, word w X + has a unique factorization in words from X. A code X never contains the empty word ε, because word w = ε w = w ε has different presentation. Any subset words from a code X is a code too.

The word u is a prefix of a word v, denoted as u v , if v = u w , for some w Σ * . We say that u and v are prefix comparable if either v u , or u v . The set X is a prefix code if no element of X is a proper prefix of another element in X. This is equivalent to say that there are no comparable words u v of the relation in the set X. For all words u , v X , if u v , then u = v .

We use standard graph theory notions, as contained in [7], so we only fix the notation and give a few definitions and example below.

A digraph (directed graph) G = ( V , E ) consists of a finite set of vertices V and a set of edges E V 2 . For a subset of vertices V , let G [ U ] denote the induced sub(di)graph ( U , E U × U ) , which is obtained by restricting the vertex set V of G to U and redefining the edge set E appropriately.

The graph G = ( V , E ) has vertices V = { 1 , 2 , 3 , 4 , 5 } and edges E = { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) , ( 2 , 5 ) , ( 5 , 5 ) , ( 5 , 2 ) , ( 3 , 3 ) , ( 3 , 4 ) , ( 4 , 1 ) } (Figure 1).

An automaton A [2] [4] [8] over alphabet Σ consists of a set of states Q, the initial states I Q , the final/terminal states T Q , and a set F Q × A × Q called the set of edges. The automaton is denoted by:

A = ( Q , F , I , T )

The automaton is finite when the set Q is finite. The language L is recognized by A, denoted L(A), is the set of words in Σ * which are labels of paths from I to T.

Figure 2 shows the automaton A with four states, the set of initial states I = { 1 } , the set of terminal states T = { 3 , 4 } , the set of edges F = { ( 1 , a , 2 ) , ( 1 , a , 4 ) , ( 2 , a , 3 ) , ( 2 , b , 4 ) } . The finite language L ( A ) = { a a , a b , b } is recognized by automaton A.

Recall that an algebra [9] over a field K is a K-vector space A with a binary operation (multiplication) A × A A , ( a , b ) a b specified on it, satisfying the following requirements:

1) a ( b + c ) = a b + a c , ( b + c ) a = b a + c a for any a , b , c A ;

2) ( λ a ) b = a ( λ b ) = λ ( a b ) for any λ K , a , b A .

We will additionally assume that:

3) There is a unit in A, i.e. an element 1 such that 1 a = a 1 = a for any a A ;

4) Algebra A is associative, i.e. ( a b ) c = a ( b c ) for any a , b , c A .

Figure 1. Graph G with V = 5 vertices and E = 9 edges.

Figure 2. Automaton A with four states.

Throughout the following, we will additionally assume that the field K is the field of rational numbers or the field Z/2Z. We can embed monoid Σ * over alphabet Σ = { x 1 , x 2 , x n } into free algebra of polynomials K [ x 1 , x 2 , x n ] with homomorphism φ : Σ * K [ x 1 , x 2 , x n ] by definition on letters of alphabet Σ:

φ ( x i ) = x i , i = 1 , , n .

We need to use the formal series over the alphabet Σ with coefficients in semiring K. We recall the definition of semiring [10].

Let K be a semiring.

A semiring K is a set equipped with two operations denoted + and satisfying the following axioms:

1) The set K is a commutative monoid for addition + with a neutral element denoted by 0;

2) The set K is a monoid for multiplication with a neutral element denoted by 1;

3) Multiplication is distributive on addition;

4) For all x K , 0 x = x 0 = 0 .

A formal series over alphabet Σ with coefficients in K is a mapping:

σ : Σ * K

We denote the set of formal series σ over Σ * by K Σ * or K Σ * . The value of σ on w Σ * is denoted ( σ , w ) and we can formally denote:

σ = w Σ * ( σ , w )

The support of a series σ K Σ * is the set:

s u p p ( σ ) = { w Σ * | ( σ , w ) 0 } .

If the set s u p p ( σ ) is finite, then we denote that set of formal series σ by K < Σ * > .

Another words the set of formal series σ K Σ * such that ( σ , w ) = 0 for all but a finite number of w Σ * is denoted K < Σ * > and then σ is called a polynomial.

We define the formal series + τ , σ τ , and k σ by:

( σ + τ , w ) = ( σ , w ) + ( τ , w )

( σ τ , w ) = w = u v Σ * ( σ , u ) ( τ , v )

( k σ , w ) = k ( σ , w ) .

For example, let K is Boolean semiring with addition and multiplication table:

0 + 0 = 0 , 0 + 1 = 1 + 0 = 1 , 1 + 1 = 1

0 0 = 0 , 1 0 = 0 1 = 0 , 1 1 = 1.

alphabet Σ = { a , b } ,

and series σ , τ is defined ( σ , a ) = 1 , ( σ , a b ) = 1 , ( τ , ε ) = 1 , ( τ , b ) = 1 :

σ = 1 a + 1 a b

τ = 1 ε + 1 a b

Then,

σ + τ = 1 ε + 1 a + 1 a b

σ τ = 1 a + 1 a b + 1 a a b + 1 a b a b

It is easy to see that in this case, K < Σ * > is a free semiring of noncommutative polynomials and for finite languages X , Y Σ * σ X Y = σ X + σ Y , σ X Y = σ X σ Y .

3. Decomposition of Codes and Finite Languages

At the first, we consider decomposition problems for the class of finite prefix codes. Then, subset X is a prefix code if no element of X is a proper prefix of another element in X. This is equivalent to the fact that there are no comparable words u v of the relation in the set X. That is for all words, v X , if u v , then u = v . For example, the set X = { b b , a b a } is prefix code. A convenient representation for the prefix code is a tree view.

Figure 3 shows the presentation of prefix code X = { b b , a b a } .

The bold lines present the words of code, the dotted lines present the words Σ * . A given code X can be associated a subtree of the literal representation of this code X. The infinite tree may present the free monoid Σ * . In this case the root of tree of relation over Σ * is drawing in Figure 3 as node 1.

Regular prefix codes are languages recognized by finite automata and such that no word is a prefix of another. The representation of the code is a deterministic finite automaton. Let two automaton A 1 , A 2 present two prefix codes X 1 , X 2 appropriately, then we can draw finite set of words X = X 1 X 2 same kind like Figure 3. In this case, tree presents code X consists of the words of prefix code X1 and then words of prefix code X2. It is easy to see that X is prefix code

Figure 3. Presentation of prefix code X.

too. The word of code X is the leaf of this tree. There is the only (unambiguous) path from root of this tree to the leafs for each word of code X. Then, deterministic automaton A, which presents code X, has a graph like in Figure 4. The words u 1 , u 2 , , u n , v 1 , v 2 , , v n are words of codes X 1 = { u 1 , u 2 , , u n } and X 2 = { v 1 , v 2 , , v n } , where n, m are the number of words in codes X1 and X2.

We denote the vertex V d i v i s o r of graph G = ( V , E ) of minimal deterministic automaton A and call divisor-border vertex V d i v i s o r of automaton A. The DB-vertex of complete minimal deterministic automaton A [11] of prefix code X is the vertex that we have to divided graph G by this vertex on several automaton A i . The automaton A i is automaton appropriately factors X i of code X = X 1 X 2 X k .

Theorem

A prefix code X = X 1 X 2 X k is divided on k prefix code X i if and only if a prefix code X has graph G of its minimal deterministic automaton A with ( k 1 ) DB-vertex.

Proof

If prefix code X = X 1 X 2 X k is divided on k prefix code X i , then the words of this code X present the unambiguous paths from root of the tree for prefix code X to the leaves of the tree. So, we can factorize this tree after the end of words each codes X i as the result of this operation is like Figure 4. Then, graph G of its minimal deterministic automaton A has ( k 1 ) DB-vertex.

If the graph G has ( k 1 ) DB-vertex, we can prove by mathematical induction. The theorem is obvious for =2. Suppose that it is true for k = n , then prove it for k = n + 1 . We can divide the graph G on last DB-vertex Vn on two graph Gn and G1. For graph Gn, we have presentation X 1 X 2 X n and just concatenate the words of automaton for graph G1. The results X = X 1 X 2 X n X n + 1 .

Example

For prime prefix code X 1 = { " a " , " b a " , " b b " } , X 2 = { " a a " , " b a " , " b b " } .

X = X 1 X 2 = { " a a a " , " a b a " , " a b b " , " b a a a " , " b a b a " , " b a b b " , " b b a a " , " b b b a " , " b b b b " }

We construct the automaton from the prefix code X.

Display(A);

| 1 2 3 4 5 6 7

---------------------------------------

a | 5 4 1 6 5 1 4

b | 5 7 1 3 5 5 4

Initial state: [ 2 ]

Accepting state: [ 1 ]

Figure 4. Graph of automaton A.

Figure 5 shows the automaton A with DB-vertex (state) V = 4.

The state V = 5 is a “dead” state of automaton A.

Figure 6 shows Graph G 1 and G 2 for automata A.

We decompose the prefix code X = X 1 X 2 to the product of two prefix codes X 1 , X 2 .

Now, we can formulate Theorem of decomposing for finite language L.

Theorem

A finite language L = L 1 L 2 L is divided on k finite languages L i if and only if a finite language L has graph G of its minimal deterministic automaton A with ( k 1 ) DB-vertex and the number of final state automaton A is equal one.

Proof

It is the same as the Theorem for prefix codex.

Algorithm for factorization prefixes code and finite language.

Input

An finite prefix set X = { u 1 , u 2 , , u n } .

Step 1

Built the minimal deterministic automaton A of prefix set X.

Step 2

Define the number n of DB-vertex in the graph of automaton A.

It can be do with complexity of O ( n max ( lengthofwordsinprefixset X ) ) .

Step 3

Cut the graph of automaton A and find n + 1 factors of prefix set X = X 1 X 2 X n X n + 1 .

There is algebraic approach to decomposing the finite language. The main idea is the embedding language L = { v 1 , v 2 , , v n } into free algebra of polynomials K [ x 1 , x 2 , , x n ] or set of formal series σ over semiring K. We can embed language L Σ * over alphabet Σ = { x 1 , x 2 , , x n } with homomorphism φ : Σ * K [ x 1 , x 2 , , x n ] by definition on letters of alphabet Σ:

φ ( x i ) = x i , i = 1 , , n

The homomorphism φ ( v i ) = p i maps the word v i L into monomials ˚ p i .

Figure 5. Automaton A with DB-vertex V = 4.

Figure 6. Graph G 1 and G 2 for automata A.

The set of monomials generate the ideal I L . Then, we can try to find the divisors of this ideal I L . We use an exhaustive search for the divisors just added monomials q d with less power to the ideal I L and apply Buchberger algorithm for this set to find the noncommutative Groebner basis. If the Groebner basis will be equals the added monomials q d , then we find this divisors for finite language L. This algorithm has exponential time, but it will be ended because finite alphabet S and finite language L.

4. Apply System GAP for Decomposition Prefix Codes and Finite Language

The computer discrete algebra system GAP is the system for Groups, Algorithms and Programming. The name GAP reflects the original aim of the system, but now GAP has many packages for wide field investigation of modern algebra. The discrete algebra system GAP has become somewhat broader, and contained the information about algorithms and programming for other algebraic structures, such as semigroups, algebras, discrete automata and many other things.

Below describes the data and structures used in packages “automata” for finite deterministic and nondeterministic automata [12] [13] and some functions to determine property about them.

We can easy create free monoid over alphabet Σ = { a , b } .

gap> m1:=FreeMonoid(["a","b"]);

gap> gm1:=GeneratorsOfMonoid(m1);

[ a, b ]

gap> a:=gm1[1];

a

gap> b:=gm1[2];

b

alphabet Σ = { a , b } .

It is easy to form the list of elements free monoid (associative words) and product of two lists:

gap> s1:=[a,b*a,b*b];

[ a, b*a, b^2 ]

gap> s2:=[a*a,b*a,b*b];

[ a^2, b*a, b^2 ]

gap> s3:=Mult_Sp(sp1,s2);

[ a^3, a*b*a, a*b^2, b*a^3, (b*a)^2, b*a*b^2, b^2*a^2, b^3*a, b^4 ]

Now, we transform the list of associative words to list of words:

gap> aw1:=AssocWord_Sp(s3, gm1);

[ "aaa", "aba", "abb", "baaa", "baba", "babb", "bbaa", "bbba", "bbbb" ]

Build the minimal deterministic automaton new3 from list of words:

new3:=ListOfWordsToAutomaton("ab",aw1);

gap> Display(new3);

| 1 2 3 4 5 6 7

------------------------------------------

a | 5 4 1 6 5 1 4

b | 5 7 1 3 5 5 4

Initial state: [ 2 ]

Accepting state: [ 1 ]

Then, we can draw the graph of automaton new3:

newS3:=DotStringForDrawingAutomaton(new3);

At the last, we discuss algebraic algorithm decomposing the finite set of words free monoid Σ*.

We can embedding the finite set of words set X = { u 1 , u 2 , , u n } in free semiring of polynomials as described earlier φ : Σ * K [ x 1 , x 2 , , x n ] and find the noncommutative Groebner bases in this ideal generate by polynomials from words { u 1 , u 2 , , u n } [14]. In our example, K is field of rational numbers.

gap> A1:=FreeAssociativeAlgebraWithOne(Rationals, "a", "b");;

gap> gA1:=GeneratorsOfAlgebraWithOne(A1);

[ (1)*a, (1)*b ]

gap> e:=One(A1);

(1)*

gap> a:=gA1[1];

(1)*a

gap> b:=gA1[2];

(1)*b

We defined the generators of free associative algebra with one.

Now, we use the list of associative words as the prefix code. The prefix code is the same as above example with automaton new3.

gap> s2:=[a*a,b*a,b*b];

[ a^2, b*a, b^2 ]

gap> s3:=[ a^3, a*b*a, a*b^2, b*a^3, (b*a)^2, b*a*b^2, b^2*a^2, b^3*a, b^4 ];;

gap> pL2:=[ (1)*a^3, (1)*a*b*a, (1)*a*b^2, (1)*b*a^3, (1)*(b*a)^2, (1)*b*a*b^2, (1)*b^2*a^2, (1)*b^3*a, (1)*b^4,(1)*a^2, (1)*b*a, (1)*b^2 ];;

gL2:=GP2NPList(pL2);

[ [ [ [ 1, 1, 1 ] ], [ 1 ] ], [ [ [ 1, 2, 1 ] ], [ 1 ] ], [ [ [ 1, 2, 2 ] ], [ 1 ] ], [ [ [ 2, 1, 1, 1 ] ], [ 1 ] ],

[ [ [ 2, 1, 2, 1 ] ], [ 1 ] ], [ [ [ 2, 1, 2, 2 ] ], [ 1 ] ], [ [ [ 2, 2, 1, 1 ] ], [ 1 ] ],

[ [ [ 2, 2, 2, 1 ] ], [ 1 ] ], [ [ [ 2, 2, 2, 2 ] ], [ 1 ] ], [ [ [ 1, 1 ] ], [ 1 ] ], [ [ [ 2, 1 ] ], [ 1 ] ],

[ [ [ 2, 2 ] ], [ 1 ] ] ][ [ [ 1, 2, 2 ], [ 1, 2, 1 ], [ 1, 1, 1 ], [ 1 ] ], [ 1, 1, 1, 1 ] ]

Function GP2NPList() transforms list of polynomials into internal presentation polynomials in the package gbnp . Now, we use the list gL2 for

gap> GB := Grobner(gL2);

#I number of entered polynomials is 12

#I number of polynomials after reduction is 3

#I End of phase I

#I End of phase II

#I List of todo lengths is [ 0 ]

#I End of phase III

#I The computation took 2 msecs.

[ [ [ [ 1, 1 ] ], [ 1 ] ], [ [ [ 2, 1 ] ], [ 1 ] ], [ [ [ 2, 2 ] ], [ 1 ] ] ]

gap> PrintNPList(GB);

a^2

ba

b^2

The result polynomials a 2 , b a , b 2 consists a basis of divisor for polynomials list [ a 3 , a b a , a b 2 , b a 3 , ( b a ) 2 , b a b 2 , b 2 a 2 , b 3 a , b 4 , a 2 , b a , b 2 ] . Thus, we obtain the decomposition of the prefix code into products of codes using the Buchberger algorithm. Similarly, factorization can be obtained for a finite language.

5. Conclusions

The linear time algorithm is designed, which finds the prime decomposition for prefix codes. It can be applied to some modifications for finite languages.

The algorithm for factorization of finite languages has exponential complexity. From the point of view of abstract algebra, we can apply Buchberger algorithm for an exhaustive search of the factors of the finite language.

The study was conducted using system for computational Discrete Algebra GAP. The results of decomposition of languages are obtained when considering modified graphs of this automaton.

There is a big gap between complexity factorization of prefix codes (lineal complexity) and finite languages (exponential complexity). Further research will related to the search for a polynomial algorithm to determine the class of languages for which such an algorithm is applicable.

Founding

This work is supported by a grant from the research program of Chinese universities “Higher Education Stability Support Program” (Section “Shenzhen 2022―Science, Technology and Innovation Commission of Shenzhen Municipality”).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Aho, A. and Ullman, J. (1973) The Theory of Parsing, Translation, and Compiling. Vol. 1, Prentice Hall, Upper Saddle River.
[2] Brauer, W. (1984) Automation Theory: An Introduction to the Theory of Finite Automata. Vieweg + Teubner Verlag, Wiesbaden.
[3] Mateescu, A., Salomaa, A. and Yu, S. (1998) On the Decomposition of Finite Languages. Technical Report 222, Turku Centre for Computer Science, Turku.
[4] Lallement, G. (1979) Semigroups and Combinatorial Applications. Wiley & Sons, Inc., Hoboken, NJ, 376 p.
[5] Berstel, J. and Perrin, D. (2008) Theory of Codes. Academic Press, New York, 345 p.
[6] Mateescu, A., Salomaa, A. and Yu, S. (2002) Factorizations of Languages and Commutativity Conditions. Acta Cybernetica, 15, 339-351.
[7] Diestel, R. (2005) Graph Theory. 3rd Edition, Springer, Berlin, 421 p.
[8] Melnikov, B. (2018) Regular Languages and Nondeterministic Finite Automata. RGSU Publ., Moscow. (In Russian)
[9] Winberg, E.B. (2005) Course of Algebra. Factorial Press, Providence. (In Russian)
[10] Berstel, J. and Perrin, D. (2005) Codes and Automata. Springer, Berlin, 545 p.
[11] Melnikov, B. (2017) The Complete Finite Automaton. International Journal of Open Information Technologies, 5, 9-17.
[12] Melnikov, B. and Dolgov, V. (2022) Simplified Regular Languages and a Special Equivalence Relation on the Class of Regular Languages. Part I. International Journal of Open Information Technologies, 10, 12-20. (In Russian)
[13] Abramyan, M.E. (2021) Computing the Weight of Subtasks in State Minimization of Nondeterministic Finite Automata by the Branch and Bound Method. University Proceedings. Volga Region. Physical and Mathematical Sciences, 2, 46-52. (In Russian) https://doi.org/10.21685/2072-3040-2021-2-4
[14] Mora, T. (1994) An Introduction to Commutative and Noncommutative Griibner Bases. Theoretical Computer Science, 134, 131-173. https://doi.org/10.1016/0304-3975(94)90283-6

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.