Optimization of Data Structure in Classification and Clustering Problems ()
1. Introduction
The paper proposes a new computational technology for solving of classification problem based on a new concept of object similarity, which is one of the fundamental concepts in machine learning, because it allows one to compare subsets of data in order to recognize objects of different classes [1]. Usually, the similarity of objects is assessed by the distance between them in metric space. Here, objects of a finite set of the same class are considered to be close in the value of a certain feature if these values are close enough.
According to this concept, the center of computational procedures is not the object as an element of a multidimensional feature space, but the object feature value as an element of each feature vector. Therefore, the majority of calculations are performed for one-dimensional rather than multidimensional functions, which leads to a qualitative simplification of algorithm.
We considered several options for implementing this approach, which differed in the way of transforming structure of data matrix. Of greatest interest is the method that boils down to splitting the values of each feature of the objects in the combined sample (consisting of training and test samples) into the same number of intervals, which play the role of calculated parameters [2]. Lists of the TS objects of the same class falling within these intervals were considered as information granules [3]. Then, the frequency of any feature value in a certain class is equal to the frequency of corresponding granule, the frequency of an object in each class is equal to average frequency of all its features in each class, and the class of an object corresponds to maximum of these frequencies.
Let us note that in [4], it was shown that the above classification algorithm according to the mechanism for processing information received from the environment by receptors of various sensory systems of a mammal. The totality of this data is supplemented by previously obtained information and is generalized in the brain only at the last stage. Thus, the approach being considered is bio-inspired. In this paper, a new version of this approach was developed [5] based on the use of ordered data.
2. Ordering of Feature Vectors
2.1. Some Properties of Ordered Features
Let us consider the training sample (TS) of a classification problem. Let
is quantitative data matrix the TS,
are the numbers of objects and
is the feature vector
. We will call the elements set of the vector
ordered if they were renumbered and received new numbers
such that the corresponding values of this vector form a non-decreasing sequence
. We will extend the term “ordered” to similar sets of quantities [6].
By definition, ordered elements of a vector are the nearest neighbors by feature value. Note that nearest neighbor methods [7] are widely used in solving classification problems. However, these methods consider the issues of objects features distribution in metric space, and not changes in the data structure as in the present article.
The numbering of elements of the ordered vector
has important peculiarities. Obviously, if the feature value
of objects
and
, then the ordered numbers of objects
, and this relationship is preserved for objects of the same class. Therefore, objects of a certain class can be identified not only by their ordered numbers, but also by the sequence of these numbers
. Here
is the length of class
, and the number
corresponds to the minimum value
for objects of this class.
Let us illustrate the peculiarities of numbering using the example of vector
for the case
:
.
Here the set
is the union of objects subsets
and
for class
and
, respectively. In ordinal scales these subsets have the form
and
. Then the vectors
and
will describe in these scales the classes objects features of
and
, respectively.
Note that in the case
we get an ambiguous relation
. But this circumstance will not affect subsequent results, since both object numbers correspond to the same feature value.
As shown above, for any
the values
form a non-decreasing sequence on the set of points
. In other words, on the specified set there is defined a deterministic function
such that
. This function describes the relationships between classes and features of the TS.
However for objects of the same class, the values distribution of each feature on the set
has many jumps that are close in magnitude to the range of the feature values. Therefore, for the original data there is no functional dependence between classes and features. For ordered values, this distribution will be quite smooth, since the nearest neighbors by the feature value are arranged on the set
. It can be considered that due to the ordering the complex chaotic relationship between classes and feature values for the same class objects is transformed into the deterministic function
. This conclusion means that the reduction of the uncertainty level of information and, accordingly, information entropy of data contained in the TS is achieved by ordering the features values.
Updated data matrix by structuring will be a set of
ordered feature vectors. Compared to the original one, the new structure has an important advantage in relation to solving the classification problem, since functions
are defined on the set of its ordered features, which significantly simplify, as will be shown below, the algorithm for solving the problem. Note that these functions exist for any TS, since their derivation did not require the introduction of any assumptions or restrictions. Moreover, structuring is reduced to the simplest sorting of the values of individual TS features.
The new structure can be viewed as an independent version of the given data matrix. Apparently, for some databases and solution methods this version of the matrix may be preferable to the original one. Next, we will limit ourselves to solving the classification problem for the updated structure.
2.2. Classification of Structured Data
In the example above, the characteristics of the vector
are specified for objects of individual classes of the TS, which can be visualized on a plane by constructing diagrams, the horizontal axis of which corresponds to the values
and the vertical axis to the values
. To visualize the information contained in the TS, we consider similar scatter plots of the function vector
for some
.
Such diagrams are presented in the panels of Figure 1 for the Wine database for
(left) and
(right). Here the maximum value of
is equal to the maximum value of
, since each panel contains points for objects of classes
. For clarity, the diagrams are constructed for normalized values
, calculated using the formula
, where the
subscripts min and max correspond to the minimum and maximum of the feature values
. Further we will assume that all features are normalized.
Figure 1. Graphs of functions
for the Wine database for
(left) and
(right).
The diagrams display that the values of the corresponding feature of the same class objects are represented by their own chain of points, which, with some exceptions, are quite far from the points corresponding to other classes. These points are the closest neighbors in terms of the feature value. Therefore, the following classification algorithm based on the new concept of similarity is proposed.
Let
be the vectors set of objects normalized feature values in the test sample,
. For each object
of this set, we find the value
equal to the class
of the TS object, the feature value
of which
is in the
-neighborhood of the value
. Here
is the proximity parameter, which characterizes the acceptable level of accuracy in the problem under consideration and satisfies the condition
(1)
Then the average frequency of class
of object
over all features is equal to
(2)
The maximum the frequency determines the object class
. (3)
Calculations performed for the Iris and Wine [8] databases showed that the number of classification errors for the test sample for
ranges from 0 to 2.
Let us note that for the objects features of any class of the combined sample, inequality (1) is fulfilled randomly, in particular, due to the error of observations and data measurements. Considering that the discrete function
is monotone, we can quite simply reduce the influence of these errors and, accordingly, increase the accuracy of the solution to the problem if we transform it into a continuous one by using interpolation or approximation. However, issues of improving the proposed algorithm are beyond the scope of this article.
3. Properties of an Ordered Data Matrix
Consider the ordering effect for a data matrix. It is obvious that the arrangement of elements
along the length of the TS depends on the distribution
values. At the same time, the sets of these elements must describe the given objects represented by rows of the data matrix
. Therefore, the ordering of the entire set of the TS data is carried out for each of the features separately.
Then the data matrix is mapped onto a set of N data matrices
.
Columns of matrices
and
, corresponding to the same features, will consist of the same elements and differ only in the order of arrangement of these elements. The rows of such matrices differ only in the order of their arrangement, since they represent feature values sets of individual objects.
Example of data matrices
and
.
,
,
,
.
Let class
of the TS object number
, equal to the row number of the data matrix
, be given by the dependence
. We will consider the properties of objects subsets, called clusters
, whose features are described by row the
of the matrix
when
. Notice that clusters
and classes
have the same length. To assess the objects coincidence level of class
and clusters
, we find the average number of objects of the TS for all classes for which the dependence is satisfied
. (4)
The number of such coincidences, divided by the set length, was called the coincidence index
of the feature
.
Index analysis was performed for 10 databases [9]. Calculations showed that for a third of the databases considered, the maximum index value exceeds 0.9, 0.7 or 0.5, for one of the databases it reaches 0.961, and for another
for all
. From the results obtained, it follows that classes
and clusters
split many objects into subsets, which partially (in many cases) or almost completely (in some cases) consists of the same objects.
The result obtained is unexpected, but to a certain extent corresponds to ideas regarding the role of order in nature and allows us to penetrate deeper into the essence of the concept class of set [10]. As you know, classes are subsets of objects have common properties that differ for different subsets. The division of a set into classes is performed by specialists in a certain field of knowledge based on the analysis of any common properties of objects, for example, those related to cost, health, quality of products or services. But, when calculating the index, we do not take into account the specifics of these properties.
This conclusion indicates that structuring by ordering features allows us to identify the relationship between classes and features in the clustering problem. A similar relationship in the form of a functional dependence was established in the previous section to solve the classification problem.
For each
, dependence (4) determines the objects numbers
of class
, as well as cluster
. Their feature description for each
is represented by row
of the matrix
. Thus, all objects of cluster
are nearest neighbors by feature
and, according to the similarity hypothesis, will have common properties by this feature. Since such a situation will occur for all
, then object
has certain properties that distinguish it from objects in other clusters. It can be assumed that these properties will be close to the properties of class
objects. Note that the wide range of index values
is partly caused by measurement errors and selection of features characterizing the properties of the class objects.
4. Conclusions
The paper develops a new concept for solving problems of classification and clustering, based on transforming the structure of the original data by ordering feature vectors. This concept is bio-inspired.
It has been established that the ordering of features leads to a decrease in the entropy of features distribution which allows us to detect functional dependencies of object classes on the features values. When they are used, the algorithm for solving the classification problem is qualitatively simplified.
The updated data structure can serve as the basis for a new type of neural networks, in which the functional dependencies obtained in the article are used to simplify and speed up training.
It is shown that by ordering the features, one can find a large number of options for partitioning the set into clusters that are close to the corresponding classes in the composition of objects.