Legendre Polynomial Kernel: Application in SVM

Abstract

In machines learning problems, Support Vector Machine is a method of classification. For non-linearly separable data, kernel functions are a basic ingredient in the SVM technic. In this paper, we briefly recall some useful results on decomposition of RKHS. Based on orthogonal polynomial theory and Mercer theorem, we construct the high power Legendre polynomial kernel on the cube [-1,1]d. Following presentation of the theoretical background of SVM, we evaluate the performance of this kernel on some illustrative examples in comparison with Rbf, linear and polynomial kernels.

Keywords

Share and Cite:

Rebei, H. and Alharbi, N. (2022) Legendre Polynomial Kernel: Application in SVM. Journal of Applied Mathematics and Physics, 10, 1732-1747. doi: 10.4236/jamp.2022.105121.

1. Introduction

The approach of reproducing kernel function has many applications in probability theory, in statistics and recently in machine learning (see [1] [2] [3]). They have been applied in many fields and domains of life sciences such as pattern recognition, biology, medical diagnosis, chemistry and bio-informatics.

It is well-known from many years that data sets can be modeled by a family of points $\mathcal{X}\subset {ℝ}^{d}$ and similarity between them is given by an inner product on ${ℝ}^{d}$. The classification problem consists of separating these points into classes with respect to given properties. In the simplest situation, the points are linearly separable in the sense that there exists an hyperplane ${H}_{s}$ separating the two classes.

Support vector machines (SVMs) have become a very powerful tool in machine learning in particular in classification problems and regression ( [1] [4] [5]). In classification problem, we can consider only two-classes of data, so we speak about a binary classification. Also we can rencounter more than two classes of data. This is called a multi-task classification problem. In our cases, we focus on binary classification and the application of kernel functions in SVM classification.

In simplest situations, the linear SVM technic allows to find the optimal hyperplane separating points in two classes. Unfortunately, in concrete examples, these points are not linearly separable. Then, one can traduce similarity of points in term of positive definite kernel function k on $\mathcal{X}$. This leads to invent a new technic named non-linear SVM which combines linear SVM and kernel tools.

Mathematically, this new approach of non-linear SVM is related to the Kolmogorov representation $\left(\mathcal{H},\Phi \right)$, where $\mathcal{H}$ represents the feature space and $\Phi$ is the feature map (see [1]). Since separation expresses a degree of similarity between points in the same class, Vapnick used the kernel approach to translate the problem from initial space ${ℝ}^{d}$ to the feature space. The transfer between initial space and the feature space is made under the feature map and similarities are expressed by the inner product in the feature space which is given by a kernel k. The crucial idea of the kernel function methods is that non-linearly separable points can be transformed to linearly separable points in the feature space guarding similarities which is expressed by

$〈\Phi \left({x}_{i}\right),\Phi \left({x}_{j}\right)〉=k\left({x}_{i},{x}_{j}\right).$

Furthermore, the solution of the classification problem using non-linear SVM is given by the decision function which depends only on the kernel and support vectors ( ${x}_{i}\in S$ ). Precisely, it takes the following form

$F\left(x\right)=sgn\left(b+\underset{{x}_{i}\in S}{\sum }\text{ }\text{ }{\alpha }_{i}k\left({x}_{i},x\right)\right).$

Since the choice of the kernel function is not canonic, the first problem of this approach is how to choose a suitable kernel function for such given data points. The second problem is that the feature space has an infinite dimensional and this causes a technical problem in modeling computational algorithm for such a solution. This problem is also related to the feature map $\Phi$ which is in general unknown. In our case, we use the Mercer decomposition theorem in order to obtain a type polynomial kernel having a good separation property. This means that infinite dimensional feature space can be approximated by finite dimensional feature space. This reduces the dimensionality of new data in feature space.

The paper is organized as follows: In Section 2, we recall some known results on Reproducing Kernel Hilbert Space (RKHS in what follow), in particular, the Mercer decomposition theorem 2.2 and the high power kernel theorem 2.3. Section 3 is devoted to the main result in which we introduce the one-dimensional Legendre polynomial kernel ${K}_{n}$ and we give its canonical decomposition in term of Legendre orthogonal polynomials (see Theorem 3.1). Next, we deduce the high power Legendre polynomial kernel ${K}_{n}={K}_{n}^{d}$ defined on the cube ${\left[-1,1\right]}^{d}$ for which the feature space ${H}_{n}={\mathcal{H}}_{n}^{\otimes d}$ is the tensor product Hilbert space of the RKHS associated with ${K}_{n}$. In Section 4, we recall the theoretical foundation of linear and non-linear SVMs. In Section 4, we give some illustrative examples in order to evaluate the performance of the Legendre polynomial kernel in comparison with some predefined kernels in Python.

2. Preliminaries

In this section, we begin by describing the RHKS and its associated kernel. Then we give the decomposition of a given positive definite kernel on a measure space $\mathcal{X}$ (see Theorem 2.2).

Definition 1. (Positive definite kernel).

Let $|XX$ be a nonempty set. A symmetric function $k:\mathcal{X}×\mathcal{X}\to ℝ$ is called positive definite kernel if,

$\underset{i=1}{\overset{n}{\sum }}\text{\hspace{0.17em}}\underset{j=1}{\overset{n}{\sum }}\text{ }\text{ }{a}_{i}{a}_{j}k\left({x}_{i},{x}_{j}\right)\ge 0\text{\hspace{0.17em}}\text{\hspace{0.17em}}\forall n\ge 1,\text{\hspace{0.17em}}{a}_{1},\cdots ,{a}_{n}\in ℝ,\text{\hspace{0.17em}}{x}_{1},\cdots ,{x}_{n}\in \mathcal{X}$ (2.1)

and for mutually distinct ${x}_{i}$, the equality holds only when all the ${a}_{i}$ are zero.

Clearly that every inner product is a positive definite function. Moreover if $\mathcal{H}$ is any Hilbert space and $\mathcal{X}$ a nonempty set and $\varphi :\mathcal{X}\to \mathcal{H}$, then the function $h\left(x,y\right):=〈\varphi \left(x\right),\varphi \left(y\right)〉$ is a positive definite.

Now let $\mathcal{H}$ be a Hilbert space of functions mapping from some non-empty set $\mathcal{X}$ to $ℝ$, i.e., $\mathcal{H}$ is considered as a subset of ${ℝ}^{\mathcal{X}}$. We write the inner product on $\mathcal{H}$ as

$〈f,g〉,\text{ }f,g\in \mathcal{H}$

and the associated norm will be denoted by

${‖f‖}_{2}={〈f,f〉}^{\frac{1}{2}},\text{ }f\in \mathcal{H}.$

We may alternatively write the function f as $f\left(\cdot \right)$, to indicate it takes an argument in $\mathcal{X}$.

Definition 2. (Reproducing kernel Hilbert space).

Let $\left(\mathcal{H},〈\cdot ,\cdot 〉\right)$ be a Hilbert space of $ℝ$ -valued functions defined on a non-empty set $\mathcal{X}$. We say that $\mathcal{H}$ is a Reproducing Kernel Hilbert Space if there exists a kernel function $k:\mathcal{X}×\mathcal{X}\to ℝ$ such that the following properties are satisfied:

1) $\forall x\in \mathcal{X}$, ${k}_{x}:=k\left(.,x\right)\in \mathcal{H}$,

2) $\forall x\in \mathcal{X}$, $\forall f\in \mathcal{H}$, $f\left(x\right)=〈{k}_{x},f〉$ (the reproducing property).

Remark 1. From the reproducing property, we note that

$k\left(x,y\right)={k}_{y}\left(x\right)=〈{k}_{x},{k}_{y}〉=〈k\left(.,x\right),k\left(.,y\right)〉,\text{\hspace{0.17em}}x,y\in \mathcal{X}.$ (2.2)

Thus necessarily, k is positive definite. Note also that reproducing kernel k associated with $\mathcal{H}$, it is unique.

Theorem 2.1. (Moore-Aronszajn [6])

Let $k:\mathcal{X}×\mathcal{X}\to ℝ$ be a positive definite kernel. There is a unique RKHS $\mathcal{H}\subset {ℝ}^{\mathcal{X}}$ with reproducing kernel k. Moreover, if space

${\mathcal{H}}_{0}=span\left[{\left\{{k}_{x}\right\}}_{x\in \mathcal{X}}\right]$ (2.3)

is endowed with the inner product

${〈f,g〉}_{{\mathcal{H}}_{0}}=\underset{i=1}{\overset{n}{\sum }}\text{\hspace{0.17em}}\underset{j=1}{\overset{m}{\sum }}\text{ }\text{ }{\alpha }_{i}{\beta }_{j}k\left({x}_{i},{y}_{j}\right),$ (2.4)

where $f={\sum }_{i=1}^{n}\text{ }\text{ }{\alpha }_{i}{k}_{{x}_{i}}$ and $g={\sum }_{j=1}^{m}\text{ }\text{ }{\beta }_{j}{k}_{{y}_{j}}$, then $\mathcal{H}$ is the completion of ${\mathcal{H}}_{0}$ w.r.t the inner product (4).

Now let us consider a compact metric space $\mathcal{X}$ and a finite Borel measure $\nu$ on it. We consider a positive definite continuous kernel on $\mathcal{X}$. Let ${S}_{k}$ be the linear map

${S}_{k}:{L}_{2}\left(\mathcal{X};\nu \right)\to \mathcal{C}\left(\mathcal{X}\right),$

$\left({S}_{k}f\right)\left(x\right)={\int }_{\mathcal{X}}\text{ }\text{ }k\left(x,y\right)f\left(y\right)\text{d}\nu \left( y \right)$

and ${T}_{k}={I}_{k}\circ {S}_{k}$, where ${I}_{k}:\mathcal{C}\left(\mathcal{X}\right)$${L}_{2}\left(\mathcal{X};\nu \right)$. Then we have the following results:

Theorem 2.2. (Mercer’s theorem [6])

1) ${T}_{k}$ is a compact self-adjoint positive definite from ${L}_{2}\left(\mathcal{X};\nu \right)$ to itself. It is called the integral operator corresponding to the kernel k.

2) There exists an ortho-normal system ${\left\{{\stackrel{˜}{e}}_{j}\right\}}_{j\in J}$ of eigenvectors of ${T}_{k}$ with eigenvalues ${\left\{{\lambda }_{j}\right\}}_{j\in J}$, where J is at most countable set of indices, corresponding to the strictly positive eigenvalues of ${T}_{k}$ such that

${T}_{k}\left(f\right)=\underset{j\in J}{\sum }\text{ }\text{ }{\lambda }_{j}〈{\stackrel{˜}{e}}_{j},f〉{\stackrel{˜}{e}}_{j},\text{ }f\in {L}_{2}\left(\mathcal{X},\nu \right).$ (2.5)

3) For all $x,y\in \mathcal{X}$

$k\left(x,y\right)=\underset{j\in J}{\sum }\text{ }\text{ }{\lambda }_{j}{e}_{j}\left(x\right){e}_{j}\left(y\right),$ (2.6)

where ${e}_{j}={\lambda }_{j}^{-1}{S}_{k}{\stackrel{˜}{e}}_{j}\in \mathcal{C}\left(\mathcal{X}\right)$ and where the convergence is uniform on $\mathcal{X}×\mathcal{X}$, and absolute for each pair $\left(x,y\right)\in \mathcal{X}×\mathcal{X}$.

4) The Hilbert space

$\mathcal{H}=\left\{f=\underset{j\in J}{\sum }\text{ }\text{ }{a}_{j}{e}_{j}:\left(\frac{{a}_{j}}{\sqrt{{\lambda }_{j}}}\right)\in {\mathcal{l}}^{2}\left(J\right)\right\}$

endowed with the inner product

${〈\underset{j\in J}{\sum }\text{ }\text{ }{a}_{j}{e}_{j},\underset{j\in J}{\sum }\text{ }\text{ }{b}_{j}{e}_{j}〉}_{\mathcal{H}}=\underset{j\in J}{\sum }\frac{{a}_{j}{b}_{j}}{{\lambda }_{j}}$ (2.7)

is the RKHS associated with k.

Theorem 2.3. (Power of kernel [4] [7])

Let k be a kernel on $\mathcal{X}$. Then

${K}_{d}\left(x,y\right):=\underset{i=1}{\overset{d}{\prod }}\text{ }\text{ }k\left({x}_{i},{y}_{i}\right),\text{\hspace{0.17em}}x=\left({x}_{1},\cdots ,{x}_{d}\right),y=\left({y}_{1},\cdots ,{y}_{d}\right)\in {\mathcal{X}}^{d}$

is a kernel on ${\mathcal{X}}^{d}$. In addition, there is an isometric isomorphism between ${\mathcal{H}}_{K}$ and the Hilbert space tensor product ${\mathcal{H}}_{k}^{\otimes d}$.

Example 1. Here we give some most known kernels used in SVM.

1) The linear kernel: ${k}_{lin}\left(x,y\right)=〈x,y〉+c$, $c\ge 0$.

2) The polynomial kernel: ${k}_{poly}\left(x,y\right)={\left(a〈x,y〉+b\right)}^{m}$, $a>0$, $b\ge 0$, $m\ge 1$.

3) The exponential kernel: ${k}_{exp}\left(x,y\right)={\text{e}}^{\sigma 〈x,y〉}$, $\sigma >0$.

4) The radial basis function (Rbf): ${k}_{rbf}\left(x,y\right)={\text{e}}^{-\gamma {‖x-y‖}^{2}}$, $\gamma >0$. It is also called the Guassian kernel.

5) The sigmoid kernel: ${k}_{sigm}\left(x,y\right)=\mathrm{tanh}\left(〈x,y〉+c\right)$, $c\ge 0$.

3. Legendre Polynomial Kernel and High Power Legendre Polynomial Kernel

Let us consider the family of Legendre polynomials ${\left\{{L}_{n}\right\}}_{n\ge 0}$ defined by the following recurrence relation (see [8])

$\left\{\begin{array}{l}{L}_{0}\left(x\right)=1,\text{ }\text{\hspace{0.17em}}{L}_{1}\left(x\right)=x,\\ {L}_{n+1}\left(x\right)=x{L}_{n}\left(x\right)-{w}_{n}{L}_{n-1}\left(x\right),\text{\hspace{0.17em}}n\ge 1,\end{array}$ (3.1)

where

${w}_{n}=\frac{{n}^{2}}{\left(2n-1\right)\left(2n+1\right)},\text{\hspace{0.17em}}n\ge 1.$ (3.2)

It well-known [8] that they are orthogonal w.r.t the inner product

$〈f,g〉:={\int }_{-1}^{1}\text{ }\text{ }f\left(x\right)g\left(x\right)\text{d}x.$

Moreover we have

${‖{L}_{n}‖}^{2}={w}_{1}\cdots {w}_{n}.$ (3.3)

In order to apply the Mercer theorem in SVM, we consider the Legendre polynomial kernel defined on ${\left[-1,1\right]}^{2}$ by

${K}_{n}\left(x,y\right)=\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left(x\right){L}_{k}\left(y\right).$ (3.4)

Theorem 3.1. Let

${\mathcal{H}}_{n}=\left\{f=\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{k}{L}_{k},{\alpha }_{k}\in ℝ\right\}$ (3.5)

be the linear space generated by this family ${\left\{{L}_{k}\right\}}_{0\le k\le n}$. Then

1) ${\mathcal{H}}_{n}$ is a Hilbert space endowed with the inner product

$〈\underset{i=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{i}{L}_{i},\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{\beta }_{j}{L}_{j}〉=\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{j}{\beta }_{j}$ (3.6)

2) The sequence ${\left\{{L}_{k}\right\}}_{0\le k\le n}$ is an orthonormal basis of ${\mathcal{H}}_{n}$.

3) ${\mathcal{H}}_{n}$ is a RKHS with reproducing kernel ${K}_{n}$.

4) The feature map associated with ${K}_{n}$ is

${\varphi }_{n}:\left[-1,1\right]\to {\mathcal{H}}_{n},\text{\hspace{0.17em}}t↦\varphi \left(t\right)=\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{j}\left(t\right){L}_{j}.$

Proof. First: Clearly that (3.6) defines a inner product on ${\mathcal{H}}_{n}$, for which ${\mathcal{H}}_{n}$ becomes an Hilbert space since it has a finite dimension.

Second: Form the definition of the inner product (3.6), clearly that ${\left\{{L}_{k}\right\}}_{0\le k\le n}$ is an orthonormal system of ${\mathcal{H}}_{n}$ whose cardinality coincides with the dimension of ${\mathcal{H}}_{n}$. Thus it is an orthonormal basis of ${\mathcal{H}}_{n}$.

Third: Let us consider the operator ${S}_{n}$ defined on ${\mathcal{H}}_{L}:={L}^{2}\left(\left[-1,1\right]\right)$ by

${S}_{n}:f↦{\int }_{-1}^{1}\text{ }\text{ }{K}_{n}\left(x,y\right)f\left(y\right)\text{d}y\in \mathcal{C}\left(\left[-1,1\right]\right).$

Setting ${T}_{n}={I}_{n}\circ {S}_{n}$, where ${I}_{n}$ is the canonic injection of $\mathcal{C}\left(\left[-1,1\right]\right)$ into ${\mathcal{H}}_{L}$.

1) ${K}_{n}$ is positive definite. In fact

$\begin{array}{c}\underset{1\le i,j\le m}{\sum }\text{ }\text{ }{a}_{i}{a}_{j}{K}_{n}\left({x}_{i},{x}_{j}\right)=\underset{1\le i,j\le m}{\sum }\text{ }\text{ }{a}_{i}{a}_{j}\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left({x}_{i}\right){L}_{k}\left({x}_{j}\right)\\ =\underset{k=0}{\overset{n}{\sum }}\text{\hspace{0.17em}}\underset{1\le i,j\le m}{\sum }\text{ }\text{ }{a}_{i}{a}_{j}{L}_{k}\left({x}_{i}\right){L}_{k}\left({x}_{j}\right)\\ =\underset{k=0}{\overset{n}{\sum }}{\left(\underset{i=1}{\overset{m}{\sum }}\text{ }\text{ }{a}_{i}{L}_{k}\left({x}_{i}\right)\right)}^{2}\ge 0.\end{array}$

2) ${T}_{n}$ is symmetric since ${K}_{n}$ is symmetric, so by Mercer theorem 2.2, ${T}_{n}$ is self-adjoint compact and positive definite.

3) For all $j=0,1,\cdots ,n$, we have

$\begin{array}{c}{T}_{n}\left({L}_{j}\right)\left(x\right)={\int }_{-1}^{1}\text{ }\text{ }{K}_{n}\left(x,y\right){L}_{j}\left(y\right)\text{d}y\\ ={\int }_{-1}^{1}\left(\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left(x\right){L}_{k}\left(y\right)\right){L}_{j}\left(y\right)\text{d}y\\ =\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{\int }_{-1}^{1}\text{ }\text{ }{L}_{k}\left(x\right){L}_{k}\left(y\right){L}_{j}\left(y\right)\text{d}y\\ =\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left(x\right){\int }_{-1}^{1}\text{ }\text{ }{L}_{k}\left(y\right){L}_{j}\left(y\right)\text{d}y\\ =\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left(x\right){‖{L}_{j}‖}^{2}{\delta }_{k,j}\\ ={‖{L}_{j}‖}^{2}{L}_{j}\left(x\right).\end{array}$

So the eigenvectors of ${T}_{n}$ are exactly the polynomials ${L}_{j}$ and the corresponding eigenvalues are ${\lambda }_{j}={‖{L}_{j}‖}^{2}$. Thus $\left\{{L}_{j},\text{\hspace{0.17em}}0\le j\le n\right\}$ is an orthonormal basis of ${\mathcal{H}}_{n}$ formed by eigenvectors of ${T}_{n}$.

From Mercer theorem 2.2, The RKHS associated with ${K}_{n}$ is given by

${\mathcal{H}}_{{K}_{n}}=\left\{f=\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{j}{L}_{j},\frac{{\alpha }_{j}}{\sqrt{{\lambda }_{j}}}\in {l}^{2}\right\}$

Since $\frac{{\alpha }_{j}}{\sqrt{{\lambda }_{j}}}\in {l}^{2}$ for all sequence ${\alpha }_{j}$, the space ${\mathcal{H}}_{{K}_{n}}$ coincides with ${\mathcal{H}}_{n}$ given by (3.5). Thus ${\mathcal{H}}_{n}$ is the RKHS associated with the reproducing kernel ${K}_{n}$.

The reproducing property is immediate from Mercer theorem but also it can be checked by hand using (3.6)

$〈{K}_{n}\left(x,\cdot \right),f\left(\cdot \right)〉=〈\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left(x\right){L}_{k}\left(\cdot \right),\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{j}{L}_{j}\left(\cdot \right)〉=\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{\alpha }_{k}{L}_{k}\left(x\right)=f\left(x\right).$

Forth: The feature map is given by

Let ${\varphi }_{n}$ given by

${\varphi }_{n}:\left[-1,1\right]\to {\mathcal{H}}_{n},\text{\hspace{0.17em}}t↦{\varphi }_{n}\left(t\right)=\underset{i=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{j}\left(t\right){L}_{j}\left(\cdot \right).$

So from (3.6), we get

$〈{\varphi }_{n}\left(s\right),{\varphi }_{n}\left(t\right)〉=〈\underset{i=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{i}\left(s\right){L}_{i},\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{j}\left(t\right){L}_{j}〉=\underset{j=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{j}\left(s\right){L}_{j}\left(t\right)={K}_{n}\left(s,t\right).$

Now let us consider the set $\mathcal{X}={\left[-1,1\right]}^{d}$. Define the high power Legendre polynomial kernel function ${K}_{n}:\mathcal{X}×\mathcal{X}\to ℝ$ as follows:

${K}_{n}\left(x,y\right)=\underset{i=1}{\overset{d}{\prod }}\text{ }\text{ }{K}_{n}\left({x}_{i},{y}_{i}\right)=\underset{i=1}{\overset{d}{\prod }}\text{ }\text{ }\underset{k=0}{\overset{n}{\sum }}\text{ }\text{ }{L}_{k}\left({x}_{i}\right){L}_{k}\left({y}_{i}\right),\text{\hspace{0.17em}}\text{ }x=\left({x}_{1},\cdots ,{x}_{d}\right),\text{\hspace{0.17em}}y=\left({y}_{1},\cdots ,{y}_{d}\right).$ (3.7)

Let us introduce ${H}_{n}={\mathcal{H}}_{n}^{\otimes d}=\underset{d\text{-times}}{\underset{︸}{{\mathcal{H}}_{n}\otimes \cdots \otimes {\mathcal{H}}_{n}}}$ called the d--times tensor product of ${\mathcal{H}}_{n}$.

It’s known that ${H}_{n}$ is a real Hilbert space endowed with the inner product

${〈f|g〉}_{{H}_{n}}=\underset{i=1}{\overset{d}{\prod }}〈{f}_{i},{g}_{i}〉,\text{\hspace{0.17em}}f={f}_{1}\otimes \cdots \otimes {f}_{d},\text{\hspace{0.17em}}g={g}_{1}\otimes \cdots \otimes {g}_{d}\in {H}_{n},\text{\hspace{0.17em}}{f}_{i},{g}_{i}\in {\mathcal{H}}_{n}.$ (3.8)

For $\left(x,y\right)\in \mathcal{X}×\mathcal{X}$, we have

${K}_{n}\left(x,y\right)=\underset{i=1}{\overset{d}{\prod }}\text{ }\text{ }{K}_{n}\left({x}_{i},{y}_{i}\right)=\underset{i=1}{\overset{d}{\prod }}〈{\varphi }_{n}\left({x}_{i}\right),{\varphi }_{n}\left({y}_{i}\right)〉={〈\underset{i=1}{\overset{d}{\otimes }}\text{ }\text{ }{\varphi }_{n}\left({x}_{i}\right)|\underset{i=1}{\overset{d}{\otimes }}\text{ }\text{ }{\varphi }_{n}\left({y}_{i}\right)〉}_{{H}_{n}}.$ (3.9)

Define now the function ${\Phi }_{n}:\mathcal{X}\to {H}_{n}$ given by

${\Phi }_{n}\left(x\right)=\underset{i=1}{\overset{d}{\otimes }}\text{ }\text{ }{\varphi }_{n}\left({x}_{i}\right),\text{ }x=\left({x}_{1},\cdots ,{x}_{d}\right).$ (3.10)

by substituting (3.10) in (3.9), we get

${K}_{n}\left(x,y\right)={〈{\Phi }_{n}\left(x\right)|{\Phi }_{n}\left(y\right)〉}_{{H}_{n}}.$ (3.11)

Hence ${K}_{n}$ is a kernel, with the feature map ${\Phi }_{n}$ and feature space ${H}_{n}$.

Now, applying Theorem 2.3 to the kernel ${K}_{n}$, we deduce the following result.

Corollary 3.1.1. The space ${H}_{n}$ is a RKHS with corresponding kernel ${K}_{n}$.

4. Application of Kernels in Classification Problem

The binary classifications means that given a date points $\mathcal{X}={\left({x}_{i}\right)}_{1\le i\le L}\subset {ℝ}^{d}$ which are belonging to two classes. One is denoted Class(1), the other is denoted Class(−1). To each point ${x}_{i}$ we associate ${y}_{i}=±1$ indicating to which class ${x}_{i}$ belongs. The points ${\left({x}_{i}\right)}_{1\le i\le L}$ are called training points. The idea of SVM is to find an hyperplane ${\mathcal{H}}_{S}$ separating the two classes and construct a decision function f from which a new point can be classified (x belongs to Class(1) or x belongs to Class(−1)).

4.1. Support Vector Machine: Linearly Separable Classification

When the training samples $\mathcal{X}={\left({x}_{i}\right)}_{1\le i\le L}\subset {ℝ}^{d}$ are linearly separable, we speak about a linear classification (i.e., there exists an hyperplane separating the two classes), the idea of SVM is the following:

Let us consider the set of training points, ${\left\{\left({x}_{i},{y}_{i}\right)\right\}}_{1\le i\le L}$, where ${x}_{i}=\left({x}_{i}^{\left(1\right)},\cdots ,{x}_{i}^{\left(d\right)}\right)\in {ℝ}^{d}$ and the corresponding labels ${y}_{i}\in \left\{-1,1\right\}$. The first step is to find an hyperplane ${\mathcal{H}}_{S}$ in the space ${ℝ}^{d}$ separating the two classes. The second step is to construct the decision function f.

Support Vectors are the examples closest to the separating hyperplane ${\mathcal{H}}_{S}$ and the aim of SVM is to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes. For example in 2-dimensional space this means that we can draw a line on a graph of ${x}_{i}$ v.s ${x}_{j}$ separating the two classes. In high dimensional space ( $d>2$ ), the line will be replaced by an affine hyperplane ${\mathcal{H}}_{S}$, i.e., $\mathrm{dim}{\mathcal{H}}_{S}=d-1$ (see Figure 1).

Step.1.

Recall that any affine hyperplane is described by the equation

$w\cdot x+b=0,$

where “ $\cdot$ ” is the dot product in ${ℝ}^{d}$ and w is normal to the hyperplane. So we have to determinate the appropriate values of w and b for the hyperplane ${\mathcal{H}}_{S}$. If we now just consider the points that lie closest to the separating hyperplane, i.e., the Support Vectors (shown in circles in the diagram), then the two planes ${\mathcal{H}}_{1}$ and ${\mathcal{H}}_{-1}$ that these points lie on can be described by:

Figure 1. Separating hyperplane and marginal distance.

${x}_{i}\cdot w+b=1\text{ }\text{ }\text{\hspace{0.17em}}\text{for}\text{\hspace{0.17em}}{\mathcal{H}}_{1}$ (4.1)

${x}_{i}\cdot w+b=-1\text{ }\text{ }\text{\hspace{0.17em}}\text{for}\text{\hspace{0.17em}}{\mathcal{H}}_{-1}$ (4.2)

Referring to Figure 1, we define ${\delta }_{1}$ as being the distance from ${\mathcal{H}}_{1}$ to the hyperplane ${\mathcal{H}}_{S}$ and ${\delta }_{2}$ from ${\mathcal{H}}_{-1}$ to it. The hyperplane’s equidistance from ${\mathcal{H}}_{1}$ and ${\mathcal{H}}_{-1}$ means that $\delta ={\delta }_{1}={\delta }_{2}$. The quantity $\delta$ is known as the SVM’s margin. In order to orientate the hyperplane to be as far from the Support Vectors as possible, we need to maximize this margin. It is known from [1] that this margin is equal to

$\delta =\frac{1}{‖w‖}.$

Now the problem is to find w and b such that the marginal distance $\delta$ is maximal and for all $i=1,\cdots ,L$,

$w\cdot {x}_{i}+b\ge 1\text{\hspace{0.17em}}\text{ }\text{for}\text{ }\text{\hspace{0.17em}}x\in {\mathcal{H}}_{1}\text{\hspace{0.17em}}\text{ }\text{and}\text{\hspace{0.17em}}\text{ }\text{\hspace{0.17em}}w\cdot {x}_{i}+b\le -1\text{\hspace{0.17em}}\text{ }\text{for}\text{ }\text{\hspace{0.17em}}x\in {\mathcal{H}}_{-1}.$ (4.3)

This is equivalent to the optimization problem under constraint

$\mathrm{min}\left(\frac{1}{2}{‖w‖}^{2}\right)\text{ }\text{ }\text{S}\text{.T}\text{ }\text{ }{y}_{i}\left(w\cdot {x}_{i}+b-1\right)\ge 0\text{ }\forall i=1,\cdots ,L,$ (4.4)

which is in fact a quadratic programming optimization (Q.P-optimization). Using the Lagrange multipliers $\lambda =\left({\lambda }_{1},\cdots ,{\lambda }_{L}\right)$, where ${\lambda }_{i}\ge 0$ $\forall i=1,\cdots ,L$, the problem (4.4) will be equivalent to

$\underset{\lambda }{\mathrm{max}}\left[\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}-\frac{1}{2}\lambda H{\lambda }^{\text{T}}\right]\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{S}\text{.T}\text{ }\text{\hspace{0.17em}}{\lambda }_{i}\ge 0,\text{\hspace{0.17em}}\sum \text{ }\text{ }{\lambda }_{i}{y}_{i}=0,$ (4.5)

where

$H={\left({H}_{i,j}\right)}_{1\le i,j\le L}={\left({y}_{i}{y}_{j}{x}_{i}\cdot {x}_{j}\right)}_{1\le i,j\le L}.$

This is a convex quadratic optimization problem, and we run a QP--solver which will return $\lambda$. Thus we can deduce w and b which are given by [1]

$w=\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}{y}_{i}{x}_{i};\text{ }b=\frac{1}{{N}_{S}}\underset{s\in S}{\sum }\left({y}_{s}-\underset{k\in S}{\sum }\text{ }\text{ }{\lambda }_{k}{y}_{k}{x}_{k}\cdot {x}_{s}\right),$ (4.6)

where S the set of the support vectors ${x}_{s}$ (i.e., the vectors of indices i corresponding to ${\lambda }_{i}>0$ ) and ${N}_{S}$ is the number of support vectors.

Step.2.

The second step is to create the decision function f which determines to which class a new point belongs. From [1], the decision function is given by

$f\left(x\right)=sgn\left({y}_{i}\left(w\cdot {x}_{i}+b\right)-1\right)\in \left\{±1\right\}.$ (4.7)

Thus for a new data x, $f\left(x\right)=1$ means x belongs to the first class and $f\left(x\right)=-1$ means that x belongs to the second class.

In practise, in order to use an SVM to solve a linearly separable, binary classification problem we need to:

1) Create the matrix $H=\left({H}_{i,j}\right)\in {M}_{L}\left(ℝ\right)$, where ${H}_{i,j}={y}_{i}{y}_{j}{x}_{i}\cdot {x}_{j}$.

2) Find $\lambda$ so that

$\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}-\frac{1}{2}\lambda H{\lambda }^{\text{T}}$

is maximized, subject to the constraints ( ${\lambda }_{i}\ge 0$ and $\forall i$, ${\sum }_{i=1}^{L}\text{ }\text{ }{\lambda }_{i}{y}_{i}=0$ ).

This is can be done using a QP solver.

3) Calculate $w={\sum }_{i=1}^{L}\text{ }\text{ }{\lambda }_{i}{y}_{i}{x}_{i}$.

4) Determine the set of Support Vectors S by finding the indices such that ${\lambda }_{i}>0$.

5) Calculate $b=\frac{1}{{N}_{S}}{\sum }_{s\in S}\left({y}_{s}-{\sum }_{k\in S}\text{ }\text{ }{\lambda }_{k}{y}_{k}{x}_{k}\cdot {x}_{s}\right)$.

6) Each new point ${x}^{\prime }$ is classified by evaluating $f\left({x}^{\prime }\right)=sgn\left(w\cdot {x}^{\prime }+b\right)$.

4.2. SVM for Data That Is Not Fully Linearly Separable

In order to extend the SVM methodology to handle data that is not fully linearly separable, we relax the constraints (4.3) slightly to allow for misclassified points. This is done by introducing a positive slack variable ${\xi }_{i}\ge 0$, $i=1,\cdots ,L$

$w\cdot {x}_{i}+b\ge 1-{\xi }_{i}\text{ }\text{\hspace{0.17em}}\text{ }\text{for}\text{\hspace{0.17em}}{y}_{i}=+1,$

$w\cdot {x}_{i}+b\le -1+{\xi }_{i}\text{ }\text{\hspace{0.17em}}\text{ }\text{for}\text{\hspace{0.17em}}{y}_{i}=-1.$

which can be combined into:

${y}_{i}\left(w\cdot {x}_{i}+b\right)-1+{\xi }_{i}\ge 0\text{ }\forall i=1,\cdots ,L.$ (4.8)

In this soft margin SVM, data points on the incorrect side of the margin boundary have a penalty that increases with the distance from it. As we are trying to reduce the number of misclassifications, a sensible way to adapt our objective function (4.8) from previously, is to find

$\mathrm{min}\left(\frac{1}{2}{‖w‖}^{2}+C\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\xi }_{i}\right)\text{ }\text{ }\text{S}\text{.T}\text{ }\text{ }{y}_{i}\left(w\cdot {x}_{i}+b\right)-1+{\xi }_{i}\ge 0\text{ }\forall i=1,\cdots L,$ (4.9)

where the parameter C controls the trade-off between the slack variable penalty and the size of the margin.

Similarly to what have been done in the previous case, the Lagrange method leads to the following convex quadratic optimization problem

$\underset{\lambda }{\mathrm{max}}\left[\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}-\frac{1}{2}\lambda H{\lambda }^{\text{T}}\right]\text{ }\text{ }\text{S}\text{.T}\text{ }\text{ }0\le {\lambda }_{i}\le C,\text{\hspace{0.17em}}\sum \text{ }\text{ }{\lambda }_{i}{y}_{i}=0.$ (4.10)

We run a QP-solver which will return $\lambda$. The values of w and b are calculated in the same way as (4.6), though in this instance the set of Support Vectors used to calculate b is determined by finding the indices i for which $0<{\lambda }_{i}.

In practise, we need to:

1) Create the matrix H, where ${H}_{i,j}={y}_{i}{y}_{j}{x}_{i}\cdot {x}_{j}$.

2) Select a suitable value for the parameter C which determines how significantly misclassifications should be treated.

3) Find $\lambda$ satisfying

$\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}-\frac{1}{2}\lambda H{\lambda }^{\text{T}}\text{\hspace{0.17em}}\text{ }\text{is}\text{\hspace{0.17em}}\text{maximized}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{S}\text{.T}\text{ }\text{ }0\le {\lambda }_{i}\le C\text{\hspace{0.17em}}\text{\hspace{0.17em}}\forall i\text{\hspace{0.17em}}\text{ }\text{and}\text{ }\text{\hspace{0.17em}}\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}{y}_{i}=0.$

This is done using a QP-solver.

4) Calculate $w={\sum }_{i=1}^{L}\text{ }\text{ }{\lambda }_{i}{y}_{i}{x}_{i}$.

5) Determine the set of Support Vectors S by finding the indices such that $0<{\lambda }_{i}.

6) Calculate $b=\frac{1}{{N}_{S}}{\sum }_{s\in S}\left({y}_{s}-{\sum }_{k\in S}\text{ }\text{ }{\lambda }_{k}{y}_{k}{x}_{k}\cdot {x}_{s}\right)$.

7) Each new point ${x}^{\prime }$ is classified by evaluating

$f\left({x}^{\prime }\right)=sgn\left(w\cdot {x}^{\prime }+b\right)=sgn\left[\left(\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}{y}_{i}{x}_{i}\cdot {x}^{\prime }\right)+b\right].$

4.3. Non-Linear SVM

In the case when the data points are not linearly separable, i.e., there is no hyperplane separating data in two classes, we have to insert some modification on the data in order to obtain a linearly separable points. This is based on the kernel functions. It is worth noting that in the case of linearly separable data, the decision function requires only the dot product of the data points ${x}_{i}$ and the input vector x with each ${x}_{i}$. In fact, when applying the SVM technic to linearly separable data we have started by creating a matrix H and the scalar b from the dot product of our input variables

${H}_{i,j}={y}_{i}{y}_{j}{x}_{i}\cdot {x}_{j};\text{ }b=\frac{1}{{N}_{S}}\underset{s\in S}{\sum }\left({y}_{s}-\underset{k\in S}{\sum }\text{ }\text{ }{\lambda }_{k}{y}_{k}{x}_{k}\cdot {x}_{s}\right).$

This is an important constatation for the Kernel Trick. In fact the dot product will be replaced by such a kernel which also a positive definite function. The idea is based on the choice of such kernel function k and the trick is to maps the data into a high-dimensional feature space $\mathcal{H}$ via a transformation $\varphi$ related to k, in such way that the transformed data are linearly separable. $\varphi$ is called feature map $\varphi :\mathcal{X}\to \mathcal{H}$. When this hyperplane is back into the original space it describes a surface.

Similarly to the previous section, we adopt the same procedure of separation at the level of the feature space $\mathcal{H}$. This leads to the following steps.

1) Choose a kernel function k.

2) Create the matrix H, where ${H}_{i,j}={y}_{i}{y}_{j}k\left({x}_{i},{x}_{j}\right)$.

3) Choose how significantly misclassifications should be treated, by selecting a suitable value for the parameter C.

4) Determine $\lambda$ so that

$\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}-\frac{1}{2}\lambda H{\lambda }^{\text{T}}$

is maximal under the constraint $0\le {\lambda }_{i}\le C$ $\forall i$ and ${\sum }_{i=1}^{L}\text{ }\text{ }{\lambda }_{i}{y}_{i}=0$.

This is can be done by using a QP-solver.

5) Determine the set of Support Vectors S by finding the indices such that $0<{\lambda }_{i}.

6) Calculate $b=\frac{1}{{N}_{S}}{\sum }_{s\in S}\left({y}_{s}-{\sum }_{i\in S}\text{ }\text{ }{\lambda }_{i}{y}_{i}k\left({x}_{i},{x}_{s}\right)\right)$.

7) Each new point ${x}^{\prime }$ is classified by evaluating

$f\left({x}^{\prime }\right)=sgn\left(w\cdot {x}^{\prime }+b\right)=sgn\left[\left(\underset{i=1}{\overset{L}{\sum }}\text{ }\text{ }{\lambda }_{i}{y}_{i}k\left({x}_{i},{x}^{\prime }\right)\right)+b\right].$

Note that in general, the feature map $\varphi$ is unknown so w is also unknown. But we don’t need it! We need only the values of the kernel at the training points (i.e., $k\left({x}_{i},{x}_{j}\right)$ for all $i,j$ ) to know b, and $k\left({x}_{i},{x}^{\prime }\right)$ to determine the values of the decision function f at the input vector ${x}^{\prime }$.

5. Numerical Simulations

Dataset of female patients with minimum twenty one year age of Pima Indian population has been taken from UCI machine learning repository. This dataset is originally owned by the National institute of diabetes and digestive and kidney diseases. In this dataset, there are total 768 instances classified into two classes: diabetic and non diabetic with eight different risk factors: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function and Age. To diagnose diabetes for Pima Indian population, performance of all kernels is evaluated upon parameters like Accuracy, precision, recall score, F1 score and execution time.

Before giving experiment results, we would like to recall some characteristics of the confusion matrix (Table 1) and related metric evaluation.

5.1. Terminology and Derivations from a Confusion Matrix

Table 1. Confusion Matrix.

where:

- (TP) is the number of True Positive cases (real positive cases and detected positive),

- (TN) is the number of True Negative cases (real negative cases and detected negative),

- (FP) is the number of False Positive cases (real negative cases and detected positive),

- (FN) is the number of False Negative cases (real positive cases and detected negative),

- (PP) is the number of Predicted Positive cases $\text{PP}=\text{TP}+\text{FP}$,

- (PN) is the number of Predicted Negative cases $\text{PN}=\text{FN}+\text{TN}$,

- (RP) is the number of Real Positive cases $\text{RP}=\text{TP}+\text{FN}$,

- (RN) is the number of Real Negative $\text{RN}=\text{FP}+\text{TN}$,

- The Total Population is given by

$Tp=\text{PP}+\text{PN}=\text{RP}+\text{RN}.$

These are some metrics used in order to compare performance of Legendre polynomial kernel with linear, rbf and polynomial kernels.

- Precision (or positive predictive value PPV) has been used to determine classifier’s ability that provides correct positive predictions of diabetes.

$\text{PPV}=\frac{\text{TP}}{\text{TP}+\text{FP}}.$

- Recall score (or true positive rate or sensitivity) is used in our work to find the proportion of actual positive cases of diabetes correctly identified by the classifier used.

$\text{TPR}=\frac{\text{TP}}{\text{RP}}=\frac{\text{TP}}{\text{TP}+\text{FN}}.$

- Precision and recall score provide F1 score, so this score takes into account of both. It is the harmonic mean of precision and sensitivity:

${\text{F}}_{\text{1}}=\frac{\text{2PPV}×\text{TPR}}{\text{PPV}+\text{TPR}}.$

- Accuracy (ACC) is the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions.

$\text{ACC}=\frac{\text{TP}+\text{TN}}{\text{RP}+\text{RN}}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}.$

5.2. Numerical Results

Since Legendre polynomial kernel is defined on ${\left[-1,1\right]}^{d}$, a scaling procedure is needed in order to make data in $\left[-1,1\right]$. For this, we suggest the following transformation ${z}_{i}^{\left(j\right)}=\frac{2{x}_{i}^{\left(j\right)}-\left({M}_{j}+{m}_{j}\right)}{{M}_{j}-{m}_{j}},\text{\hspace{0.17em}}\text{\hspace{0.17em}}j=1,\cdots ,d,\text{\hspace{0.17em}}\text{\hspace{0.17em}}i=1,\cdots ,L$, where ${M}_{j}={\mathrm{max}}_{1\le i\le L}{x}_{i}^{\left(j\right)};\text{\hspace{0.17em}}{m}_{j}={\mathrm{min}}_{1\le i\le L}{x}_{i}^{\left(j\right)}.$

The confusion matrices corresponding to each kernels mentioned above are given in Figures 2-5, where the test size is 0.05 and the penalty coefficient C= 0.01.

Figure 2. Polynomial Legendre kernel.

Figure 3. Liner kernel.

Figure 4. Rbf kernel.

Figure 5. Polynomial kernel.

In order to show the powerful separation properties of the kernel defined by (3.1), our numerical simulations were carried out on the diabetes detection model mentioned above. In this example we apply SVM with different kernels in Pima Indian diabetes dataset in order to compare the performance of the Legendre polynomial kernel with linear, rbf, polynomial kernels with respect to their Accuracy, Precision, Recall score, F1 score and Execution time (see Table 2).

Table 2. Numerical results.

The programs were implemented in Python.3.7 a Windows 7 computer with 3G memory. The test size is equal to 0.2 and the penalty is taken to be C= 0.001.

5.3. Conclusion

Form the comparative table above, it is clear that the polynomial Legendre kernel have a good separation property with a good precision and accuracy w.r.t the classical predefined kernel in Python. Essentially, we have shown that it is not necessary to have an RKHS with an infinite dimension in order to separate by an hyperplane all dataset in the feature space. In fact we can obtain a good classification with non big degree $N=20$. The more N increases, the more we obtain a better separation. The only disadvantage is the time required to fit the model with the polynomial Legendre kernel. The idea used here can be generalized with arbitrary sequence of orthogonal polynomials in order to get new kernel functions. The performance of separation property depends on the polynomials sequence and it should be tested on some examples of dataset.

Acknowledgements

The authors gratefully acknowledge Qassim University, represented by the Deanship of Scientific Research, for the support for this research.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

 [1] Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. https://doi.org/10.1007/BF00994018 [2] Chatterjee, R. and Yu, T. (2017) Generalized Coherent States, Reproducing Kernels and Quantum Support Vector Machines. arXiv:1612.03713v2. [3] Abbassa, N., Abdessamad, A. and Bahri, S.M. (2019) Regularized Jacobi Wavelets Kernel for Support Vector Machines. Statistics, Optimization & Information Computing, 7, 669-685. https://doi.org/10.19139/soic-2310-5070-634 [4] Jakkula, V. (2010) Tutorial on Support Vector Machine (SVM). School of EECS, Washington State University, Pullman. [5] Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York.https://doi.org/10.1007/978-0-387-84858-7 [6] Aronszajn, N. (1950) Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 686, 337-404. https://doi.org/10.1090/S0002-9947-1950-0051437-7 [7] De Vito, E., Umanità, V. and Villa, S. (2011) An Extension of Mercer Theorem to Vector-Valued Measurable Kernels. arXiv:1110.4017v1. [8] Rebei, H., Riahi, A. and Chammam, W. (2022) Hankel Determinant of Linear Combination of the Shifted Catalan Numbers. Multilinear Algebra.

 +1 323-425-8868 customer@scirp.org +86 18163351462(WhatsApp) 1655362766 Paper Publishing WeChat