Bayesian Classifier Based on Robust Kernel Density Estimation and Harris Hawks Optimisation ()
1. Introduction
Data mining systematically extracts valuable knowledge from vast quantities of noisy data [1]. Anomalous data records in production are expected to be encountered for numerous reasons. These abnormal data can disturb actual knowledge and potentially result in incorrect outcomes. Hence, it is imperative to eradicate the disruption caused by outliers [2]. In addition to recognizing and deleting outliers [3], it is usually possible to dynamically attenuate the impact of these outliers on the learning model.
Improving classification accuracy for continuous data sets is a critical issue for Bayesian classifiers due to their reliance on accurate probability density estimation. Continuous data often exhibit complexities such as noise and outliers, which can skew the density estimates and lead to misclassification. Accurate density estimation is essential for determining the posterior probabilities that inform the classification decision. When outliers are present, traditional methods may produce biased estimates, adversely affecting the classifier’s performance. Therefore, enhancing classification accuracy through robust methods like optimized kernel density estimation is paramount, as it mitigates the influence of outliers and improves the overall reliability of the classifier’s predictions.
This study presents a specific technique to enhance the classifier’s generalization in classification when the training data includes outliers. The proposed approach utilizes a new Bayesian classifier that relies on Optimised Robust Kernel Density Estimation (ORKDE). This method employs the Harris Hawks Optimisation (HHO) [4] as its optimization strategy to minimize the Complete Cross-validation (CCV) objective function for determining the optimal bandwidth.
Additionally, M-estimation techniques can ensure the model’s robustness during estimation. This procedure effectively eliminates outliers that disrupt the legitimate data. The method employs the Iterative Weighted Least Squares Method (IWLSM) along with the Hampel cost function to reduce the impact of outliers [5] and leads to a precise calculation of the data distribution. The ORKDE technique is employed for class conditional probability calculations in Bayesian classifiers. The outcomes of this method are used to infer class labels for the test samples.
The study is organized as follows: the initial section discusses the classification learning issue when polluted samples are included and analyses its impact on a classification model’s decision boundary. The second section provides the estimation basic principles of the Bayesian classifier for attributes with continuous values. The third section explores the basic tenets of Robust Kernel Density Estimation (RKDE). The fourth section discusses the influence of bandwidth on estimates and provides the HHO technique for optimizing bandwidth. Experimental results conclusively validate the excellent performance of the proposed method.
2. Description of Outlier Problem in Classification
Suppose there are training data samples:
(1)
Let
represent a random sample with d-dimension, where each sample is independently and identically distributed according to an unknown density function
. The remaining samples follow a distribution function for contaminated data.
represents the proportion of samples generated by
, and it always exists
. In a binary model problem, the decision boundary reflects how the model’s preferences are influenced by the distribution of the outliers [6]. In Figure 1,
demonstrates that the density of outlier samples generated by the
distribution has a harmful impact on
. The model erroneously identifies the test samples depicted as pentagons during the testing. Biassed Bayesian inference is often caused by the misrepresentation of sample decision boundaries caused by inaccurate distribution estimations. The key to fixing the problem is achieving precise density estimates, even when outlier data is included. Methods for estimating general probability distributions can be classified into parametric estimation [6] and non-parametric estimation [7]. Parametric estimation methods assume that data is drawn from a known probability distribution with unknown parameters, allowing the inference problem to be framed as a parametric solution. These techniques are widely used in statistical and machine-learning applications. However, making assumptions about the underlying data distribution can be challenging in real-world scenarios, often requiring substantial prior knowledge. This limitation hinders the ability to draw conclusive insights about the distribution’s structure and parameters [8]. In some cases, the mathematical representation of the problem is crucial for its resolution, particularly in outlier detection and robust estimation. Parametric methods may struggle when data deviates from the assumed distribution or is complex.
![]()
Figure 1. Decision boundary for outlier perturbation.
Furthermore, these outlier problems tend to occur in regions where formulating and thoroughly describing modeling assumptions is naturally challenging. The benefit of non-parametric estimation is its ability to avoid assuming a predetermined form for the general distribution [9]. Instead, it seeks to extract the density information directly from the data and construct a statistical model to characterize it independently. This approach also offers a more accurate depiction of the density distribution in the complex region.
Non-parametric models usually have relevant advantages, such as fewer overall assumptions, broad applicability, and ease of understanding [10]. Such estimation strategies are significantly less restrictive. Typical estimation methods include histogram estimation [11], which can be used to estimate the frequency variation of the data. However, the resulting curve is not smooth, does not reflect the density function’s local details, and is unsuitable for outliers. Kernel Density Estimation (KDE) [12], also known as Parzen window estimation, can be considered an additive model composite applied to the density function estimation. Assuming that the data
is taken from an unknown continuous distribution
, the density estimation at point x is defined as
, where K is called the kernel function, H is the bandwidth parameter, and the kernel function itself satisfies the conditions
,
.
Standard kernel functions include the Triangular kernel, Epanechikov kernel, Gaussian kernel, etc. Mathematically, the estimate can be considered the average of the kernel mapping for each sample. Outliers can easily affect such a method, and it isn’t easy to estimate the density function accurately [13]. The weighting strategy needs to be considered, which treats each sample unequally according to its importance.
3. Bayesian Classifier
The Bayesian classifier is a traditional fundamental classifier, and it is designed to minimize the theoretical average risk at a predetermined cost [14]. The method has a strong theoretical foundation in statistical analysis and aims to maximize the Bayesian posteriori probability. It also involves calculating the posteriori probability by using Bayesian theory. The class with the highest posteriori probability is then selected as the label for the test object. Suppose the dataset
, where
,
, and
, the label of any training sample
belongs to
. At this point, the Bayesian maximized posterior can be expressed as:
(2)
The calculation of the value
in Equation (2) is essential when all of the attribute values are continuous. Assuming that the attribute values are independent, the calculation of this value uses
. Non-parametric estimation methods can extract the density information for a continuous random variable. If the sample contains outliers, it is necessary to use a robust estimate approach to accurately calculate the probability value and the class-conditional probability density value [15].
4. Robust Kernel Density Estimator
RKDE achieves resilience in estimating density by integrating traditional kernel density estimation with the M-estimation statistic technique. The KDE process for a Positive Semi-Definite (PSD) kernel can be viewed as the inner product of the average sample values in the Reproducing Kernel Hilbert Space (RKHS) [16]. Outliers in the original space can change the sample means in the Hilbert space. The method used the M-estimation technique to get accurate estimates of the means in the RKHS. RKDE employs a Kernelized Iteratively Reweighted Least-Squares (KIRWLS) approach to minimize the robust loss function. The convergence of this method has also been demonstrated [17].
4.1. Kernel Density Estimation Using the Hilbert Space Mapping
There is a mapping function
, where
represents the Hilbert space. The kernel function
can be defined as the inner product of the Hilbert space resulting from the mapping. By examining the KDE from this viewpoint, the function can be formulated as the following equation:
(3)
According to the Equation (3), the KDE can be understood as the inner product between the mean of
and the mapping of
. Due to the property of the RKHS, the estimated KDE value is equal to the mean of the sample mapping in the RKHS. The average value is highly affected by the extreme values among the samples’ mapping and needs to be calculated by using a robust estimator [18]. Replace the sample mean with a robust M-estimation.
(4)
By using the
and weighted form, the RKDE formula is expressed as follows.
(5)
The resulting kernel density estimator is provided in a weighted form.
(6)
In contrast to the conventional KDE, this expression incorporates weighted samples, eliminating the use of equal weighting. The weights adhere to the conditions
,
, with a tendency to provide lower weights to outlier data points. And
in the Equation (4) is the robust loss function, and the value of
can be calculated by this way.
(7)
There is also
, at the same time.
(8)
4.2. M-Estimator and Kernelized Iterative Reweighting Algorithm
M-estimator is a robust estimator that extends the maximum likelihood estimator to handle contamination in the distribution [19]. The M-estimator effectively resolved the Equation (1), which resembles the maximum likelihood estimator but employs a distinct loss function. The maximum likelihood estimator for multivariate functions can be used to substitute the likelihood function with
in the following manner.
(9)
The function
is strictly convex or a continuous symmetric function in the positive semi-axis. If
is derivable, the M-estimator can also be expressed as the following equation.
(10)
This evaluation function is derivative-related to the function, and the solution of Equation (10) is equivalent to Equation (9) if
is a convex function and its derivative exists everywhere. By selecting appropriate evaluation functions
, M estimation can achieve efficient estimation while ensuring essential robustness performance. Standard robust estimation functions include the Huber and the Hampel loss functions. The Huber function is defined in the following equation:
(11)
Then, we can make the
function.
(12)
To control the parameters
, for outliers, at that time, it is a constant, but when
, and the
, it increases monotonically. Using the Huber activation function, the objective function in Equation (6) can be calculated using an iterative algorithm to optimize weights.
(13)
Algorithm 4.2 outlines the computational procedure for the kernelized iterative reweighting algorithm. During the update, the weights are calculated based on Equation (8), Equations. (5), and (9). After each iteration, the weights of the samples are normalized and reintroduced into the formula for further calculations. The computation terminates and outputs the optimal weights if the value is below a specified threshold.
5. Optimized Robust Kernel Density Estimation
Optimizing RKDE is achieved by a combination of advanced statistical techniques to enhance the performance of Bayesian classifiers, particularly in the presence of outliers. The process begins with M-estimation to minimize the influence of extreme values by iteratively updating parameter estimates based on weighted least squares. The weights assigned to each sample are derived from their distance to the estimated mean, allowing the method to dampen outliers’ impact on the overall density estimation [20]. The influence of bandwidth on the accuracy of estimations is a critical aspect that must be addressed in the kernel density estimation process. The method employs CCV to ascertain the optimal bandwidth for the kernel density estimator. This process entails partitioning the dataset into multiple training and validation subsets, thereby enabling a comprehensive assessment of the estimator’s performance across a range of bandwidth values. By systematically evaluating the integrated squared error (ISE) for each bandwidth, the method identifies the bandwidth that minimizes the ISE, thereby ensuring that the estimator is finely tuned to the specific characteristics of the dataset. This tailored approach enhances the accuracy of the density estimates and mitigates the adverse effects of outliers, ultimately leading to more reliable probabilistic assessments in the context of Bayesian classification. The integration of CCV into the methodology underscores the importance of bandwidth selection as a pivotal factor in achieving robust and accurate density estimation [21].
5.1. Complete Cross-Validation
CCV is a proper way to obtain the optimal bandwidth. Unbiased Cross-Validation (UCV), Biased Cross-Validation (BCV) and Bootstrap comprise the CCV. The ISE is defined as the following equation.
(14)
Determining that
(15)
the roughness of the estimated value is denoted by . This term can be ignored since
does not vary with h. Using a leave-one-out estimator, cross-validation is performed for calculation.
(16)
Allows us to estimate the second term in Equation (15) by noting that
(17)
the equation Equation (18) implies an approximation.
(18)
Hence, the expected value of UCV is equal to the expected value of ISE subtracted by the constant. Therefore, an alternative term for this approach is unbiased cross-validation. The generalization of least-squares cross-validation to arbitrary dimensions can be achieved by using the multivariate product kernel. It is constructed as follows: let X be an
data matrix of random vectors
, where
are independent observations sampled from a multivariate density
of dimension d. Let
denote the jth entry of
. Given is the multivariate product kernel estimator of
(19)
where
is the smoothing parameter for the jth dimension. The smoothness of
when using the standard Normal kernel for K is demonstrated
(20)
where
. The second part of the UCV estimator in Eq. (18) is given explicitly by the coming equation.
(21)
For the sake of simplicity, the factor
has been substituted with a large n. The multivariate least-squares cross-validation function,
, is obtained.
(22)
The BCV method integrates bias correction into the UCV method [22] for addressing potential biases that may occur from the iterative exclusion of individual data points. The following equation expresses the mean integrated squared error (MISE).
(23)
The general form of BCV using the asymptotic form is the following equation [23].
(24)
Extensive research has been conducted on applying bootstrap in univariate kernel density estimation. A simplistic method for utilizing the bootstrap involves resampling
from the original dataset and subsequently creating a bootstrap density estimate in the following manner.
(25)
From Equation (23), the global MISE criterion can be computed from the point-wise MSE criterion.
(26)
This formula’s naive bootstrap estimate of
expresses.
(27)
However, with the naive bootstrap, we can obtain the following,
(28)
here,
has a discrete uniform distribution over the original data. Thus, the simple bootstrap estimate of the MSE in Equation (27) fails because the bias, , vanishes and only the variance is estimated.
Reducing bias has been approached in many ways, limiting the resampling size to be less than the original sample size and building the estimator [24] [25], and it is recommended using smoothed bootstrap to calculate MSE with Equation (27) [26]. To clarify,
is resampled from the kernel estimate
instead of the original data. The smoothed bootstrap technique using a Normal kernel does not require resampling because the calculations can be written in a closed form. The new MISE of the estimator is the following form,
(29)
where
is the multivariate product kernel estimator given in Equation (19). is a multivariate produce kernel estimator calculated with data sampled from
, and the expectation,
, is taken for the density
. The aim is to choose h to minimize the value of Equation (29).
By employing the standard kernel for K in Equation (19) and evaluating Equation (29), it derives the multivariate bootstrap criterion function minimize across
. We eliminated the
terms in the sums to reduce the inflated variance of this resampling strategy [27] [28].
(30)
CCV leverages components’ advantages while minimizing their downsides. It analyses complex, noisy intelligence data well. This combination completes the RKDE bandwidth selection analysis. From Equation (22), Equations. (24) and (30), CCV can be expressed as the following equation.
(31)
5.2. Harris Hawks Optimisation Algorithm
The HHO algorithm is then applied to optimize the CCV objective function. HHO mimics the hunting strategies of Harris hawks and is particularly effective in exploring the parameter space for optimal solutions [29]. The implementation involves initializing a population of hawks, each representing a potential solution in the bandwidth search space. The fitness of each hawk is evaluated based on the CCV error, and their positions are iteratively updated to converge toward the optimal bandwidth. This optimization process is crucial as it ensures that the bandwidth selection is not only practical but also efficient, leading to improved classification performance. This hunting method comprises two phases, which are exploration and exploitation. Each phase mimics hawk predation behaviors [30]. Time decreases the prey’s escape energy, represented by
in the following formula.
(32)
Here,
is equal to
that represents the initial energy state,
is a random value
, and
fluctuates between
during each iteration. The maximum and current iteration numbers are T and t, respectively. Harris Hawks capture tactics depend on
. They are in the exploration phase when
is greater than 1. Otherwise, they are in the exploitation phase.
In the exploration phase, hawks follow target animals and other flock members’ erratic prey-seeking movements to establish their location [31]. The following equation describes the mathematical model.
(33)
In each iteration,
, and q, which are random values in
. Searching upper and lower boundaries are
and
. In the next iteration,
represents the hawks’ position,
represents the optimal position of prey,
is the current hawk position,
means a randomly selected hawk, and
is the average position of the current population. The following formula calculates the average position of hawks,
(34)
where
represents each hawk’s location in the iteration t.
In the exploitation phase, hawks chase and catch the victim during this period. It can be modeled by four predatory behaviors: soft encirclement, aggressive encirclement, gradual swift fall, and gradual rapid descent. HHO chooses a different strategy based on the
and the probability r.
When
,
, the HHO carries out the soft besiege. In this stage, the prey still has the energy to escape, and the hawks use a weak encircle strategy to deplete it and launch a surprise attack. The following equation models the behavior
(35)
and
(36)
where
is the difference between the prey position and the Harris Hawks’ current position in iteration t,
is a random value between (0, 1), and J is the rabbit’s random jump strength when escaping. To replicate rabbit motions, J changes randomly in each iteration.
When
,
, the HHO executes the hard besiege. In this stage, the prey has no energy to escape, and the Harris hawks use a hard encircling to hunt the prey for a final assault. The current position is updated according to the following equation.
(37)
When
,
, the HHO fulfills the soft besiege with progressive rapid dive. In this stage, the prey has enough energy to escape from the eagles, and the Harris hawks will gradually dive and softly surround the target. The levy function is used in the position update process to simulate the prey’s escape mode and jumping action. This process is modeled as follows equation.
(38)
D is the problem dimension and S is the D-dimensional random row vector. LF is a levy flight function expressed as Eq. (39).
(39)
r is the escape probability, v and u are random values in the range [0, 1], and the specificity is 1.5. The final hawk’s position is determined as in Equation (40).
(40)
where F is the fitness function.
When
,
, the HHO carries out the hard besiege with a progressive rapid dive. In this stage, the prey is exhausted and has low escape energy; the Harris Hawks surround the prey through the hard besiege with a progressive rapid dive. They attempt to reduce the distance between their average position and the target prey. The position update formula is expressed in this equation.
(41)
where Y and Z are updated by Equation (42).
(42)
The algorithm 5.2 relates the basic flow of Harris Hawks Optimization.
6. Experimental Results and Analysis
This study employs synthetic data to validate its effectiveness in three components. Many datasets from different distributions are created to evaluate various optimization methods to find the optimal bandwidth. Subsequently, the density estimation performance of ORKDE, RKDE, and KDE is assessed under identical optimization bandwidths using different probability density functions. The final classification comparison employs ORKDE, RKDE, and KDE with Naïve Bayes.
6.1. Comparative Experiments of Intelligent Optimisation Algorithms Based on CCV
This experiment compares CCV optimization techniques. HHO started with a population size of 50 and a threshold of 0.0003. In the Quantum Particle Swarm Optimisation (QPSO) algorithm [32], a population size of 50 and a control parameter of 0.5 were used. With an initial population size of 50, the Black Widow Optimisation algorithm [33] was limited to a maximum of 50 iterations. The computation’s optimization algorithms have a 100-iteration restriction, and each algorithm is run 50 times independently. Figure 2(a) depicts many optimization methods’ search paths, showing that the optimization path of HHO is relatively optimal. Table 1 gives the optimized pairs of values of UCV under the conditions of data generated by different distributions, and the optimization algorithm is run independently 50 times for each simulated data, with a sample size of 500, where the Gaussian function has the parameters: [0, 0], δ: [[3, 1.2], [1.2, 3]], and the T-distribution function has the parameters L: [0, 0], with the degree of freedom DF = 5, and the values S: [[2, 0.1], [0.1, 2]]; the Cauchy distribution function with parameters L: [0, 0], S: [[2, 0.1], [0.1, 2]] and degrees of freedom DF1 = 1; and the Laplace distribution L: [0, 0], S: [[2, 0.1], [0.1, 2]].
Upon analyzing the 2-D CCV findings, it is evident that HHO outperforms the other four optimization methods. Figure 2(b) compares the convergence of optimization algorithms for Gaussian distributed data with two-dimensional bandwidth conditions. The results demonstrate that the HHO algorithm surpasses other algorithms in terms of convergence speed and value. The experiment surpasses the outcomes of various algorithms for HHO in terms of the schematic search path, the convergence speed of the objective function, and the final convergence value.
Table 1. CCV comparison with different optimization methods.
Distribution |
PSO |
BWO |
WOA |
QPSO |
HHO |
Gaussian |
−6.8 ± 0.4 |
−6.8 ± 0.1 |
−6.7 ± 0.6 |
−6.7 ± 0.6 |
−6.9 ± 0.2 |
T-distribution |
3.4 ± 0.2 |
2.3 ± 0.5 |
2.3 ± 0.5 |
1.1 ± 0.2 |
0.8 ± 0.2 |
Cauchy |
0.8 ± 0.2 |
0.6 ± 0.2 |
0.7 ± 0.3 |
0.5 ± 0.2 |
0.4 ± 0.1 |
Laplace |
2.1 ± 0.4 |
1.8 ± 0.4 |
1.1 ± 0.2 |
0.3 |
1.2 ± 0.3 |
(a) Search paths under 2-D (b) Objective function convergence
Figure 2. Comparison of different methods.
6.2. Experiments Comparing KDE and RKDE Distribution Fitting
The HHO-based optimization bandwidth was used as the bandwidth value for all estimators in the experimental comparison of function fitting effects. The MISE metric compares the differences between the estimated results and the proper distribution function. The experiences are done for both one-dimensional and two-dimensional data. For the one-dimensional comparison experiments, 300 sample points are generated using a standard normal distribution with 0 as the mean value, 0.5 as the variance value, and 40 outlier sample points using a uniform distribution in the interval from 6 to 17. In this case,
in the Huber function is taken from the 75% quantile of the mapping after each calculation, and the parameters
,
and
of the Hampel function are taken as the 25%, 50% and already 75% quantiles, respectively. Figure 3 shows that the traditional KDE has the highest density values in the outlier section, even when the bandwidth value is optimized. This result means that the main density components are shown less accurately. Compared to the Hampel function, the red curve of the RKDE with the Huber function is the closest to the original reference function curve among all the fitted curves. Although there is a non-zero density value of the outliers corresponding to the red curve for the fitting part of its outliers, the fitting of the principal components is optimal compared to the other methods. The RKDE using the Hampel loss function completely suppresses the density values of outlier points for one-dimensional data. Still, there is a significant deviation in the fitting of the principal function.
From the one-dimensional experimental results, the RKDE using the Huber function is the most effective in fitting the one-dimensional data. We made 1000 sample points with the distribution F1, which has a Gaussian function mean of [0,0] and variance of [1, 0; 0, 1]. We also made 1000 samples with the distribution F2, which has a Gaussian function mean of [5.5, 5.5], and variance of [1, 0; 0, 1], and 1000 samples with the distribution with a Gaussian function mean of [−2, 6.5] and variance of [2, 0; 0, 2]. The Gaussian distribution F3 generates 100 outlier sample points. ORKDE’s bandwidth values of both KDE and RKDE are from HHO-optimized bandwidth [0.34, 0.32]. Compared with Figure 3(b) of the reference function, the estimation of traditional KDE in Figure 3(c) produces estimations value in the region of the outlier points, which leads to some error in the central part, and the estimation results using ORKDE are more accurate in Figure 3(d). Table 2 gives the parameters, outlier points, and number of samples for generating 1-D and 2-D data distributions. The experimental results in Table 3 show that the MISE value of ORKDE is the smallest among all types of fitting methods.
6.3. Comparative Experiments of Bayesian Classifiers Based on Optimized RKDEs
This part of the experiment mainly compares the classifications using ORKDE-NB, RKDE-NB, KDE-NB, and common classifiers. The data in this part is simulated data, mainly used to obey several different distributions of data in different data sets, a simulation of binary classification, and a triple classification problem, which takes into account the different dimensions of the data, the dimensions in this experiment range from two-dimensional, three-dimensional, to four-dimensional. The parameter control of outliers in this classification process,
,
specific parameters are shown in Table 5. Meanwhile, the ten-fold method is used for different datasets to verify the classifier’s effectiveness. The comparison of the effectiveness of each classifier for two-dimensional T-distribution is given separately in Table 4.
(a) comparison of fitness using 1-D data (b) 2-D reference probability density curve
(c) 2-D KDE probability density estimation curves (d) 2-D ORKDE probability density estimation curve
Figure 3. Comparative experiments of fitness.
Tables 6-9 give the results of testing the mean accuracy of the algorithms based on the simulated generation of relevant distribution experimental data in Table 5. In addition to comparing the relevant Bayesian classifiers, basic classifiers such as SVM, Decision Tree (DT), Random Forest (RF), XGboost, etc, are used. Among them, SVM adopts RBF kernel, and its parameter adopts lattice search method,
,
, with a step size of 0.1. Decision Tree adopts the C4.5 model combined with dichotomization to deal with continuous attributes. The parameters of the random forest were set in the experiment as criterion = “gini”, maxdepth = None, minsamplessplit = 2, minsamplesleaf = 1, bootstrap = True. The parameters in the XGboost model were set as learningrate = 0.3, gamma = 2.5, maxdepth = 6, minchildweight = 3, maxdeltastep = 0.5; with the classification results in Tables 6-9, among the classification accuracies of various dimensions in the binary classification, and tertiary classification, the classification accuracies of ORKDE- NB has the best classification accuracy in all cases. As the data dimensions increase, the accuracy of the same classifier decreases; this experimental result also fully demonstrates the effectiveness of the method proposed in this paper.
Table 2. Parameters for distribution function generation.
Distribution function |
1-D distribution |
2-D distribution |
Gaussian |
μ1: 0; Samples: 500; δ1: 0.5;Outliers: 20; μ2: 1; δ2: 3 |
μ1: [0, 0]; Samples: 500; δ1: [[3, 1.2], [1.2, 3]];Outliers: 100; μ2: [0.5, 0.5]; δ2: [[3, 1.5], [1.5, 3]] |
T-distribution |
μ1: 0.5; δ1: 3; L1: 0; DF1 = 5;S1: 2.1; Samples: 500; Outliers: 20 |
μ2: [0.5, 0.5]; δ2: [[3, 1.5], [1.5, 3]]; L1: [0, 0]; df1 = 5; S1: [[2, 0.1], [0.1, 2]]; Samples: 500; Outliers: 20 |
Cauchy |
μ1: 0.5; δ1: 3; L1: 0; Samples: 500;S1: 2.1; DF1 = 1; Outliers: 20 |
L1: [0, 0]; Samples: 500; s1: [[2, 0.1], [0.1, 2]];DF1 = 1; Outliers: 20; μ2: [0.5, 0.5]; δ2: [[3, 1.5], [1.5, 3]] |
Laplace’s |
L1: 0; Samples: 500; S1: 2.2;Outlier: 20; μ1: 1; δ1: 3 |
L1: [0, 0]; Samples: 500; s1: [[2, 0.1], [0.1, 2]];Outliers: 20; μ2: [0.5, 0.5]; δ2: [[3, 1.5], [1.5, 3]] |
Table 3. MISE comparison with different distributions.
Estimator |
Gaussian |
Gamma |
T |
Chi-square |
Laplace |
Gaussian |
T |
Laplace |
|
1-D |
2-D |
KDE |
4.76 |
2.80 |
3.36 |
9.33 |
2.35 |
4.27 |
5.56 |
9.15 |
RKDE |
5.16 |
3.15 |
3.57 |
9.87 |
2.44 |
4.31 |
5.58 |
9.31 |
ORKDE |
4.36 |
2.76 |
3.16 |
8.25 |
2.28 |
3.57 |
5.51 |
8.94 |
Table 4. Comparisons of two-dimensional T-distribution data.
Classifier |
Precision |
Recall |
Accuracy |
KDE-NB |
0.78 ± 0.2 |
0.83 ± 0.2 |
0.81 ± 0.2 |
RKDE-NB |
0.79 ± 0.3 |
0.84 ± 0.4 |
0.82 ± 0.4 |
ORKDE-NB |
0.82 ± 0.1 |
0.85 ± 0.3 |
0.83 ± 0.1 |
SVM |
0.80 ± 0.4 |
0.79 ± 0.2 |
0.78 ± 0.3 |
DecisionTree |
0.81 ± 0.5 |
0.83 ± 0.7 |
0.80 ± 0.4 |
RandomForest |
0.82 ± 0.2 |
0.84 ± 0.6 |
0.82 ± 0.5 |
XGboost |
0.82 ± 0.3 |
0.84 ± 0.3 |
0.82 ± 0.2 |
The ORKDE method demonstrates significant advantages over traditional kernel density estimation methods. Its robustness to outliers is achieved through the combination of M-estimation and a weighted kernel density estimator, resulting in more accurate density estimates. Additionally, the adaptive bandwidth selection facilitated by CCV and HHO enhances the estimator’s ability to adjust to the underlying data distribution, making it more versatile in real-world applications. In terms of applicability, the ORKDE method is designed to accommodate various data distributions, ensuring that the kernel density estimates accurately reflect the
Table 6. Comparisons with Gaussian multidimensional data.
Classifier |
Accuracy |
Binary classification |
triple classification |
2-D |
3-D |
4-D |
2-D |
3-D |
4-D |
KDE-NB |
0.81 |
0.79 |
0.73 |
0.89 |
0.81 |
0.78 |
RKDE-NB |
0.82 |
0.78 |
0.72 |
0.88 |
0.80 |
0.77 |
ORKDE-NB |
0.82 |
0.83 |
0.76 |
0.90 |
0.86 |
0.79 |
SVM |
0.78 |
0.78 |
0.74 |
0.80 |
0.83 |
0.75 |
DT |
0.80 |
0.76 |
0.71 |
0.83 |
0.82 |
0.76 |
RF |
0.82 |
0.79 |
0.75 |
0.84 |
0.82 |
0.77 |
XGboost |
0.82 |
0.81 |
0.75 |
0.87 |
0.81 |
0.78 |
Table 7. Comparisons with T multidimensional data.
Classifier |
Accuracy |
binary classification |
triple classification |
2-D |
3-D |
4-D |
2-D |
3-D |
4-D |
KDE-NB |
0.81 |
0.79 |
0.78 |
0.71 |
0.71 |
0.70 |
RKDE-NB |
0.82 |
0.78 |
0.77 |
0.73 |
0.80 |
0.72 |
ORKDE-NB |
0.83 |
0.83 |
0.76 |
0.75 |
0.82 |
0.76 |
SVM |
0.78 |
0.78 |
0.73 |
0.73 |
0.82 |
0.71 |
DT |
0.80 |
0.76 |
0.76 |
0.74 |
0.81 |
0.70 |
RF |
0.82 |
0.75 |
0.75 |
0.75 |
0.80 |
0.73 |
XGboos |
0.82 |
0.82 |
0.77 |
0.75 |
0.81 |
0.72 |
Table 8. Comparisons with Cauchy distribution data.
Classifier |
Accuracy |
binary classification |
triple classification |
2-D |
3-D |
4-D |
2-D |
3-D |
4-D |
KDE-NB |
0.84 |
0.78 |
0.71 |
0.73 |
0.80 |
0.72 |
RKDE-NB |
0.86 |
0.74 |
0.72 |
0.74 |
0.82 |
0.73 |
ORKDE-NB |
0.87 |
0.77 |
0.73 |
0.76 |
0.83 |
0.75 |
SVM |
0.79 |
0.72 |
0.76 |
0.77 |
0.84 |
0.71 |
DT |
0.78 |
0.73 |
0.73 |
0.78 |
0.76 |
0.73 |
RF |
0.76 |
0.74 |
0.72 |
0.73 |
0.78 |
0.74 |
XGboos |
0.77 |
0.79 |
0.76 |
0.71 |
0.83 |
0.73 |
Table 9. Comparisons with Laplacian distribution data.
Classifier |
Accuracy |
binary classification |
triple classification |
2-D |
3-D |
4-D |
2-D |
3-D |
4-D |
KDE-NB |
0.87 |
0.81 |
0.74 |
0.83 |
0.91 |
0.79 |
RKDE-NB |
0.88 |
0.82 |
0.73 |
0.82 |
0.92 |
0.80 |
ORKDE-NB |
0.89 |
0.83 |
0.76 |
0.84 |
0.93 |
0.82 |
SVM |
0.83 |
0.80 |
0.71 |
0.80 |
0.80 |
0.79 |
DT |
0.81 |
0.78 |
0.75 |
0.79 |
0.83 |
0.78 |
RF |
0.80 |
0.80 |
0.75 |
0.84 |
0.85 |
0.77 |
XGboos |
0.84 |
0.82 |
0.78 |
0.82 |
0.88 |
0.76 |
actual underlying structure of the data. This adaptability is essential in scenarios where datasets may contain varying degrees of contamination from outliers, allowing the method to maintain high performance across diverse conditions. Overall, the ORKDE method represents a significant advancement in robust statistical estimation techniques, providing a reliable framework for improving classification accuracy in Bayesian classifiers, especially when dealing with continuous datasets that exhibit complexities such as noise and outliers. The integration of M-estimation, CCV, and HHO not only enhances the accuracy of probability density estimation but also contributes to the overall reliability of classification decisions, making it a valuable tool in the field of machine learning and data analysis.
7. Conclusions
This study introduces a Bayesian classifier based on ORKDE that offers substantial benefits, including enhanced robustness to outliers, improved classification accuracy, and adaptive bandwidth selection. These features collectively contribute to the method’s effectiveness in handling complex datasets characterized by noise and outliers. However, certain limitations persist, particularly concerning the computational complexity associated with the HHO algorithm and the necessity for meticulous tuning of hyperparameters. Future research directions may encompass several avenues:
Exploration of alternative optimization algorithms: Investigating other optimization techniques that could provide improved convergence rates or enhanced computational efficiency, thereby broadening the applicability of the ORKDE method.
Extension of applicability: Conducting empirical evaluations of the method across a diverse array of datasets and application domains to further substantiate its robustness and effectiveness in varied contexts.
Integration with deep learning approaches: Merging the ORKDE method with deep learning frameworks to capitalize on their capabilities in feature extraction and representation learning. This integration has the potential to further enhance classification performance, particularly in complex scenarios where traditional methods may falter.
These proposed enhancements and explorations will significantly contribute to the ongoing development of robust classification methodologies, particularly in environments where data is prone to contamination from outliers and noise. The ORKDE method stands as a promising advancement in the field, offering a reliable framework for improving Bayesian classification accuracy and fostering further innovations in machine learning and data analysis.