Outlier Detection Based on Robust Mahalanobis Distance and Its Application ()
1. Introduction
With the advancement of information technology, all fields have gradually entered into the era of big data. In addition, with more comprehensive research data in various research fields, it also brings trouble to the processing data. In the case of more variables that need to be detected, collected and processed, the greater the probability of errors will cause, the more the number of outliers in the data will increase. In fact, the data will be affected by various complicated and uncertain factors, as well as the occurrence of outliers due to the accuracy of the instrument, statistical omissions, and operational errors. Therefore, detecting outliers are generally required before analyzing data. Due to the increase of the probability of occurrence of outliers in high-dimensional data, it is more necessary to detect outliers in high-dimensional data. For one-dimensional data, there are many methods for determining outliers, e.g., three standard deviation standards, box plots [1] , etc. However, in high-dimensional data, because some variables may have a certain correlation, the exception of one of the variables does not indicate that this is an outlier, so the method of detecting outliers for one-dimensional data is not directly applicable to high-dimensional data.
There are many methods for detecting outliers in high-dimensional data. For example, leverage value [2] [3] [4] , Mahalanobis distance [5] , genetic algorithm [6] . Previous studies include: Yan et al. [7] proposed anomaly diagnosis based on leveraged large dataset sampling, Shi et al. [8] proposed a high-dimensional outlier detection algorithm based on genetic algorithm, and so on. The classical Mahalanobis distance is a common method for detecting outliers. However, it is a method based on sample mean vector and sample covariance matrix. Since the classical mean vector and covariance matrix algorithms are sensitive to outliers, the classical Mahalanobis distance is also sensitive to outliers. Many authors have proposed robust estimation methods for mean vector and covariance matrix, such as S-D estimator (Stahel (1981) [9] , Donoho (1982) [10] ), MVE estimator [11] , MCD estimator (Grübel R, 1988) [12] , S estimator (Rousseeuw and Yohai, 1983) [13] , etc. At the same time, the robust Mahalanobis distance is proposed in the literature based on the robust mean vector and covariance matrix. In addition, the fast MCD estimator [14] is widely applied since its computation is simple and fast, and it has high robustness. In 2005, Wang [15] proposed a robust Mahalanobis distance based on fast MCD estimator. In this paper, the robust sample Mahalanobis distance is calculated based on the fast MCD estimator. Compared with the classical Mahalanobis distance, there is a good improvement in robustness. In 2014, Feng et al. [16] applied the robust Mahalanobis distance based on fast MCD estimator to the analysis of LiDAR point cloud data. In 2017, Maronna and Yohai [17] pointed out that the bias of the fast MCD estimator increased as the data dimension increased, and then proposed a Rocke estimator. The paper shown that with equal efficiencies, comparing to S-D estimator, MCD estimator and MM estimator etc, the Rocke estimator has the best robust when the data dimension was larger than 15. Also its computing time is competitive for data dimension was less than 100, and can presumably be improved. Due to the increasing number of variables in practical applications, and MCD estimator is not robust under high dimensional data, we need a robust Mahalanobis distance algorithm for high-dimensional data. Thus, in this paper, we propose a robust Mahalanobis distance algorithm based on the Rocke estimator. The numerical simulation and a real data analysis show that our proposed method can better detect the outliers in the data than the Mahalanobis distance method and the robust Mahalanobis distance base on the fast MCD estimator when there are outliers in the data and the dimensions of data are very high.
The rest of the paper is organized as follows. In Section 2, we introduce the principle and application of Mahalanobis distance and the basic principles and algorithms of Rocke estimator, and propose an algorithm for detecting the outlier of robust Mahalanobis distance in high-dimensional data. In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed methods. In Section 4, a real data set is analyzed to compare the proposed methods with the existing methods. A discussion is given in Section 5.
2. Methodology and Algorithm
2.1. Principle of Mahalanobis Distance
The Mahalanobis distance was proposed by the Indian statistician Mahalanobis [5] . It represents a covariance distance of data, which can effectively estimate the similarity of sample sets. Compared with Euclidean distance, the Mahalanobis distance considers the correlation between features and is dimensionless. For a p-dimensional data
with mean vector
and covariance matrix
, the Mahalanobis distance is defined as follows:
(1)
It can be understood as the difference between the mean of x and the sample data, that is, the difference between x and the total sample position. If the difference is larger, it is considered that the possibility that x is not the total sample source is greater. In addition, when
is an identity matrix, the Mahalanobis distance is the same as the Euclidean distance.
2.2. Application of Mahalanobis Distance
In data mining, such as clustering, classification and other algorithms, the distance function is applied, and the Mahalanobis distance is one of the most commonly used distances. Furthermore, in the field of signal processing, information security, and even biomedicine, astronomy is inseparable from the concept of distance. Therefore, the Mahalanobis distance is very meaningful for these researches. For a set of samples
with n and dimensions of p, we first calculate the mean vector
and covariance matrix
of the sample
, and then calculate the Mahalanobis distance of each sample. We identify whether a point is an outlier, a threshold is needed. We know that
approximates a chi-square distribution with a degree of p. Therefore, given a confidence level
, if there is
for a certain sample, then the sample is an outlier, and vice versa.
2.3. The Principles and Algorithms of Rocke Estimator
Rocke estimator first proposed by Rocke is a robust estimation method for high-dimensional data based on an improved S-estimator, and then further improved and empirically compared to other estimates by Maronna and Yohai [17] . They pointed out that robustness was superior to other estimates when the data dimension was larger than 15. At the same time, the initial value for Rocke estimator has a significant influence. The subsampling approach usually employed for computing the starting values is very expensive for large dimensions. This study demonstrates that a semi-deterministic equivariant procedure, initially proposed by Peñaand Prieto (2007) [18] for outlier detection, dramatically improves both the computing times and the statistical performances of the estimators. By comparing the initial values obtained by KSD estimator [18] with those obtained by MVE estimator, KSD estimatoris more robust and rapid than other initial values in terms of robustness and operation speed. For a set of samples
with empirical distribution function
, we first estimate the initial mean vector
and the covariance matrix
with KSD estimator, and calculate the squared of the Mahalanobis distance as
. Rocke estimator mainly applies a non-monotonic weight function by Rocke (1996) [19] , and iteratively updates the weight of each sample point, and finally obtains a robust mean vector and covariance matrix estimator. When the distance change is very small, that is, the scale of the Mahalanobis distance to be small, the iteration is stopped, and a robust mean vector and covariance matrix can be obtained. The detailed algorithm for Rocke estimator is given by the following four steps:
In the first step, centering and scaling the data, and then the mean vector
and covariance matrix
of the sample data are obtained by KSD estimator.
In the second step, let
represent the scale estimate of the Mahalanobis distance, and solve it by
(2)
where
controls the size of the breakdown point, when
, Rocke estimator can achieve the highest finite sample breakdown,
where p represent the data dimension. The relationship between the
function and the weight function W is:
, The function
is given by:
(3)
where
denotes the weight range. Since the Mahalanobis distance d approximates the chi-square
distribution with the degree of p, when the
value of
is outside
, there is
(
is the
confidence). When the p is large, the
distribution tends to be symmetric, and there are:
(4)
(5)
let
, and calculate
by fixed point method.
In the third step, the Mahalanobis distance is obtained by the initial value; then the new mean vector and the covariance matrix are calculated by the following weight function:
(6)
The different weights are applied to different samples by the following equation to calculate
, the uncoordinated covariance matrix C to obtain the final covariance matrix
:
(7)
(8)
(9)
In the fourth step, repeat the second and third steps until obtain
has the following relationship with the
obtained last time:
stops the iteration, where
is the preset error, and finally obtains the final stable mean vector and covariance matrix, and then calculates the Mahalanobis distance of the sample data.
3. Numerical Simulation Examples
We will apply the classical Mahalanobis distance and Mahalanobis distance based MCD estimator, and Mahalanobis distance based on the Rocke estimator to detect the outlier via the simulation studies. We generate mixture distribution data set, which consists of the standard normal distribution and contaminated data. The mixture distribution data is
where
denotes the contaminated ratio and the constant
determines the scatter of the outliers. In this simulation studies, we consider n = 100, p = 6; n = 300, p = 30,
, and
, and take
as the threshold. We calculate the above three Mahalanobis distance, and then identify the outliers in the data. We also calculate the number of real outliers (NRO), detected the corrected number of outliers (DENO), rate containing outliers in the data (DaOR), and detected outliers rate (DeOR). The corresponding results are shown in Figures 1-8.
It can be seen from the above results that under the 6-dimensional data setting, the results detected by the two robust Mahalanobis distances are substantially consistent with those of the classical Mahalanobis distances in the absence of outliers. When the outlier ratio increases to 20%, the classical Mahalanobis distances is affected by the outliers. Therefore, the classical Mahalanobis distance
Figure 1. λ = 0, n = 100, p = 6, ε = 0. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 2. λ = 0, n = 100, p = 6, ε = 0.2. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 3. λ = 0, n = 300, p = 30, ε = 0. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 4. λ = 0, n = 300, p = 30, ε = 0.2. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 5. λ = 0.5, n = 100, p = 6, ε = 0. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 6. λ = 0.5, n = 100, p = 6, ε = 0.2. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 7. λ = 0.5, n = 300, p = 30, ε = 0. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 8. λ = 0.5, n = 300, p = 30, ε = 0.2. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
cannot detect outliers. However, both robust Mahalanobis distances still maintain good robustness and can accurately detect outliers. In the 30-dimensional data setting, the detection results of the three Mahalanobis distances are also similar when there are no outliers. After the outlier ratio is increased to 20%, the classical Mahalanobis distance is still unable to detect the outliers. In addition, the detection result of the Mahalanobis distance based on MCD estimator has been greatly deviated. However, the Mahalanobis distance based on Rocke estimator can still accurately detect the outliers. Therefore, the Mahalanobis distance based on Rocke estimator can accurately detect the outliers in both low-dimensional data set and high-dimensional data set.
4. Empirical Analysis
In this section, we apply the proposed methodology to analyze the Breast Cancer Wisconsin (Diagnostic) Data Set (1995) [20] . The data set contained 30 test variables with a total of 569 data, of which 357 were diagnosed as benign and 212 were diagnosed as malignant. Classification by characteristic variables of the sample can distinguish between benign and malignant (Wolberg, Street, Heiseyand Mangasarian) [21] . Therefore, for data diagnosed as benign, data diagnosed as malignant is equivalent to the contaminated data. Adding contaminated data to the data diagnosed as benign with data diagnosed as malignant, and use
as the threshold. In the following, we apply the above three methods to detect outliers, and obtain the number and proportion of detected outlier and the scatter plot. Since the actual data may contain a certain proportion of outlier, it is not advisable to add too much data when the data diagnosed as malignant is used as the outliers. The first 200 diagnosed as benign data were taken, and 0 and 16 (100th to 115th) of the data diagnosed as malignant were sequentially added. We also calculate the number of outliers added (NOA), the number of outlier detected (NOD), detected the corrected number of outliers (DENO), and detected outliers rate (DeOR). The results are shown in Figure 9 and Figure 10.
Figure 9. Diagnosed as benign data test results. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
Figure 10. Adding diagnosed as malignancy data test results. (a) Rocked detection; (b) Classical detection; (c) MCD detection.
It can be observed that when no contaminated data is added, the proportion of outlier detected by Mahalanobis distance based MCD estimator, and Mahalanobis distance based on the Rocke estimator are about 30% and 40%, respectively, and the classical Mahalanobis distance detects about 10%. This implies that the data itself contains outliers. When 16 contaminated data are added, the classical Mahalanobis distance and Mahalanobis distance based MCD estimator can only accurately detect one of the real outliers added, but Mahalanobis distance based on the Rocke estimator can accurately detect the all 16 added outliers. Therefore, in the real data analysis, the Mahalanobis distance based on the Rocke estimator can be used to effectively identify the outliers in the high-dimensional data.
5. Discussion and Conclusion
Since the classical Mahalanobis distance is greatly affected by the outliers, there is a large deviation in either low-dimensional data set or high-dimensional data set when there are outliers in the data set. Therefore, the outliers cannot be accurately detected. Mahalanobis distance based MCD estimator can detect outliers more accurately in low-dimensional data, but it will produce large deviation in high-dimensional data. Our proposed method is more robust in both low-dimensional data set and high-dimensional data set. Specially, the robustness advantage is more obvious in high-dimensional data set, and it can accurately detect outliers. Through numerical simulation and empirical analysis, the accuracy and practicability of our proposed methodology in high-dimensional data set are validated. Thus, our proposed methodology can provide a new robust method for effectively detecting outliers in high-dimensional real data analysis.
Funding
Li’s research is partially supported by and the Natural Science Foundation of Guangdong Province (No. 2018A030313171) and National Statistical Scientific Research Center Projects (No. 2017LY65).