Survey and Proposal of an Adaptive Anomaly Detection Algorithm for Periodic Data Streams ()
1. Introduction
Anomaly detection in real-time massive data streams (practically infinite flow of data, pouring in as time goes, each piece of data having its own timestamp) is one of the important research topics nowadays due to the fact that the most of the world data generation is a continuous temporal process. Many sophisticated and highly effective anomaly detection methods exist that run-in batch mode, where the data is collected and processed after the occurrence. However, identifying anomalies long after they happened isn’t our primary goal. On the contrary, real-time data processing, requests continual input, time-critical manner processing, and instant output (e.g. alarm) if anomaly happened. Instead of searching for the unknown anomalies we can, in advance, model a normal behavior of the data stream and compare it to the observed one. Consequently, predicting the values of a stream one-time step ahead are used, the deviation between the predicted values and the observed values are measured, and a decision mechanism, if an observed value exceeds normal behavior, is established. Yet other questions arise. The real-time streams are infinite, can have a high rate of data appearance in time unite (high volume, high velocity) and can evolve over time. Thus, the development of the model of normal behavior must adapt to these challenges to maintain detection accuracy: be iterative, use only a part of the stream (even before it is permanently stored), and be implemented as a positive feedback in the learning process (e.g. repeated anomalies labeling in the supervised process). Due to the need for the real-time detection process, detection algorithms have to be robust, with low processing time (low complexity), even at the cost of the accuracy. Currently, the most intensively developed anomaly detection methods that consider underlined challenges are based on machine learning, neural networks, predictive and statistical time series forecasting models.
In this paper, we are interested in anomaly detection of real data streams that have seasonal patterns. There are a number of studies in this area. The most adequate and often used models are Moving Average (MA) [2] , the AutoRegressive Moving Average (ARMA) and AutoRegressive Integrated Moving Average (ARIMA) [2] , exponential smoothing algorithms HW [3] and TDHW [4] , Hierarchical Temporal Memory (HTM) [5] algorithm and sliding windows [6] [7] .
However, our work brings several benefits [1] :
· Review and classification of existing literature for anomaly detection;
· From all the reviewed literature for anomaly detection, we assessed methods and algorithms for anomaly detection in data streams (time series) which are proper and capable to respond to the challenges that massive data streams and real-time detection have;
· We propose an enhancement of the additive HW and TDHW algorithms that answers the stated challenges. The algorithm is implemented as a positive feedback optimization with a periodic adaptation of the algorithm parameters;
· Starting with ideas of numerous papers [4] [7] - [13] , we use the GA optimization process, to optimize α, β, γ, ω, the HW and TDHW smoothing parameters, where we added optimization of the three new parameters k, n and δ;
· Improvement is made in the new definition of the optimization function based on the input training datasets with the annotated anomaly intervals, enhanced Hyndman’s MASE [14] definition where k and n define the two sliding windows intervals, and δ is the threshold parameter;
· The positive feedback learning process is achieved if the anomalies detected in the next time frame, by the proposed detection engine based on the computed optimal parameters from the annotated anomalies of previous one, are verified/acknowledged by human and reused for parameter optimization;
· The results of the experiments performed on the sets of synthetic and real data periodic streams show that our proposed HW algorithm, with GA optimized parameters and with improved MASE, outperforms the other algorithms.
The data used for experiments are known as anomaly detection benchmarks NUMENTA [12] and Yahoo [15] datasets with annotated anomalies and our real log data from the Macedonian national education system e-dnevnik1.
The rest of this paper has the following structure: in the second section is related work; in the third section proposed a model for real-time data streams anomaly detection is described, in the fourth section are the experimental results; and the last section contains conclusions and further work.
2. Review and Classification of Methods for Anomaly Detection
In this section, we have shown some algorithms used for anomaly detection classified by type of data, type of anomalies, application area, type of supervision and also is done classification of algorithms as predictive and non-predictive.
In Figure 1 below is shown the classification which is done for anomaly detection algorithms for different fields.
2.1. Classification of Anomaly Detection Methods by Type of Data
Kalinichenko et al., 2014 [2] categorize the data (and related methods) into three categories: the metric data, evolving data, and multi-structured data.
The Metric Data
The methods used are the distance between objects, the correlation between
Figure 1. Classification of anomaly detection methods.
them, and the probabilistic distribution of data. Further subdivision is based on the notion of distance (clustering methods, K nearest neighbors and their derivatives), based on the correlations (method of linear regression, PCA-Principal component analysis), data distributions, an iterative algorithm based on the maximum likelihood method, and methods related to the data with high dimension (methods of dimensionality reduction).
The Evolving Data (Discrete Sequences Data and Time Series Data)
The methods for Discrete Sequences Data are distance-based, frequency-based and Hidden Markov Model that measure the deviation of a specific value or whole sequence. In the survey [1] , the methods are divided into three groups: sequence-based, contiguous subsequence-based and pattern-based. The first group includes Kernel Based Techniques, Window Based Techniques, Markovian Techniques, contiguous subsequence methods include Window Scoring Techniques and Segmentation Based Techniques. Pattern-based methods include Substring Matching, Subsequence Matching, and Permutation Matching Techniques. For the category of Time series data, the methods that are used are based on well-developed apparatus of time series analysis, including predictive methods, Kalman Filtering, Autoregressive Modeling, detection of unusual shapes with the Haar transform and various statistic techniques.
Multi-Structured Data
The data are categorized into two categories: text data and graph data. The methods used for outlier detection in text data are LSA (Latent semantic analysis) which makes it possible to group text, integrating it with the standard anomaly detection methods, and tf-idf measure. For graph data, the methods are Minimal Description Length principle, Bayesian Models, Markov Random Field, Ising Model, EM and LOF algorithm.
2.2. Classification of Methods by Type of Anomalies
Unlike previous authors, in their comparative research Chandola et al., 2009 [1] and Pokrajac et al., 2008 [16] classify methodologies according to the type of anomaly (point, contextual, and collective anomaly).
The widest range of methodologies is devoted to the simplest one, a point anomaly detection. These are classification methodologies, supervised, semi-supervised and unsupervised (based on the rules, neural networks and SVM), methodologies based on the nearest neighbor (density, distance, local outlier factor LOF), clustering (transformation of data multidimensional signals), statistical (parametric, nonparametric) methods based on the information theory, and statistical probability, spectral-based visualization and others. Contextual anomalies (also called conditional anomalies’2). Often occur in data such as time series [17] [18] , and spatial data [19] and the choice of the methodologies is often associated with the application domain. Unlike point anomalies, for the contextual anomalies, there is not a wide range proposed methodology. They fall into two categories: using separate transformations to reduce the problem into a point anomaly detection in a particular context, for example, the methodology illustrated in the [20] and predictive sequence and time series in methodologies (mostly regression-based techniques for time series).
The collective anomaly occurs when the collection of instances deviates in relation to other data. For example, an individual event in the computer system does not necessarily mean anomaly, but a certain sequence of events can mean a hacker attack. The collective anomaly may exist if the data are associated with certain relations. It is investigated in a series of data [21] [22] , time series [23] , graphs [24] and spatial data [19] . The discovery of collective anomalies is more complex in terms of point and contextual anomalies since it requires a separate examination of the structure.
2.3. Classification of Methods by Application Area
Due to the wide number of areas for anomaly detection, the authors of some comparative studies limited themselves to the comparison of methods and techniques of detection of anomalies by a separate research field, data type or application area. Thus Phua et al. 2004 [25] , Dua and Du [26] Sreevidya et al., 2014 [27] , compare methods in the field of data mining, Chandola et al., 2009 [3] methods for data type of discrete series, Gogoi et al., 2011 [28] and Ranshous et al., 2013 [29] , Lazarevic et al., 2003 [9] methods for detecting anomalies in networks, Zwietasch, 2014 [30] detection of anomalies in the log files, Viaene, 2014 [31] for selecting the best methodology in integrated management. Significant comparative research is given by Gupta et al. 2014 [4] , who from the perspective of IT researchers classify and compare the methodologies for anomaly detection in the temporal and spatial data, data streams and time series with reference to accounting and H/S features. In most of the comparative studies, the authors discuss methodology advantages/disadvantages in terms of the application field, the type of anomaly, the data characteristics, and computational complexity.
2.4. Classification of Methods by Type of Supervision
The other significant classification of methods is by type of supervision. A training data set is required by techniques which involve building an explicit predictive model. The labels associated with a data instance denote if that instance is normal or an outlier. Based on the extent to which these labels are utilized Chandola et al., 2009 [1] , outlier detection techniques divide into the three categories: supervised, semi-supervised and unsupervised outlier detection techniques.
2.5. Review of Predictive and Not Predictive Algorithms for Real-Time Anomaly Detection in Massive Data Streams (Contextual Anomalies)
Usually, authors of newly proposed algorithms for anomaly detection compare their results with the results of the state-of-the-art techniques (for example, LOF, k-NN), but often, they do not take into account a possibility of real-time detection in a huge amount of incoming data. The starting goal of this work was to evaluate different categories of algorithms, (we divided them into predictive and non-predictive (statistical) algorithms), for which we expected to be fast and with satisfactory detection rate (sensitivity-recall and precision [32] [33] ) and so suitable for real-time anomaly detection of massive data streams.
Several algorithms were explored, MAD, runMAD [34] , Boxplot [35] , Twitter ADVec [36] , DBSCAN [37] [38] , our proposal combination of runMAD and Boxplot, ARIMA [39] , Moving Range Technique [40] [41] , Statistical Control Chart Techniques [39] , Moving Average [42] , Hierarchical Temporal Memory (HTM), Holt-Winters and Taylor’s Double Seasonality Holt-Winters.
Autoregressive (AR) and Moving Average (MA) forecasting models have been in existence since the early 1900s. Exponential Smoothing Methods, as a forecasting tool, are introduced in the 1950s. Detailed history, statistical theory, and classification depending on the time series characteristics can be found in [43] .
Following is a more detailed review of the research papers dealing with Holt-Winters and Taylor’s Double Seasonality Holt-Winters forecast modeling of normal data streams behavior. Papers are grouped in studies where HW and TDHW models are used for anomaly detection and their model parameters calculated by exponential formula or decided experimentally [3] [44] [45] , studies that deal with optimization of the model parameters for the best fitted forecast [4] [8] [9] [10] [46] , parameter optimization are done using classical optimization methods (e.g. using excel solver) [47] [48] , or different metaheuristic algorithms as GA [8] [9] [46] , Particle Swarm (PS) [9] , Artificial Bee Colony algorithm (BEE) [4] , Differential Evolution (DE) [48] , Hill climber (HC) and Simulated Annealing (SA) [9] , etc.
J. Brutlag [44] for the first time in the 2000 year, used a model based on HW forecasting. He integrated it into the Cricket/RRDtool open source monitoring tools to detect automatically, in the real-time, aberrant behavior of the WebTV services streams. He proposed usage of the exponential formulas for calculation of the smoothing parameters. The anomaly is detected if the new observed data stream value yt falls outside the interval, determined by the measure of deviation dt for each time point in the seasonal cycle. Deviation dt is a weighted average of absolute deviation, updated via exponential smoothing (calculated with the same parameter γ as a sessional factor in HW). While perhaps not optimal, this solution was shown as a flexible, efficient, and effective tool for automatic detection. Authors in [3] implement the same idea in multiplicative HW forecasting model, as a part of a test platform that collects real IP flow, based on open source software Nfsen/RRDtool. Calculation of the parameters was as in [44] . They used Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as suitable to compare different forecasting methods and Mean Absolute Percentage Error (MAPE), to compare how a forecasting method suits forecasting different time series. In [45] author emphases the need of close examination of the stream behavior before choosing the forecast model: trend existence, characterization of single/multiple seasons, threshold determination concerning the importance of a number of correct and false detections, and a number of detected anomalies in time unit to signal an alarm.
Optimization of parameters in forecasting model is dating back to 1996 [8] . GA optimization is applied to determine HW smoothing parameters α, β, γ, including variable s, a seasonality interval, and corresponding start-up values for level, trend, and seasonality, by minimizing the evaluation function forecasting Mean Square Error (MSE). As the forecasting task presented in this thesis did not require a great precision for the parameters and the start-up values, a binary GA (not a real-valued one) is used. Authors underline the great applicability of GA in such type of prediction tasks, especially when a large number of parameters is required.
Similarly to the previous paper, in the [46] , optimization of HW parameters are done along with tuning of the GA initialization, population size, and crossover probability, that enable the comparative study of the best accuracy prediction (minimum value) of MAPE. The data used in this study are monthly data set for the total number of tourist arrivals in the ten years period.
In some of the works, authors used classical non-linear optimization methods with constrained values of variables, to optimize HW parameters. In [47] authors used the MS Excel Nonlinear Solver, a spreadsheet-based non-linear optimizer, to find the values of the smoothing parameters, together with an initial forecast that minimize a measure of forecast error MAD or MSE. A detailed description is given to avoid problems reported by several other authors. Similar work is given in [48] where spreadsheet modeling of additive and multiplicative HW is given to identify optimal smoothing parameters by minimizing MSE with MS Excel Nonlinear Solver and DE heuristics.
In Ashraf [4] authors improved prediction accuracy MSE by employing Artificial Bee Colony algorithm to optimize smoothing parameters of the multiplicative multi sessional HW forecasting model. Cloud workload with multi-seasonal cycle’s data stream is forecasted to scale in advance computational resources. Performance of the proposed algorithm has been evaluated with double and triple exponential smoothing methods using MAPE and RMSE.
In [9] authors optimize α, β, γ, δ, smoothing parameters, φ damped parameter and λ adjustment for the first-order autocorrelation error, of the multiplicative double seasonality and additive damped trend forecast HW. They compare the results of minimization of the sum of squared errors equation (SSE) by several meta-heuristic methods: local improved procedure HC and SA, Evolutionary Algorithms (EA), GA, PS. Optimization is implemented in MATLAB for Portuguese three months electricity demand stream of data. The conclusion is that the values obtained for the forecasting equation’s parameters using different meta-heuristic algorithms were similar as well as the post-sample forecasting performance which suggests that HC algorithm for its simplicity is a good solution.
In [10] authors use PS metaheuristic minimizing the Residual Standard Error (RSE), Sum of Squared Errors (SSE), Mean of Squared Errors (MSE) or Mean Absolute Deviation (MAD) to determine optimal smoothing parameters of the additive Holt model. The direction of the exchange rate and the actual exchange rate values for the Dollar-Peso and Euro-Peso is accurately forecasted.
In [7] work is interesting due to proposed ideas of optimization of the sliding time windows that defines set of time legs used to build various forecasting methods and also define the number of the model inputs, using the Genetic and Evolutionary Algorithms (GEA) with a real-valued representative.
Ideas for using metaheuristic optimization of parameters of similar forecasting models exist. In [49] , Seasonal Autoregressive Integrated Moving Average SARIMA forecasting model parameters are optimized by GA. In [11] , authors compared slightly modified HW (that instead of using the time intervals immediately before the analyzed ones for the forecasting calculation, used the time intervals that are equal to the current and relating to the prior seasonal cycle), with the Ant Colony Optimization (ACO) cluster model.
For more details about the suitable algorithms and their classification are given in the following section.
3. Methods
Next are presented the algorithms which are used to compare the proposed method and are shown the proposed method [1] .
3.1. Algorithms for Anomaly Detection in Real-Time Massive Data Streams
Suitable to the need for the real-time alarm and semi-supervised or unsupervised procedures for massive streaming data anomaly detection, algorithms have to be robust with low processing time, eventually at the cost of the accuracy.
The studied algorithms we categorize into two classes:
1) Non-predictive, statistical (Boxplot, DBSCAN, MAD);
2) Predictive (HTM, ARIMA, HW, TDHW).
We choose to analyze algorithms with rather low computational complexity runMAD [34] , Twitter ADVec [36] , Boxplot [35] , Moving range technique [40] [41] , Statistical Control Charts [39] , ARIMA [39] , Moving Average [42] , DBSCAN [37] [38] , HTM [5] , HW [45] and TDHW [43] . All of them we implement in R language [50] except HTM, which is already implemented in NAB environment [12] .
DBSCAN algorithm is a density-based clustering algorithm. It works by greedily agglomerating points that are close to each other. Outliers are considered clusters with few points in them [38] . This algorithm has two main parts: a parameter ε that specifies a distance threshold under which two points are considered to be close; and the minimum number of points that have to be within a point’s ε-radius before that point can start agglomerating.
The Tukey (1977) BoxPlot does not make any distribution assumptions nor does it depend on a mean or standard deviation. The lower quartile (q1-the 25th percentile), and the upper quartile (q3-the 75th percentile) of the data define the inter-quartile range (IQR) and lines (whiskers) are indicating variability outside the upper and lower limits (9th and 91st percentile or 1.5 IQR over and below IQR defining anomalies.
RunMAD3 (Median Absolute Deviation of Moving Windows) for streaming data is the median of the absolute deviations from the data’s median for the defined window. As such does not make any distribution assumptions. Similar window functions are runmin, runmax, runmed, runquartile, etc. Depending on the stringency of the researcher’s criteria, which should be defined and justified by the researcher, the author [51] proposes the values of k = 3 (very conservative), k = 2.5 (moderate conservative) or even k = 2 (poor conservative) for anomaly detection that are outside Median ± k*MAD.
Twitter ADVec [36] proposed by Twitter is composed of different algorithms. The primary algorithm, Seasonal Hybrid ESD (S-H-ESD), builds upon the Generalized ESD test for detecting anomalies. S-H-ESD can be used to detect both global and local anomalies. This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD. In addition, for long time series such as 6 months of minute data, the algorithm employs piecewise approximation. This is rooted in the fact that trend extraction in the presence of anomalies is non-trivial for anomaly detection.
Statistical control chart technique [39] is a graph used to study how a process changes over time and control of repetitive processes. In general, the chart has a central line that represents the mean value of the in-control process and the other two lines, the upper control limit, and the lower control limit. These control limits are chosen so that almost all the data points will fall within these limits as long as the process remains in control. Data could be a chart of individual data, aggregated by a time parameter (e.g. hour), moving range, moving average and others.
In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting) Moving average. In time series analysis, the moving average (MA) model is a common approach for modeling univariate time series. Together with the autoregressive (AR) model, the moving-average model is a special case and key component of the more general ARMA and ARIMA models of time series, which have a more complicated stochastic structure.
Hierarchical Temporal Memory (HTM) [5] is a machine learning algorithm based on the input stream and prediction of the next value. Raw anomaly score that measures the deviation between the model’s predicted input and the actual input is calculated. The distribution is modeled as a rolling normal distribution where the sample mean and variance are continuously updated from previous anomaly scores. The recent short-term average of anomaly scores is using to apply as mean to the Gaussian tail probability to decide whether to declare an anomaly. HTM can robustly detect anomalies in a variety of conditions. The resulting system is efficient, extremely tolerant to noisy data, continually adapts to changes in the statistics of the data, and detects very subtle anomalies while minimizing false positives.
3.2. The Adaptive Algorithm for Anomaly Detection
In Figure 2, the positive feedback optimization method for continuous adaptation of the anomaly detection parameters is shown. The method is composed of four different stages [1] .
Figure 2. Model for a proposed method for anomaly detection [1] .
First is the annotation of the anomalies in the training dataset. The anomaly annotation is defined as a time interval where an anomaly is located. The annotation is done by a human or an oracle.
The second stage is the computation of anomaly detection parameters for our algorithm using GAs, i.e. computation of HW or TDHW parameters, together with δ, k and n. GAs have been successfully applied to solve optimization problems, both for continuous (whether differentiable or not) and discrete functions” [14] . This enables us to find near-optimal values of the anomaly detection parameters very successfully.
The third stage is the actual anomaly detection engine based on the computed optimal parameters from the second stage. This stage outputs the detected anomalies with our proposed algorithm.
The fourth stage is the human acknowledgment of the output data, and classifies the output data into TP (true positive), FP (false positive) and FN (false negative). The result of the verification/acknowledgment stage is then used again in the second stage for further optimization of the anomaly detection parameters.
In the rest of this section, we present the improved algorithm for anomaly detection of real data streams with sessional patterns, based on well-known HW and TDHW [3] [4] additive forecasting models.
The first improvement is done by modification of the Mean Absolute Scaled Error (MASE) [52] , and the second one by optimization of the model parameters.
3.2.1. Standard Algorithms for Anomaly Detection and MASE Modification
Additive HW trend forecast prediction
is defined iteratively (1) by three components, level lt, trend bt and seasonality st using restricted real smoothing constants 0 ≤ α, β, γ ≤ 1:
Forecast equation:
Level:
Trend:
(1)
Seasonality:
where m is the periodicity of the one whole seasonal cycle, i.e. the number of time steps of one season. Good initial values l0, b0 and s0 (2) can be achieved by having
streaming data of two full sessional cycles 2m.
Initial level component:
Initial trend component:
(2)
Initial seasonal component:
,
.
Additive TDHW, trend forecast prediction
(3) is defined iteratively by four components: level lt, trend bt, m1 seasonality and m2 seasonality, using restricted real smoothing constants 0 ≤ α, β, γ, ω ≤ 1.
Forecast equation:
Level:
Trend:
(3)
m1 seasonality:
m2 seasonality:
For example, if the stream values
are observed every minute a daily cycle m1 = 24 × 60 = 1440 and a weekly cycle m2 = 24 × 60 × 7 = 10,080 [53] . Possible initial values are:
Measurement of the forecast accuracy (by using MASE), defined by Hyndeman [52] is calculated as follows:
(4)
where l is a number of values in the training stream. In the anomaly detection models based on HW or TDHW models [3] [44] [53] , if MASE > δ, where δ is a predefined threshold, the new arrived stream data yt is determined as an anomaly.
We propose [1] an adoption of the MASE definition (5) by adding two window parameters k and n, to the current iterative processes (1) and (3) with smoothing parameters α, β, γ and ω. For the HW forecast, MASE depends on parameters α, β, γ, δ, k, n and for TDHW, MASE depends on parameters α, β, γ, δ, k, n.
,
(5)
where
.
,
where
.
The anomaly is declared if
, where δ is threshold.
3.2.2. Finding the Optimal Values of the Algorithm Parameters
The goal of our proposed algorithm is to find the optimal parameter values for the anomaly detection algorithm in order to achieve the correct TP and zero FP and FN.
The evaluation of the optimization parameters for the anomaly detection is based on input datasets and annotated anomaly intervals. We define the following procedures for counting the TP, FP and FN:
· TP (true positive) is the number of anomalies annotated intervals with at least one detected anomaly;
· FP (false positive) is the number of detected anomalies outside of all annotated intervals;
· FN (false negative) in the number of annotated intervals with 0 detected anomalies.
Having defined these values, we use the following evaluation function for our genetic algorithm optimization:
(6)
where w1, w2, w3 and w4 are weight factors (constants) that are given based on the importance of the targeted goals. In our case, we favor to achieve correct TP, and minimal FP and FN, hence the w1 is 100 and w2, w3 and w4 are 1.
Based on the defined EF (6), we use a real-valued GA optimization for parameters optimization using the following constraints:
EF starts with a calculation of a prediction using additive HW (1). Then based on this prediction, we calculate
(4) and evaluate its value against δ. δmax is defined experimental based on the dataset (in our case 50). If our algorithm detects an anomaly, we add the timestamp to a list of anomalies for further evaluation. The next step is an evaluation of the anomaly list against the anomaly annotated intervals, thus deriving TP, FP and FN, and finally calculating the EF value.
The GA optimization is very effective: we use small populations with less than 100 individuals, and achieve the optimal solutions in less than 20 iterations. The proposed algorithm is implemented in R language.
4. Experimental Results
In this section, the datasets used in the experiments are described. The main part of the section is a comparison of the results (TP, FP, FN, detection rate, precision) achieved with our proposed algorithms HW GA and DTHW GA, compared to several older variations of HW, DTHW and ARIMA, MA, HTM algorithms.
4.1. Experimental Datasets
To evaluate the proposed algorithm, we have used the most known benchmarks from Yahoo, Webscope dataset “data-labeled-time-series-anomalies-v1_0” [15] , NAB [12] “artificial With Anomaly” and our real data log-file, generated by NEIS.
We have exploited the first 4 out of 100 Yahoo synthetic A2 and real A3 and A4 time-series benchmarks, with tagged anomaly points. The datasets are suitable for testing the detection accuracy of various anomaly-types including outliers and change-points. The synthetic dataset consists of time-series with the varying trend, noise and seasonality, while the real one consists of time-series representing the metrics of various Yahoo services. Some datasets have a weekly and some a weekly and daily seasonality Part of the datasets A4 is shown in Figure 3 below.
NAB contains artificially-generated datasets with varying types of tagged anomalies and a daily seasonality. The NEIS dataset has weekly and daily seasonality. Anomalies are unknown but are analyzed and tagged by a human. All the datasets contain a timestamp and single value based on the log.
4.2. Results and Discussion
In order to evaluate if the optimization of the parameters works well, we have separated the datasets into training and test sets. The optimal values of the parameters are determined on the training set and then they are verified on the test set.
Our proposed algorithm (HW GA) with GA optimized parameters (α, β, γ, δ, k, n) and with improved
is compared with ARIMA, MA (implemented in our previous work [54] ), HTM [5] algorithm,
HW where smoothing parameters are calculated by formula and default MASE (HW calc. MASE), HW by default smoothing parameters (optimized in R) and default MASE (HW def. MASE), HW by default smoothing parameters and improved
(HW def. MASE(k, n)).
Figure 3. Yahoo A4 benchmark time series.
HW GA [1] counts automatically the number of TP, FP and FN that is not possible with other compared algorithms. The smoothing parameters can be calculated by Formula (7) were for the total weight we take 0.95:
(7)
A number of points (frequency) for Yahoo benchmark stream, with week seasonality, is 24 × 7 = 168, having data each hour. A number of points for the Numenta benchmark stream are 12 × 24 = 288 having data every 5 minutes.
To be able to compare the results we use detection rate (recall) in % (d.r.) and precision (prec.), the statistical performance measures of a binary classification test. Due to the big number of the TN-True Negative values, specificity (the true negative rate) and accuracy are not applicable measures for the time series data.
In Tables 1-5 below, a number of detected TP, FP and FN for NUMENTA, Yahoo, and NEIS on training and test sets are given.
Similarly, the Taylor’s Double Holt Winters GA (TDHA GA) with optimized parameters (α, β, γ, ω, δ, k, n) and with improved
, is compared with the same algorithms as for HW, where HW type algorithms are replaced with TDHW.
In Table 6 below are shown experiments for double seasonality for both training sets and test sets for NEIS data.
The last rows indicated by gray color show the results of our HW GA. As can be seen in all the cases it outperforms or is equal to the results of the other algorithms. Direct comparison of the result achieved on the same benchmark datasets can be done between proposed HW GA algorithm and HTM anomaly detection algorithm [5] (online implemented in [1] ). HW GA and HTM have given equally good results on NUMENTA datasets, while HW GA (100% detection rate and 0% false positive) significantly outperform HTM on all the Yahoo benchmark datasets as also our e-dnevnik dataset. HW GA outperforms the best results (detection rate 84.67%, and false positive 10.12%) of HW forecasting algorithm with parameter maximum likelihood estimates optimization in [53] , as also results of another type of algorithms (sliding windows) applied on the similar type of data streams reported in [6] .
The other important achievement of the HW GA [1] is that the algorithm is self-learning and can be implemented as a positive feedback optimization with a periodic adaptation of the parameters of the algorithm. In Table 2 the first dataset is used as a training set. Anomalies detected on the second dataset (test set) are verified/acknowledged by human and reused for new parameter optimization. With such newly optimized parameters detection is implemented on the third set and so on.
Correct results are achieved even in the case when there are no anomalies in the training set, while the test set has anomalies (example in Table 3).
In Tables 3-6 below, the parameters used by various algorithms are shown. Parameters δ, k, n, tagged by (*) are defined experimentally.
Table 2. The result from all tested algorithms for Yahoo benchmark (DR and precision).
Table 3. Part of Numenta training set and test set optimal parameters.
Table 4. Part of Yahoo training set and test set optimal parameters.
Table 5. e-Dnevnil training set and test set TDHW GA optimal parameters.
Table 6. Percentage of TP anomalies found depending on and the GA iteration.
5. Conclusions
As a conclusion, we may say that anomaly detection in real-time massive data streams nowadays is very important in different domains. From the reviewed and classified literature, we came to the conclusion that there is a broad research area, covering mathematical, statistical, information theory methodologies for anomaly detection. A big number of methods (distance-based, clustering, classification, machine learning, predictive based) coming from these areas are in relation with the various factors and problems of anomaly detection we have (the type of data, type of anomaly, availability of annotated anomalies in training set).
In this paper, we restricted ourselves to study algorithms for anomaly detection in data streams (time series data) due to problem area we investigate anomaly detection in log files streams.
In order to choose the appropriate algorithm, we have studied several algorithms suitable for anomaly detection in real-time massive data streams from where we chose to further test several of them (MA, ARIMA, HTM) and together with standard HW and TDHW to propose our algorithm as a future work.
Based on the experimental evaluation of the detection rate and precision, performed on sets of synthetic and real data periodic streams, we can conclude that our proposed HW with GA [1] optimized parameters (α, β, γ, δ, k, n) and with improved MASE outperforms the other algorithms. This can’t be concluded for the TDHW with GA optimization. Due to the HW iterative procedures, detection time is appropriate for the real-time anomaly detection. Optimization with GA that is also rather fast, with rather a small number of iterations (about 25 - 30 iterations are needed to achieve all tagged anomalies recognition in the training sets), can be done in batch mode on training sets, as also re-optimization with verified newly detected anomalies. In our future work, we will incorporate HW GA in our implemented infrastructure [14] for anomaly detection in massive data streams. We plan further investigation and tuning of the TDHW with GA optimization and generalization of the optimization function by including additional parameters in optimization like seasonality and initial values. Ongoing work is motivated by the need for real-time alarm in the case of anomalies in the national online educational system.
Availability of Data and Material
The dataset used in this paper is offered by FINKI data center for research purposes, they are data from e-dnevnik (national education system in Macedonia). Link: http://ednevnik.edu.mk/.
Authors’ Contributions
Jakup Fondaj has done the review of existing algorithms and also was part of executing experiments. Zirije Hasani proposes the new adaptive algorithm.
List of Abbreviations
· MA—Moving Average
· ARMA—Auto Regressive Moving Average
· ARIMA—AutoRegressive Integrated Moving Average
· TDHW—Taylor’s Double Holt-Winters
· HW—Holt-Winters
· NEIS—National education information system
· GA—Genetic Algorithm
· MASE—Mean Absolute Scaled Error
· TP—True positive
· FP—False positive
· FN—False negative
NOTES
1http://ednevnik.edu.mk/.
2For the first time the term is mentioned in: Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
3http://svitsrv25.epfl.ch/R-doc/library/caTools/html/runmad.html.