1. Introduction
The 21st century has witnessed the profound impact of the Internet, emerging as one of the most transformative inventions in our lives. Presently, the Internet transcends numerous boundaries, revolutionizing the way we communicate, engage in recreational activities, conduct work, shop, socialize, enjoy music and movies, order food, manage finances, extend birthday wishes to friends, and more. The indispensability of these service applications is paramount for modern organizations, demanding uninterrupted availability and global accessibility around the clock.
The exponential growth of sensitive services and web-based applications has become a magnet for hackers seeking lucrative gains, technological secrets, including vaccine-related information, or any competitive edge. This surge in valuable data has not only enticed criminal organizations globally but has also led certain governmental entities to recruit exceptionally skilled security experts for cyberattack operations.
The continuous expansion of both lawful and unlawful activities has led to an exponential increase in the complexity and volume of Internet traffic. As a result, network security administrators grapple with ever-evolving and intricate challenges, striving to swiftly impede malicious traffic. To combat this, they heavily rely on a trio of key tools: Firewalls, SIEM (Security Information and Event Management), and IDSs (Intrusion Detection Systems), which stand as primary instruments for detecting and filtering suspicious traffic.
To scrutinize and identify potentially suspicious activities within network traffic using IDSs, two primary detection methods prevail: signature-based and anomaly-based detection. Signature-based or misuse detection methods employ pattern-matching techniques to identify pre-known attacks. The primary advantage lies in their high accuracy, ensuring minimal false positives or negatives when detecting previously recognized suspicious attacks. Anomaly-based detection methods necessitate an initial phase to comprehend normal traffic patterns, employing techniques like machine learning, statistical analysis, or knowledge-based methodologies. Any significant deviation between observed traffic and established norms is flagged as suspicious. The primary advantage lies in its capability to effectively identify unknown suspicious attacks with commendable accuracy.
The current state of the art presents a myriad of intriguing techniques (e.g., [1]-[4]) and tools that have notably bolstered network security by effectively detecting and thwarting malicious traffic. Nevertheless, the challenge persists: cyberattacks persistently wreak havoc, inflicting substantial damage. Hence, any novel contribution that mitigates the risks associated with network traffic would be immensely valued.
This paper introduces a novel technique employing differential analysis to discern suspicious network traffic. The approach initially segments traffic into small-time slices, transforming each of them into a value in
. Subsequently, it computes the divergence between neighboring slices to unveil abrupt changes in traffic behavior. After that, clustering techniques are applied to abstracted intervals to validate traffic homogeneity (a single class) or detect significant variations (multiple classes), indicating potential suspicious activities.
The approach we introduce is geared towards enhancing the efficiency of Security Information Event Management (SIEM) [5], an integral component of a Security Operations Center (SOC) [6]. A (SIEM), such as Wazuh [7], encapsulates a suite of functionalities aimed at gathering, analyzing, and presenting information sourced from network and security devices. It essentially integrates two vital components: Security Information Management (SIM) and Security Event Management (SEM). SIM focuses on storing, analyzing, and reporting log files, while SEM is responsible for real-time monitoring, event correlation, notifications, and console views.
The rest of this paper is organized as follows. Section 2 delves into related works within the field. Section 3 details the methodology of the approach. Section 4 presents three case studies. Finally, concluding remarks are presented in Section 5.
2. Related Work
The state of the art contains many valuable techniques that have significantly contributed to the improvement of the security of network services and applications. Here, the study focuses on anomaly-based detection techniques and methods that try to detect suspicious traffic based on IP packets information such as IP address (layer 3 in the TCP/IP Model), TPC or UDP ports (layer 4) and web application data (layer 5).
Najafabadi et al. proposed in [8] an anomaly detection mechanism for detecting HTTP GET flood attacks. They used the Principal Component Analysis (PCA)-subspace method on the browsing behavior instances extracted from HTTP server’s logs in order to detect abnormal behaviors. They apply the approach to detect some DDoS and HTTP GET flood attacks. This approach used the supervised machine learning techniques.
In [9], Betarte et al. proposed a method based on machine learning to enhance the famous ModSecurity [10], a Web Application Firewall provided by OWASP, by using one-class classification and n-gram techniques on three datasets. The proposal method used the supervised machine learning techniques and provides better detection and false positive rates than the original version of ModSecurity.
Wang et al. presented in [11] a new web anomaly detection method which uses Frequent Closed Episode Rules Mining (FCERMining) algorithm to analyze web logs and detect new unknown web attacks. The method used the supervised machine learning techniques and has a detection rate of 96.67% and a false alarm rate of 3.33% for detecting abnormal users.
In [12], Brontë et al., proposed an anomaly detection approach that uses the cross-entropy technique to calculate three metrics: cross entropy parameters (CEP), cross entropy value (CEV) and cross entropy data type (CET). These metrics aim to compare the deviation between learned request profiles and a new web request. The cross-entropy approach performs better than Value Length and Mahalanobis distance approach. This approach used the supervised machine learning techniques, focused on detecting four types of web attacks: SQLI, XSS, RFI, and DT and has a detection rate of 66.7%.
Ren et al. presented in [13] a method based on the bag of words (BOW) model to extract features and efficiently detect web attacks with hidden Markov algorithms. BOW has higher detection rate and lower false alarm rate when compared with N-gram feature-extraction algorithms. This approach used the supervised machine learning techniques to detecting SQL injection and cross-site scripting attacks. The accuracy increased to 96%, but the false alarm rate still remained low.
In [14], Pukkawanna et al. proposed a method using port pair distribution and Kullback-Leibler (KL) divergence to detect suspicious flows when the KL divergence deviates from an adaptive 3-sigma rule-based threshold. This approach used the unsupervised machine learning techniques to detecting mimicry attacks. The approach does not need any previous learning step.
Hounkpevi proposed in [15] a method using K-means, port pair distribution and Kullback-Leibler (KL) algorithm that improves [14]. The approach compares the traffic of current time intervals with the nearby ones by applying the k-mean algorithm. Any significant divergence means that the current time interval traffic is suspicion. This approach used the unsupervised machine learning techniques to detecting mimicry attacks. The proposal approach seems more efficient than [14].
In [16], Munz et al. presented a novel Network Data Mining approach that applies the K-means clustering algorithm to feature datasets extracted from flow records. Training data containing unlabelled flow records are separated into clusters of normal and anomalous traffic. This approach used the unsupervised machine learning techniques to detecting Port scans and D/oS attacks. In this approach there is a challenge to determine the optimum number of clusters.
Asselin et al. presented in [17] an anomaly detection model based on crawling method and n-gram model that is effective in reducing the access to the log file generated by the web servers. It has shown to be a good solution for web applications black-box analysis but it is not efficient for detecting attacks that use cookie or post data. This approach used the unsupervised machine learning techniques to detecting brute force, DDoS, Crawler Miss, High Load, Anomalous Query attacks and has a detection rate of 95%.
Swarnkar and Hubballi described, in [18], a new method for payload-based anomaly detection that learns normal behavior and detects deviations. The approach makes a frequency range of occurrences of n-grams from packets in training phase and count the number of deviations from the range to detect anomalies. The approach showed lower false positives and higher detection rate when compared to Anagram methods.
Kang et al. [19] described a one-class classification method for improving intrusion detection performance for malicious attacks. Results scores were evaluated based on artificially generated instances in two-dimensional space. In the detection phase, the approach based on simple logic, the center of the normal patterns was determined at (0, 0), and two malicious class centers were at (1, 1) and (−1, −1), respectively. Experimental results on simulated data show better performance.
Camacho et al. [20] developed a framework that used a PCA-based multivariate statistical process control (MSPC) approach. The framework monitors both the Q-statistic and D-statistic. Thereby, it was possible to establish control limits in order to detect anomalies when they became consistently exceeded.
Yoshimura et al. [21] proposed a new model called DOC-IDS, which is an intrusion detection system based on Perera’s deep one-class classification. This approach used the supervised machine learning techniques to detecting Multi-attacks and has a detection rate of 97%.
Zavrak et al. [22] proposed an intrusion detection and prevention architecture called SAnDet which is based on an anomaly-based attack detection module that uses the EncDecAD method to detect attacks. This approach used the semi-supervised machine learning techniques to detecting DoS and Portscan attacks and has a detection rate of 99.3%.
The evaluation of the previous approaches according to cited criteria is illustrated by Table 1.
Table 1. Evaluation of the approaches.
Author |
Techniques |
Attacks types |
Target |
Learning types |
Logic rules |
Training is not required |
Multi- target |
Detection rate |
Pukkawanna et al. [14], 2015 |
Kullback-Leibler (KL) Divergence |
Mimicry attacks |
TCP/ UDP-Ports |
unsupervised learning |
× |
✓ |
× |
12.5% |
Hounkpevi [15], 2020 |
- Kullback-Leibler (KL) Divergence. - k-mean algorithm. |
Mimicry attacks |
TCP/ UDP-Ports |
unsupervised learning |
× |
✓ |
× |
66.7% |
Najafabadi et al. [8], 2017 |
PCA (Principle Component Analysis)-Subspace method |
detecting HTTP GET flood attacks DDOS |
HTTP.Url |
supervised learning |
× |
× |
× |
- |
Betarte et al. [9], 2018 |
- one-class classification - n-gram |
Multi attacks |
HTTP.Url |
supervised learning |
× |
× |
× |
90% |
Wang et al. [11], 2017 |
FCER (Frequent Closed Episode Rules) Mining algorithm |
Unknown web attacks. |
HTTP.Url |
supervised learning |
× |
× |
× |
96.67% |
Bronte et al. [12], 2016 |
Cross Entropy. |
SQLI, XSS, RFI, and DT. |
HTTP.Url |
supervised learning |
× |
× |
× |
66.7% |
Ren et al. [13], 2018 |
-Bag of words (BOW) model - Hidden Markov algorithms. |
SQL injection and cross-site scripting |
HTTP.Url |
supervised learning |
× |
× |
× |
96% |
Munz et al. [16], 2007 |
K-mean algorithm. |
Port scans and D/oS attacks. |
TCP/ UDP-Ports |
unsupervised learning |
× |
✓ |
× |
- |
Asselin et al. [17], 2016 |
black-box approach (crawling based) N-gram model. |
brute force, DDoS, Crawler Miss, High Load, Anomalous Query |
HTTP.Url |
unsupervised learning |
× |
✓ |
× |
95% |
Yoshimura et al. [21], 2022 |
one-class classification. |
Multi attacks |
- |
supervised learning |
× |
× |
× |
97% |
Zavrak et al. [22], 2023 |
EncDecAD. LSTM. |
DoS Portscan |
- |
semi-supervised learning |
× |
× |
× |
99.3% |
The existing approaches could be evaluated according to many criteria such as:
Attack Types: The different types of attacks detected by the approach
Target: The fields of the IP packet that are analyzed by the approach to detect suspicious behaviors such as IP address, HTTP.Url and TCP-UDP Port.
Learning Types: If the approach uses any supervised or unsupervised machine learning techniques.
Logic Rules: It is useful if the approach provides an expressive language such as temporal logic to specify a rich variety of malicious traffics (fine-grained specification).
Training is not required: Most of existing approaches require a training step, but some few others do not.
Multi-Target: It is related to the ability of the approach to detect suspicious traffic that requires the analysis of many fields in IP packets in the same time.
Detection Rate: It gives the percentage of detected bad traffics.
3. Methodology
The detection of suspicious traffic is based on the following simple observation: the nature of the traffic should not change suddenly. If this happens, it will be suspicious. For example, there is no reason that the nature of the traffic between the period P1 = [10 am - 10:30 am] will be so different from the period P2 = [10:30 am - 11 am]. However, distinctions might reasonably exist between daytime and nighttime traffic patterns, as well as between traffic from different years.
Let
be a function such that
measures a particular feature related to the network traffic.(e.g., x is time and y is the number of packets coming from a specific country). Assume that the curve of
is as shown by Figure 1, then it is clear that there exists a sudden variation from
to
which is suspicious.
More precisely, the traffic τ will be scattered to one or many sequences of ordered slices. On each of these slices, we apply a function
that measure some of its features. After that, we compute the distance between successive values of
as shown in Figure 2. The sudden changes of
appears, if there exist a big deviations between the measured distances.
Figure 1. Sudden variation in traffic.
Figure 2. Looking for sudden variation in traffic.
The function F may not solely yield a singular real value within
; instead, its outputs could exist within
. For example, it might produce a complete distribution that assesses various characteristics across analyzed slices of the trace. In such scenarios, assessing the disparity between F values could involve employing measures like KL-divergence or Euclidean distance.
Furthermore, in determining whether the variation between successive F values exhibits abrupt changes or unacceptable deviations, clustering analysis could be valuable. If the resultant clusters surpass one in number, and the expectation dictates smooth change in traffic distributions across successive slices, we conclude that the analyzed traffic is suspicious.
In the subsequent sections, we elaborate on and formalize all of these analyses.
To maintain simplicity in presenting the approach, we concentrate solely on network traffic. However, it’s important to note that the same concept can be extended to analyze any type of log file.
3.1. Preliminary Notations
In order to articulate the definition of suspicious traffic formal and more succinctly, it’s essential to establish a set of initial notations.
We assume that network traffic is represented by a sequence of stamped IP packets or messages where each one of them is a structure that contains a header and a payload. We suppose that we have access to any field (e.g., IP addresses, ports and protocols) to any non-encrypted header of the network protocols (e.g., IP, TCP and UDP) inside an intercepted traffic.
Definition 1 (Messages). We denote by
the set of messages that could be found in the network traffic.
: we use
to range over the possible fields in messages of
. Examples of
are given in Table 2.
: if m is a message and
is an attribute, we denote by
the value of
in m.
Table 2. Examples of attributes.
Stamped messages are called events and are defined as follows:
Definition 2 (Events). We denote by
, the set of the possible events built from
as follows:
: we denote by
the value of
in e. It is defined as follows:
and
, if
.
A sequence of stamped events forms a trace.
Definition 3 (Trace). A trace τ over
is defined using the following BNF grammar:
where
is the empty trace. The “.” represents the chronological order, i.e., if e appears before e' in a trace τ, then necessarily e happened at a previous time than e'.
We introduce the following propositional logic allowing to verify whether an event in a trace respects some conditions. The main purpose of this language is to define specific patterns of messages we are looking for within the trace, such as message having a given source or destination IP addresses or ports.
Definition 4 (Propositional Event Logic). Let
be a field name and v be a value, we introduce the Propositional Event Logic (PEL) as follows:
An event e respects a proposition p, and we say that
, if one of the following conditions holds:
For instance, to know if (TCP.DestPort = 80)(e), we check if (e@TCP.DestPort) = ? 80.
3.2. Trace Slicing
This step requires meticulous attention to ensure the approach’s effectiveness is maximized. It’s important to decompose the trace into one or multiple sequences of slices characterized by smooth variations. The end user must have a clear understanding of their activity’s nature to identify instances where sudden changes should not occur. Below, we provide some illustrative examples:
Significant and sudden fluctuations in traffic volume are often indicative of potential Denial of Service (DoS) attacks. To detect this activity, it’s appropriate to divide the traffic trace τ into successive discrete slices, denoted as
, each representing a predefined time window, such as 10 minutes.
The previous analysis will be more precise and efficient if we separate the traffic of different IP addresses. Also input traffic can be separated from output. Sudden variation in input traffic can be du to DoS attack but variation of output traffic can be generated by a malware (e.g. botnet) activity. Therefore this kind of separation allow us either to know the IP address in the suspicious traffic as well as the nature of the attack.
Input and output traffic of different IP address can be further separated into traffics related to different IP protocols and TCP ports.
The previous divisions can be further refined as we will show in the case study section. For instance, we can separate the traffic of different days of the weak. By doing so, we assume that traffic related to successive Monday should not present a sudden change.
The forthcoming definition introduces a slicing function designed to partition a trace, catering to diverse scenarios and requirements.
Definition 5 (Slicing). Let p be a propositional formula in PEL and τ be a trace in
. We inductively introduce a slicing function
as follows:
Let
denote propositions. We extend the selection function to operate on sets of sequences of propositions as follows:
If
is a proposition that depends on i, we use the notation
as an abbreviation of
where n is the natural number such that
and
. For instance:
is same as
, and
is same as
, where:
Example 1 (Selection). Let τ be the trace containing the traffic captured between 10:00:000 and 10:00:052 focusing on IP.Prot as shown by Table 3.
Table 3. Captured traffic.
Let
. When slicing τ using φ, we compute
, resulting in the sequence
, as illustrated in Table 4.
Table 4. Sliced captured traffic.
3.3. Feature Measuring
Each slice, derived from the preceding step, undergoes transformation into an element in
(
) by quantifying certain characteristics through a predefined function F. For simplicity, we concentrate on a class of functions F that produce distributions by tallying events adhering to specified conditions, as delineated in the following definition:
Definition 6 (Feature Measuring Function). Let q be a propositional formula in PEL and τ be a trace in
. We introduce a slicing function
inductively as follows:
Broadly speaking,
returns the number of packets in τ that satisfy the property q.
We also extend the selection function to operate on both a sequence of propositions
and a set of traces as follows:
Example 2. Let’s examine the trace provided in Example 1. Let
such that
,
,
and
, then when applying the function
to the slices
as depicted in Table 4, the resulting outcomes are as illustrated in Table 5.
Table 5. Quantification of slices using
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For instance,
indicates that in slice
, there is 1 packet with IP.Prot = 1, 3 packets with IP.Prot = 6, 1 packet with IP.Prot = 17, and 0 packets with other IP.Prot values.
The distributions of these slices serve as inputs to algorithms like KL-Divergence, enabling the measurement of traffic divergence across distinct slices. However, in cases where certain events are absent during observation, their frequencies register as zero, posing a challenge for computing KL-Divergence and potentially leading to division by zero errors. To address this issue, we must either explore alternative divergence techniques or slightly adjust the data distribution through methods such as smoothing. The following definition illustrates one of the well-known smoothing techniques.
Definition 7 (Laplace Smoothing). Let
be a sequence of real numbers. We denote by
the k-Laplace Smoothing Distribution (k-LSD) of a trace and we define it as follows:
We augment the function
with Laplace smoothing as follows:
Definition 8 (Feature Measuring Function with Smoothing). We denote by
, the smoothed version of
achieved through the application of the smoothing function
. More formally:
Example 3. By applying
to column 2 of Table 5, we obtain
as shown by column 3 of Table 6.
Table 6. Quantification and smoothing of slices using
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
When detecting suspicious activities within traffic data, it can be advantageous to prioritize specific positions within the values returned by
in
. For instance, if
yields
where each
represents traffic originating from a specific country, these values might be weighted according to the respective country’s reputation in cyberattacks, assigning greater weight to countries with negative reputations. Presently, there’s a lack of a systematic approach to guide end users in determining these weight values. However, we believe that fine-tuning these weights based on intuition could enhance detection capabilities.
The subsequent definition formalizes the concept of weights.
Definition 9 (Weighting Function ω). We denote by ω a weighting function that accepts weights in
, a tuple in
, and returns a probability distribution, i.e.:
.
Let
be in
). We extend ω to a set
and a sequences
of tuples as follows:
The following definition provides an example of ω.
Definition 10. (Product Scalar Weighting Function) We define the scalar product weighting function, abbreviated as spw, as follows:
where
is the scalar product of the tow vectors w and u, i.e.:
We extend the function
by incorporating a weighting function as follows:
Definition 11 (Feature Measuring Function with Smoothing and Weighting). Let ω a weighting function. In the sequel, we denote by
, the weighted version of
using the weighting function ω. More precisely:
and for any trace τ and a weight vector w, we have:
Example 4. Let’s examine the trace provided in Example 3. Suppose we aim to prioritize packets containing ports not in 1, 6, 17. As an example, we apply the weighting function
with weights
. The results are illustrated in Table 7.
Table 7. Slice distribution.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3.4. Divergence Measuring
After abstracting and transforming the traffic into smoothed distributions, the next step involves measuring the divergence between adjacent slices within each sequence. To achieve this, we employ a divergence function such as the KL-Divergence.
Definition 12 (Divergence Function). A divergence measuring function, denoted by Δ, can be any function with the following signature:
.
Examples of divergence measuring functions are given in Table 8.
Table 8. Examples of divergence functions.
Divergence |
Δ |
|
KL-Divergence [23] | Cosine [24] |TF-IDF [25] |… |
Notice that, since the KL-Divergence, usually denoted by
, between two distributions
and
is not commutative (i.e.,
as shown by Equations (1) and (2)), we can consider
as the divergence value.
(1)
(2)
Example 5. We apply the KL-Divergence to the trace of Example 4. The result is shown by Table 9.
Table 9. Slice distribution.
|
|
|
|
|
|
|
|
|
0.049 |
0.51 |
1 |
|
|
|
0.0755 |
0.066 |
0.142 |
|
|
|
1.3435 |
1.2440 |
2.597 |
|
|
|
0.5107 |
0.6265 |
1.137 |
|
|
|
0.4024 |
0.5388 |
0.941 |
|
|
|
- |
- |
- |
3.5. Divergence Clustering
After quantifying the divergence between successive slices of traces, the next step is to ascertain if significant abrupt changes have occurred. To accomplish this, we estimate the number of clusters generated by the divergence values. If this count exceeds one, we infer that the trace contains suspicious traffic.
Definition 13 (Clustering). Let
be a clustering algorithm that estimates the optimal number of clusters N associated with a dataset in
. It returns true if the number
, indicating that the threshold for suspicious activity has been surpassed, and false otherwise.
We are particularly interested in
. When
returns true, it indicates that the traffic is considered suspicious. Examples of the
function are provided in Table 10.
Table 10. Examples of clustering functions.
Example 6. Let’s apply the K-means algorithm with the Elbow Method to compute
on the trace from the previous example, as illustrated in Table 11.
Table 11. K-means results.
Cluster 1 |
Cluster 2 |
0.100, 0.142, 0.941, 1.137 |
2.587 |
3.6. Suspicious Traffic Detection
Now, we have all the necessary ingredients to define a suspicious traffic.
Definition 14 (Suspicious Traffic)
Let τ be a trace.
Let
be a n sequences of propositions.
Let
be a m sequences of propositions.
Let
be a weight vector in
.
Let
be a divergence measuring function such as KL-Divergence.
Let
be a clustering algorithm that estimated the best number of clusters N related to the set of data in
and returns true if
, false otherwise.
We define
, a generic function designed to detect suspicious traffic within an analyzed trace τ, as follows:
The Suspicious function integrates various analyses, conducted in the sequence depicted in Figure 3, and returns true if the traffic is deemed suspicious, and false otherwise. It requires three functions, ω, Δ, and
, as well as four parameters: τ, w, φ, and ψ.
Figure 3. Steps involved in detecting suspicious traffic.
Example 7. Let’s apply the Suspicious function to the trace provided in Example 1 to ascertain if there exists a sudden change. Based on the results shown in Table 11, where
generates more than one cluster, we deduce that:
The suspicious traffic is triggered on slice
.
4. Case Study
In this section, we present three cases of detecting suspicious activities using two distinct datasets comprising real traffic. The first case involves detecting suspicious activities based on daily patterns in the dataset from [29]. The second and third cases utilize the UNSW-NB15 dataset [30] to detect suspicious traffic by analyzing TCP and DNS traffic, respectively.
4.1. Detecting Suspicious Activities Based on Days of the Week
An example of an interesting dataset with a real traffic is available at [29]. It contains 21,000 rows and covers the traffics related to 10 workstations with local IP addresses over a period of three months. Half of these local IP addresses were hacked at some point during this period and became members of different botnets and generated abnormal traffic.
A screenshot of a part of the dataset is shown in Figure 4, where:
date: yyyy-mm-dd (from 2006-07-01 through 2006-09-30);
l_ipn: local IP address (coded as an integer from 0-9);
r_asn: remote ASN (an integer which identifies the remote Autonomous System Network);
f: flows (number of connections during the corresponding day).
Figure 4. A part of the dataset provided by [29].
We try to detect the infected computer based on the following assumption for each workstation of the network: the nature of traffic may vary across different days of the week. For instance, weekend traffic could differ significantly from that of weekdays. However, when we consider a specific day, such as Monday, there is no compelling reason for it to undergo substantial changes from one week to another. This implies that Monday’s traffic should remain relatively consistent across all weeks. A similar pattern is expected for other days of the week, such as Tuesday, Wednesday, and so forth.
Based on the assumption, we proceed as follows: we segregate the traffic associated with each workstation and day of the week into distinct files. With ten workstations and seven days a week, this results in a total of 70 files. Subsequently, each of these files undergoes analysis to identify any abrupt changes.
Here are the values of the parameters required used within the function GSuspicious allowing to detect suspect traffic.
τ (trace): the dataset available at [29].
The trace is scattered into various slices, each exclusively comprising traffic linked to a specific IP address and a designated day of the week. To illustrate, for IP address 0, distinct slices are allocated for Mondays, Tuesdays, and so forth. Similar slices are build for IP addresses 1 to 9. By doing this division, we are implicitly making the assumption that for any IP address, the traffic of different Mondays should be quite similar and this should be the same for the other days of the week. More formally, the slicing will be based on the following set of propositions:
where
and
represents the number of events in the trace.
Let
where
are the different values that appears in the column r_asn presented in ascending order.
. This captures the fact that each element of the partition has the same weight.
Δ is the KL-Divergence.
Let
is the composition of K-means and Elbow Methods. The K-means do the clustering and the Elbow Methods estimate the best number of clusters.
All these fixed parameters will be the input of our Suspicious function to conclude whether the traffic is suspicious or not. This function proceed as follows:
After applying the function
to the dataset, we obtain a separate file for each IP address and each day of the week. For instance, for IP address 0 and Monday, we generate a file that will be analyzed independently for suspicious traffic. This file aggregates traffic not only from a single Monday but from multiple Mondays, and our objective is to detect any sudden changes in the distribution of traffic from one Monday to another. We repeat this process for the other days of the week and for the remaining IP addresses.
The traffic from each IP address and each day of the week undergoes transformation through the function
, resulting in a point in
, where each dimension represents the number of connections related to every r_asn, and n is the total number of r_asn.
Thanks to the function Δ, we quantify the divergence between every two successive Mondays for each IP address, and we repeat this process for the other days of the week as well.
Using the function
(composition of the K-means and the Elbow Method), we estimate the number of clusters generated by the previous steps.
If we observe two or more clusters for any analyzed sequence, we infer that the traffic is suspicious.
Below, we present the results obtained from the Elbow Method corresponding to the different days and IP addresses.
1) Monday: Based on the analysis of Monday traffic depicted in Figure 5, we identify five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8).
Figure 5. Elbow results for every Monday.
2) Tuesday: Based on the analysis of Tuesday traffic depicted in Figure 6, we identify five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8).
3) Wednesday: Based on the analysis of Wednesday traffic shown in Figure 7, we observe five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8).
4) Thursday: According to the analysis depicted in Figure 8, we identify five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8) on Thursday.
5) Friday: Based on the analysis presented in Figure 9, we observe five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8) on Friday.
Figure 6. Elbow results for every Tuesday.
Figure 7. Elbow results for every Wednesday.
Figure 8. Elbow results for every Thursday.
Figure 9. Elbow results for every Friday.
6) Saturday: According to the analysis shown in Figure 10, we can identify five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8) on Saturday.
Figure 10. Elbow results for every Saturday.
7) Sunday: Based on the analysis presented in Figure 11, we observed five non-suspicious machines (3, 5, 6, 7, and 9) and five suspicious machines (0, 1, 2, 4, and 8) on Sunday.
Here are the conclusions extracted from Figures 5-11:
There are five clear elbows showing that the number of clusters related to the traffics of the machines l_ipn values 0, 1, 2, 4, and 8 is greater than one and then they are the origins of the suspicious traffics shown in Table 12.
There are five machines l_ipn values 3, 5, 6, 7, and 9 with no elbow, meaning that the number of their clusters is one, then they are not associated with any suspicious traffic showed in Table 12.
Figure 11. Elbow results for every Sunday.
Table 12. Detecting suspicious traffic based on days of the week.
|
Monday |
Tuesday |
Wednesday |
Thursday |
Friday |
Saturday |
Sunday |
Unsuspicious |
3, 5, 6, 7, 9 |
3, 5, 6, 7, 9 |
3, 5, 6, 7, 9 |
3, 5, 6, 7, 9 |
3, 5, 6, 7, 9 |
3, 5, 6.7, 9 |
3, 5, 6, 7, 9 |
Suspicious |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
0, 1, 2, 4, 8 |
Total Suspicious |
0, 1, 2, 4, 8 |
|
Predicted |
|
Negative |
Positive |
Actual |
Negative |
True Negative (TN) |
False Negative (FN) |
|
Positive |
False Positive (FP) |
True Positive (TP) |
Our approach predicted that 5/10 local IPs are botnets. Actually, only 5/10 local IPs are real botnets. Therefore:
|
Predicted |
|
|
Negative |
Positive |
Total |
Actual |
Negative |
5 |
0 |
5 |
|
Positive |
0 |
5 |
5 |
|
Total |
5 |
5 |
10 |
It follows that:
.
.
.
False Negative (FN) = 0%.
False Positive (FP) = 0%.
Performance: Our code was executed on a Ubuntu virtual machine with a 2.3 GHz Intel Core i9 processor, equipped with 2 cores and 4GB of RAM. The total execution time to process the entire dataset, consisting of 21,000 rows covering 10 workstations over a three-month period, was approximately 51.7 seconds.
4.2. Detecting Suspicious Activities Based on DNS and HTTP Traffic
The UNSW-NB15 dataset [30] was generated using the IXIA PerfectStorm tool. It encompasses nine categories of modern attack types and incorporates realistic behaviors of normal traffic. Comprising 49 features across various categories, some of them are illustrated in Figure 12. Utilized as an attack tool, IXIA dispatches both benign and malicious traffic to different network nodes. A segment of certain fields from this traffic is demonstrated in Table 13.
Figure 12. UNSW-NB15: example of features.
The network contains three sub networks as shown by Figure 13.
1) Sub network1 (server1): contains nodes with source IP addresses from 59.166.0.0 to 59.166.0.9.
2) Sub network2 (Server2): contains nodes with source IP addresses from 175.45.176.0 to 175.45.176.3.
Table 13. UNSW-NB15 samples.
Figure 13. UNSW-NB15 network.
3) Sub network3 (Server3): contains nodes with source IP addresses from 149.171.126.0 to 149.171.126.19.
Subnetwork 1 (servers1) and subnetwork 3 (server3) are configured to exhibit normal traffic patterns, whereas subnetwork 2 (server2) is associated with abnormal or malicious activities.
We employ our approach across various source IP addresses within all subnetworks. Our assumption is that the nature of the outbound traffic should not undergo sudden changes.
4.2.1. Detecting Suspicious Activities Based on DNS Traffic
Below are the values of the required parameters used within the Suspicious function for detecting suspicious traffic:
τ (trace): it is the dataset available at [30].
Let
where
are the different IP source addresses appearing in τ.
UDP.SourcePort
Let
, where
are the different IP destination addresses appearing in τ.
.
Δ is the KL-Divergence.
Let
is the composition of K-means and Elbow Methods. The K-means do the clustering and the Elbow Methods estimate the best number of clusters.
Below, we present the results obtained from the Elbow Method corresponding to the different subnetworks (servers).
1) Sub network1 (Server1): the Source IP addresses from 59.166.0.0 to 59.166.0.9: shown in Figure 14.
Figure 14. Elbow results for sub network1 (server1) based on DNS services.
2) Sub network2 (Server2): Source IP addresses from 175.45.176.0 to 175.45.176.3: shown in Figure 15.
Figure 15. Elbow results for sub network2 (server2) based on DNS services.
3) Sub network3 (Server3): Source IP addresses from 149.171.126.0 to 149.171.126.19: shown in Figure 16.
Figure 16. Elbow results for sub network3 (server3) based on DNS services.
From Figure 14, the maximum distortion value for subnetwork 1 (server1) is above 8. In Figure 15, the maximum distortion value for subnetwork 2 (server2) exceeds 80. Meanwhile, in Figure 16, the maximum distortion value for subnetwork 3 (server3) is above 70, but for only one IP address, whereas more than 20 IP addresses have a distortion value of zero.
Based on these findings, our approach predicts that subnetwork 2 (server2) is suspicious, as it exhibits abnormal or malicious activities in the network traffic. Therefore:
|
Predicted |
|
|
Negative |
Positive |
Total |
Actual |
Negative |
2 |
0 |
2 |
|
Positive |
0 |
1 |
1 |
|
Total |
2 |
1 |
3 |
It follows that:
.
.
.
False Negative (FN) = 0%.
False Positive (FP) = 0%.
4.2.2. Detecting Suspicious Activities Based on HTTP Traffic
Below are the values of the required parameters used within the Suspicious function for detecting suspicious traffic:
τ (trace): it is the dataset available at [30].
Let
where
are the different IP source addresses appearing in τ.
UDP.SourcePort
Let
, where
are the different IP destination addresses appearing in τ.
.
Δ is the KL-Divergence.
Let
is the composition of K-means and Elbow Methods.
Below, we present the results obtained from the Elbow Method for different subnetworks (servers).
1) Sub network1 (Server1): IP source addresses from 59.166.0.0 to 59.166.0.9 shown in Figure 17.
2) Sub network2 (Server2): IP source addresses from 175.45.176.0 to 175.45.176.3 shown in Figure 18.
3) Sub network3 (Server3): IP source addresses from 149.171.126.0 to 149.171.126.19 shown in Figure 19.
From Figure 17, the maximum distortion value for subnetwork 1 (server1) is above 4. In Figure 18, the maximum distortion value for subnetwork 2 (server2) exceeds 2. Meanwhile, in Figure 19, the maximum distortion value for subnetwork 3 (server3) is above 0.008, but only for one IP address, while more than 20 IP addresses have a distortion value of zero.
Based on these findings, our approach predicts that subnetworks 1 (server1) and 2 (server2) are suspicious, as they exhibit abnormal or malicious activities in the network traffic. Therefore:
Figure 17. Elbow results for sub network1 (server1) based on HTTP services.
Figure 18. Elbow results for sub network2 (server2) based on HTTP services.
Figure 19. Elbow Results for sub network3 (server3) based on HTTP services.
|
Predicted |
|
|
Negative |
Positive |
Total |
Actual |
Negative |
1 |
1 |
2 |
|
Positive |
0 |
1 |
1 |
|
Total |
1 |
2 |
3 |
It follows that:
.
.
.
False Negative (FN) = 0%.
False Positive (FP)
.
4.3. Discussion
Table 14. Evaluation of the proposed approaches.
Techniques |
Attacks types |
Target |
Learning types |
Logic rules |
Training is not required |
Multi-target |
Detection rate |
-Kullback-Leibler (KL) Divergence. -Cosine Similarity -TF-IDF -k-mean algorithm. |
suspicious attacks for different target |
IP-Addresses TCP/UDP-Ports HTTP.Url others |
unsupervised learning |
✓ |
✓ |
✓ |
100% |
Table 14 resumes the main features of the proposed approach. Although it has shown the best detection rate (100%), our experimental dataset remains small and we need to apply it on further representative datasets to have better precision on this parameter and other metrics.
5. Conclusions
This paper introduces a promising new technique for incident detection, leveraging differential analysis. Initially, the traffic undergoes dispersion via a slicing function
, partitioning it into sequences of slices based on propositional logical formulas φ, which are specified by the end-user. Subsequently, each slice undergoes transformation through a measuring function
, mapping it to a point in
by quantifying select characteristics defined by the end-user via a formula ψ. following this, the distances between successive values returned by
, associated with the same sequence, are evaluated using a designated function Δ (e.g., KL-Divergence). Lastly, employing a clustering technique (e.g., K-means), the values produced by Δ are clustered, and the number of clusters is estimated. If any sequence yields more than one cluster, it indicates suspicious activity.
The experimental results demonstrate significant promise, with a 100% accuracy achieved across both datasets used in the experiments. However, it’s essential to note that this level of accuracy may not be guaranteed with other datasets and is contingent upon the parameters selected for analysis, such as φ and ψ.
In addition to its remarkable efficiency, the approach exhibits versatility in tackling a wide array of attacks spanning various activities, including those targeting networks, operating systems, and applications. Notably, it operates without necessitating any learning step or data.
Looking ahead, our future endeavors entail applying this methodology to diverse datasets encompassing log files that capture a spectrum of activities across networks, operating systems, and applications. Furthermore, we aspire to integrate this approach into an open-source Security Information and Event Management (SIEM) tool like Wazuh, thereby extending its accessibility and practicality within cybersecurity frameworks.