Adaptive Financial Fraud Detection in Imbalanced Data with Time-Varying Poisson Processes ()
1. Introduction
Financial fraud is growing exponentially, especially because of the large sums involved. McAfee estimates in 2018 that cybercrime, of which financial fraud is a factor, costs the world about US$600 billion, or 0.8% of global GDP. According to McKinsey, global losses due to card fraud could reach nearly US$44 billion by 2025. In addition to the direct cost of fraud, companies also suffer from lost sales when real transactions are denied by the companies. McKinsey estimates that false positives account for up to 25% of transactions denied by online retailers, see Dyzma (2018). However, as a first step, banks and financial institutions have approached the detection of fraud using manual procedures or rule-based solutions, which have yielded good results, but these methods currently have limitations. The rule-based approach means that a complex set of requirements for suspicious transaction reporting must be defined and reviewed manually. While this may be effective in detecting anomalies consistent with known patterns, it does not detect frauds that follow new or unknown patterns. The increasing complexity of digital attacks and the creativity of cyber-attackers make these conventional detection methods less effective and quickly obsolete. More sophisticated techniques must be developed, including automatic learning algorithms, and evolve the detection of fraud towards methods using adaptive rules to tighten the mesh of the network.
The machine learning models work with many parameters and are much more efficient at finding subtle correlations in the data, which can be masked by an expert system or by human criticism, Dyzma (2018). The large volume of transactional data and client data readily available in the financial services industry makes it an ideal tool for the application of complex machine learning algorithms. In addition to learning from known models, machine learning can go further and learn new models without human operation. This allows models to adapt over time to discover previously unknown patterns or to identify new tactics that can be used by fraudsters. In fact, the development of conventional machine learning algorithms has led them to solve some specific problems, one of the most important features of which is that the distribution of data is generally balanced, unlike financial fraud, which is not balanced. Most standard classifiers such as decision trees and neural networks assume that learning samples are evenly distributed among different classes. However, in many real-world applications, the ratio of the minority class is very small (1:100, 1:1000 or can be exceeded at 1:10000). Due to the lack of data, few samples of the minority learning class tend to be falsely detected by the classifiers and the decision limit is therefore far from correct. Numerous research works in machine learning have been proposed to solve the problem of data imbalance; He and Garcia (2009), Galar et al. (2012), Krawczyk (2016), Elrahman and Abraham (2013), etc. However, most of these algorithms suffer from certain limitations in real-world applications, such as the loss of usual information, classification cost, excessive time, and adjustments, see Elrahman and Abraham (2013).
In this paper, we address the problem of fraud detection in imbalanced data using the Poisson process; fraud is defined as a rare event occurring at a random time and involving significant financial losses. In this context, the fraud times are defined as the jump times of the Poisson process with intensity that describes the instantaneous rate of fraud. Unlike machine learning methods, we do not look inside the subtle correlations in the data; instead, we assume that an exogenous rate or intensity must be determined. Instead of asking why the fraud is committed, the fraud rate is calibrated using market data. A lot of research has been done on the application of the Poisson process to financial risks, see Artzner and Delbaen (1995), Jarrow and Turnbull (1995), Duffie and Singleton (1999), etc. For calibration purposes, we assume that intensity is a deterministic function of time that takes into account the homogeneous and inhomogeneous Poisson process. Three main inputs are needed to estimate the intensity: the deterministic form of the intensity function, the arrival times of the frauds and the labels.
The rest of the paper is organized as follows. Section II defines the mathematical concepts of Poisson process; the homogeneous and the Inhomogeneous Poisson process are reviewed. The estimation of the intensity and the prediction of fraud events are discussed. In section III, the model is applied to financial datasets and the results are presented. The dataset was provided by NetGuardians1, a swiss company which develops solutions for banks to proactively prevent fraud.
2. Mathematical Concepts of Poisson Process
2.1. Fraud Event
Consider a financial institution such as a bank, an insurance company, a trading company, etc. and information about its clients. We are interested in the occurrence of fraud in client transactions for such an institution. The fraud event is then defined as a rare event occurring at a random time and resulting in significant financial losses for the client and the financial institution. Whatever the definition used for a fraudulent event, let us note the fraud time by
which corresponds to
value of random variable on the filtered probability space
.
denotes the possible states of the world,
is the
-algebra,
is the filtration with
contains all information up to time t and
.
is the probability measure describing the likelihood of certain events. The only mathematical structure assumed for
is that it should be a stopping time, that is a random variable
, such that
for
. Intuitively, one can determine whether or not the fraud time occurs before a certain deterministic time by observing the past up to time t, which is encoded in the filtration (
).
Now consider a sequence
of fraud times and let
be a counting process given by
. (1)
In other words,
counts the number of fraud events between 0 and t. N has the following properties: 1)
; 2)
is an integer; 3) For
,
. The last property implies that N is a submartingale since
. Because of the last property, the Doob-Meyer theorem guarantees the existence of an increasing predictable process A called compensator starting at 0 such that
is a martingale. The compensator A is uniquely defined up and governs the distribution of N. We assume that the compensator A is absolutely continuous w.r.t. Lebesgue measure such that there is a non-negative, integrable and predictable intensity process
that satisfies
. (2)
The process
represents the conditionally expected number of events per unit of time in the sense that, at any time t, the
conditional probability of an event between t and
is approximatively
for small h, where
contains all information just before time t. In fact, because N has the predictable intensity process
,
is a martingale increment, and heuristically we thus have
. (3)
Since
is predictable,
so we can move
outside the expectation and obtain
, (4)
and
, (5)
For more details see Reiss (1993) and Fleming and Harrington (2005).
In the rest of this paper, we will focus on the counting process with a deterministic intensity that gives rise to homogeneous and inhomogeneous Poisson process. In this context, the likelihood of fraud events will be derived and implemented.
2.2. Homogeneous Poisson Process
The Homogeneous Poisson Process (HPP) is a fundamental stochastic process which is simple, easy to understand and possesses desirable mathematical and theoretical properties making it easy to handle. It can be easily extended to more complicated and realistic situations Kingman (1993). Let
be the counting process defined above i.e. for each
which counts the number of fraud events that happen between time 0 and time t. In order to have an overview of the Poisson process, let’s consider three definitions of the Poisson process that are equivalent to each other. For the proof see Ross (2010) and Drazek (2013).
Definition 2.1. N is an HPP with constant intensity
if:
1)
;
2) The process has stationary and independent increments;
3) For small h,
;
4)
.
Definition 2.2. N is an HPP with constant intensity
if:
1)
;
2) The process has stationary and independent increments;
3) For
,
is Poisson distributed with parameter
.
That is,
(6)
For any interval for size t,
is the expected number of frauds in that interval.
Definition 2.3. N is HPP with constant intensity
if the waiting times between successive events, or arrivals follow an exponential distribution of parameter
.
This definition made the Poisson process unique among renewal process by the memoryless of the Exponential distribution.
Estimation of the Constant Intensity λ for Homogeneous Poisson Process
The simple and trivial way for estimating the constant intensity
is to use the above third definition of the HPP related to the Exponential distribution of the waiting times.
Let
an homogeneous Poisson process of parameter
and
a sequence of fraud times. We define
, the waiting times between the event
and the event n with
. Because
follows the exponential distribution of parameter
,
. (7)
Let
an estimator of
; using the moment method, the estimator
of
is given by
, (8)
which is also the Maximum Likelihood Estimator (MLE) of
.
In the next section, we consider a time-varying intensity which conducts to Non-Homogeneous Poisson Process.
2.3. Non-Homogeneous Poisson Process
Non-Homogeneous Poisson Process (NHPP) means that the intensity
is deterministic function. Thus, the distribution of the number of events between two particular points on the timeline is no longer a function depending on the difference between these points, as in the case of a Homogeneous Poisson Process (HPP). Here it is a function of the starting-point and the end-point of the time interval and is not necessarily stationary. Let’s start with the definition of the NHPP given in Ross (2010).
Definition 2.4. The counting process
is said to be a NHPP with intensity function
,
, if it satisfies,
1)
;
2) N has independent increments;
3) for small h,
;
4)
.
The function
is sometimes called the instantaneous arrival rate of the NHPP.
A consequence of the above definition is that
follows Poisson distribution of parameter
. That is,
. (9)
We can explore the relationship between the average number of events occurring up to the time t and the intensity function
of the corresponding NHPP:
. (10)
As described above, the compensator
is a non-decreasing right-continuous function and is referred here as the expectation function of the NHPP.
In addition, the expected number of events between times t and
is expressed as
. (11)
According to Cox & Lewis (1966), we can examine the distribution function of the time to the next event in NHPP by
. (12)
Let
, the probability density function of the time to the next event, which can be obtained by deriving the expression in (12) with respect to
. (13)
As we will see later, this expression (13) is very useful in estimating of the intensity.
Estimation of the Intensity
for Non-Homogeneous Poisson Process
There is a substantial history of statistical inference for Non-Homogeneous Poisson process; see Basawa and Rao (1980), Brown (1972), Ross (1996), etc. Suppose we have data from a non-homogeneous Poisson process
and we are looking for the intensity function that caused it. The first step is to define the form of the intensity
; we limit ourselves to the case of parametric intensity. In the second step, given the probability density function defined in (13) we can use the principle of Maximum Likelihood Estimate (MLE) to find the intensity parameter
maximizing the likelihood that a fraud will occur. The procedure is the following:
Suppose the n events occur at
in the interval
. Since the n events are independent and using (13), the desired joint probability density takes the form
,
where
is the probability of no event occurs in the interval
. It is calculated as follows:
.
The likelihood of getting
is then
.
The Log-Likelihood is:
(14)
. (15)
For more details about the derivation of (15), see Ross (1996).
The intensity estimate consists of finding the parameters of the intensity
maximizing the Log-likelihood function defined in (15). This estimated intensity is then used to predict the fraud event on the next transaction (
) based on the information available up to the time of the transaction T.
2.4. Prediction of Fraud Event
Consider the filtration
that contains the information about the fraud events up to time T. Suppose a new transaction is in progress at time
(
) and we would like to know if this transaction is fraudulent or not.
Proposition 1. The probability that a fraud occurs at time
is given by
, (16)
where
.
Proof. Following (12)
In the special case of homogeneous Poisson process, that is for constant
.
. (17)
We observe that in the case of homogeneous Poisson process, the probability of fraud is a function of parameter
and the elapsed time (
) between the two transactions. For the Inhomogeneous Poisson, it is actually a function of the difference between the compensator
and
.
Following (16): as
,
and then the
. On another side, as
,
and then the
.
Therefore, when the time between two transactions is large, it is very likely that the model generates a fraud alert. On the other hand, when two transactions are close, the model will not generate a fraud alert. This consequence could reduce the predictive power of the model when there is a succession of fraud events in record time.
3. Application to Financial Dataset
3.1. Choice of Deterministic Intensity Functions
To apply the Poisson process to the dataset, the shape of the intensity function must be defined. Three classes of intensity functions are proposed. For each class of function
, we set the conditions for
.
1)
: this is the case of Homogeneous Poisson process and
must be greater than 0.
is estimated following § 2.2.1.
2)
: the intensity is assumed to be a linear function of time. To ensure
for
, we impose as in Massey et al. (1996) the conditions
. (18)
Proof. We want
for
.
If
:
.
If
:
.
We also know that
since
. In order to have
, it is sufficient that
. So, the conditions are
and
.
If
, we obtain the trivial condition
.
Therefore, when we consider a short period to estimate the intensity parameters, the feasible region of (18) expands to find the optimal solution. Figure 1 shows an example of feasible regions for different values of T. For the sake of readability,
and
. We observe that when T becomes larger, the feasible region is reduced to the trivial region.
3)
: the intensity is a quadratic function as a function of time. To ensure
for
, we impose the conditions
. (19)
(a)(b)(c)
Figure 1. (a) Example of feasible region for
when
: solution is in (18). The region is shown for
and
. (b) as for (a) but
. (c) as for (a) but
.
The proof is similar to the above. Also, when when
, (19) is reduced to:
.
The conditions (18) and (19) are the constraints of the optimization problem in (15) for the Inhomogeneous Poisson process.
3.2. Data
The datasets provided by NetGuardians2 consist of two years of transactions for clients of a financial institution. It covers the period from 09-2015 to 09-2017 and includes a total of 18,139,078 transactions made by 124,177 clients. For confidentiality reasons, the name of the financial institution will not be mentioned. The dataset includes a total of 49 features such as transaction dates, transactions amounts, transaction senders IDs, transaction recipients account numbers, banking countries, etc. To be able to train a Poisson process algorithm, labelled data with examples of fraud are needed. All transactions in the dataset are labeled as fraudulent or not. Since the ground truth is not available, the labeling is based on the following simple pattern: transactions for which banks receiving money are outside Switzerland are considered fraudulent. With the labelling method only 55,226 clients have fraudulent transactions. To train the Poisson process, three features are required: client ID, timestamp and the label. Timestamps and labels are trained for each client to estimate the intensity of the fraud that will be used to predict fraudulent event.
The proportion of fraud corresponding to the number of fraudulent transactions in relation to the total number of transactions is calculated for each client. According to the labeling method, some clients may have a 100% fraud proportion. This concerns clients for whom the recipient institutions are all located outside Switzerland. To be realistic, we remove these clients from our analysis. In addition, clients that do not contain any fraud events in the complete dataset are deleted because the hours of fraud events are unknown and their intensity can not be estimated. In addition, these datasets contain only one class and, in this context, no measure of classification performance such as ROC-AUC is defined.
Figure 2 shows the distribution and the Boxplot of fraud proportions. We notice that the cleaned dataset is generally unbalanced because most clients have a low proportion of frauds. The Boxplot shows a skewed right data with the presence of larger outliers. With the value of the median, 50% of the clients have a fraud proportion less than 9%.
However, it is important to mention that the labelling method is relatively simple and that the above histogram is not representative of the true distribution of fraud because, in practice, the majority of fraud proportions are less than 1%. To study our analysis in an imbalanced dataset framework, we propose to focus on the clients with less than 20% frauds. Next, we divide this dataset into four
(a)(b)
Figure 2. (a) Histogram of Fraud proportions in the full dataset. (b) Boxplot of Fraud proportions in the full dataset. The clients with no fraud events and the clients with 100% of fraud proportion are removed from this full dataset.
subsets containing different fraud profiles. The first subset includes clients fraud rate less than 1%, the second subset concerns clients with a proportion between 1% and 5%, the third subset is for clients whose fraud proportion is between 5% and 10% and the last one for clients whose fraud proportion is between 10% and 20%. Figure 3 shows the Boxplot for each group. The four datasets are roughly symmetric with no outliers. Obviously, the greater variability in the group 4 and the smaller variability in the group 1 are well observed.
In each subset, we randomly select 500 clients and we train and test the Poisson
Figure 3. Boxplots for the four subsets. Clients with the fraud proportion
are grouped in four subsets containing different fraud profiles.
models on the transactions for each client. The training set represents the first 80% of transactions for which intensity parameters are estimated. The test set represents the last 20% and the fraud events are predicted with the estimated parameters. In addition, to take into account the time-varying intensity parameters, the prediction in the test set is also performed by rolling windows.
From a practical point of view, when there is no fraud in the training set, it is difficult to estimate the fraud intensity because the fraud event times are not available; see Equation (8) and Equation (15). Two solutions are possible:
1) Remove the clients for whom there was no fraud occurrence in the training set; The consequence is that we could lose more information.
2) Make the assumption that the intensity i.e. the occurrence rate of fraud
as there are no fraud events in the training set. In this context the fraud prediction probability is zero; see Proposition 1.
We conduct our analysis with the last one that is the intensity
when we train a dataset with no fraud information. The main reason is that we expect to keep most of client profiles in our analysis. As we will see later, under this assumption the dynamic models perform worse than the static models. To compare the various Poisson models, we define a baseline model (benchmark) based on a naive approach. The naive approach is to calculate the proportion of fraud in the training set and use that probability to predict fraud in the test set. Finally, predictive performance is summarized in each subset using two performance measures: ROC-AUC and Average Precision (AP) Score.
4. Results
By adding the rolling windows approach to our study, we have a total of 6 models to compare. Let start by giving more explanations to the 6 models:
1) The first model is the homogeneous Poisson process (
). The constant intensity
is estimated in the training set. By (17), the estimated
is used for predicting the fraud event in the whole test set. We note this model by HomoStatic.
2) The second model is the Homogeneous Poisson process unless the prediction is done by rolling windows. The window starts by the training set and it is used for the estimation of the intensity; this estimated intensity is used to predict the fraud event on the next transaction in the test set. Then, the sliding window is shifted one step ahead on the next transaction. The intensity is estimated again in the second time window and it is used for the prediction of fraud on the next transaction. This procedure is repeated until the end of the test set. The goal of this methodology is to take account the time varying of the intensity. The model is denoted by HomoDynamic.
3) The third model is the non-homogeneous Poisson process with the intensity is a linear function of time (
). Intensity parameters are estimated in the training set and are used for the fraud prediction in the whole test set. It is denoted LinearStatic.
4) The fourth model is the inhomogeneous linear intensity function unless the prediction is performed by rolling windows. The rolling windows procedure is the same as above. It is denoted LinearDynamic.
5) The fifth model is the non-homogeneous Poisson process with the intensity being a quadratic function of time (
). The procedure is the same as in LinearStatic. We denote this model QuadraticStatic.
6) The last model is as QuadraticStatic unless we make a prediction by rolling windows. It is denoted by QuadraticDynamic.
In addition, we note by NaiveStatic the baseline model to estimate the probability of fraud in the training set and using the same probability for the prediction in the test set. The probabilities of prediction are therefore the same for all the transactions of the test set. This is equivalent to a random classifier because the model has no discrimination capability to distinguish genuine transactions from fraudulent transaction.
We are interested in the power of prediction of the different models. Thus, all the results presented below are based on the predicting probabilities and the labels in the test set. Tables 1-4 show the AUC (Area Under The curve)-ROC (Receiver Operating Characteristics) curves for the different models in each group. AUC-ROC is the measure of performance for the classification problem at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is able to distinguish between classes. Higher the AUC, better the model is. By analogy, higher the AUC, better the model is at distinguishing between genuine and fraudulent transactions. The tables show the mean, the standard deviation, the minimum and maximum for the AUCs calculated for 500 clients in each group.
We note that dynamic models (with rolling windows) are more volatile than static models (without rolling windows). All static models perform significantly
Table 1. AUC: Summary for statistics in the group 1 (
).
Table 2. AUC: Summary for statistics in the group 2 (
).
Table 3. AUC: Summary for statistics in the group 3 (
).
Table 4. AUC: Summary for statistics in the group 4 (
).
better than the dynamic models. The LinearStatic model is the best one and has a mean AUC of 69%, 73%, 72%, 71% in the group 1, group 2, group 3 and group 4 respectively. It is followed by the QuadraticStatic model. The baseline model (naive approach) is significantly worse than Poisson models with the exception of the QuadraticDynamic model in the group 1 where the mean AUC is 47%. However, the HomoDynamic model performs better than the other dynamic models. It is important to mention that in some cases the Poisson models do not predict frauds correctly, as AUCs are equal to 0. It is often the case when the fraud information used in the training set to estimate the intensity is not sufficient for the prediction in the test set. Let us illustrate one common situation in our dataset where there is no fraud in the training set that conducts to AUC = 0. Consider an example of dataset with 6 training instances and 3 test instances. The labels are:
Training set: [0 0 0 0 0 0] Test set: [1 0 0].
The labels 0 indicate genuine transactions and labels 1 indicate fraudulent transactions. There are no fraud events in the training set and from the above assumption
. For all static models, the prediction probabilities in the test set are 0 and therefore the AUC-ROC is equal to 0.5. On the other hand, the dynamic models based on the sliding windows show an AUC-ROC equal to 0. In fact, it is easy to show that using the sliding windows in the test set, the first predicting probability is 0 and the next two ones are different to 0. This conducts to an AUC-ROC equal 0.
AUC-ROC can be a misleading measure for classification in imbalanced fraud dataset. One of the main reasons is that it underestimates the false positive rate. In fact, since the number of legitimate transactions (negative examples) far exceeds the number of fraudulent transactions (positive examples), a significant variation in the number of false positives can lead to a slight change in the false positive rate. This can lead to erroneous conclusions. In this case, the precision-recall analysis is more appropriate because these metrics do not take into account the number of legitimate transactions (negative examples) in their calculation. We focus on the Average Precision (AP) which is an estimate of the area under the precision-recall curve and their results are shown in the following Tables 5-8. All the Poisson models significantly outperform the naive approach and static approaches perform better than the dynamic approaches. LinearStatic model still remains the better one for all groups, following by the QuadraticStatic model. Also, the HomoDynamic model performs better than the other dynamic models. In conclusion, the AUC-ROC and AP analyses showed that in all four groups the linearStatic model is the best; it is followed by the QuadraticStatic model and then by the HomoDynamic model. All the Poisson models outperform significantly the baseline approach.
We are also interested in the relative performance in term of prediction between the Poisson models and the baseline approach. The idea is to determine in which group the Poisson models perform best. AP scores are used for this analysis. The relative variations between the Mean Average-Precision (MAP) for the
Table 5. AP: Summary for statistics in the group 1 (
).
Table 6. AP: Summary for statistics in the group 2 (
).
Table 7. AP: Summary for statistics in the group 3 (
).
Table 8. AP: Summary for statistics in the group 4 (
).
different Poisson models and the baseline model are calculated in Table 9. The table shows that the relative variation decreases when the fraud proportion of the group increases. So, the predicting power of the Poisson models increases with the degree of imbalanced dataset. Figure 4 shows the relative performance for the different models in each group. We observe that the relative performance is better in the group 1 and that the linearStatic model outperforms the other 5 models.
During the analysis, we observe that dynamic approaches (Rolling Windows) are less efficient than the static approaches regardless the performance measures. That is, taking account the temporal variation of the intensity parameters by the rolling windows does not produce better results. Two mains reasons could explain this weak performance of dynamic models. First, as illustrated above, the assumption of
when we train a dataset with no fraud may conduct to this weak performance. Second, the window size is essential for the forecast accuracy. In fact following Inoue et al. (2017), different window sizes may lead to different empirical results in practice and good results might be obtained simply by
Table 9. Relative Variations of MAP between the Poisson Models and the Baseline model in the four groups.
Figure 4. Relative performances between the different models and the baseline approach. These performances are plotted in each group showing in which group the Poisson models perform the best.
chance. To produce better results, one can vary the window size and select the optimal window size for better prediction. Another possibility is to consider a stochastic intensity model that incorporates the time varying of the parameters. This has to be conducted in a next research.
5. Conclusion
The Poisson process is applied to detect fraud in an imbalanced dataset. The case of homogeneous and non-homogeneous Poisson processes is investigated. For non-homogeneous Poisson process, the linear and quadratic functions are considered. We have shown how to estimate the intensity and to predict fraud events. Our methodology is applied to financial datasets.
For each Poisson model studied, we consider the static and the dynamic approach. Unlike the static approach, the dynamic one takes into account the temporal variation of intensity parameters and works with rolling windows. All models are compared to a baseline model of fraud prediction using the proportion of frauds obtained in the training set. We found that all Poisson models outperform the baseline and that static approaches perform better than the dynamic ones. The static linear model remains the better for all groups followed by the static quadratic model and then by the homogeneous Poisson model. The study also showed a better predicting power of the Poisson models in the case of the more imbalanced dataset.
One of the main problems of this study is the training of the Poisson process in a set with no fraud events. In this context, it is difficult to estimate the intensity parameters because we have no fraud event times. In this study, it is assumed that the intensity is zero. But as indicated above this assumption could conduct to a poorer performance of the model.
Another problem is the dynamic of the intensity function. It is assumed here that the fraud rate is constant or deterministic i.e. function of time. In fact, fraud is a rare event that can happen at any time; so it must be stochastic, a random variable at any time. These issues will be addressed in future research by detecting fraud using a stochastic intensity model combined with deep learning algorithms.
The main contributions of the paper are:
1) Even though the intensity-based approach is used in many fields, such as credit risk models, we are among the first to apply this approach to fraud detection.
2) The Poisson process is addressed to rare events and it requires few inputs for the estimation of the intensity; so, the risk of over-fitting and computational cost would be reduced.
3) The approach combined with the machine learning algorithms can conduct a sophisticated technique for detecting frauds.
NOTES
1https://netguardians.ch.
2https://netguardians.ch.