1. Introduction
The digital revolution has allowed the development of sophisticated information systems as well as made information exchange easier. This has given rise to security problems at both the local and network levels. Numerous solutions have been developed to deal with security problems, such as encryption and hashing algorithms [1] [2] [3] [4] . Kouam and Marcellin [1] have proposed a security model that provides multi-level security of data by using biometric authentication of information exchanges. The interest is to prevent data theft and fraudulent information transfers in cyberspace.
Cybersecurity is the branch of computer science that focuses on protecting information systems in cyberspace. With the rapid proliferation of digital devices, the number of attacks is increasing, making it more important to maintain security. However, using cyber deception is a component of ensuring quality security. Many models like [5] [6] [7] , and others have proposed various solutions using different methods for the deception of attackers in systems with the use of chatbots, game theory, and others. Machine-based text analysis was proposed by [8] for sentence classification. Implementing a deception system involves the creation of the environment itself and the fake data. The need to automate the data generation process becomes crucial with the volume of data to consider.
Natural Language Processing (NLP) can be defined as the automatic manipulation of natural languages, like speech and text, by software. NLP is used in cybersecurity to address several tasks, like malicious domain name detection [9] , spam detection or mail classification [10] , vulnerability detection [11] , and others. NLP is also used for deception to encrypt the context of a message [12] , encrypt relevant information in data [13] , duplicate documents in a server [14] , and many others. To the best of our knowledge, we have not seen any model using NLP for data generation in a cyber deception context, and this work is the first one to address the issue.
The fake data generation in the context of deception must consider the actual data during the generation process to be consistent. Given the fact that some texts (documents) can be composed of several domains, the model must be able to consider each field contained in the text. Our main object is to propose a text generation model for the cyber deception context.
The rest of this work will be organized as follows: section 2 will give a brief state of the art on NLP in cybersecurity domain and text generation models, section 3 will present methodology, and finally, section 4 will present some results.
2. State of the Art
Cyber deception tends to touch multiple domains in order to implement an effective security solution. It uses game theory, graph theory, NLP and others. Language is the main tool that promotes human-machine teamwork in cybersecurity activities. Among the different tasks that can be performed using NLP, some can be used to create threats or attacks or also to secure resources in information systems.
We can use NLP to encrypt the context of a text [12] . The method takes a message and replaces a part of the message with another message which does not have a relationship but we can read a part on real message and this part can contain the main information which is a real problem in the deception context. To perform a deception environment, Tanmoy [14] proposes “a fake repository Engine for cyber deception”. The model can identify important documents in a server and for each document, a set of fake documents is generated and associated with the original document. The goal is to waste the time of the attacker to intercept it either during the download of the set of documents or the search of the original document directly in the server. The authors did not mention document content, but just how to identify and duplicate it.
Prakruthi et al. [15] proposed fake document generation for cyber deception by manipulating text comprehensibility. They adopt a set of quantitative measures based on qualitative principles of psycholinguistics and reading comprehension: connectivity, dispersion, and sequentiality. Given an input text, the authors introduce some sentences or remove some sentences to make sure that attacker misunderstands the proposed text. The problem is that the proposed method identifies sentences containing target concepts to manipulate their positioning; however, the selection of sentences is purely based on the occurrence of concepts. It does not consider a different rephrase of a concept or related information in the selected sentences. Another problem is that we can see a part of real information contrary to our work.
Syntactic analysis is one of the key tools used to complete the tasks of the NLP, which is used to determine how the natural language aligns with the grammatical rules. The most widely used techniques in NLP are lemmatization, morphological segmentation, word segmentation, part-of-speech marking, parsing, sentence breaking, stemming, etc.
Indeed, semantic analysis can allow the classification of data and can thus classify, for example, emails received as an attack or a normal email [10] . The applicability of feature extraction for malicious message filtering is determined by text mining methodologies [16] .
Data analysis has become more complicated with large volumes of data and the diverse properties that data presents. This does not make it easy for data analysis methods to consider all the properties to produce a good analysis. Thus, NLP can be used to reduce the dimensionality of the data by extracting features for a more efficient analysis [17] . This allows for the removal of duplicates and others in the considered data. Many vulnerabilities exploited by attackers, however, are sometimes found in the programs we use. Indeed, some programs have had security flaws since the development process. Mokhov et al. [18] used n-grams NLP techniques combined with machine learning for the detection, classification, and reporting of weaknesses related to vulnerability or bad coding practices found in artificially constrained language.
The expanding number of reports of cyberattacks can make a deeper analysis of such data prohibitively time-consuming. However, shallow text analysis cannot provide many of the details necessary to yield actionable steps for improving security measures against the vast number of cyberattacks occurring each day. Pawlick et al. [7] use game theory in order to deceive the attacker. Yanlin Chen et al. [8] proposed machine-based text analysis, which builds automatic sentence classification. As given a new category, it can automatically update training data and build a tool to analyze the text of cybersecurity strategies. Another problem is malicious domain name detection, which can be done using lexical analysis [9] . NLP is increasingly used nowadays by cybercriminals and security defense tools in the understanding and processing of unstructured data generated. NLP’s ultimate aim is to extract knowledge from unstructured data or information [19] . Behavior modeling is used to detect malware and attacker in an information system. To formulate a behavior report, [11] used the bag-of-word (BoW) of NLP.
The NLP can be used for text generation tasks and it is evolving rapidly. The goal is to generate text that looks as real as possible to humans. Text generation involves the prediction of words for sentence construction. One of the most widely used models for this task is the sequence-to-sequence (Seq2Seq) model, where the recent model includes attention [20] [21] , etc.
There are some topics in the text generation domain. We have open domain dialogue [22] [23] [24] where [22] focuses on the specific topic of the current conversation and makes automatic changes, [23] uses unconventional texts for training the model and [24] proposes a model which associates images on the different word during the conversion process.
Many generation models have focused on building long texts such as paragraphs, long sentences or documents [25] - [31] , but some problems can be observed. These models cannot generate a long sentence; they have problems with the style of output, syntax, context, and others. However, there is no specific point on text generation for the cyber deception context that chatGPT addresses. In order to be used in cyber deception, the text generation model must keep some properties such as context preservation, domain preservation, syntax, consistency, size, etc.
3. Methodology
Many information systems are vulnerable to multi-level attacks. Attackers use this technique to disrupt defenders and to carry out their attacks. However, multi-level deception aims at setting deception barriers to intercept them.
3.1. Deception Architecture
Figure 1 presents our architecture on multi-level deception which has two (2) levels:
All users who want to access in system start by an authentication point. We used the authentication model proposed by Kouam and Nkenlifack [1] with biometric authentication.
After the authentication process, the intrusion detection system classifies users into three groups: malicious users, legitimate users, and uncertain users. Uncertain users are those whose system has not been able to identify them as legitimate or as attackers.
The attackers then have access to a database containing fake data, while legitimate users access the real data. The fake data is obtained using a generation module that takes the real data as input and provides the generated data as output. However, the actions of attackers are collected for analysis and security
Figure 1. Multi-stage deception architecture.
enhancement purposes. These actions can include attempts at unauthorized access, data deletion, query modification, and many others. When the system realizes that the attacker’s interaction is decreasing (multiple attempts at fraudulent action), the attacker is redirected to the second level of deception with the response to his request. The attacker will then believe that he has achieved his objective or is making progress in the system. The data generation module is designed in such a way that the data generated at the first deception level slightly generates the data available at the second level.
3.2. Multi-Stage Deception
Multi-level deception consists of moving the attacker from one level of deception to another. Indeed, when the attacker’s behavior shows that he is no longer interacting sufficiently with the system and that he is making requests to access the system, he is redirected to the second deception level. This shows the attacker that he is making progress. Based on Figure 1, we proposed the following multi-level deception management algorithm (algorithm1) in Figure 2.
Let BS, WS, and GS be the black state, white state, and gray state, respectively. Let u be a user, and L the level of deception the user u is in. When the user authenticates, he or she is placed in one of the system states: black, white, or gray. When the user is placed in the black state, he is redirected to the first level of deception, and the data to which he has access is fake. If the user is placed in the gray state and the number of connection attempts has not been exhausted, the system switches to a forced authentication approach. If this fails, the user is considered an attacker and placed in the black state. If not, he’s considered a legitimate user. Legitimate users are redirected to the real information system, where they can access real data.
If the attacker is at the first deception level and his actions demonstrate persistence in breaking into the real system, he is redirected to the second deception
Figure 2. Algorithm for multi-level deception.
level in order to renew his belief. All the attacker’s actions at different levels are collected to increase the system’s level of security.
3.3. Text Generation Component
The text generation component is used to generate the data used to feed the servers in deception environments. This component considers as input the real data and produces as output the generated data. It must be able to produce several outputs for the same data. It will allow the creation of several deception servers from the same real data server. Considering that the second level of deception contains particular attackers, that is to say, who have an idea of the quality of the data and the system functioning, the data generated for the server of this environment must be more practical.
Deceiving an attacker with text requires that the text be generated in accordance with a number of principles. Among these principles, there are two that are essential for deception to be effective:
· Firstly, the generated text must not provide the attacker with any sensitive information or allow the attacker to deduce any sensitive information. This ensures that the model is risk-free.
· Secondly, the generated text must remain in the same domain as the original text. This increases the attacker’s confidence. If, for example, the attack targets the banking system and the attacker receives a medical text, he will quickly realize that the text received is not credible.
In addition to these two elements, we can add the proportionality of input and output text, the consistency of generated text, and multi-context generation. Multi-context or multi-domain generation refers to a text in which several unrelated subjects can be identified.
Thus, the generation model consists of four stages: pre-processing, extraction of sensitive information, replacement of sensitive information, and, finally, generation of the final text. If our model is applied to an enterprise data server with small data sets, the segmentation stage is not necessary, as all the data in the database will be of the same type. Figure 3 shows the architecture of our model.
3.3.1. Step 1: Pre-Processing
This stage consists of removing all characters that could bias the model’s training. These include punctuation characters such as commas, semi-colons, periods, question marks, exclamation marks, etc., as well as specific characters such as $, #, &, \, /, etc.
3.3.2. Step 2: Extraction of Sensitive Information
Assume a text T consisting of n sentences and each sentence consisting of m words. The set of sentences and words in text T can be defined by S = {s1, s2, ..., sn}, with each sentence si defined by wi = {w1,i, w2,i, ..., wm,i}, respectively. To extract the key parameters, we used the Stanford parser model available at https://nlp.stanford.edu/software/.
Given G = (C, E, w) a graph, where C is the set of sensitive information (concepts) defining the nodes, E the links between sensitive information such that E Í C * C, and w a function that associates with each link (c, c') Î E a weight representing the frequency of occurrence of c and c' in T.
If c is used to explain c', then it is important to understand c before c'. Agrawal
et al. [32] propose a quantification between c and c' to define the level of comprehension. Given the input text T is made up of a set of concepts R(c) (c’ Î R(c), if w (c, c') > 1 [32] and the frequency of appearance of c in T noted f (c, T), the sensitivity level of c in T is defined by
(1)
The key phrase of the text (the most sensitive phrase) noted, sk, is defined by the phrase with the highest sensitivity level, δ (c, T).
Given the graph G = (C, E, w) defined above and the connection links between these different critical words (c, c') Î E, the connectivity of c is defined by:
(2)
where w is the weight of edges connecting to c, and λ is a constant set [33] .
The t connectivity can be given by
(3)
Based on Equations (2) and (3), we can replace each sensitive word c Î C. The connectivity graph can help you choose a good word for replacement. This can allow for text consistency.
3.3.3. Step 3: Critical Word Replacement
Critical word replacement involves taking θ (c) and connect (T) and finding a word or group of words cr that can replace c in such a way that the text retains its consistency. By replacing the set of critical words in T, we end up with a text T’. Thus, the critical words are defined by the function fc by:
(4)
where Cr is the set of replaced critical words. Thus, cr connectivity is defined by:
(5)
Similarly, T' connectivity is given by:
(6)
3.3.4. Step 4: Text Generation Process
Text is generated after the critical words have been replaced. This ensures that the output text supplied to the attacker cannot contain any sensitive information.
Text generation can be performed in two ways:
· Either a total generation of T', which consists in generating a text from all the words in T'.
· Or by partially generating T', i.e., generating all the other words in the T' text, but without touching the critical words.
The generation process consists of two phases as shown in Figure 4:
· The first phase consists of segmenting the input text into sentences (input sequence) and identifying the context of each sentence (context extraction). This is passed to the generator, which uses the sequence-to-sequence generator and outputs n candidate sequences for each input sequence.
· The second phase takes the candidate sentences and the general context of the input text as input and selects the best candidate sentence close to the general context using similarity based on the cosine function [34] .
· The selection function then performs a post-processing operation to reconstitute the various segments into a single output segment.
3.4. Deception Data Base
Generation in the context of cyber deception differs from classical generation in the sense that it is not enough to generate text randomly, but a text that can convince an attacker. We assume that the attackers have partial knowledge of the information they are attacking.
Our generation model is therefore a module that lies between real data and generated data as presented in Figure 5.
For each data of the real system, we have generated data in a deception environment. This module can thus be regarded as a function defined in the following way:
Let D be the set of real data and D' the set of generated data. We can define a function f using Equation (7):
(7)
where d is real data and d’ is the generated data. d can be a word, sentence or file.
According to the generation model process expressed by Equation (7), f must be:
· Injective: for two distinct elements d1 and d2 in D, we have two distinct elements d'1 and d'2 in D'. It is not possible to have the same output from two distinct texts. It ensures that if we have a document with similar content, it will not possible to have a repetitive sentence after generation;
· Surjective: for each element d of D' we have an element d of D from which we have generated d. Each text generated in D' is based on a text d of D.
In view of the above, we can therefore conclude that f must be a bijective function.
In addition to being bijective, this function must be a one-way function. This means that from a generated text d, it must not possible to find the original text d (line number 2). The fact that f must be bijective makes the model more interesting. So, from the same text d of D, we can have several outputs d'1, d'2, ..., d'n of D'. Hence, the ability of the model to provide several outputs makes it possible to build several deception environments from the same data set. The fact that the f function is unidirectional means that there is no risk of finding the real message in a text that has been generated.
4. Simulation Result
4.1. Deception Environment Implementation
The authentication model implemented in this model the one proposed by Kouam and Nkenlifack [1] using two-factor authentication: password and biometric. We also used an intrusion detection system proposed by [35] , which enabled us to create three user queues: malicious, uncertain, and legitimate.
We have installed and used WAMP server and MySQL database management systems with three databases: real_base, fake_base1, and fake_base2. Malicious users are connected to fake_base1, legitimate users to real_base, and unsure users are sent back for an authentication process as long as the number of attempts is not exceeded. We have considered three attempts in our simulations.
Figure 5. Deception process in data base.
If the first deception level records repetitive actions or bypass attempts, for example, this implies that the attacker is aware that he is in a deception system. We can then transfer him to the second deception level by connecting him to fake_base2 to make him believe that he has succeeded in bypassing security since he has different data. In this way, the first level of deception is used to connect newly detected users, and when the attacker’s actions raise doubts, he is sent to the second level of deception.
4.2. Dataset and Implementation Parameters
We used several datasets in order to vary the model evaluation contexts. The first dataset name True dataset [36] was download on Kaggle. This dataset contains a list of articles considered as real news. It contains four columns: a title (the title of the article), a text (the text of the article), a subject (the subject of the article), and a date (the date that this article was posted). For our test, we focus on text and subject columns. The text column contains 21192 unique values and the subject column contains 53% of political News and 47% of world News. In this dataset, we have two (2) contexts (political data and world data) that we will manage. The second one is the dataset of OpenMRS [37] . This database contains a maximum size of 5000 patients with almost 500,000 observations. The third dataset has been collected on the Internet.
The pre-processing process consists of transforming the text into the lower case to make training easier. But also to remove special characters such as ?, |, #, etc., for example. We used the Generative Pre-trained Transformer, one of today’s most widely used pre-processing models.
For the implementation, we used the Seq2Seq with attention method for the first module and the Doc2Vec method for the second module. The cosine comparison method is used to make the comparison between two texts. We used the LSTM method with a size of 120 layers, a softmax activation function, and a dropout of 0.1 to build the first module. For the second module, the model is trained using a bidirectional LSTM layer with a size of 512, RELU as activation function, and a dropout of 0.5; a dense layer with the size of the dimension vector, we use ADAM optimizer and logcosh function to compute the loss error, with a learning rate of 0.001. After training the model on 70 epochs, we have an accuracy of 98% for True dataset, 89% for OpenMRS dataset and 91% for data collected on the Internet.
4.3. Example of Text Generated
We ran a simulation, taking data from social networks on the one hand and also running tests on members of our laboratory on the other. The model was evaluated on its ability to identify and replace sensitive words so as to hide the real message. Let’s consider the following text in Figure 6.
Using the keyword identification model, we obtain Figure 7.
In this text, the words in red are words identified as sensitive according to our sensitive word dictionary. The graph of sensitive words and input text is given in Figure 8 and Figure 9, respectively.
These graphs make it possible to carefully select words to substitute for sensitive words. The result is Figure 10, with the replaced words in blue.
At this point, we can proceed with text generation. The text obtained after generation is present in Figure 11.
In this output, the sensitive words have been replaced with alternative terms to obfuscate the original meaning. As we can see, “packet delivery” has been replaced with “secure item transfer,” “Cleveland Clinic” has been replaced with “City Medical Center,” “Hospital” has been replaced with “Facility Medical Facility,” and “parcel” has been replaced with “package.” These changes aim to deceive potential attackers by altering the context and making it harder to discern the true meaning of the text.
4.4. Evaluation Metrics
For evaluation, we have use four metrics:
Precision: Precision measures how accurately our model identifies and replaces sensitive information. It calculates the ratio of true positive replacements (accurately identified and replaced sensitive information) to the total number of replacements made by the model. A higher precision indicates a lower rate of false positives.
Figure 7. Text with identified sensitive information.
Figure 10. Text with sensitive information replaced.
Recall: Recall measures how well the model captures all instances of sensitive information. It calculates the ratio of true positive replacements to the total number of actual sensitive instances in the text. A higher recall indicates a lower rate of false negatives.
F1 Score: The F1 score provides a balanced measure of precision and recall. It is the harmonic mean of precision and recall, giving equal weight to both metrics. The F1 score helps evaluate the overall performance of the model in identifying and replacing sensitive information.
Human Evaluation: Human evaluation can help identify any nuanced errors or improvements that may not be captured by automated metrics. We have used five users to make human evaluations.
4.5. Evaluation
The evaluation metric can be presented in Figure 12 as:
For the human test, we asked five users to provide us with data and manually identify their keywords. We ran the data through our model and obtained the following results present in Table 1.
By using data in Table 1, we can plot the graph in Figure 13.
4.6. Discussion
Babu [12] proposes a deception model by encrypting the context message using NLP. The pattern involves replacing part of the input text. The output can be inconsistent, and we can also see the critical information in the output. Kushwaha et al. [13] encrypt relevant information in order to keep the information. The problem is that after changing the sensitive information, the attacker can read the message and deduce the sensitive one. Tanmoy [14] proposes a model that, for each document on a server, generates n documents and keeps them on the same server. The problem is that the attacker may be lucky and directly have
Figure 13. Precision and recall across evaluation scenarios.
the good document. Moreover, the model uses too much space to save some duplicate documents. Prakruthi et al. [15] suggest that for a given text, include more text to manipulate text comprehension. But the output shows all the real messages with the added message. The attacker can read all sensitive information. The authors of [25] - [31] have worked on the proposal to generate texts with a size that matters. However, they encountered problems of consistency and missing words in the generated text.
Just by focusing on these models, we can see that each of them presents at least one major problem in the case where we would like to disappoint the attacker. However, in the model, the segmentation introduced in Module 1 enables us to solve the problem linked to the size of the input text. Indeed, by segmenting a text into several segments, we can obtain inputs that can be easily manipulated by the model. One of the key problems in generating receptive text lies in the ability to hide information sensitive to the attacker in the output text. Unlike [12] and [13] , which replace words in the input text to modify the context, our model searches for sensitive words and replaces them in such a way as to preserve the coherence of the text after replacement. What’s more, the output text is still generated before being transmitted to the output. So, contrary to the literature, it is not possible to read or deduce sensitive information from the generated text.
4.7. Applicability
The multi-level deception model proposed in this article can be used in any information system, but it needs to be optimized for use in embedded tools such as sensors. However, the text generation model applied to it can only be used by systems handling textual data, as the model does not handle numerical data or formulas. To be effective in the medical field and produce good results, the model will require further training with more data sets, as the model is not effective when it comes to abbreviations. Ultimately, the model cannot be used in banking systems, information systems manipulating image or video data or geographic coordinates, such as drones, and the like.
5. Conclusions
Information systems face multiple attacks, usually targeting data. This can range from modification, deletion, or the unavailability of data. In this paper, we propose a multi-level deception system consisting of three states: the black state containing malicious users, the gray state containing uncertain users, and the white state containing legitimate users. Data manipulated by black-state users is generated using real data.
Text generation involves several steps, including the identification of sensitive words, the replacement of sensitive words, and text generation. We applied this to test data and users, and the results obtained validated the model.
However, there are a few problems with the model, such as the handling of synonymous words.
In the text generated above, the sensitive words have been replaced with alternative terms to obfuscate the original meaning. For example, “packet delivery” has been replaced with “secure item transfer,” “Cleveland Clinic” has been replaced with “City Medical Center,” “Hospital” has been replaced with “Facility Medical Facility,” and “parcel” has been replaced with “package.” These changes aim to deceive potential attackers by altering the context and making it harder to discern the true meaning of the text.
However, the word drone, which refers to UAV, has not been identified, nor has the word delivery.
Acknowledgements
Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-21-1-0326. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
The work of the second author is partially supported by the EPSRC grant EP/V049038/1 and the Alan Turing Institute under the EPSRC grant EP/N510129/1.