“stppSim”: A Novel Analytical Tool for Creating Synthetic Spatio-Temporal Point Data ()
1. Introduction
Understanding crime dynamics holds immense significance in shaping effective policies and fostering safer communities [1] . By delving into the complex interplay of factors that drive criminal behavior and patterns, we gain insights that can guide targeted interventions and law enforcement strategies. This importance lies in its potential to prevent crime. The analysis of temporal patterns and spatial concentrations, along with the intricate interconnection of these dimensions in criminal behaviour, empowers law enforcement entities to allocate resources strategically and deploy preventive measures proactively [2] [3] . This, in turn, reduces the opportunity for criminal acts to occur.
Nonetheless, the ongoing advancements in digital data acquisition systems have undoubtedly improved the quality of urban crime recordings [4] [5] across various policing jurisdictions, enabling police practitioners to enhance their understanding of crime dynamics. Fine-grained crime data is particularly useful in hotspot policing, where it is used to identify problematic areas and target appropriate policing responses [6] [7] . However, with improved data quality arise concerns regarding the confidentiality of personal information [8] [9] . The disclosure of personal information from crime data is a serious concern, as it can put individuals at risk of harm, discrimination, or other negative consequences. As such, police agencies take steps to protect the confidentiality of individuals relating to the data. This includes implementing strict data sharing protocols, limiting access to data, and ensuring that any data released is anonymized. Data aggregation is one common technique used to coarsen spatiotemporal data for the purpose of sharing while protecting privacy [10] [11] . However, these techniques can also have negative impacts on the data accuracy [12] [13] [14] [15] , data quality [16] [17] , and data fitness for purpose [18] [19] . While aggregation (spatial) may serve to reduce biases in analytical outcomes [20] , fine-grained raw data sets are often considered more valuable due to their flexibility for manipulation and suitability for a wider range of purposes.
Accessing detailed spatiotemporal crime records presents a set of formidable challenges. Chief among them is the intricacy of data privacy and security [21] . Police is responsible for sensitive and confidential data related to criminal activity, and must ensure that any data sharing is done in compliance with legal requirements and security protocols. Moreover, the fragmentation of data sources across jurisdictions poses a significant hurdle. The diverse methods, formats, and standards of data collection within different geographical boundaries necessitate complex integration efforts, hindering seamless analysis of spatiotemporal patterns. Additionally, resource constraints within law enforcement agencies can impede data quality and accessibility [22] [23] . Many agencies lack the necessary technological infrastructure and expertise to effectively manage and share the complex spatiotemporal data [22] . As a result, inconsistencies and gaps in data reporting can arise. Overcoming these challenges demands collaborative efforts to ensure data security, integration, and accessibility while maintaining the privacy of individuals and the integrity of ongoing investigations.
As an alternative, a synthetic data that models specific aspects of crime dynamics, such as patterns of biases in crime counting [24] , spatial concentration of crimes [25] [26] , and target selection by offenders [27] , can be developed. However, existing studies lack adaptable methodologies and practical tools to replicate real-world datasets or synthetise predefined patterns and interactions among crime occurrences in both spatial and temporal domains. The significance of crime event interactions in crime science cannot be overstated. Numerous crime phenomena, including repeat victimisation [28] [29] [30] [31] , crime concentrations [32] , and optimal foraging idea [33] [34] emanate from the interplay between crime events in space and time. Consequently, scrutinizing space-time interactions of crime carries both research and operational benefits. For instance, the recurrent patterns of residential offenders can guide law enforcement in targeting limited police resources. Hence, this paper addresses this methodological challenge by developing “stppSim” tool in R platform in order to allow reproducibility and advancement in other domains.
Specific criminological theories played important roles in the development of “stppSim” tool. These include the rational choice theory (RCT) [35] , routine activity theory (RAT) [36] , and crime pattern theory (CPT) [37] . In particular, the RAT describes the conditions that have to be met while the offender moves and interacts with the environment. It states that a crime occurs when three elements, namely; motivated offenders, suitable targets, and the absence of capable guardians, converge in space and time: The use of RAT and other related theories for the simulation studies of crime can be found in many existing literature [38] [39] [40] .
In order to simulate crime in a virtual environment, two approaches are commonly used, namely: the agent-based modeling (ABM) [41] [42] and microsimulation (MSM) [43] [44] . These techniques operate at the individual (entity) level and rely on assumptions, domain theories, and previous findings. ABM focuses on the interactions between individuals to produce unexpected outcomes. It can simulate complex social systems and model how individuals’ actions affect each other and their environment. MSM, on the other hand, focuses on individual stochastic behavior to generate aggregated/dissagregated patterns. In other words, it can simulate the behavior of large populations by modeling the behavior of individual members. [45] demonstrated that ABM and MSM can be integrated to simulate burglary crimes in a heterogeneous environment by combining street network and land use information. This hybrid approach is considered more dynamic than traditional methods and allows for more realistic simulation of crime patterns.
This article introduces an innovative fusion of ABM and MSM techniques to establish a versatile framework for simulating point events across spatial and temporal dimensions. Complementing this framework is an analytical tool named “stppSim”, developed within the R programming platform. The primary objective of this study is to elucidate the operational mechanics of the framework, highlight the tool’s functionalities, and offer insights into its pivotal outcomes. By harnessing the potential synergy of ABM and MSM, the stppSim tool creatively replicates crime patterns to align with pre-defined specifications. It simulates the stochastic conduct of individual offenders, their engagements with the environment, resulting in the emergence of crime patterns and interactions spanning both spatial and temporal dimensions. In order to ensure a balance between safeguarding the spatiotemporal identities of real individuals and generating valuable data (i.e., accurate records), the simulation commences by anchoring the process at a higher macro (global) level. This involves capturing the overall behaviour, trends, and patterns of the system without delving into the details of individual components. As simulation parameters are gradually dissipated from a broad perspective to a finer granularity, the framework generates outputs in alignment with the predefined data structure, concurrently upholding location privacy at the detailed level.
This paper is organized as follows: Section 2 presents a detailed overview of the proposed agent-based microsimulation framework. In Section 3, the implementation of the “stppSim” tool is described, emphasizing its key features and functionalities. The application of the tool for generating synthetic spatiotemporal point patterns of crime is described in Section 4. The last section of the paper discussed the significance of the tool in research and practical contexts, while concluding with essential considerations for users and identifying potential areas for future enhancements.
2. Spatio-Temporal Point Pattern Simulation Framework
The proposed simulation framework is aimed at synthesizing crime events marked by the locations and reference times, through artificial offenders (agents) within a specified geographical region and time period. The objective is to ensure that a significant number of events which are relatively close in space are also relatively close in time [46] , according to specified spatial and temporal thresholds, hence the space-time interactions between the events.
The framework consists of two main components: Features Calibration and Model Integration, as shown in Figure 1. The Features Calibration component contains two sets of variables: global and individual level variables. These variables are identified as important to crime modeling based on existing theories. The initial values of these variables are set using expert knowledge and research
![]()
Figure 1. Spatio-temporal point pattern simulation framework.
findings. The global variables are those that affect the overall spatial patterns and trend of crime, such as the spatial proportional ratio [47] and trend direction. On the other hand, individual level variables refer to the characteristics of agents (e.g., offenders) that are embedded in the simulation. These variables may include residences (origins) and speed (step length) of offenders.
The Model Integration component takes the selected variables from the Features Calibration component to initialize and configure the modeling functions (ABMs and MSMs) within the simulation environment. This integration results in the changing of agents’ states from “exploratory” to “offending” and vice versa. A crime event is said to occur when an agent assumes an “offending” state. Each component is further described as follows:
2.1. Features Calibration
Using global level variables (see Table 1), the framework configures the spatial and temporal properties of the simulation environment. For example, the spatial proportional ratio is a global level variable that controls the spatial concentration of events across the simulation space [47] . Similarly, the trend is a global level variable that determines the long-term direction of the simulated time series. These variables play a critical role in defining the overall characteristics of the simulation environment and ensuring that the synthetic data generated by the simulation is realistic and comparable to the existing data.
The individual level variables connect the global level variables in order to ensure that the simulated data has the desired characteristics at low levels. In other words, the variables control local variations in the simulation, such as the variance in local concentration of events (using “s_band” variable), as well as short-term patterns in the time series. It’s important to note that the landscape in which the simulation takes place can be either homogeneous or heterogeneous, with varying levels of restrictions depending on the features included, such as land use or street network.
2.2. Model Integration
In order to initialize and configure the functions that enable agents’ movements and interactions across the landscape, the selected variables are integrated together with the environmental features to allow changes in the behaviours (states) of the agents. The process is summarised as pseudocode in Table 2.
3. “stppSim” in Practice
The proposed framework is developed into an add-on package “stppSim” to the statistical software R [56] . The utility and the reproducibility are described as follow:
3.1. Implementation
The “stppSim” package is freely available under Open-Source GNU GPL 3
![]()
Table 1. Simulation parameters and their descriptions.
license on the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/stppSim/index.html). The development version and code are available on Github (https://github.com/MAnalytics/stppSim). To install stppSim, open R console (or RStudio) and type: “install.packages (‘stppSim’)”, then run “library (stppSim)” to load the package.
· Package name: stppSim
· Current version: 1.3.1
Package home page: https://github.com/MAnalytics/stppSim
![]()
Table 2. Model integration and state change.
· Operating system(s): Platform independent.
· Programming language: R
· Other requirements: R (≥4.1.0)
· Key dependencies: SiMRiv [57] ; raster [58]
· License: e.g. GNU GPL v3.0
· Any restrictions to use by non-academics: None。
3.2. Modes of Operation and Assessments
The tool operates in two modes. The first mode allows users to generate complete synthetic data from a sample (source) data, using “psim_real” function. The function learns the spatial and temporal properties of the sample data and generates the synthetic dataset accordingly. This method is particularly useful when there is only a sparse or small sample of crime records available. The second mode, using “psim_artif” function, generates synthetic data based on pre-defined spatiotemporal characteristics provided by the user, without the need for a sample dataset. This mode is useful when there is no available sample source datasets. In using either of these modes, many of the arguments have been set with default values which are chosen to be suitable for a wide range of scenarios. However, users can re-define any argument to suit their specific research objectives. The detailed instructions and reproducible examples can be found in the package manual and vignette.
The efficacy of the “stppSim” tool can be evaluated through both visual inspection and basic statistical methods. From a visual standpoint, the spatiotemporal patterns can be observed by mapping the distribution of points and tracking event trajectories over a given period. Here, using a scatterplot for spatial distribution and a time series plot for temporal patterns would be most appropriate. Conversely, the space-time interactions present in point datasets can be scrutinized with the NearRepeat calculator [59] . This tool determines the statistical significance of proximity of points in both space (within a set spatial range) and time (within a specified temporal frame).
4. Applications
4.1. Replicating Spatio-Temporal Patterns
Besides its ability to generate pre-conceived spatiotemporal point patterns across a research area (via the “psim_artif” function), “stppSim” also excels in discerning the spatiotemporal patterns and trends present in a sample source dataset (through the “psim_real” function) and subsequently curating new datasets based on those patterns. To illustrate the proficiency of stppSim in this regard, we utilized a randomly chosen subset of residential burglary records from a section of Southwest-side Detroit (Michigan, US) [60] . This subset facilitated the creation of a fully synthesized 1960 events. Figure 2 offers a visual comparison of the spatial and temporal point arrangements between the source sample (Figure 2A(i-ii)), representing 40% (or around 780 records), and the comprehensive original dataset (Figure 2B(i-ii)) with its 1960 records. Notably, the sample datasets exhibit a spatial point distribution (SPD) akin to their full dataset counterparts. Moreover, the time series (TS) plots of both groups align considerably, with the overall trends appearing more consistent than their medium-term variations, which, in turn, are more consistent than their short-term variations.
Figure 3A(i) displays the spatial point distribution (SPD) of the synthetic dataset, with its corresponding time series (TS) plot illustrated in Figure 3A(ii). At a glance, the SPD and prominent hotspots closely mirror those of the original datasets, particularly the pronounced hotspots in the southern and western sectors. Nonetheless, some differences are evident, such as the missing cluster in the area’s southwest corner within the synthetic data. This omission could be counteracted by leveraging the interactive argument of the function, which previews potential results prior to initiating the simulation. Regions marked as off-limits (like parks or swamps) based on land use data are consistently bypassed in the simulation. For instance, the tract between the south-west corner and the west is designated as a swamp (“resistriction value of 1”). The original dataset’s hotspots seem more condensed than their synthetic counterparts, which appear slightly dispersed. A possible remedy could be to use fewer origins during the simulation. Users are encouraged to refer to the package manual for a deeper understanding of how varying parameters can impact the simulation. On a more detailed scale, such as street-level units, there are discernible variances between the source and synthetic datasets. The correlation between the original and synthetic data stands at 0.07, suggesting that pinpoint event locations in the original data don’t typically correspond with those in the synthetic set.
![]()
Figure 2. Spatial and temporal pattern of residential burglary of South-west Detroits (Michigan, US, 2009-2010).
In the long term, the most striking resemblance between the two datasets lies in their overarching trend. Both sets, for instance, display a consistent upward trajectory. While the synthetic data showcases more pronounced seasonal spikes compared to the original, their general patterns remain analogous. When observed at a finer temporal resolution, like daily aggregates, the datasets seem more disparate. This distinctiveness in both spatial and temporal detail is essential, ensuring that the precise spatiotemporal locations of individual events in the source datasets remain confidential.
4.2. Simulating Space-Time (ST) Interactions
Utilizing the “psim_real” and “psim_artif” functions, users can respectively
![]()
Figure 3. Spatial and temporal pattern of synthetic data.
replicate the space-time interactions found in source dataset and create new synthetic data sets with specified space-time interactions.
1) Emulating Patterns from the Original Dataset
Drawing from the repeat victimization study on residential burglary, a maximum spatial boundary of 600 metres (i.e., s_range = 600) is established. This range is then divided into three equal spatial ranges: (0 - 200 m), (201 - 400 m), and (401 - 600 m), which are labeled as “small”, “medium”, and “large” spatial bandwidths, respectively. By setting a time span of 30 days with a daily incremental range, Table 3 juxtaposes the outcomes derived from the sample source data (780 records), the full source data (1960 records), and the synthesized data for each bandwidth (1960 records each). The table presents the Knox ratios as per the NearRepeat Calculator, with asterisks denoting statistically significant point interactions for the given space-time bandwidths.
When contrasting the outcomes of the synthetic datasets with both the sample and the full source datasets, it’s evident that the package often yields results more akin to the sample datasets than the entire data collection. For instance, within the initial time span (i.e., days 1 - 15), there exist seven overlapping spatiotemporal bandwidths with significant interactions, such as (0 - 200 m) at 6 days, between the synthetic and sample source datasets. In comparison, there are only five shared bandwidths displaying significant interactions, like (201 - 400 m) at 12 days, when matching the synthetic data with the full source datasets.
Transitioning to the latter segment of the time frame (i.e., days 16 - 30), the sample source data shows six concurrent spatiotemporal bandwidths with significant interactions, while the full source data only offers one. Furthermore, the data suggests that interactions within the proximate temporal span (days 1 - 15) are pinpointed with greater precision than those in the extended temporal range (i.e., days 16 - 30). Overall, the outcomes demonstrate the proficiency of stppSim
![]()
Table 3. Comparing space-time interactions of source and synthetic datasets.
Signif. codes: p < 0.001 “*”.
in mirroring spatiotemporal interactions present in source datasets.
2) Simulation of Pre-defined ST Interaction
In the same study region (a segment of Detroit’s Southwest side), we generated a synthetic dataset featuring simulated spatiotemporal interactions. Here, three distinct spatial bandwidths were defined: [0 - 100 m], [100 - 200 m], and [200 - 300 m]. Concurrently, four 2-day interval temporal bandwidths were specified: 4 - 5, 13 - 14, 21 - 22, and 28 - 29 days. As a result, twelve individual synthetic datasets were formulated. Each dataset encapsulates point interactions as characterized by a distinct combination of spatiotemporal bandwidths that mirror the actual bandwidths.
For every synthetic dataset, the NearRepeat calculator is utilized to evaluate all potential combinations of spatiotemporal bandwidths (hereafter referred to as test bandwidths) juxtaposed against the actual bandwidths. Table 4 showcases these findings. It’s worth noting that the diagonally aligned cells with statistical significance denote that the spatiotemporal interactions at the pertinent real spatial
![]()
Table 4. Detection of space-time interactions in synthetic data.
Signif. codes: p < 0.001 “*”.
and temporal bandwidths were simulated with accuracy.
Each of the twelve synthetic datasets effectively mirrored the intended point interactions, as evidenced by the significant findings in the diagonal cells. This reaffirms the tool’s prowess in precisely emulating spatiotemporal interactions.
It’s also noteworthy to mention the presence of significant results in off-diagonal cells. Such findings can be credited to the compounded effects of the bandwidths specified. Essentially, a set spatial or temporal bandwidth can inadvertently catalyze the manifestation of point interactions across broader spatial or temporal bandwidths, especially if these larger bandwidths are direct multiples of the original ones. A clear illustration of this phenomenon can be observed in cells adjacent to the diagonal with significant results. For instance, the 0 - 100 m spatial bandwidth could be perceived as an inherent component of its larger counterparts, namely; 100 - 200 m and 200 - 300 m. Furthermore, beyond just the spatial dimensions, the notable groupings of statistically significant cells within the temporal bandwidths of 21 - 22 days and 27 - 28 days can be traced back to the cumulative influences of the 4 - 5 days and 13 - 14 days temporal bandwidths, respectively.
5. Discussion and Conclusions
Given the limited availability of detailed crime records, the “stppSim” package serves as a valuable data resource for both research and educational purposes. It provides an alternative data source that can facilitates in-depth examination of crime dynamics in space and time, leading to potential policy and operational implications. The package is conveniently accessible on the CRAN platform, allowing users to freely download, reuse, redistribute, and explore its applications in various domains.
The field of criminology recognizes the significant value of examining the space-time interaction of crime events in various contexts. In the analysis of recurring residential burglaries, such analysis can aid in identifying individuals and locations that face a disproportionate risk of victimization [29] [48] . Researchers are often interested not only in the “same repeat” victims, referring to individuals or locations that experience multiple crimes within a short period after the initial incident, but also in the concept of “near repeat” victims. These near repeat victims are nearby individuals or locations that become victimized shortly after the initial crime occurs. The stppSim package provides opportunities for simulating or exploring different scenarios of repeat and near-repeat victimization within a specific geographical area. As demonstrated in this paper using a section of South-west Detroit as an example, it becomes possible to identify the spatial and/or temporal signatures associated with a particular area [47] [61] [62] .
The analysis of spatio-temporal point interaction extends beyond criminology and finds applications in various research domains, such as earthquakes, ecology, epidemiology, and more. In these fields, the identification of relationships between events and their evolution over space and time holds significant importance. Researchers seek to understand the underlying phenomenon by studying spatio-temporal event interactions, clustering or regularity patterns, and distances that provide insights into these interactions. For instance, in ecological studies at the community level, the analysis focuses on examining interactions of competition and facilitation among trees as a primary objective.
There are several important considerations for potential users of the stppSim package. Firstly, it should be noted that the synthetic point events generated by the package are inherently geomasked at a fine-grained level. This is done to preserve the sensitivity of any source data used. Secondly, the simulation functions in the package incorporate specific random elements, which means that two synthetic datasets generated with the same simulation parameters may not be identical. However, an “interactive” argument embedded in the functions can be used to preview the spatio-temporal models before continuing the actual simulation. Thirdly, it’s important to recognize that the properties of the synthetic data may be biased towards the characteristics of the sample dataset provided, rather than accurately representing the entire population of the actual data. Therefore, synthetic data should not be considered a replacement for real or source data. Any modeling or inference conducted on synthetic data carries additional risks. The author of the package suggests that synthetic data, when used in a research context, can help expedite the research process, but it is crucial that any final data intended for real-world applications be evaluated and fine-tuned using the actual data if necessary. Lastly, it should be noted that the current version of the stppSim package is computationally intensive, particularly when using the “psim_real” function. On a standard office PC with an Intel Core i7-7500CPU and 16.0 GB RAM, it takes approximately 30 minutes to complete. However, the “psim_artif” function allows for the generation of synthetic data within a relatively short period, such as around 5 minutes. Future work on the package will prioritize improving computational efficiency by incorporating parallel processing functions. Additionally, upcoming versions of the package will include the ability to simulate other relevant nominal information, such as age, gender, occupation, and so on, of the objects under study. The author of the package encourages users to provide suggestions, feedback, bug reports, and explore opportunities for collaborations to further enhance its capabilities.
In summary, while stppSim offers valuable synthetic data generation capabilities, users should be aware of the geomasking, inherent randomness, and potential biases of the synthetic data. It is essential to exercise caution and verify results with real data when applying the findings to real-world scenarios.