A Hybrid Spatial Dependence Model Based on Radial Basis Function Neural Networks (RBFNN) and Random Forest (RF) ()
1. Introduction
The real world can be represented as its geographic location. Virtually everything that occurs or exists happens “somewhere”. It is crucial to know “where” something occurred or existed. Geographic location is crucial since all human actions require understanding of the earth [1] . If the information contained in a map or photograph can be expressed digitally, it is because a variety of specialized computer systems have been developed to process geographical information [2] .
Nowadays, topographic surveys, satellites, and other methods collect vast amounts of data in computer format, frequently with geographical references [2] . Data regarding the position, attributes and associations features in space are often termed spatial data [3] . Therefore being referenced to a particular location in space creates unique characteristics (properties). These characteristics provide difficulties and present opportunities for information mining. For that reason, the exploration of these data requires implicit or explicit knowledge included in spatial science [4] .
Machine learning (ML) techniques enable computers to learn within data, extract knowledge, and recognize structures from huge and high-dimensional datasets. They can be supervised, unsupervised, semisupervised or reinforcement [5] . These techniques have been frequently used across many fields [6] . The emergence of ML algorithms has also been applied in spatial data. The main challenge is to build an adaptable model based on the interaction between geographical data and ML management that can handle and evaluate complex spatial information.
According to the first property of spatial data, “everything is related to everything else, but near things are more related than distant things”, knows as spatial dependence [7] . Consequently, we would anticipate that the majority of geographic events will have some sort of spatial autocorrelation (dependence). The existence of this geographical relationship contradicts the concept of identical and independent distribution (i.i.d.) upon which many non-spatial statistical methods are based. This is frequently the case in population data since people who have similar qualities tend to live in similar neighborhoods for a variety of reasons, including house prices, proximity to employers, and cultural considerations. For this reason, a ML algorithm may not adequately capture a significant occurrence [8] .
One of the most common applications of ML is spatial prediction, which uses samples for training to predict unknown values in specific locations [9] . ML algorithms have been used as potential replacement for geostatistical interpolation techniques (ordinary kriging and regression kriging) and for spatial analysis techniques (spatial autoregressive and geographically weighted regression). Kriging and its multiple variants have been utilized as the Best Unbiased Linear Prediction approach since the 1960s, but the authors [10] [11] and [12] have proven that ML models to be superior to geostatistical interpolation techniques in terms of prediction.
According to [13] , by incorporating geographical proximity (buffer distance) effects into a learning process, for spatial predictions, a ML model is able to generate a prediction comparable to ordinary and regression kriging. In the same ideas, [14] incorporated spatial lag and ESF features in a ML model whose showed fewer mistakes and lower spatial autocorrelation compared to a model without spatial features. For spatial ML prediction, the incorporation of spatial features is commonly used in a model to account for spatial dependence.
To assess spatial dependence in models, moran’s I methods was quantified [14] , and a new integration of weights matrix was developed to detect spatial dependence patterns among residuals [15] . Therefore, when analyzing spatial data, we need to be conscious of the specificity that dependence involves when we conduct ML. In fact, the explicit management of the spatial dependence might improve the performance of the ML model or provide important new insights into how a task is learned. However, failure to appropriately consider or include that property into the ML model can impact learning.
The exploration of the spatial dependence within the learning algorithm is still in its early stages. An approach is developed with the goal of illustrating and improving a ML model to identify the dependence. This study focuses on a hybrid model of ML that captures the spatial dependence through the residuals without incorporating any spatial features in the model. Two representative ML models are proposed for spatial data: RBFNN, a type of neural network used for prediction tasks when the current outcome is affected by the neighbors’ states or by contextual information and RF algorithm, that is also appropriate for the case of spatially dependent samples.
The paper is structured as follows: Section 2 presents the mathematical methods of the models; Section 3 presents their application; Section 4 discusses the experimental results; Section 5 provides the conclusion.
2. Methods
2.1. Models Specification
The purpose of supervised learning is to infer a function or mapping from labeled training data. It involves defining the input vector
(the features that will be used to train the model, which may include spatial indexing) and the output vector y (the prediction or label that the model is trying to predict). The parameters are then used to determine how our model will use these features and labels in order to make accurate predictions.
The following algorithms will be developed: RBFNN, RF and Hybrid model.
2.2. Radial Basis Function Neural Networks
Radial Basis Function (RBF)
In mathematics, RBF is a function
whose value mainly depends on the distance between the input and some fixed point. In the case where c is the fixed point, the formula can be expressed as:
(2.1)
is the distance function.
In the domain of neural network, RBF is used to develop a mathematical model called Radial Basis Function Networks. The model uses radial basis function as activation function [16] . The notion behind RBF is that a predicted target value of a particular item is likely to be nearly equal to the other neighbors of the predictor variables. Therefore, An RBF network places one or multiple RBF neurons in the coordinate space, depending on what the predictor factors indicate. The Euclidean distance in the space concerned is between each neutron, where distance is calculated from the centre of the neutrons [17] .
Let denote
the input vector and
the output of the network given by a scalar function of the input vector. Then, the relation between the output and the input layers can be expressed as:
(2.2)
In Equation (2.2), M is the number of the neurons in the hidden layer,
is the center vector for neuron j and
the weight of neuron j in the linear output neuron. By including the bias
in Equation (2.2), the formula can be expressed as follows:
(2.3)
where
.
Generally, the radial basis function is taken to be Gaussian:
(2.4)
The parameter
is given by:
,
denotes the width of the basis function.
As shown in Figure 1, each input layer corresponds to the input vector space. The hidden layer processes the spatial data through the use of radial basis functions that is centered on a spot. The last is the linear output layer which is the summation of the value obtained from the hidden layer and multiplied by a weight related to the neuron.
Figure 1. The architecture of RBF networks.
2.3. Random Forest
Random forest is a commonly-used machine learning algorithm based on the idea of constructing multiple decision trees. The use of multiple decision trees and random feature subsets allows the model to capture both linear and nonlinear relationships. Each decision tree works on a different sample and takes their majority vote for classification and average in case of regression [18] .
Let consider
the continuous response and
as input space vector. m defined the number of inputs and i the partitions of the feature space. The training set is expressed as:
(2.5)
During the training process, the RF splits the input data at each node so that the parameters of the split functions can be improved to fit the
set. From the first step, the algorithm generates a number of decision trees. The decision tree has to determine the best split among all variables. This splitting process begins at the root and proceeds through each node, with each node applying its own split function to the new input X. This is done recursively until a terminal node (also known as tree leaves) is reached [19] . At the end of this process, a prediction function
of each decision tree is constructed over
and calculated as follows:
(2.6)
where
is the identity function that returns 1 if
is in the subset and 0 otherwise and
is the average of y.
The final prediction of the Random Forest model is the average of the predictions of all the decision trees in the forest. This can be written as:
(2.7)
where B is the number of decision trees in the forest and
is the prediction of the m-th decision tree for the input space x.
In Figure 2, the input features are used to construct multiple decision trees, and the mean of all predicted decision trees is taken to obtain the prediction of the random forest.
2.4. Hybrid Model
The field of ML has rapidly evolved in recent years, with various models being developed and implemented to tackle different tasks. One such model that has gained popularity is the hybrid model. A hybrid model is a combination of ML models that are designed to solve a particular problem [20] . In this section, two distinct types of algorithms are used in the formulation of the hybrid model. These algorithms are RBFNN and RF.
RBFNN and RF prediction results are integrated as extra features within a Generalized Boosted Regression (gbm) technique to build a single strong model
Figure 2. Random forest regression construction.
(strong learner). This combined gbm model can then be used for making predictions on new spatial data by following the same process of extracting RBFNN and RF predictions. One of the main benefits of this hybrid model is the ability to reduce errors. Errors arise when a model does not fit the training data set properly, resulting in poor performance. By incorporating the predictions from the two models as additional features, the gbm model can benefit from the different perspectives and strengths of the individual models, potentially leading to improved performance without losing the spatial dependence.
Let
be a set of base learners:
(2.8)
where
is the matrix of the initial features,
and
are the predicted values from RBFNN
and RF
respectively.
Given a training set
of known
values, the objective is to
identify a prediction function
for predicting
that maps
to
in such a way that, given the joint distribution of all
values, the expected value of a certain loss function
is minimized.
Let L define the loss function:
(2.9)
Then the model is initialized
(2.10)
The loss function is determined by the squared-error loss
.
For
to M do:
compute the pseudo residual by taking the loss function derivative with regard to the previous prediction
i.e.
and multiplied by −1.
(2.11)
where m is the number of iteration.
Every iteration of the residuals, the model
(also called week prediction model) is updated:
(2.12)
Each model
is constructed based on the residuals and is trained to minimize the remaining error. This process iterate until a low learning error is reached.
The weak models are combined to create a strong predictive model that captures the complex relationships between the predictors and the response. The final model is:
(2.13)
where
corresponds to a weak prediction model and
stands for the weight of the weak model.
As we can see in Figure 3, the predicted values for each algorithms (
and
) and the initial features are combined in the gbm function to obtain the hybrid model.
2.5. Cross-Validation
Validating the stability of a model is always necessary in machine learning. Cross-validation is a data re-sampling technique used to evaluate the true prediction error of models and tune model parameters to avoid overfitting [21] . In this case, we use the k-fold cross-validation. This technique splits the data into k subsets or folds, and the model is trained on k − 1 of these folds, while the remaining fold is used for testing. This process is repeated k times, with each fold being used once for testing, and the other k-1 folds used for training [22] . The idea is to assess the stability of each model by comparing the performance across different folds. It is expressed as:
(2.14)
where
is the cross-validation estimate of the true error rate or performance metric,
is the loss or performance metric for the
fold,
is the model that was trained and k is the number of folds.
2.6. Spatial Autocorrelation
Spatial autocorrelation is a concept in statistics that refers to the degree to which nearby locations are similar to one another. In other words, it is a measure of the degree (that ranges from −1 to +1) to which the values of a variable in one location are similar to the values of the same variable in nearby locations. It can be positive or negative. Positive spatial autocorrelation occurs when nearby locations have similar values for a variable, while negative spatial auto-correlation occurs when nearby locations have dissimilar values for a variable. When there is no spatial auto-correlation, nearby locations have values for a variable that are unrelated to one another. In this study, we used Global Moran’s I that is used to analyze the global spatial autocorrelation and Local Moran’s I that evaluates the individual features and compares them to their neighbors and looks for clustering.
2.7. Models Evaluation
To evaluate the performance of the models, three criteria are used: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R-squared (R2), which are defined by
(2.15)
(2.16)
(2.17)
3. Application
3.1. Dataset and Tools
A public spatial dataset is used in this study. The proposed HM model is applied to California housing prices spatial data from Kaggle. This spatial dataset was originally published by Dr. Pace and Dr. Ronald Barry to build spatial auto-regressive models on 1990 California Census data. It contains information on the demographics (income, population, households) of the districts, as well as their location (latitude, longitude), and a general description of each district’s homes (number of rooms, number of bedrooms, house value, ocean proximity). The dataset contains a total of 20.640 observations of housing prices with 10 features. Each observation consists of a single block in California. The attributes of the dataset are described in this Table 1 and shows the data sample. Figure 4 shows the distribution of each of the variables. The histograms and bar graph provide further details about the distribution of each feature. The dataset reveals that some features are skewed to the right, with Median House Value peaking on the far right. Figure 5 displays the median home value dispersion across California by population and geographical area. We can observe that, on average the houses nearest to the ocean tend to have higher median house values. Typically, homes along the water cost more than homes inland. Therefore, it becomes sense to take the spatial data into consideration when predicting the price of household.
We used the open-source statistical programming language R. The algorithm implementations of the following packages have been used: caret (Short for Classification and Regression Training), random Forest (Classification and Regression with Random Forest), RSNNS (R Stuttgart Neural Network Simulator), gbm (Generalized Boosted Regression Modeling), spdep (Spatial Dependence) and tmap (Thematic Map Visualization).
3.2. Preprocessing
Real world data frequently possesses undesirable characteristics like inconsistent formats, missing values, unreadable formats etc. The spatial dataset contains 10 columns and 20,640 rows for different census tracts in California. This raw data needs to be processed in order for a machine learning algorithm to understand it and use it later for processing. This phase is known as preprocessing.
Modern ML techniques do well without feature selection, as models learn to identify useless features and focus on others [23] [24] . In this study, all features are included in the training process. The hybrid model (gbm method) is capable of evaluating the importance of features based on their contribution to the predictive performance of the model [25] . The target variable is the median house value that ranges from $14,999 to $500,001. The features are longitude, latitude, housing median age, total rooms, total bedrooms, population, households, median income and ocean proximity. Only ocean proximity is a categorical variable that has been transformed into numerical values. The variable total bedroom contains missing values that are imputed with a median value. Now, we can visualize dataset.
3.2.1. Normalization
Data normalization is used to organize data in a structured way so that it can be
Table 1. California housing prices description.
on a similar scale. This process improves the performance and training stability of the model. Mathematically, it is given as follows:
(3.1)
where
is the normalized value,
is the minimum value of the feature and
is the maximum value of the feature.
3.2.2. Training and Testing Process
The training and testing process is a crucial step in the development of machine learning models. Over 20,640 observations, all the spatial data was divided into
Figure 5. Distribution of house prices across the population in California.
two parts. The first part, the data is divided into 70% as training data. The second part, 30% as testing data. The former was used to build the previously mentioned models, while the latter was used to validate the models as showed in Table 2.
4. Experimental Results
Since it is necessary to quantify the results, in this step, the findings are extracted from the fit statement, which contains a list of stored values for each model. The RBFNN and RF algorithms are isolated and independent of each other during the training process. Next, we implemented a hybrid model to enhance the novelty and effectiveness of the work.
4.1. Results of Machine Learning Models
The training stage involves cross-validations and hyper-parameters adjustment to better connect spatial information. The rbf () function in the RSNNS package is used to build a RBFFN model. The grid parameter values was defined for the number of hidden neurons (size = c(5, 10, 15, 20)) and maximum number of iterations (maxit) of 1000. For RF the number of variables randomly sampled (mtry) parameter was used with c(2, 5, 10) values and the number of trees to grow (ntree = 100). Then, the hybrid model was constructed using the predicted values from RBFNN and RF as additional features for training the gbm model. The number of iterations (n.trees), the maximum depth of each tree (interaction. depth) and the learning rate (shrinkage) were defined as 100, 2 and 0.8 respectively for the HM model. The distribution (for HM) was specified as gaussian.
Table 3 shows the results of all models. The proposed HM model outperforms all individual models. It has extremely low values for both RMSE and MAE, indicating that the predicted values align closely with the actual values. The high R2 value of 0.9991087 suggests that this model explains approximately 99.91% of the variance in the spatial data, indicating a very strong fit. The proposed model is significant compared to all other models used because we have combined the best performing RBFNN (R2 = 74.22%) and RF (R2 = 82.26%). On the other hand, while the RBFNN and RF models perform reasonably well, the RF model generally outperforms the RBFNN model in terms of RMSE, MAE, and R2 scores.
4.2. K-Fold Cross-Validation Results
To demonstrate the significance of our proposed approach, we performed a 5-fold cross-validation using all models on the used spatial dataset. Table 4 shows the performance metrics in terms of RMSE on the models on a cross-validation process, along with the corresponding test error. Analyzing the spatial data, it becomes evident that the HM model consistently achieves the lowest RMSE values across all folds and exhibits the lowest test error (which is the average RMSE across all folds within a model). These findings strongly suggest that the HM model is the most accurate model for the given task. The RF model also demonstrates reasonably good performance, showcasing lower RMSE values compared to the RBFNN model. However, the RBFNN model exhibits slightly higher RMSE values, indicating a comparatively lesser accuracy in predicting the target values.
4.3. Spatial Autocorrelation Evaluation
In this section, we measure the spatial autocorrelation Moran’s I in the residuals to validate the HM model. In the case, we created a spatial weights matrix with the knn2nb function based on the k-nearest neighbors (k = 3) of the location coordinates (latitude and longitude) of the observations. In Table 5, the global Moran’s I statistic of residuals results are describe for each models.
The Global Moran’s I value represents the spatial autocorrelation, specifically the degree of spatial clustering or similarity, observed in the residuals. The RBFNN model exhibits a Moran’s I value of 0.42, indicating a relatively strong
Table 3. Results obtained from all models on the testing set.
Table 4. 5-fold cross validation results.
Table 5. Degree of global spatial autocorrelation.
positive spatial autocorrelation pattern in the residuals. The RF model shows a lower Moran’s I value of 0.21, suggesting a relatively weaker spatial autocorrelation compared to the RBFNN model. In contrast, the HM model demonstrates the lowest Moran’s I value of 0.12, indicating the lowest spatial autocorrelation among the two other models. In summary, the comparison of the Moran’s I values of the models shows the reduction of the spatial autocorrelation in the residuals. It clearly indicates that the performance of a model influences the way in which the model captures the spatial autocorrelation in the residuals.
The study also looked for the spatial association around each individual location. Figures 6-8 show the maps of local moran’s statistic (Ii). A positive Ii value implies that the unit is surrounded by units with similar values. It appears from the maps, all models used in this study captured the neighborhood, which means that they take into account the dependencies in the residuals according to the blue points in the figures. The hybrid model figure shows the li has a small range (−7.32 to 14.55) compared to others. While the model (HM) demonstrates high performance, it does not fully account for the spatial structure in the data, despite containing some dependencies.
This suggests that the models successfully captured the spatial property, as evidenced by the robust performance evaluation using RMSE for the HM model
and moderate performance for the other two models. These results align with our expectations, demonstrating that a specific ML model can effectively process spatial information without incorporating explicit spatial features during the learning process. Additionally, it highlights the model’s ability to capture spatial dependencies and improve accuracy.
5. Conclusion
This study proposed a hybrid approach for spatial dependence detection using machine learning (ML) without incorporating any spatial features in the learning process. The hybrid model (HM) was developed by combining two models, Radial basis function neural networks (RBFNN) and Random forest (RF) to achieve high accuracy and efficiency. Both models (RBFNN and RF) perform well and can detect the dependence because of their ensemble architecture. Combining them, they further achieved 99.91% of performance. This significant performance improvement observed can be attributed to the utilization of the boosting technique (Generalized Boosted Regression), which identifies errors for each model. In conclusion, the individual models were able to capture a greater amount of spatial information, including spatial dependencies as measured by global Moran and local Moran, despite having lower R2 values compared to the HM model. The HM model, on the other hand, exhibited a high R2 but showed weak positive spatial dependence.