1. Introduction
The traffic flow theory is the theoretical basis for analyzing the operation mechanism of traffic flow under different traffic conditions to effectively organize and manage the transportation system. The car-following behavior is the driving behavior that the driver follows the preceding vehicle when he/she cannot change lane. As the most basic driving behavior, the modeling study on car-following behavior is one of the core research contents of traffic flow theory, and it has received extensive attention from researchers from multiple research fields [1] [2]. Compared with another common driving behavior model (i.e. the lane-changing model [3] [4] ), the car-following model describes the longitudinal behavior of vehicles in the current lane, which is very common in the restricted overtaking section (such as ramp) and the continuous-flow facilities (such as the highway). Establishing an effective model is the premise of accurately describing the car following behavior. At present, the theory-driven models are dominant in the research on car-following behavior [5]. The theory-driven models represented by the GM model [6], Gipps model [7], OV model [8], FVD model [9], and ID model [10] as well as their extended models [11] [12] [13] [14] [15] have shown high performance in the respective research fields. However, the car-following behavior is a typical nonlinear and time-varying research object. For this type of research object, it is difficult to apply one single theoretical method to construct a model that can describe its characteristics with higher accuracy and strong generalization ability. Comparably, the data-driven method has shown unparalleled performance in describing non-linear and time-varying research objects. Different from the theory-driven methods, which have a clear model structure and are based on various premises as well as the strict mathematical derivation, the data-driven methods are based on data to establish a description method of the research object by exploring the internal connections of the data. Data-driven methods are not sensitive to prior knowledge and theoretical assumptions but are very sensitive to the quality of data. In other words, the availability of high-quality data directly determines whether an effective and accurate model can be constructed using data-driven methods. In recent years, the ITS-related technologies have been rapidly developing and popularizing, of which the core feature is informatization. In ITS, using the high-altitude or overhead image acquisition system, global positioning system, smartphone, vehicle-mounted sensors, roadside sensors, and other V2X equipment, traffic managers and researchers can obtain high-precision and large-scale vehicle trajectory data, which provides the basis of modeling the car-following behavior based on the data-driven methods. The existing car-following models based on the data-driven methods mainly focus on the fuzzy logic method [16] [17] [18], the ANN method [19] [20] [21] [22], and the combination of these two methods [23]. In the fuzzy logic method, it is difficult to construct the fuzzy sets and the corresponding membership functions. And in the ANN method, the structure is relatively complex and the train requires high-performance computing resources. In contrast, as a typical integrated machine learning method, the RF [24] has shown very high performance in many fields [25] [26] [27] [28].
Based on this, a car-following model based on the RF is constructed employing high-precision, high-refresh-rate, and large-scale vehicle trajectory data by exploring the internal connections of the data in this work to achieve an accurate description of the car-following behavior. The main contents are: in Section 2, the model is proposed; in Section 3, the training and verification of the model are carried out; and in Section 4, the conclusion is given.
2. Model
The RF is a parallel ensemble learning algorithm based on the Bagging ensemble learning theory [29] and the random subspace method [30], of which the basic learner is the Classification and Regression Tree (CART). The basic structure of RF is as shown in Figure 1.
From Figure 1, we can obtain that the core characteristics of the RF method are “random” and “parallel”. The “random” gives the RF method the performance with high prediction accuracy and strong generalization ability, and the “parallel” gives the method the high training and working efficiency. The “random” of the RF method is reflected in two aspects: the randomness of the sample and the randomness of the attributes of the sample. The “parallel” of the RF method is embodied in that one can train all the T decision trees contained in the RF at the same time, thereby greatly improving the efficiency of training and working.
The training process of the RF method is as shown in Figure 2.
As shown in Figure 2, in the RF method, when the data set is input, it selects the input data set according to the Bagging theory, and randomly extracts the sample set. For the m data in the input data set, the probability P of each data not being selected is:
(1)
Taking the limit of Equation (1), one can obtain
Figure 1. Basic structure of the RF method.
Figure 2. Training process of the RF method.
(2)
From Equation (2), we can see that 63.2% of the data is randomly selected from the input data set for the training of one of the decision trees in each round of sampling, which is the sample randomness mentioned above. For ensemble learning, the stronger the independence of the basic learners it contains, the better the performance of the assembled learner. It is almost impossible to construct completely independent basic learners, and the random extraction principle of the Bagging theory guarantees the relative independence of the basic learners to the greatest extent, referring to the Equation (1) and (2).
When applying the RF method, there are points that need to be determined:
1) Input of the model;
2) Number of attributes in the split attribute set;
3) Impurity function;
4) Size of the forest.
The process of training the RF method is the process of training the decision tree it contains. The core of this process is how to segment features. Given the relatively low number of features involved in this study, the exhaustive method is adopted, which traverses all the values of each feature to find the optimal segmentation. The impurity is used to evaluate the optimal degree of segmentation. For each child node, the calculation method of impurity [31] is
(3)
where
is the segmentation variable,
is a segmentation value of the segmentation variable,
and
respectively are the number of training samples of the left and right child nodes after segmentation,
is the number of training samples of the current node,
and
respectively are the training sample sets of the left and right child nodes, and
is the impurity function. The commonly used impurity functions are shown in Table 1.
The first two impurity functions are suitable for the classification problem, while the latter two impurity functions are suitable for the regression problem.
Based on the characteristics of the RF method and considering the characteristics of the research object car-following behavior, the structure of the RF-based car-following model constructed in this research is as shown in Figure 3.
Table 1. Optional impurity function in the RF method.
Figure 3. Structure RF-based car-following model.
In the model, the input is the velocity
of the object vehicle at the current moment t, the headway
between the object vehicle and its preceding vehicle at the current moment t, the relative velocity
between the object vehicle and its preceding vehicle at the current moment t, and the output is the acceleration
of the object vehicle at the next moment
. And the impurity function
employed in this work is
(4)
Then the training process for a certain node in the RF is equivalent to the following optimization problem
(5)
Substituting Equation (4) into Equation (5), we can obtain
(6)
Equation (6) is the solution method of each node in the RF-based car-following model constructed in this research.
The size of the forest is determined by the iterative method during the training process, and the detailed information about this is given in Section 3.
3. Calibration and Training
3.1. Data Preprocessing
The effectiveness and accuracy of the data-driven model depend on the quality of the training data. The US101 dataset provided by the NGSIM project initiated by the Federal Highway Administration is utilized to complete the training and verification of the proposed model. The NGSIM project aims to provide high-precision vehicle trajectory data required for research in the transportation field. It has the characteristics of abundant data, complete objects, high accuracy, and acquisition frequency of 0.1 s/time, and it is widely used in car-following behaviors and other research fields. The validity of this data set has been widely recognized. However, the total amount of the data set is abundant, and many data are not suitable for this research. The preprocess needs to be carried out.
The NGSIM project implemented vehicle trajectory data collection on different road sections in December 2003, April 2005, and June 2005. The US101 data set employed in this work was collected on Hollywood Expressway (No. US-101) in June 2005, and the lane setting of the data collection section is as shown in Figure 4.
The US101 data set contains microscopic car-following trajectory data such as the position, velocity, and acceleration of 6101 different types of vehicles. The specific data fields include Vehicle ID, Frame ID, Total Frames, Global Time, Local X, Local Y, Global X, Global Y, Vehicle Length, Vehicle Width, Vehicle Class, Vehicle Velocity, Vehicle Acceleration, Lane Identification, Preceding Vehicle, Following Vehicle, Spacing, and Headway. For the usage in this work, the above data fields contain redundant ones, specifically: Local X, Local Y, Global X, Global Y, Vehicle Length, Vehicle Width, Vehicle Class, Vehicle Acceleration, and Following Vehicle.
Although the US101 data set contains large-scale car-following trajectory data up to 6101, it cannot be directly used for the l training and verification of the constructed model. The preprocess needs to be carried out, and the detailed process is as shown in Figure 5.
Figure 4. Lane setting of the US101 road section.
Figure 5. Preprocess process of the data set.
By traversing and processing all the items included in the US101 data set one by one according to the process shown in Figure 5, a data set containing 2152 groups of car-following trajectory data suitable for this study is obtained. 70% of them (1506 groups in total) are randomly selected as the training set, and the remaining 30% (646 groups in total) is used as the validation set.
3.2. Model Calibration and Training
Input the training set into the RF-based car-following model, and the model is trained based on this. In the training of the model, the size of the trees contained in the RF has a significant impact on the training quality. Among the previous research and application, there are no certain rules for setting the size of the trees. The common processing method is to rely on expert experience to set the initial value and repeat the process of testing, adjusting parameters and retesting, and finally get the optimal setting value, which is the so-called iterative method. Based on the iterative method, the size of the trees is set as a value in a given interval, and the optimal value of the parameter is determined by examining the prediction error of the model under the corresponding value. Considering the scale of the data set and the features used in the research, the interval is set as
, and the size of the trees is set as
. Then we can obtain
.
The corresponding error under different values of
is as shown in Figure 6.
From Figure 6, one can obtain that with the increase of
, the error is considerably decreased when
is less than 100. The error decreased slightly with the increase of
, and there are also some fluctuates of the error, when
is more than 110. Considering that more consumption of computing resources with the increase of
and the amount of error reduction caused by
Figure 6. Errors under different values of
.
this increase in the unit is becoming less and less significant, the
in the proposed model is set as 110. Based on this, the training of the model is conducted.
4. Verification and Discussion
To verify the validity and accuracy of the RF-based car-following model constructed in this research, the performance of the model is evaluated utilizing the verification set. Then, the representative data-driven model (the one based on the ANN) and theory-driven models (the GM model and the FVD model) are employed to compare with and verify the proposed model with the same data set (the verification set). Before the verification, the training or the calibration of the above models are carried out according to the previous research: the aforementioned
,
as well as
are set as the input of the ANN model and the Genetic Algorithm is employed to calibrate parameters in the GM model and the FVD model. After the training or the calibration of these models, the verification set is used to evaluate the performance of the proposed model in this work. The Mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Relative Error (MARE) and Root Mean Squared Error (RMSE) are employed as the evaluation indicators, and the equations of these indicators are
(7)
(8)
(9)
(10)
where N is the total amount of data,
is the output value of the i-th object vehicle, and
is the measured value of the i-th object vehicle.
The above evaluation indicators are used to evaluate the performance of the proposed model in this work and the models employed to compare with the proposed model. The evaluation results are as shown in Table 2.
The ME refers to the arithmetic mean of the errors of all output values relative
Table 2. Evaluation results of the models.
to the measured ones, which reflects the average deviation between the output value and the measured value. The MAE further introduces the absolute value to avoid the problem of inaccurate evaluation caused by the offset of the positive and negative ones. The RMSE index is very sensitive to extra large and small values, and thus it can reflect the obvious degree of the deviation between the output value and the measured value. ME, MAE, RMSE reflect the degree of the error, while MARE reflects the proportion of the error in the samples. From Table 2, we can see that the four models show considerably different performances with the same data set. Among these models, the performance of the model proposed in this work is better than the others. According to the evaluation indicators, the performance improvement range of the model proposed in this work is up to 85.716% and can maintain 5.227% at the lowest level. In addition, the fit degree of the two data-driven models to the measured data is significantly better than that of the two theory-driven models. Among the data-driven models, compared with the ANN model commonly used in previous research, the performance improvement of the model proposed in this work can reach up to 77.282%. Among the theory-driven models, the FVD model, in which more factors are considered, shows better performance than the GM model. This is consistent with the research consensus in the field of modeling car-following behavior, which verifies the validity and reliability of the employed evaluation system. Compared with the FVD model, the performance improvement of the model proposed in this work is up to 85.513%, and the value can reach up to 85.716% when compared with the GM model. Even considering the lowest improvement range, the value is 11.672% when compared with the FVD model, and that is 10.009% when compared with the GM model.
5. Conclusion
The theory-driven car-following behavior model still has shortcomings in terms of prediction accuracy and generalization ability. The application of the ITS facilitates the collection of large-scale, high-quality vehicle trajectory data, which is the research foundation for the car-following models based on data-driven methods. In this work, a data-driven car-following model was constructed based on the RF method, and the NGSIM data set was used to train and verify the model. The results show that compared with the data-driven model and theory-driven models that are widely used in the previous research, the model proposed in this work has better performance represented by four typical evaluation indicators, which verified the validity and accuracy of the model. Compared with typical data-driven methods, such as the ANN method, the RF method employed in this work not only has better prediction accuracy, but also has the advantages of low computational power consumption and extensive trial range. It is not required for the RF method to achieve excellent training performance with a high-performance GPU. With the appropriate data set, the RF method can theoretically be suitable for solving a considerable part of scientific issues, including regression and classification issues. This maps to the car-following behavior and the lane-change behavior, when talking about the traffic flow theory. The application efficiency of random forest method in other traffic flow theories, other than car-following behavior and even broader fields is worthy of further exploration.
Acknowledgements
This study was funded by the Qingdao Top Talent Program of Entrepreneurship and Innovation (Grant No.19-3-2-11-zhc), the Natural Science Foundation of Shandong Province (Grant No. ZR2020MF082), the Foundation of Shandong Intelligent Green Manufacturing Technology and Equipment Collaborative Innovation Center (Grant No. IGSD-2020-012), and the National Key Research and Development Project (Grant No.2018YFB1601500).