A Comparison of Machine Learning Techniques in the Carpooling Problem


Urban traffic congestion is a severe and widely studied problem over the decade because of the negative impacts. However, in recent years some approaches emerge as proper and suitable solutions. The Carpooling initiative is one of the most representative efforts to propitiate a responsible use of particular vehicles. Thus, the paper introduces a carpooling model considering the users’ preference to reach an appropriate match among drivers and passengers. In particular, the paper conducts a study of 6 of the most avid classified techniques in machine learning to create a model for the selection of travel companions. The experimental results show the models’ precision and assess the best cases using Friedman’s test. Finally, the conclusions emphasize the relevance of the proposed study and suggest that it is necessary to extend the proposal with more drives and passengers’ data.

Share and Cite:

Santos, M. , Santos, C. , Martínez, S. , Rocha, J. , Menchaca, J. , Villanueva, J. , Berrones, M. , Cobos, J. and Rocha, E. (2020) A Comparison of Machine Learning Techniques in the Carpooling Problem. Journal of Computer and Communications, 8, 159-169. doi: 10.4236/jcc.2020.812015.

1. Introduction

Due to the constant growth of urban populations, cities face enormous challenges related to transport, road congestion, carbon control and environmental pollution, avoid accidents, as well as other situations and problems that are unavoidable. As mentioned in [1] Carpooling is a strategy proposed in the 1940s by the U.S. government during World War II, when fuel and rubber shortages for the manufacture of tires forced residents to make more careful use of private vehicles for personal transport. According to [2] the average occupancy rate for personal vehicle travel is 1.2 persons per kilometer transferred by vehicle; millions of hours/man are lost daily due to excessive traffic generated in roads, in turn causing levels of environmental and auditory pollution that affect health. For example, a regular vehicle with the ability to transport 5 people in full occupancy and used only for travel by transporting only one person, 80% of the carrying capacity is wasted during personal travel. In Mexico alone, this involves 25 million personal vehicles and the fact that you are not using a car in at least 60% of its occupancy, i.e. 3 people per car trip, causes a great loss for the country. Similarly, this inefficiency has also been observed in the transport of businesses such as taxis, vans, trucks and other vehicles.

In response to this problem, Carpooling is an environmentally friendly transport system, which is based on sharing empty vehicle seats with people (cooked or unknown) who have the same or no destination on the driver’s route, and is one of the most effective solutions to reduce vehicle traffic congestion and improve other situations, therefore, according to [3] Carpooling is the dynamic in which a driver shares their car with one or more additional passengers who have a similar destination. In doing so, the occupancy rate of cars could be substantially increased by reducing the number of empty seats in these vehicles making the use of roads more efficient. This would require fewer vehicles to transport the same number of people to their destinations, resulting in significantly fewer cars on the road. Additional benefits of Carpooling include cost-sharing of travel, reduced road traffic, and reduced polluting emissions to the environment produced by vehicles, among others. In this sense, ridesharing can be an effective tool to reduce vehicle traffic congestion and in turn CO2 emissions. The problem of shared travel is an issue that has gained great importance in recent years, that is why several studies have been carried out on how the shared travel benefits the environment to the users who perform this practice, one of the main benefits is the reduction in hydrocarbon consumption, such is the case [4], where the study resulted in the casual sharing of vehicles savings of between 1.7 and 3.5 million liters of petrol per year, or 200 to 400 liters per participant. This also brings great benefits to the environment by helping to reduce emissions of polluting gases, according to [5] performing this practice can lead to a decrease of up to 22% in CO2 emissions. In turn, studies have also been carried out showing that users both drivers and passengers who carry out this practice have benefited in different aspects, these are mainly economical and comfort [6] [7] [8].

Currently there is a lot of work dedicated to the assignment of drivers and passengers to make the shared trip, where general data such as the source and destination location of the users, the time of day, etc. are considered. However, few studies have been conducted that consider users’ personal preferences to improve comfort and experience when sharing vehicles, such as [9] [10] [11] [12] where the most common preferences were obtained as results: the age and gender of the user, whether the person is a smoker or not, the volume and style of music used during the trip, as well as the time that lasted the journey and the cost of the trip, thus improving the comfort of the users who practiced the shared trips , this data was obtained from different sources, either through surveys applied to people, online forms or data provided by agencies engaged in carpooling, the data obtained were very similar in most cases.

In the literature, work has been found by various researchers, in which different methodologies for the selection of travel companions are proposed, such as [13] [14] [15] where authors employ classification techniques in the area of machine learning or Machine Learning. Machine learning is the science of teaching computers how to make data-driven predictions. At a basic level, machine learning involves giving a computer a dataset and asking it to make a prediction. At first, the computer will have many incorrect predictions. However, over the course of thousands of predictions, the computer will update its algorithm to make better predictions [16]. This study presents a review and comparison of results among the most popular machine learning techniques, for this study a dataset based on user preferences is used to practice Carpooling due to the difficulty of performing a great amount of proves to refine the models with real people.

2. Related Work

The K-NN algorithm is one of the best-known classification algorithms and an example of supervised learning. In this algorithm we will use a set of already classified examples that we will call training set to classify the new examples. A new model is not created, but the model is the training set itself. The algorithm is so called K-NN (K-Nearest Neighbors) because it classifies each new example by calculating the distance of that example with all those in the training set. The class predicted for this new example will be given by the class to which the closest examples of the training set belong, the value of k is the one that determines how many neighbors we should look at to predict the class. Thus, with a value of k × 1, the predicted class for each new example will be the class to which the closest example of the training set belongs [17]. In [18], the authors mention that this algorithm has been used to classify different objects in road networks, an example where this applies most is in shared taxis and in the practice of Carpooling. It is for this reason that these authors decided to use this technique and modify it to adapt it to their own needs, which consists of grouping moving objects, to make shared trips. K-Means is a method of grouping or clustering. Clustering is a technique for finding and classifying K data groups (clusters). Thus, elements that share similar characteristics will be together in the same group, separate from the other groups with which they do not share characteristics. To know if the data is similar or different, the K-media algorithm uses the distance between the data. Observations that look alike will have a shorter distance between them. In general, the Euclidean distance is used as a measure, although other functions can also be used. K-Means needs as input data the number of groups in which we are going to segment the population. From this k number of clusters, the algorithm first places k random points (centroids). Then assign any of those points all samples with the smallest distances [17]. The K-means algorithm has been used in various jobs related to the practice of shared cars, such as [19], where the authors developed an implementation of this technique for the assignment of travel companions in private vehicles, this implementation used a set of test data in which the travel routes that were most widely used by users were analyzed and from this key points were created where they could serve as places where users both drivers and passengers agreed to meet to take the shared trip. Another example of the use of the K-Means technique can be found at [20], unlike the work cited above, in this document, the authors made an implementation for the assignment of passengers in shared taxis, this being another way of sharing vehicles, as well as the previous work, the authors conclude that it is the use of this technique helps to improve the allocation of travel companions based on the distribution (location) of participating users. In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan [21], used to generate a decision tree from the dataset. ID3 is typically used in machine learning and natural language processing. The ID3 algorithm is a classification algorithm based on information entropy, its basic idea is that all examples are assigned to different categories based on different values in the condition attribute set; its core is to determine the best sets of classification attribute form condition attributes. The algorithm chooses information gain as attribute selection criteria; typically, the attribute that has the highest information gain is selected as the split attribute of the current node. According to the different attribute values, branches can be established, and the above process is called recursively on each branch to create other nodes and branches until all samples in a branch belong to the same category. To select split attributes, the concepts of Entropy and Information Gain are used. Bayesian networks are an important method not only because it offers a qualitative analysis of the attributes and values that can intervene in the problem, but also because it also accounts for the quantitative importance of these attributes. On the qualitative aspect we can represent how these attributes relate either in a causal form, or by simply pointing out the correlation that exists between those variables (or attributes). Quantitatively (and this is the great contribution of Bayesian methods), it gives a probabilistic measure of the importance of these variables in the problem (and therefore an explicit probability of the hypotheses that are formulated). This is perhaps one of the fundamental differences offered by Bayesian networks with respect to other methods, which do not give a quantitative measure of that classification. In addition to these considerations, Bayesian network-based learning is especially appropriate in certain tasks such as text classification, being even more efficient than the other methods already outlined, and offers a measure for the study and understanding of these other methods [22].

The Naive Bayes classifier is another machine learning technique used in carpooling practice, such as [23], where the authors implement this algorithm for predicting drivers on shared trips based on the waiting times of both drivers and passengers.

Artificial Neural Networks (NN) are a computational model based on a large set of simple neural units (artificial neurons), roughly analogous to the behavior observed in the axons of neurons in biological brains. Each neural unit is connected to many others and the links between them can increase or inhibit the activation state of adjacent neurons. Each neural unit, individually, operates using addition functions. There may be a limiting or threshold function in each connection and in the unit itself, so that the signal must exceed a limit before propagating to another neuron. These systems learn and train themselves, rather than are explicitly programmed, and excel in areas where the detection of solutions or features is difficult to express with conventional programming [24].

3. Methodology

A Carpooling-oriented dataset was created for the application of machine learning techniques based on user preferences (see Table 1). This data set was generated from decisions made by a user who worked to take on the role of a driver. Based on the dataset, the configuration parameters of the different techniques were established. Subsequently, these were applied to achieve their efficiency in the classification of the data set.

For the calculation of the efficiency of the techniques it was considered to start the dataset with the total number of 108 cases that was divided into two segments: training data and test data. This was done through dataset segmentation as follows: 87 cases for training (80%) and 21 cases for training validation (20%).

The following describes the process in which the application of each of the techniques considered in this work was carried out. Similarly, the efficiency obtained by each of these. To perform the experimentation of the K-NN technique it was decided to use the Manhattan distance to measure the similarity between the new case to be classified and the rest of the cases belonging to the training data. It was decided to use this metric as in addition to being easy to implement, it is a measure of robust similarity, since in the selection process it was compared

Table 1. Attributes used in the dataset.

with the Euclidean and Chebyshev metrics, with Manhattan being the one that obtained the best results in this experimentation.

Having performed various applications of this technique on the test dataset, as can be seen in Table 2, it is observed that this technique obtained excellent results using a value of K-16, this being the best configuration, obtaining an efficiency of 95.23%.

To perform the experimentation of the K-Means technique unlike the aforementioned technique, the measure of similarity that obtained the best results was the Euclidean distance, also for the configuration of this technique different configurations were made for the K value in the creation of the groups, in this case the best configuration was K-2, yet the algorithm performed presented an efficiency of only 78.77%, this value was obtained by obtaining the mean based on the results obtained in each of the generated groups, in Table 3 you can see the results achieved for each of the groups of an experimentation with a configuration of K-2.

To perform the experimentation of the ID3 decision tree, no special configuration was made, as this technique automatically creates the decision tree based on the entropy of the dataset, this means that depending on the dataset will be the result to be obtained, so this algorithm will always get the same result for a given dataset, this technique proved to have a low level of efficiency in the classification of selected test cases. In the experimentation carried out for the data set described above this technique obtained 90.47% efficiency. The next technique used in experimentation is a variation of Bayesian networks called Naive Bayes, this technique just like simple decision trees will get the same result for a banished dataset, so no special configuration was made for this machine learning technique either. This technique proved to have a slightly lower level of efficiency compared to the ID3 decision tree, in the classification of the selected test cases. In the experimentation carried out for the data set described above this technique obtained 85.71% efficiency. There are many variations of neural networks, for this project was carried out the implementation of a neural network of retro propagation, this type of neural network usually gets good results, since after a training time the errors are transmitted towards the beginning thus updating

Table 2. Examples of results obtained with different K values.

Table 3. Examples of results obtained using K-2 value.

the weights of the initial neurons and improving the training of the hidden layer, this RNA was made a configuration of 5 inputs and 3 outputs, for the inputs one was implemented for each of the parameters regarding personal preferences, and for the outputs one for each of the possible decisions and an additional one that works as support for the previous two. Simple perceptron training was performed using a hidden layer with 9 neurons, as well as sigmoid function as a training method. Based on this structure, an efficiency of 90.47% was achieved by correctly classifying 19 out of 21 test cases.

4. Experimental Results

Once the calibration phase for each of the techniques used was completed, the performance of these with different data sets of different sizes was evaluated, for this phase of experimentation, in particular 4 data sets distributed as follows were used:

● The first dataset has 108 elements divided into two sections, the training set with 87 cases and the training set with 21.

● The second dataset consists of 82 cases, of which 66 of these belong to the training set and the remaining 16 to the test data.

● The third set contains 65 total cases in which 52 of them aim to train the algorithm and the other 13 validate the results of this training.

● The fourth data set is the smallest of them, with only 50 elements, where 40 of these are used as training and the remaining 10 for the purpose of evaluating the performance of the algorithm.

After experimenting with each of the datasets for each of the techniques, the results shown in Table 4 were obtained, where at first glance you can see that it is the technique that obtained the best performance for the datasets used in experimentation.

To perform a comparison with multiple datasets, it is necessary to check if all the results obtained by the algorithms have any inequality. In the case of finding it, then we can know, by using a post-hoc test, which algorithms have different average results. The non-parametric tests used are described below.

The first is the Friedman test, which is a non-parametric test equivalent to the ANOVA. Under the null hypothesis, it states that all algorithms are equivalent, so a rejection of this hypothesis implies the existence of differences between the performances of all the algorithms studied. After this, a post-hoc test could be used to find out whether the control or the proposed algorithm has statistical

Table 4. Results obtained in experimentation.

differences from the remaining methods in the comparison. The simplest test of these is Bonferroni-Dunn, but it is a very conservative procedure and we can use more powerful tests that control the FWER and reject more hypotheses than the Bonferroni-Dunn test; for example, Holm’s method [25].

The second statistical test technique used is the Holm test. The Holm test is a multiple comparison procedure that can work with a control algorithm (usually chosen at best) and compares it to the remaining methods. The Holm test is an intensification procedure that sequentially evaluates hypotheses ordered by their importance. Let us denote the p-values sorted by p1, p2, ..., so that p1p2 ≤ ... ≤ pk1. The Holm test compares each i to α (k á i), based on the most significant p-value. If p1 is below α (k × 1), the corresponding hypothesis is rejected and p2 can be compared with α (k × 2). If the second hypothesis is rejected, the test continues with the third hypothesis, and so on. As soon as a certain null hypothesis cannot be rejected, all remaining hypotheses are also preserved [25].

To validate the efficiency values of the techniques used in the experimentation phase, Friedman’s non-parametric statistical test was used using a significance level value of α-0.05. By applying this test, the data shown in Table 5 were obtained as a result, which shows the average efficiency for each of them, based on the 4 sets used in the experimentation phase. In this sense the K-NN technique proved to be the best ranked by this statistic, followed by the neural network by a wide difference.

Friedman’s test results are as follows. Friedman’s statistic showed a value of 12.85, and the P-value obtained was 0.012032 which is less than α, this means that if there is a calculated efficiency difference for the machine learning techniques evaluated. However, the result obtained does not provide the necessary information to determine which of the techniques used is best for assigning travel companions. This is why the second test mentioned above, the Holm test, was used. The P-values adjusted by Holm shown in Table 6 indicate that the

Table 5. Classification of implemented ML techniques.

Table 6. Holm’s adjusted P-values.

algorithms have a value less than or equal to the value of α, in this sense it can be verified that, if there is a significant statistical difference between these algorithms, it is the K-NN algorithm that obtained the best results.

5. Conclusions

One of the purposes of this research was to understand the problems related to the provision of travel companion assignment services when taking shared trips. From the literature, we know that Carpooling is an essential element in the transport structure of large cities; thus, some works in the literature aim to promote such practices using user preferences to improve the experience in the real way of share vehicle.

For all of the above, in this work it was decided to carry out an evaluation of the most used machine learning techniques in order to determine which of them is best suited to be used in the allocation and decision-making in Carpooling. Based on the efficiency results obtained, it was concluded that the integration of the K-NN learning technique into the process of classifying passenger and driver assignment on shared travel is appropriate for use in the creation of such systems where you want to improve the user experience when performing carpooling practice. Future work consists in the development of a comprehensive platform in which users both drivers and passengers can register with their personal preferences and that they can make shared trips based on their tastes and preferences, as well as feedback the system with the experiences of the participating users, thus validating the effectiveness of the system and the selected technique.


Authors M. A. Arteaga Santos and C. Méndez Santos are supported by the CONACYT Mexico with the grant 717529 and 719425 respectively.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.


[1] Ferguson, E. (1997) The Rise and Fall of the American Carpool: 1970-1990. Transportation, 24, 349-376.
[2] ONU-Hábitat and ONU-Hábitad (2015) Reporte Nacional de Movilidad Urbana en México 2014-2015. Rep. Glob. en Asentam. Humanos, p. 100.
[3] Ferguson, E. (1990) Demographics of Carpooling. Transportation Research Record, 1496, 142-150.
[4] Minett, P. and Pearce, J. (2011) Estimating the Energy Consumption Impact of Casual Carpooling. Energies, 4, 126-139.
[5] Bruck, B.P., Incerti, V., Iori, M. and Vignoli, M. (2017) Minimizing CO2 Emissions in a Practical Daily Carpooling Problem. Computers & Operations Research, 81, 40-50.
[6] Amey, A., Attanucci, J. and Mishalani, R. (2011) Real-Time Ridesharing: Opportunities and Challenges in Using Mobile Phone Technology to Improve Rideshare Services. Transportation Research Record: Journal of the Transportation Research Board, 2, 103-110.
[7] Abrahamse, W. and Keall, M. (2012) Effectiveness of a Web-Based Intervention to Encourage Carpooling to Work: A Case Study of Wellington, New Zealand. Transport Policy, 21, 45-51.
[8] Neoh, J.G., Chipulu, M. and Marshall, A. (2017) What Encourages People to Carpool? An Evaluation of Factors with Meta-Analysis. Transportation, 44, 423-447.
[9] Furuhata, M., Dessouky, M., Ordóñez, F., Brunet, M.-E., Wang, X. and Koenig, S. (2013) Ridesharing: The State-of-the-Art and Future Directions. Transportation Research Part B: Methodological, 57, 28-46.
[10] Malodia, S. and Singla, H. (2016) A Study of Carpooling Behaviour Using a Stated Preference Web Survey in Selected Cities of India. Transportation Planning and Technology, 39, 538-550.
[11] Berlingerio, M., Ghaddar, B., Guidotti, R., Pascale, A. and Sassi, A. (2017) The GRAAL of Carpooling: Green and Social Optimization from Crowd-Sourced Data. Transportation Research Part C: Emerging Technologies, 80, 20-36.
[12] Park, Y., Chen, N. and Akar, G. (2018) Who Is Interested in Carpooling and Why: The Importance of Individual Characteristics, Role Preferences and Carpool Markets. Transportation Research Record, 2672, 036119811875688.
[13] Anagnostopoulos, T., Anagnostopoulos, C. and Hadjiefthymiades, S. (2011) An Adaptive Machine Learning Algorithm for Location Prediction. International Journal of Wireless Information Networks, 18, 88-99.
[14] Liu, N., Feng, Y., Wang, F., Liu, B. and Tang, J. (2013) Mobility Crowdsourcing: toward Zero-Effort Carpooling on Individual Smartphone. International Journal of Distributed Sensor Networks, 2013, No. 5.
[15] Galland, S., Knapen, L., Yasar, A.-U.-H., et al. (2014) Multi-Agent Simulation of Individual Mobility Behavior in Carpooling. Transportation Research Part C: Emerging Technologies, 45, 83-98.
[16] Norman, A.T. (2019) Aprendizaje Automático En Acción: Un Libro Para El Lego, Guía Paso A Paso Para Los Novatos.
[17] García, C. and Gómez, I. (2006) Algoritmos de aprendizaje: Knn & kmeans. Universidad Carlos III de Madrid, Madrid.
[18] Dong, T.Y., Yuan, L.L., Cheng, Q., Cao, B. and Fan, J. (2019) Direction-Aware KNN Queries for Moving Objects in a Road Network. World Wide Web, 22, 1765-1797.
[19] Cruz, M.O., Macedo, H. and Guimarães, A. (2015) Grouping Similar Trajectories for Carpooling Purposes. 2015 Brazilian Conference on Intelligent Systems (BRACIS), Natal, 4-7 November 2015, 234-239.
[20] Xiao, Q., He, R. and Yu, J. (2018) Evaluation of Taxi Carpooling Feasibility in Different Urban Areas through the K-Means Matter-Element Analysis Method. Technology in Society, 53, 135-143.
[21] Adhatrao, K., Gaykar, A., Dhawan, A., Jha, R. and Honrao, V. (2013) Predicting Students’ Performance Using ID3 and C4.5 Classification Algorithms. International Journal of Data Mining & Knowledge Management Process, 3, 39-52.
[22] Mitchell, T.M. (1997) Machine Learning. McGraw-Hill Education, New York.
[23] Papoutsis, P., Michel, B., Philippe, A. and Duong, T. (2020) Bayesian Hierarchical Models for the Prediction of the Driver Flow and Passenger Waiting Times in a Stochastic Carpooling Service.
[24] Hertz, J., Krogh, A. and Palmer, R.G. (2018) Introduction to the Theory of Neural Computation. CRC Press, Boca Raton.
[25] Holm, S. (1979) A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6, 65-70.

Copyright © 2022 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.