Optimizing Facial Expression Recognition through Effective Preprocessing Techniques ()
1. Introduction
Facial Expression Recognition (FER) is a field of study that plays a crucial role in various human-computer interaction applications and emotional understanding. Recognizing and interpreting human emotions from facial expressions involves several components and challenges [1] [2].
Figure 1 explains the various components of a FER model as follows. It is essential to collect a large and diversified dataset as input. To make sure the model learns to recognize a broad range of expressions, it should cover a variety of emotions. The data gathered are typically unorganized and originate from several sources [3] [4]. Preprocessing of the data is mandatory. It must be standardized and cleaned up before given as input to the machine learning model. Preprocessing is often employed to decrease complexity of the applied algorithm and to increase accuracy of prediction [5] [6]. The preprocessing techniques proposed by this research work is a combination of resizing the raw input facial image, converting the resized image to grayscale, finally normalizing that image. The preprocessed image is then given as input to the CNN model, which classify them into specific expressions such as happy, sad, anger, disgust, surprise, fear and neutral.
There are many challenges in FER of which Intra-Class Variance and Inter-Class Variation are major challenges. Intra-Class Variance is minimizing the variation within the same class of expressions from different individuals is a challenge. People can express the same emotion in diverse ways, making it hard to standardize these expressions. Inter-Class Variation is maximizing the differences between expressions of different emotions across various individuals is also challenging. Different individuals might express different emotions in ways that are visually similar. FER techniques constantly evolve to address these challenges [7]. Deep learning methods, especially Convolutional Neural Networks (CNNs), have shown promising results in automatically learning features from facial images, improving accuracy in recognizing and categorizing expressions [8] [9]. These techniques can learn hierarchical representations, helping to deal with the complexities of inter-class and intra-class variations.
Despite the challenges, ongoing research and technological advancements continue to refine FER methodologies, making them more accurate and applicable in various domains, from security to emotional intelligence applications. The steps outlined provide a comprehensive guide for building a facial expression recognition system using machine learning. Each step mentioned below is crucial for creating an effective and accurate model for recognizing expressions.
Figure 1. Components of Facial Expression Recognition (FER).
Step 1: Defining clearly the purpose and scope essential for setting the right objectives and understanding the context in which the model will be used.
Step 2: Gathering a diverse and sizable dataset is vital. It should cover various emotions to ensure the model learns to detect a wide range of expressions.
Step 3: Preparing the data involves cleaning and standardizing it. Each data type might require specific preprocessing techniques to make it suitable for machine learning algorithms.
Step 4: Correctly labelled data is fundamental for supervised learning. Ensuring accurate annotations of emotions is crucial for the model’s learning process.
Step 5: Splitting the dataset into training, validation, and testing subsets helps in training the model, tuning hyper parameters, and evaluating its performance.
Step 6: Choosing an appropriate model or algorithm for the emotion detection task, considering the specific characteristics of the dataset and problem.
Step 7: Designing the architecture of the selected model, specifying layers, neurons, and activation functions, ensuring to suit the problem’s complexity.
Step 8: Training the model on the training dataset involves adjusting the model’s parameters and optimizing its performance.
Step 9: Assessing the model’s performance using various evaluation metrics on the test dataset is crucial to determine its accuracy and generalization capabilities.
Step 10: Integrating the model into target application or system for real-world use.
Step 11: Models should be periodically retrained with new data to maintain accuracy and relevancy.
Step 12: Being mindful of ethical implications and ensuring the responsible and fair use of the model is essential.
By following the above-mentioned steps, we can create a robust and effective facial expression recognition system [10] [11]. It is important to note that machine learning is an iterative process, and refinement might be necessary at various stages to achieve the best results.
The organization of this paper is as follows. A review of some of the publications that are relevant to this topic is given in Section 2. The research resources and methods are described in Section 3. It includes the description of the dataset utilized for the research, the pre-processing techniques, and the outcome of using the pre-processing techniques on the input facial images. Furthermore, it provides more information on the different steps involved in the preprocessing techniques to covert raw image to a form that enhances the prediction of facial expressions. The experimental results and their explanations are discussed in the Section 4. The last and the most important is Section 5 which presents the conclusions of this research work with some of its findings which will be essential for future research.
2. Literature Survey
A number of researches carried out by many peoples in the globe in the same field. Here, some of the papers which utilise this research oriented are discussed. But, a research work was done by Lopes et al. [12], in which a facial expression recognition system was proposed that uses a combination of standard method, like Convolutional Network and specific image pre-processing steps. Experiments showed that the combination of both Spatial and Intensity Normalization procedures with synthetic sample achieved an accuracy of 92% which was highest when compared to other preprocessing techniques thereby increasing the accuracy of the method significantly and additionally it takes less time to train the model. Five types of data input were tested, and the accuracy was compared. A paper presented by Shin et al. [13] compares the accuracy of facial expression recognition system on five different types of input data namely: raw, histogram equalization, isotropic smoothing, diffusion-based normalization, difference of Gaussian. Concluding that for the preprocessing method of the input image, the histogram equalization method showed the most reliable performance in all the network models.
The research work done by Yao Qian et al. [14] proposes a new preprocessing algorithm with local binary pattern for facial expression recognition. The skin color model is first established to extract the face region, and the complete face region is obtained by using the cumulative projection. Then all the images are normalized by rotating the inclined faces thereby removing the effect of light resulting in desired facial input samples, followed by facial features extraction with rotation invariance uniform local binary pattern. Finally, the facial expressions are classified by the support vector machine classifier using Matlab to implement the algorithm on Indian male facial expression database. The effectiveness of the proposed algorithm is evident from achieving an accuracy of 72.75% in predicting facial expressions.
A paper presented by Zhao et al. [15], employs the independent component analysis (ICA) technique to improve the performance of Locally Linear Embedding (LLE) algorithm in facial expression recognition. The preprocessing is initiated by representing the face images by some independent components and filtering the noise from them. The work also proposes a Supervised LLE (SLLE) algorithm to learn the hidden manifold such as pose variations, illumination conditions, and facial expressions. Based on the cluster information and Euclidean distances between the data, SLLE creates neighborhood graphs for the data. It uses the same embedding step as LLE. In the final phase, the research work extracts an implicit nonlinear mapping from the ICA space to the embedded manifold using a Generalized Regression Neural Network (GRNN). Experiments performed on the JAFFE database yield encouraging findings.
3. Methods and Materials
This research work discusses about the methods and other required information needed to obtain desired accuracy in predicting the facial expressions. Preprocessing is standardization of raw input facial images. Two sets of facial images are required to train and test the model. To efficiently train the FER system a large number facial images are taken from available repositories and preprocessed. Later to test the model, facial images can be downloaded from social media or captured which is also preprocessed. The resultant facial image after preprocessing is fed to the CNN model to accurately predict the seven basic facial expressions namely happy, sad, fear, anger, surprise, disgust and neutral.
3.1. Preprocessing Methods
Preprocessing techniques play a vital role in improving the accuracy of facial expression recognition models. A comprehensive overview of various image enhancement techniques each of which plays an important role in improving image quality and making images more suitable for specific applications are as follows:
Gaussian Smoothing is a method that reduces noise by averaging neighboring pixels. Replacing each pixel’s value with the median value of its neighborhood to reduce noise is done by Median Filtering. Wavelet Denoising is decomposing the image into different frequency bands to eliminate noise from specific bands. Histogram Equalization improves contrast by spreading out the most frequent intensity values [16] [17] [18].
Contrast Stretching widens the range of pixel intensity values to improve contrast. Normalization is sometimes called contrast stretching or histogram stretching. Normalization transforms an n-dimensional grayscale image with intensity values in the range, into a new image with intensity values in the range Nearest Neighbour Interpolation is simple but may result in pixilation. Bilinear Interpolation averages the nearest four pixels to determine a new pixel’s value [19] [20] [21]. Bicubic Interpolation utilizes a more complex mathematical model to estimate pixel values during resizing.
Gray World Assumption assumes the average color in an image should be a shade of gray. White Balance adjusts color balance based on the color temperature of the light source. Color Transfer transfers color characteristics from a reference image to the target image. Thresholding divides an image into foreground and background based on a certain threshold. Edge Detection identifies edges and boundaries in an image [22] [23] [24]. Region Growth expands regions based on certain criteria to segment the image.
Each technique has its strengths and limitations, and their effectiveness often depends on the specific application and the quality of the input image. In many cases, a combination of these techniques might be used to achieve the desired image enhancement or processing goals. Based on the selection of the preprocessing techniques the accuracy of prediction of facial expressions by the system also varies. Identifying the best preprocessing techniques is therefore essential to enhance the performance of the system.
3.2. Dataset for Preprocessing
In a FER system two types of datasets are mandatory, one for training the system and the other for testing or validating the system. The dataset for training the model must be large in number which we can be downloaded from the repositories that are available. In this research work the dataset for training is downloaded from FER2013 [25] [26] [27].
Table 1 shows the details of the image used here for testing the model and for preprocessing. An average of ten students in the age group of 9 - 10 was asked to pose with different facial expressions and their facial images were captured. The images were cropped to a size of 2 KB focusing on the facial region alone. Then these facial images were utilized as raw image input for the proposed preprocessing techniques.
3.3. Proposed Preprocessing Technique
Technical features of this Facial Expression Recognition system include the usage of hardware resources, freely available in Google Colab. The dataset can be downloaded from Kaggle and the same is processed using the python package Tensorflow 2.0 to train the model. The dataset for testing the model can be created by capturing facial images from real world or can be downloaded from any social media. The preprocessing can be accomplished with Python using the following libraries: 1) Pandas and NumPy: It is used to manipulate data which is mandatory in python for all the Machine Learning tasks. 2) Matplotlib: Data and the performance of the model can be visualized using this. 3) OpenCV: It is a library of programming functions for real-time computer vision [28] [29]. The preprocessing procedure includes resizing, converting to grayscale followed by normalization. The result of the coding for preprocessing is shown in Table 2.
Table 1. Characteristics of dataset for preprocessing.
Table 2. Stepwise preprocessing of facial images.
The programming for preprocessing is done in Visual Studio Code using Python. The preprocessing begins with resizing the raw facial image in terms of the desired pixels followed by converting the image to grayscale. The purpose of converting the image to grayscale is that image processing operations works on one plane of image data at a time in grayscale images whereas in RGBA images the operations are applied on each of the four image planes and then the results are combined. Moreover, gray scale images need to process only 1/4 of the data compared to the color image. This data reduction allows the algorithm to run in a reasonable amount of time [30] [31]. The only drawback of grayscale images is the loss of color information which is not needed in recognition of facial expressions. After converting to gray scale, the facial image is normalized where each pixel value in the image is divided by 255. This will normalize the image values so that they range from 0 to 1, instead of 0 to 255. This is often useful for machine learning tasks, as it can make the images more comparable and easier to process. For example, if a happy facial image is 255 pixels wide and 255 pixels high, and each pixel has a value ranging from 0 to 255, then the code img = img/255 will divide each pixel value by 255. This will result in an image that is still 255 pixels wide and 255 pixels high, but each pixel will now have a value ranging from 0 to 1. Therefore, when training a machine learning model to recognize happy facial images, it would be helpful if all the happy facial images had the same pixel values. Thus, normalization ensures that all the images have the same range of pixel values, which can make the training process easier for the machine learning model. Both the train and test data must be preprocessed to enhance the accuracy of predicting the emotions by the facial expression recognition model [32] [33].
4. Results and Discussions
In this research work, python libraries such OpenCV and TensorFlow on GPU NVIDIA version 375.74 from nvidia-375 is used to perform preprocessing and function the CNN model. In this work the preprocessing stage includes three steps namely, 1) resizing, 2) converting the image to grayscale and 3) finally normalization of the facial image. The result of preprocessing in predicting facial expressions is discussed in this section
Figure 2 and Figure 3 show the result of preprocessing. The raw facial image is given as input for preprocessing, where the image is first resized to a uniform pixel then it is converted to grayscale thereby saving storage space. As the size of data handled in FER is enormous. Finally, the grayscale image is normalized which enhances the performance of the FER system in accurately predicting the facial expressions.
Table 3 shows the effect of different preprocessing techniques on the accuracy of predicting the facial expressions by the model. The seven basic facial expressions considered for prediction by the model are anger, disgust, fear, happy, sad, surprise and contempt. The accuracy of predicting each class of emotion is very low when input facial images not preprocessed. When only cropping and resizing is done the accuracy of prediction fairly increases. After only converting the image to grayscale the accuracy rate increases furthermore. Performing only normalization as a preprocessing technique produces a slight increase in accuracy. But the best accuracy rate is achieved only when all the three stages namely, resizing, converting to grayscale and normalization are performed together in preprocessing stage. The average accuracy of prediction of all the seven classes of emotions is least when the images are not preprocessed, is medium when the images are only converted to grayscale and is the highest when the images are resized, converted to grayscale, and normalized.
Figure 2. Facial-image-1 before and after preprocessing.
Table 3. Preprocessing techniques and accuracy.
Figure 3. Facial-image-2 before and after preprocessing.
Table 4 illustrates the training and validation losses obtained at the end of every five epochs. Training loss and validation loss are metrics used in machine learning to evaluate how effectively a model fits both training and test data respectively. When all the training data has gone through the network once, it is called an epoch. How well a model fits the training data in each epoch is measured by training loss. How well a model fits new data at the end of each epoch is measured by validation loss. The difference between the training and validation loss curves shows how much the model overfits or underfits the data. A wide gap indicates that the model is overfitting. Also, it is observed that as the number of epochs increases both the training and validation losses decreases and the difference between the curves is negligible during epochs ranging from 55 to 65 when the model is neither overfitting nor underfitting. This state is achieved only when the input facial images both for training and testing the model is essentially pre-processed.
Figure 4 shows that after preprocessing the images, both validation and training losses are getting reduced. The difference between the training and validation loss curves shows how much the model overfits or underfits the data. This shows that the preprocessing techniques supports in resolving over fitting problem. This can be treated as a good baseline. When the epoch reaches 50 the difference between the training and validation loss curves have almost reduced therefore it can be concluded that the prediction model is accurate and efficient. However, there is still scope for advancement in terms of attempting to eliminate the difference between training and validation errors. A systematic strategy to further enhance the model would involve conducting a search for an optimized hyper-parameter across the entire set of potential hyperparameters. The learning rate, batch size, number of epochs, and other parameters may all be modified to obtain the best possible combination. The huge requirement of computer resources is a major compromise in the search for hyperparameter. The HParams dashboard in TensorBoard provides several tools to help with this process of identifying the best experiment or most promising sets of hyperparameters.
Figure 4. Training and validation loss across the epochs.
Table 4. Results of training and validation loss across epochs.
Table 5 illustrates the training and validation accuracy of the model at the end of every five epochs. The training accuracy of the model is a measure of how well the model performs on the training dataset. It is the ratio of the number of correctly predicted instances to the total number of instances in the training set. Validation accuracy is a measure of how well the model generalizes to new, unseen data. It is calculated using a separate dataset called the validation set, which the model has not seen during training. The table shows a clear increase in both the accuracies as the epochs progresses. At the end of 80th epoch, the accuracies are maximized with training accuracy increasing from 32.81 to 71.45. In a similar trend, we also observe the validation accuracy reaching a value of 68.15 at 80th Epoch from its initial value of 42.11. The preprocessing techniques plays a vital role in increasing the accuracy of prediction which is evident from the experimental results obtained.
From Figure 5, it is evident that the accuracy of both the training dataset and the test dataset increases to the best with the increase in the epochs. In the initial value of epoch both the training and validation accuracies are very low. When the accuracy reaches to an ultimate value the training can be stopped at that epoch value. From the experimental results it is evident that the accuracy rate is enhanced by the preprocessing techniques proposed in this work.
Figure 5. Training and validation accuracies across epochs.
Table 5. Training and validation accuracy across epochs.
5. Conclusion
This research work proposes a facial expression recognition system in which seven different facial expressions of different people from FER2013 dataset have been analyzed. The captured facial images are preprocessed from which the features are extracted and the CNN model detects the facial expressions. Accuracy is used as a metric to evaluate the performance of the system. Tests reveal that the accuracy of the model increases greatly when the normalization procedures are combined with rescaling and grayscale conversion. According to the results, this method provides a simpler solution and achieves higher accuracy when compared to the traditional classifiers that make use of the same facial expression database. Furthermore, training takes less time. In future the proposed work can be extended to be tested in other databases thereby achieving a cross database validation and the accuracy of the model can further be enhanced with more appropriate combinations of preprocessing techniques.