1. Introduction
Speech is one of the important ways for human’s communication and it is a convenience and simple way to transmit information. The speech signal contains not only the expressed speech meaning, but also the speaker’s emotion information which always be ignored by the traditionally speech processing [1]. But the emotion information plays a very important role in the speech communication. Therefore, in recent years, emotion recognition has become a hot spot. Traditional features, such as energy (E), zero- crossing rate (ZCR), the fundamental frequency (F0), the first formant (FF), Mel Frequency Cepstrum Coefficients (MFCC), linear prediction coefficient (LPC), the short- time average magnitude (SAM) etc. and their statistics, such as maximum (Max), minimum (Min), mean, variance (Var.), first order difference (FOD), rate of change (RC) etc. are generally applied to recognize speech emotion. By combining the above features, we can use it to recognize a speech’s emotion. In this paper, speech emotion is divided into four categories: Fear, Happy, Neutral and Surprise.
Due to the redundant and unrelated information between feature parameters, it is indispensable to select the features which remarkably characterize speech emotion information [2]. The paper adopts the Back Propagation (BP) neural network to sequence the features in the saliency measure and selects a set of features.
Based on the feature selection, two new characteristics of speech emotion, MFCC feature extracted from the fundamental frequency curve (MFCCF0) and amplitude perturbation parameters extracted from the short-time average magnitude curve (APSAM), are added to the selected features. The Gaussian Mixture Model (GMM) is used to recognize speech emotion [3]. According to the experiment result, the two new added features can effectively increase the recognition rate.
2. Database
A Chinese emotional database (CASIA) is used in this paper. CASIA was released by the Institute of Automation, Chinese Academy of Sciences [4]. It is composed of 1,200 wave files that represent different emotional states: happy, fear, sad, surprise, neutral and angry. Four actors (two females and two males) read 50 different texts respectively in the six emotions. Four kinds of emotions, Fear, Happy, Neutral and Surprise, 800 utterances in total (half of each emotion data to be trained, the remaining half to be tested) are chosen in the experiment.
3. Feature Extraction
3.1. Traditional Features
Traditional features are generally applied to recognize speech emotion which have proven to be useful [5]. In this section, first of all, we preprocess speech signals, such as pre-emphasis, framing (256 sampling points, 128 points frame-shift), adding-windows. After that we extract the feature parameters such as E, ZCR, F0, FF, MFCC, LPC from each frame signal and figure out their Max, Min, mean, Var., FOD. Moreover, there are two other features, the RC of F0, the RC of FF, 32 traditional features totally.
3.2. New Features
The method of speech emotion recognition based on the traditional features doesn’t achieve good results. So we introduce two new features, MFCCF0 and APSAM, to improve the recognition rate.
We extract fundamental frequency parameters from each frame of speech signal by autocorrelation function method and obtain a fundamental frequency curve. Then median filtering of 5 points is adopted to smooth this curve and the points of fundamental frequency off tracking the curve would be deleted. At last we extract 4th order MFCC feature parameters from this processed curve which is our first new feature.
Next we extract SAM of each frame of speech signal and achieve a SAM curve. On the basis of this curve, amplitude perturbation parameters, which is used to describe the jitter level within a certain range, can be figured out. Amplitude perturbation parameters, such as amplitude jitter percentage (Shim), amplitude jitter (ShdB), amplitude perturbation quotient (APQ), their formulas are as follows:
(1)
(2)
(3)
where A represents short-time average magnitude, N represents numbers of short-time average magnitude, i = 1, 2, ∙∙∙, N.
4. Feature Selection Based on BP Neural Network
In order to constitute the best set of features and reduce the dimensions of feature space, Ruck et al. come up with the sensitivity of the network outputs to its inputs which is used to rank the input features [6]. In the experiment, the network uses one hidden layer. The activating function of the hidden layer uses Sigmoid function, and the activating function of the output layer is a linear function. In the experiment, we let the number of the hidden nodes be 15. The set of 32 traditional features is used as the network inputs. After the network has been trained, the weights in the network are determined. Then the saliency values for each input were calculated. As each network is started with a different set of random weights, we take 10 trained network’s saliency values to an average in order to improve accuracy. After 10 experiments, we obtain the results of ranking the average saliency values, which is showed in Table 1. The input with the highest saliency value, is ranked No.1 and the lowest is ranked No.32.
5. Results and Discussion
5.1. Recognition Based on Traditional Features
From the results (Table 1) of the saliency sequencing of 32 traditional features, we can see the first six traditional features arranged in descending order: E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean and E mean. We choose the first four, five, and six feature parameters respectively and recognize them with GMM. Recognition results are showed in Table 2, Table 3, and Table 4. For the different number of Gaussian mixture model (NGMM) results in considerable difference of the recognition rates, we experiment with many different numbers of Gaussian mixture model and show the most representative results as follows.
Table 1. A rank of the 32 traditional features.
Table 2. Recognition rates of the first four features (E FOD, F0 mean, MFCC mean, ZCR FOD).
Table 3. Recognition rates of the first five features (E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean).
Table 4. Recognition rates of the first six features (E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean, E mean).
This paper only chooses the first few features used to recognize, because of their greater contribution to the speech emotion recognition. As show in the three tables above, when the first five features (E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean) are combined, the average recognition rate reaches the highest 79.75%, in which the recognition rate of Fear reaches 82% and the recognition rate of Neutral reaches 87%. If we continue to add emotional features as the inputs, we will find that the rate of single recognition and average recognition all decrease. This proved that the five features involve considerable information to differentiate emotions. With the increase of selected features, redundant and irrelevant between the features increase, and the recognition rates of the speech emotion decrease [7].
5.2. Recognition Based on Traditional and New Features
Based on the feature selection above, two new features of speech emotion, MFCCF0 and APSAM, are added to the selected features [8]. The recognition results of the first five features (E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean) and MFCCF0 are showed in Table 5. The recognition results of the first five features and APSAM are showed in Table 6. And the recognition results of the first five features, MFCCF0 and APSAM are showed in Table 7.
Table 5. Adding MFCCF0 based on the first five features.
Table 6. Adding APSAM based on the first five features.
Table 7. Adding MFCCF0 and APSAM based on the first five features.
As show in Table 5, after adding MFCCF0, the aver rate reaches 80.5% [9]. It is because of speech emotion is in relation to F0. While reading the different text in different emotions, their fundamental frequency curves are various. Also, emotion has nothing to do with the text. Therefore, feature parameters extracted from the fundamental frequency curve can characterize some emotion information and raise the recognition rate [10]. As show in Table 6, after adding APSAM, the aver rate increases by 1.0% and reaches 80.75%. This is because of amplitude perturbation parameters can describe the jitter level within a certain range. Speech in different emotion causes different jitter level, so the feature could characterize emotional information to a certain extent and raise the recognition rates. From Table 7, we can see that, after adding the two new features, MFCCF0 and APSAM, the aver rate reaches 82.25%, increases by 2.5%, in which the Neutral gets 90% at its peak and the other three emotions all achieve 76% at least.
6. Conclusion
The method of feature selection based on BP neural network is not only convenient to choose the most effective ones in various traditional features, but also reduces the dimension of feature space [11]. The average rate reaches 79.75% while a set of 5 traditional features (E FOD, F0 mean, MFCC mean, ZCR FOD, LPC mean) is used to recognize speech emotion. Based on the feature selection, two new characteristics of speech emotion, MFCCF0 and APSAM, are added to the selected features (the first five features). With GMM, we get the highest average recognition rate of the four emotions 82.25%, and the recognition rate of Neutral 90%. According to the experiment result, the two new features can improve the recognition rate of speech emotion, because they can characterize some new emotion information.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 61062011, No. 61362003) and GuangXi Key Lab of Multi-source Information Mining & Security, Electronic Engineering College, Guangxi Normal University.