A Noise Suppression Method for Speech Signal by Jointly Using Bayesian Estimation and Fuzzy Theory ()
1. Introduction
Speech recognition systems have been applied to various fields, for example, to inspection and maintenance operations in industrial factories and at construction sites, etc. where hand-writing is difficult. For speech recognition in such actual circumstances, some suppression methods for surrounding noises are indispensable.
Previously reported methods for noise reduction in speech recognition can be classified into two categories. One is based on a single microphone [1] [2], and the other uses a microphone array [3]. Since the latter requires a priori information on the number of noise sources, and the number of microphones larger than that of the noise sources is needed in the case of multi-noise sources, this category demands large scale systems. Therefore, the former based on a single microphone is more advantageous than the latter [4] [5]. In such a noise suppression task for speech signals based on a single microphone, many algorithms applying the Kalman filter have been proposed up to now [6] [7] [8] [9]. However, the Kalman filter is originally based on the assumption of Gaussian white noise [10]. The actual noises show complex fluctuation forms with non-Gaussian and non-white properties.
From the above viewpoint, in our previously reported study, a noise suppression algorithm for the actual speech signals without requirement of the assumption of Gaussian white noise has been proposed [11]. The method can be applied to actual complex situation where both the noise statistics and the fluctuation forms of speech signal are unknown. By applying the algorithm to real speech signals with several kinds of noises, its effectiveness has been experimentally confirmed in comparison with the Kalman filter.
Furthermore, signal processing methods to remove the noise for actual speech signals have been proposed by jointly using the measured data of bone- and air-conducted speech signals [12] [13]. However, the algorithms of the previous methods were introduced a simple additive model of the original speech signal and surrounding noise for the air-conducted speech observation. Furthermore, the derived algorithms have applied to only the signals mixed with noises on computer, and not to signals in real environment under existence of noises.
In this study, a new noise suppression method for speech signals is proposed by using Bayes theorem after employing a posterior distribution based on the air-conducted speech observation contaminated by surrounding noise. In the proposed algorithm, in order to improve the accuracy of estimation of speech signal, an expansion expression of conditional probability density function reflecting all linear and non-linear correlation information between original speech signal and air-conducted speech observation is adopted as the model of the speech observation. Then, a probability distribution with parameters estimated from the bone-conducted speech is adopted as the prior distribution. Furthermore, the algorithm proposed in this study is applied to signals measured in real environment under existence of noises.
Though the bone-conducted speech signal is a kind of solid propagation sound with less effect by the surrounding noise, the high frequency components of the signal are reduced through the propagation process [14]. After considering the bone-conducted speech signal with the reduction of higher components as fuzzy data, applying the probability measure of fuzzy events [15], a new simplified noise suppression method is derived by reflecting the air- and bone-conducted speech signals.
The effectiveness of the proposed method is confirmed by applying it to bone- and air-conducted speech measured in a real environment under the existence of surrounding noise.
2. Theoretical Consideration
2.1. Stochastic Model for Air- and Bone-Conducted Speech Signals by Introducing Fuzzy Theory
In the actual environment with a surrounding noise, let
,
and
be the original speech signal, the observations of air- and bone-conducted speech signals at a discrete time k. The observation
is contaminated by a surrounding noise
. In our previous studies, a simple additive model was considered for the air-conducted speech observation
[12] [13]. In this study, in order to improve the accuracy of estimation of speech signal
, an expansion expression of conditional probability density function
[11] reflecting all linear and non-linear correlation information between
and
is adopted as the model of air-conducted speech observation.
(1)
with
, (2)
where
denotes the averaging operation on variables.
As the probability density functions
and
showing non-Gaussian distribution, the following statistical orthonormal expansion series expressions are adopted.
, (3)
(4)
with
,
,
,
,
,
,
,
,
, (5)
where
is a Hermite polynomial with ith order. Functions
and
are orthonormal polynomials having weighting functions
and
, respectively. These orthonormal polynomials can be decomposed into linearly independent series as
, (6)
. (7)
The coefficients
and
are calculated beforehand by using Schmidt’s orthogonalization algorithm [16]. The expansion coefficients
with order
,
can be obtained from the correlation relationship between original speech signal
and noisy observation of air-conducted speech
. Since the original speech signal is unknown in the presence of noise, these coefficients have to be estimated on the basis of the observation
. Let’s regard the expansion coefficients
as unknown parameter vector
:
,
,
, (8)
the following simple dynamical model is introduced for the simultaneous estimation of the parameters with the specific signal
:
, (9)
Next, in order to express the relationship between the original speech signal and bone-conducted speech, after regarding the bone-conducted speech as fuzzy data, the conditional probability distribution function
can be obtained by applying the probability measure of fuzzy events [15] to (1), as follows.
(10)
where
is a membership function of the bone-conducted speech
, and a Gaussian type function:
,
, (11)
where a and b are constants and
is a parameter, is adopted. Accordingly, by considering
in Equation (1) and
in Equation (4), and the membership function in Equation (11), the numerator of Equation (10) can be expressed as follows:
(12)
with
,
,
. (13)
After considering the equality on Hermite polynomial:
, (14)
where
are expansion coefficients reflecting bone-conducted speech signal, and using the orthonormal condition:
, (15)
the integral in Equation (12) can be calculated. Thus, the following expression is derived
, (16)
. (17)
Furthermore, through the similar calculation process, the denominator of Equation (10) can be derived as follows:
,
. (18)
Therefore, by substituting Equations (16) and (18) into Equation (10), the conditional probability distribution function
can be expressed explicitly.
2.2. Derivation of Noise Suppression Algorithm Based on Bayesian Estimation
To derive an estimation algorithm for the speech signal
, the Bayes’ theorem for the conditional probability distribution [17] is first considered. Since the parameter
is also unknown, the conditional joint probability distribution of
and
is expressed as
, (19)
where
is a set of air-conducted speech data up to time k. By expanding the conditional joint probability distribution
in a statistical orthogonal expansion series on the basis of the well-known Gaussian distribution and calculating the conditional expectation, the estimates of
and
for mean can be derived as follows:
(20)
(21)
with
,
,
,
,
,
,
,
,
. (22)
Furthermore, the estimate of
for variance is derived as follows:
(23)
with
,
,
. (24)
Using Equation (1) and the orthonormal property of
, variables
and
in Equations (20) (21) and (23) can be calculated as follows:
(25)
(26)
with
,
,
,
,
,
,
,
. (27)
Furthermore, by considering Equations (10) (16) (18) and orthonormal property of
, variables
,
in Equation (22) and the conditional expectation in Equations (25) (26) can be calculated as follows:
(28)
(29)
(30)
with
,
,
,
,
,
,
,
. (31)
Since Equations (28) (29) and (30) can be evaluated by measuring bone-conducted speech
, no time transition models of
are necessary. Therefore, computation time of the proposed algorithm can be reduced than the previous one [12]. Furthermore, by considering Equation (9), two parameters
and
in Equation (22) are given by the estimates of
at the discrete time
, as follows:
,
. (32)
Finally, considering Equations (1) (9) and (10), the expansion coefficients
in the estimation algorithm in Equations (20) (21) and (23) are given by the measurement of bone-conducted speech
, estimates of parameter
at the discrete time
, through the similar calculation process to Equations (25)-(30). Therefore, recursive estimation of the speech signal
can be achieved.
3. Application to Speech Signal in Real Environment
In order to confirm the actual usefulness of the proposed noise suppression algorithm, it was applied to speech signals in real noise environment. Though, in the previous studies [12] [13], the noisy air-conducted speeches were created on a computer by mixing the original air-conducted speech signal measured in a noise-free environment, the algorithm proposed in this study was applied to signals measured in real environment under existence of actual noises. For a female and a male speech signals digitized with sampling frequency of 10 kHz and quantization of 16 bits, we estimated the speech signal based on the observation corrupted by additive noise.
More specifically, air-conducted speeches were measured in real environment under existence of a white noise generated from a noise generator and an actual machine noise. The bone-conducted speech was simultaneously measured by use of an acceleration sensor with the air-conducted speech. By setting roughly the amplitude of the noises at two levels, the proposed algorithm was applied to extremely difficult situations with low SNR (noise-free air-conducted speech signal to noise ratio defined by
) being approximately −3 dB and −5 dB.
Using the observed bone-conducted speech and noisy observation on air-con ducted speech, constants a and b are first calculated by introducing the linear regression model in Equation (11) and applying the least squared method to this model. Secondly, the parameter
of the membership function is obtained by calculating the standard deviation
of
around
, as
after assuming Gaussian distribution for the deviation.
The observed signals on air-conducted female speech contaminated by the white noise and machine noise are shown in Figure 1 and Figure 2. Furthermore, for the male speech signal, noisy air-conducted speech observations are shown in Figure 3 and Figure 4 respectively.
The estimated results by using the algorithm based on Equations (20)-(24) are shown in Figure 5 and Figure 6 for the female speech signal and in Figure 7 and Figure 8 for the male speech signal. For comparison, the estimated results of the female and male speech signals by using the estimation algorithm based on only the observation of air-conducted speech are shown in Figures 9-12.
By comparing Figures 5-8 with Figures 9-12, it is obvious that the proposed method can suppress the effects of white noise and real machine noise better than the method based on observation of only air-conducted speech.
The air-conducted female and male speech signals spoken by the same speakers in the different situation without any noises are shown in Figure 13 and Figure 14 as references. By comparing these speech signals measured in noise-free circumstance with the estimated results by the proposed method and the results by using the algorithm based on the observation of only air-conducted signal, the effectiveness of the proposed method is obvious. Furthermore, the computation time of the proposed method was reduced by 55.2% of the algorithm based on the only air-conducted observation, because it is unnecessary for the proposed method to calculate recursively the estimate of variance of
based on the air-conducted speech
.
Figure 1. Observed female speech signal contaminated by white noise with
.
Figure 2. Observed female speech signal contaminated by machine noise with
.
Figure 3. Observed male speech signal contaminated by white noise with
.
Figure 4. Observed male speech signal contaminated by machine noise with
.
Figure 5. Estimated female speech signal by use of the proposed method based on observation contaminated by white noise with
.
Figure 6. Estimated female speech signal by use of the proposed method based on observation contaminated by machine noise with
.
Figure 7. Estimated male speech signal by use of the proposed method based on observation contaminated by white noise with
.
Figure 8. Estimated male speech signal by use of the proposed method based on observation contaminated by machine noise with
.
Figure 9. Estimated female speech signal by use of the method based on only air-conducted observation contaminated by white noise with
.
Figure 10. Estimated female speech signal by use of the method based on only air-conducted observation contaminated by machine noise with
.
Figure 11. Estimated male speech signal by use of the method based on only air-conducted observation contaminated by white noise with
.
Figure 12. Estimated male speech signal by use of the method based on only air-conducted observation contaminated by machine noise with
.
Figure 13. Air-conducted female speech signal in the different situation without any noises.
Figure 14. Air-conducted male speech signal in the different situation without any noises.
4. Conclusions
In this paper, after considering the bone-conducted speech signal with the reduction of higher components as fuzzy data, applying the probability measure of fuzzy events, a new noise suppression method is derived on the basis of Bayes’ theorem as the fundamental principle of estimation. Furthermore, the proposed algorithm has been applied to real speech signals contaminated by noises measured in actual environment with low SNR. As a result, it has been revealed by experiments that better estimation results may be obtained by the proposed algorithm as compared with the method based on only air-conducted observations.
The proposed approach is quite different from the traditional standard techniques. However, we are still in an early stage of development, and a number of practical problems are yet to be investigated in the future. These include: 1) application to a diverse range of speech signals in actual noise environment, 2) extension to cases with multi-noise sources, and 3) finding an optimal number of expansion terms for the expansion-based probability expressions adopted.
Acknowledgements
The authors are grateful to Ms. Yui Maeda of the Prefectural University of Hiroshima for her help during this study. This work was supported in part by fund from the Grant-in-Aid for Scientific Research No. 19K04428 from the Ministry of Education, Culture, Sports, Science and Technology-Japan.