Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models

Abstract

The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.

Share and Cite:

A. Maesa, F. Garzia, M. Scarpiniti and R. Cusani, "Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models," Journal of Information Security, Vol. 3 No. 4, 2012, pp. 335-340. doi: 10.4236/jis.2012.34041.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] F. Garzia, E. Sammarco and R. Cusani, “The Integrated Security System of the Vatican City State,” International Journal of Safety & Security Engineering, Vol. 1, No. 1, 2011, pp. 1-17.
[2] G. Contardi, F. Garzia and R. Cusani, “The Integrated Security System of the Senate of the Italian Republic,” International Journal of Safety & Security Engineering, Vol. 1, No. 3, 2011, pp. 219-247.
[3] F. Garzia and R. Cusani, “The Safety/Security/Communication System of the Gran Sasso Mountain in Italy,” International Journal of Safety & Security Engineering.
[4] F. Garzia, E. Sammarco and R. Cusani, “Vehicle/People Access Control System for Security Management in Ports,” International Journal of Safety & Security Engineering.
[5] H. Beigi, “Fundamentals of Speaker Recognition,” VDM Verlag, Saarbrücken, 2011.
[6] R. J. Mammone, X. Y. Zhang and R. P. Ramachandran, “Robust Speaker Recognition: A Feature-Based Approach,” IEEE Signal Processing Magazine, Vol. 13, No. 5, 1996, pp.1290-1312.
[7] F. Soong, A. Rosenberg, L. Rabiner and B. Juang, “A Vector Quantization Approach to Speaker Recognition,” Acoustics, Speech and Signal Processing (ICASSP), Vol. 10, 1985, pp. 387-390.
[8] R. Auckenthaler, M. Carey and H. Lloyd-Thomas, “Score Normalization for Text-Independent Speaker Verification Systems,” Digital Signal Processing, Vol. 10, No. 1-3, 2000, pp. 42-54.
[9] S. Furui, “Recent Advances in Speaker Recognition,” Pattern Recognition Letters, Vol. 18, No. 9, 1997, pp. 859-872.
[10] S. Pruzansky, “Pattern-Matching Procedure for Automatic Talker Recognition,” JASA, Vol. 26, 1963, pp. 403-406.
[11] P. D. Bricker and S. Pruzansky, “Effects of Stimulus Content and Duration on Talker Identification,” JASA, Vol. 44, 1968, pp. 1596-1607.
[12] D. Jurafsky and J. H. Martin, “Speech and Language Processing,” Prentice Hall, Boston, 2008.
[13] M. Farrùs, “Prosody in Automatic Speaker Recognition: Applications in Biometrics and Voice Imitation,” VDM Verlag, Saarbrucken, 2010.
[14] D. A. Reynolds, “An Overview of Automatic Speaker Recognition Technology,” Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, No. , 2002, pp. 4072-4075.
[15] J. Mariani and F. Bimbot, “Language and Speech Processing,” John Wiley & Sons, Chichester, 2010.
[16] I. R. Titze, “Principles of Voice Production,” Prentice Hall, Boston, 1994.
[17] N. Morgan, H. Bourlard and H. Hermansky, “Speech Processing in the Auditory System, Chapter Automatic Speech Recognition: An Auditory Perspective,” Springer, Berlin, 2004.
[18] F. Zheng, G. Zhang and Z. Song, “Comparison of Different Implementations of MFCC,” Journal of Computer Science & Technology, Vol. 16, No. 6, 2001, pp. 582-589.
[19] M. Sahidullah and G. Saha, “Design, Analysis and Experimental Evaluation of Block Based Transformation in MFCC Computation for Speaker Recognition,” Speech Communication, Vol. 54, No. 4, 2012, pp. 543-565.
[20] C. Bishop, “Pattern Recognition and Machine Learning,” Springer, Berlin, 2006.
[21] D. A. Reynolds, “Gaussian Mixture Models,” Technical Report, MIT Lincoln Laboratory, Cincinnati, 2001.
[22] M. A. T. Figueiredo and A. K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3, 2002, pp. 381-396.
[23] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, Vol. 10, No. , 2000, pp. 19-41.
[24] L. Xu and I. Jordan, “On Convergence Properties of the EM Algorithm for Gaussian Mixtures,” Neural Computation, Vol. 8, No. 1, 1996, pp. 129-151.
[25] S. V. Vaseghi, “Advanced Digital Signal Processing and Noise Reduction,” John Wiley & Sons, Chichester, 2006.
[26] S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 27, No. 2, 1979, pp. 113-120.
[27] B. P. Bogert, J. R. Healy and J. W Tukey, “The Quefrency Analysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum, and Saphe Cracking,” Proceedings of the Symposium on Time Series Analysis, 1963, pp. 209-243.
[28] C. Roads, “The Computer Music Tutorial,” MIT Press, Cincinnati, 1996.
[29] J. G. Proakis and D. G. Manolakis, “Digital Signal Processing,” Prentice Hall, Boston, 2007.
[30] D. O’Shaughnessy, “Speech Communication: Human and Machine,” Addison-Wesley, Boston, 1987.
[31] S. Stevens, J. Stanley, J. Volkman and E. B. Newman, “A Scale for the Measurement of the Psychological Magnitude Pitch,” Journal of the Acoustical Society of America, Vol. 8, No. 3, 1937, pp. 185-190.
[32] M. Gold, “Speech and Audio Signal Processing,” John Wiley & Sons, Chichester, 2002.
[33] M. Berouti, R. Schwartz and J. Makhoul, “Enhancement of Speech Corrupted by Acoustic Noise, Proceedings of IEEE International Conference on Acoustics, Speech, and signal Processing (ICASSP1979), 1979, pp. 208-211.
[34] R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics,” Speech and Audio Processing, Vol. 9, No. 5, 2001, pp. 504-512.
[35] Voxforge Database, available on Web.
[36] R. Reynolds, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, Vol. 3, 1995, pp.72-83.
[37] A. Revathi, R. Ganapathy and Y. Venkataramani, “Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach,” International Journal of Computer Science & Information Technology, Vol. 1, No. 2, 2009, pp. 30-42.
[38] TIMIT Speech Database, available on Web.
[39] S. Chakroborty and G. Saha, “Improved Text-Independent Speaker Identification Using Fused MFCC and IMFCC feature Sets Based on Gaussian Filter,” International Journal of Signal Processing, Vol. 5, No. 1, 2009, pp. 11-19.
[40] Yoho Speech Database, available on Web.
[41] R. Saeidi, P. Mowlaee, T. Kinnunen and Z. H. Tan, “Signal-to-Signal Ratio Independent Speaker Identification for Co-Channel Speech Signals,” Proceedings of International Conference on Pattern Recognition (ICPR2009), 2009, pp. 4565-4568.
[42] P. Gomez, “A Text Independent Speaker Recognition System Using a Novel Parametric Neural Network,” Proceedings of International Journal of Signal Processing, Image Processing and Pattern Recognition, December 2011, pp.1-16.
[43] R. R. Rao, V. K. Prasad and A. Nagesh, “Performance Evaluation of Statistical Approaches for text-Independent Speaker Recognition Using Source Feature,” InterJRI Computer Science and Networking, Vol. 2, No. 1, 2010, pp. 8-13.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.