A two-level multi-gene genetic programming model for speech quality prediction in Voice over Internet Protocol systems☆
Introduction
In the world of real-time technologies, the perceived quality plays a critical role in increasing user satisfaction level. Hence, developing non-intrusive models to predict speech quality has been an important research subject during the past few years. The non-intrusive term refers to those methods do not need the reference signal in the process of estimating the quality. Originally, speech quality can be measured by performing reliable subjective tests. These methods utilize Mean Opinion Score (MOS) to categorize speech quality from poor to excellent based on human listener opinions [1]. Nevertheless, performing costly and time-consuming tests makes these methods unsuitable for real-time applications such as Voice over Internet Protocol (VoIP). Therefore, the objective models have been proposed to predict test results (i.e. MOS) from relevant parameters instead of human judgment. This type of evaluation methods can be classified as intrusive or non-intrusive [2]. The intrusive methods estimate the quality from calculation of the difference between clean and output degraded signals. The difference value (i.e. distortion) is then mapped into MOS value to obtain the predicted quality [2]. The most common intrusive method which formed as ITU-T P.862, is Perceptual Evaluation of Speech Quality (PESQ) [3]. The intrusive methods provide good correlation with subjective results. However, the major drawback is the need for original signal which make them unusable for live traffics. In order to overcome this weakness, non-intrusive methods have been proposed which measure the quality from speech features or transmission parameters. The signal-based methods measure the quality directly from analysis of the degraded version of received speech. For example, the ITU-T P.563 [4] utilizes various speech features to predict overall quality according to three main distortion classes. Parameters-based methods, in other hands, calculate the speech quality by using a number of parameters relevant to communication network. For instance, the E-model [5] uses some transmission impairments to develop a mathematical linear relation for predicting speech quality score. This model provides a quality rating factor R on a scale of 0–100.
In recent years, more attention has been paid to the use of Artificial Intelligence (AI) algorithms for developing non-intrusive quality assessment models. In [6] a parametric-based model by using Artificial Neural Network (ANN) for speech quality estimation was proposed. Falk et al. develop a speech quality prediction model using a set of extracted speech features together with Gaussian Mixture Models (GMMs) [7]. An auditory model for non-intrusive speech quality estimation was proposed in [8], which is based on temporal envelope representation of speech signal. Researchers in [9] employed the per-frame features which commonly are used in speech coding to predict speech quality. The proposed model utilized the degree of consistency between speech features and GMM of original signal to estimate overall quality. Raja [10] explored the capabilities of the Genetic Programming (GP) by developing a symbolic form model based on common single-gene GP. A non-intrusive objective measurement for estimating the quality of speech is proposed based on fuzzy Gaussian Mixture Model (GMM) and Support Vector Regression (SVR) was proposed in [11] for both narrowband and wideband speech. In [12], [13], different types of neural network were utilized to estimate speech quality, non-intrusively. These approaches employed the PESQ results to prepare dataset and predict overall MOS score. Researchers in [14] utilized Mel Filter Bank energies as the desired features and develop a non-intrusive prediction model based on Support Vector Regression (SVR). In [15] the capabilities of Bayesian learning method was studied for estimating the quality of VoIP by using several number of network impairments such as codec type, packet loss, gender and language of speaker. A parameter-based approach by using common single-gene GP method for quality estimation was introduced in [16]. A fuzzy-based model was also developed in [17] which uses a hybrid of GA and Neuro-Fuzzy for estimating quality of speech in IP networks. In [18] a modified version of E-model (ITU-G.107) was introduced by adding two new parameters related to delay impairment. In order to exploit the advantages of ensemble learning method, different base learners were learned for developing quality model in [19]. The proposed method uses the MFCC features which are extracted from different frequency sub-bands of original signal. Also, the Discrete Wavelet Transform (DWT) was utilized to decompose original signal into different sub-bands. In [20] different factors including culture, ageing, and language were used to measure perception of Quality of Service (QoS) of IP applications. The paper [21] exploited the subjective test results to develop a mathematical model for quality measurement based on native Thai users.
The aim of this study is to develop a low complexity model for predicting the quality in VoIP systems. In this way, similar to [10], [16], the GP algorithm is used as one of the well-known AI algorithms for developing quality measurement model. The GP family algorithms are the biologically-inspired techniques which attempt to find a solution (program) for a problem in a symbolic form through a number of genetic operations. In contrast to mentioned studies, a unique type of GP algorithm which utilizes the multi-gene individuals instead of common single-gene ones is utilized in this study. Using multi-gene GP is due to its ability in developing more accurate and efficient symbolic regression models in comparison with the common GP. Furthermore, in the most studies in the field of quality measurement, the proposed models are parameters-aware or signal-aware, exclusively. Against these approaches, this paper proposes a signal-aware model which is not unaware of transmission media conditions. This hybrid method utilizes the transmission impairments and speech features simultaneously to derive desired quality model. Another aspect is the complexity of developed models which is a major problem in the most of studies due to large feature vectors. In order to overcome this challenge, a 2-level structure is proposed which utilizes the parallel models at the first level. The parallelism leads to dimension reduction which avoids the curse of dimensionality. Utilizing parallel models also provides the ability for exploring the effect of combining speech features in improving model performance. Based on proposed structure, the second level calculates the overall quality score by aggregating the intermediate quality scores that are obtained at parallel models. For this, a different number of the network impairments and speech features are organized in a single feature vector to derive quality measurement model. The efficiency of proposed method is measured by performing several numbers of experiments in the cases that original (noise-free) and noisy speech signals are used at the sender side of VoIP systems.
The rest of this paper is organized as follows: in Section 2, we describe two common types of objective speech quality measurement methods including intrusive and non-intrusive briefly. In Section 3, various speech signal features and transmission impairments are described. Section 4 introduces MGGP and then describes the proposed 2-level structure for developing MGGP-based prediction model. Section 5 includes simulation system setup and dataset preparation. In Section 6, different experiments are performed according to different evolved MGGP model and results are compared. Section 7 concludes the paper.
Section snippets
Objective speech quality assessment
The objective methods attempts to predict subjective scores through an objective process based on human perception. The objective methods can be categorized in three main types: (1) The time domain methods like signal-to-noise ratio (SNR) and segmental SNR (SSNR) [2] use the original and degraded signals difference in time domain to obtain quality score, (2) The spectral domain methods use the parameters of speech production model to estimate quality. The Itakura–Saito (IS) measure, Spectral
The speech features and network impairment parameters for quality measurement
The selection of appropriate speech features has a vital importance in quality prediction modeling. Hence, the first step is to extract speech features for representing speech signal as a set of parameters. This section describes several techniques for per-frame feature extraction that are used in this study. These approaches are Mel-Frequency Cepstrum Coefficients (MFCC) [22], Linear Prediction Coding (LPC) [23], and Perceptual Linear Predictive (PLP) coefficients [24]. At the end of this
The proposed MGGP-based model description
In this section, firstly we introduce the principles of MGGP approach which is a unique type of GP machine learning technique. Then, the proposed MGGP modeling based on a 2-level structure will be described.
System setup and dataset preparation
In order to provide training and test datasets by using degraded speech signals, we set up a simulation system at MATLAB environment. Fig. 12 shows the various steps of speech quality (MOS) prediction based on MGGP modeling. The system includes the iLBC encoder for generating speech frame at sender and reconstruct degraded version of speech at receiver, the packet-loss simulator which is a 2-state Markov model pattern generator to distort the digitized speech signals. In order to extract
Experimental results and discussion
In this section, some experiments are performed based on two mentioned structures. For this, three types of per-frame feature are extracted from speech which are MFCC, LPC, and PLP with the number of coefficients 12, 12, 13 respectively. The performance of the developed model are compared with the well-known objective method i.e. PESQ. Using PESQ as a reference model is due to its high correlation with subjective test results [3]. Also, the MOS measurement accuracy is assessed using the Root
Concluding remarks
In order to predict speech quality in VoIP systems, a non-intrusive symbolic form model based on a two-level structure was proposed. In this way, multi-gene GP was utilized in two hierarchical levels to construct the final model. Several experiments are performed by employing the proposed structure and different combinations of feature vector. Furthermore, to make a thorough comparison between all derived models, the statistical hypothesis testing Paired Two-tailed t-test was carried out. The
Acknowledgment
The authors gratefully acknowledge the financial support provided by Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran, under Contract number 7.1614.
Farhad Rahdari received his M.Sc. from computer engineering department, Iran University of Science & Technology (IUST), Tehran, Iran in 2000. He also got his B.Sc. from the same department in 2007. He is now a Ph.D. candidate at faculty of computer engineering, Isfahan University. His main research interests include Signal and Speech Processing, Machine Learning and Hardware Design.
References (26)
- et al.
Non-intrusive speech quality prediction in VoIP networks using a neural network approach
Neurocomputing
(2009) - et al.
Evaluating perceived voice quality on packet networks using different random neural network architectures
Perform Eval
(2011) - ITU-T Rec. P. 800.1: Mean Opinion Score (MOS) terminology. International Telecommunication Union, Geneva,...
Speech quality of VoIP: assessment and prediction
(2007)- ITU-T Rec. P.862: perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality...
- ITU-T Rec. P.563: single-ended method for objective speech quality assessment in narrow-band telephony applications....
- ITU-T Rec. G. 107: the E-model, a computational model for use in transmission planning. International Telecommunication...
- et al.
Voice quality prediction models and their application in VoIP networks
IEEE Trans Multimed
(2006) - et al.
Nonintrusive speech quality estimation using Gaussian mixture models
IEEE Signal Process Lett
(2006) ANIQUE: An auditory model for single-ended speech quality estimation
IEEE Trans Speech Audio Process
(2005)
Low-complexity, nonintrusive speech quality assessment
IEEE Trans Speech Audio Process
Non-intrusive objective speech quality measurement based on fuzzy GMM and SVR for narrowband speech
J Beijing Inst Technol
Cited by (7)
Nonlinear symbolic regression for bit error rate prediction of NOMA systems in 5G cellular communications
2024, Engineering Applications of Artificial IntelligenceClassification of speech under stress using harmonic peak to energy ratio
2016, Computers and Electrical EngineeringEstimating the Perceived Audio Quality Based on Multigene Symbolic Regression for Broadcasting Systems and Web-Casting Applications
2021, 2021 31st International Conference Radioelektronika, RADIOELEKTRONIKA 2021A Benchmark of Non-intrusive Parametric Audio Quality Estimation Models for Broadcasting Systems and Web-casting Applications
2021, Advances in Electrical and Electronic EngineeringDaily predictions of solar radiation utilizing genetic programming techniques
2020, Indonesian Journal of Electrical Engineering and Computer Science
Farhad Rahdari received his M.Sc. from computer engineering department, Iran University of Science & Technology (IUST), Tehran, Iran in 2000. He also got his B.Sc. from the same department in 2007. He is now a Ph.D. candidate at faculty of computer engineering, Isfahan University. His main research interests include Signal and Speech Processing, Machine Learning and Hardware Design.
Mahdi Eftekhari received his B.Sc. in computer engineering from Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran in 2001. He obtained his M.Sc. and Ph.D. degrees in Artificial Intelligence from the same department in 2004 and 2008, respectively. Mahdi has been a faculty member of Shahid Bahonar University of Kerman since 2008. His research interests include Machine Learning and modeling.
Reza Mousavi received his B.Sc. in computer engineering from Department of Computer Engineering, Yazd University, Yazd, Iran in 2010. He obtained his M.Sc. degree in Information Technology from Department of Electrical and Computer Engineering, Graduate University of Advanced Technology, Kerman, Iran in 2014. His research interests include ensemble learning, evolutionary algorithms, machine learning, data mining.
- ☆
Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Jia Hu.