A Hybrid Approach for Speech Enhancement in Speaker Identification System

Reader Impact Factor Score
[Total: 10 Average: 3.8]

Published on International Journal of Informatics, Technology & Computers
Publication Date: August, 2019

Zaw Win Aung
Technological University (Loikaw), Loikaw Township, Kayah State, Myanmar

Journal Full Text PDF: A Hybrid Approach for Speech Enhancement in Speaker Identification System.

Abstract
The performance of speaker identification system is significantly degraded due to the distortion of the input speech signal by the background noises. Noise can not only reduce speech intelligibility and voice quality but also affect speech processing accuracy, and even make the system not working properly. So, speech enhancement is important in improving the performance of the speaker identification systems under noisy conditions. Many speech enhancement methods were developed over the years. Spectral subtraction is one of the most popular methods proposed for speech enhancement because of its easy to implement and less calculation in speech processing. It can eliminate the noise from the noisy input speech signal. Another popular speech enhancement method is endpoint detection. It can separate the speech segments of an utterance from the background i.e., the non speech segments obtained during the recording process. But using each speech enhancement method separately in speaker identification system is not robust enough to guarantee high accuracy. In this paper, a new robust hybrid approach was developed for speech enhancement using spectral subtraction method and endpoint detection method. The experimental study shows that using the proposed hybrid speech enhancement method is more robust and faster in computation than using each speech enhancement method separately.

Keywords: speaker identification; spectral subtraction; endpoint detection; speech enhancement

1. INTRODUCTION
Speech is the most important, direct, effective and convenient means of information exchange. With the rapid development of science and technology in recent years, people are not satisfied with the way to exchange information with computer, hoping to get rid of the keyboard and the mouse and achieving the goal of using speech to control the computer. Therefore, the speech signal processing technology was produced. Now some speech signal processing systems are embedded in the intelligent system, but they can only work in a quiet environment. However, the speech information acquisition process will inevitably have a variety of noise interference. Noise can not only reduce speech intelligibility and voice quality but also affect speech processing accuracy, and even make the system not working properly.
When the automatic speaker identification systems developed in the lab is migrated to the real world applications, the input speech signal recorded from the microphone is corrupted by the background noises such as street noises, babble noises and car noises. The performance of the speaker identification system is significantly degraded due to the distortion of the input speech signal by the background noises [1]. So, speech enhancement is important in improving the performance of the speaker identification systems under noisy conditions.
The term speech enhancement refers to methods aiming at recovering speech signal from a noisy observation. Speech enhancement techniques are concerned with algorithms that mitigate these unwanted noise effects and thus improve signal quality.
Speech enhancement systems can be classified in a number of ways [2, 3] based on the criteria used or application of the enhancement system. The noise reduction systems generally can be classified based on the number of input channels (one/multiple), domain of processing (time/frequency/spatial) and type of algorithm (non adaptive/adaptive) [3,4,5]. The speech enhancement techniques can be divided into two broad classes based on single-microphone speech enhancement and multi-microphone speech enhancement techniques [2].
In the past decades, research in the field of speech enhancement has focused on the suppression of additive background noise [6, 7, 8]. From the point of view of signal processing, additive noise is easier to deal with than convolutive noise or nonlinear disturbances [9]. The ultimate goal of speech enhancement is to eliminate the additive noise present in speech signal and restore the speech signal to its original form. Several methods have been developed as a result of these research efforts [10]. Most of these methods have been developed with some or the other auditory, perceptual or statistical constraints placed on the speech and noise signals.
The spectral subtraction method, first proposed by Boll et. al. [10], is suitable for enhancing speech signals degraded by uncorrelated additive noise. It is an approach for estimating the power spectral density of the clean signal by subtracting an estimate of the power spectral density of the noise process from an estimate of the power spectral density of the degraded signal. The estimation is performed on a frame – by- frame basis, where each frame consists of 20 – 40ms of speech samples.
Speech endpoint detection (EPD) plays a major role as a preprocessing block in a variety of speech processing applications such as speech enhancement, speech coding and speech recognition where it is desirable to detect the beginning and ending boundaries of speech in the input signal. An accurate speech endpoint detection is crucial for the recognition performance in improving the recognition accuracy and reducing the computing complexity. There are many end point detection algorithms developed. In isolated word automatic speech recognition, the detection of endpoints in a speech is done to separate the speech signal from unwanted background noise [11]. This process is called Voice Activity Detection. The main use of endpoint detection is in speech coding and speech recognition.

2. SYSTEM ARCHITECTURE
In the proposed system, there are three main modules: speech enhancement, feature extraction and feature matching. The new hybrid approach using spectral subtraction and endpoint detection is used to enhance the input speech signal. The spectral subtraction method eliminates the unwanted signals and background noise from the noisy input speech signal. After using spectral subtraction noise elimination, there still exist non speech segments in the enhanced speech signal. The endpoint detection method removes the non speech segments in the enhanced speech signal in order to improve the recognition accuracy of the speaker identification system. Figure 1 shows system architecture of the proposed system.

Figure 1. System architecture of the proposed system

2.1 Spectral Subtraction
The Spectral Subtraction algorithm is considered to be one of the first algorithms proposed for noise suppression [10]. Many variations for spectral subtraction algorithms are developed aiming at better suppression of noise. It is based on simple principle, assuming the background noise is acoustically added to clean speech; the estimate of clean speech signal can be obtained by subtracting the noise magnitude spectrum from noisy speech spectrum. Spectral information required to describe the noise spectrum is obtained during the periods when the speech signal is absent. The IDFT of the estimated signal spectrum using the phase of the noisy signal gives the enhanced output signal. The algorithm is computationally simple as it involves only Fourier transform both inverse and forward approach for suppression of acoustic noise in order to improve the intelligibility and quality of processed speech signal.
The basic spectral subtraction algorithm was proposed by Boll [10]. The voice signals and noise are not related to each other. The basic equation for the spectral subtraction can be expressed as:

Therefore after using basic spectral subtraction noise elimination, there still exist some greater power spectrum of the residual components random present in the spectrum spike. After the inverse Fourier transform, the enhanced speech formed a new rhythmic fluctuation noise (musical noise) and this kind of noise cannot be removed by using general spectral subtraction.
In order to minimize the secondary pollution to the voice information caused by musical noise (rhythmic fluctuation noise) spectral subtraction can be improved. Speech information energy generally concentrated in some frequencies or frequency bands in noisy speech, and the noise energy is often distributed over the entire frequency range. Therefore, remove the noise at the higher the amplitude of time frame.

2.2 Endpoint Detection
The process of separating the speech segments of an utterance from the background, i.e., the non speech segments obtained during the recording process, is called endpoint detection [12]. Accurate speech endpoint detection is crucial for the recognition performance in improving the recognition accuracy and reducing the computing complexity. In noisy environment, speech samples containing unwanted signals and background noise are removed by endpoint detection method. Over the years, different approaches have been proposed for the detection of speech segments in the input signal data. The early algorithms were based on extracting features such as short-term energy, zero crossing rate, linear prediction and pitch analysis. In the recent years, classification of voiced and unvoiced segments was based on cepstral coefficients, wavelet transform, periodicity measure and statistical models. The short-term energy will be used in the proposed system.
Speech is produced from a time varying vocal tract system with time varying excitation. Due to this, the speech signal is non-stationary in nature. Speech signal is stationary when it is viewed in blocks of 10-30msec [13]. Short Term Processing divides the input speech signal into short analysis segments that are isolated and processed with fixed (non-time varying) properties. These short analysis segments called as analysis frames almost always overlap one another. The energy associated with voiced speech is large when compared to unvoiced speech [14]. Silence speech will have least or negligible energy when compared to unvoiced speech [13]. Hence, Short Term Energy can be used for voiced, unvoiced and silence classification of speech. For Short Term Energy computation, speech is considered in terms of short analysis frames whose size typically ranges from 10-30 msec. Different energies used for signal analysis are as per equation (7), (8) and (9). Where, equation (7) represents Logarithmic Short-Term Energy, equation (8) represents the squared short-Term Energy and equation (9) represents Absolute Short-Term Energy [15].

Where, s(n) is the speech signal and N is length of sampled signal. The Logarithmic Short-Term Energy is most suitable, hence used in the proposed system.

2.3 Feature Extraction
The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. The speech signal is a slowly timed varying signal (it is called quasi-stationary). When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal [16].
A wide range of possibilities exist for parametrically representing the speech signal for the speaker identification task, such as Linear Predictive Coding (LPC), Mel-frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and most popular, and these will be used in this system.
MFCCs are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. The block diagram of MFCC processor is shown in Figure 2.

Figure 2. Block diagram of MFCC processor

Firstly, the input speech signal is blocked into frames of N samples overlapping by N-M samples. The values for N and M are 256 and 100. Then, the blocked frames are windowed with hamming window which has the form:

By applying the procedure described above, for each speech frame of around 30 msec with overlap, a set of mel-frequency cepstrum coefficients is computed. These are result of a cosine transform of the logarithm of the short-term power spectrum expressed on a mel-frequency scale. This set of coefficients is called an acoustic vector. Therefore each input utterance is transformed into a sequence of acoustic vectors.

2.4 Feature Matching
The state-of-the-art in feature matching techniques used in speaker identification includes Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and Vector Quantization (VQ). In this paper, the VQ approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codewords is called a codebook. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ codebook with smallest total distortion is identified. Figure 3 shows block diagram of the basic VQ training and classification structure.

Figure 3. Block diagram of the basic VQ training and classification structure

Initially, the training set of vectors is used to create the optimal set of codebook vectors for representing the spectral variability observed in the training set. Then, similarity or distance is measured between a pair of spectral analysis vectors so as to be able to cluster the training set vectors as well as to associate or classify arbitrary spectral vectors into unique codebook entries. The next step is a centroid computation procedure. Finally, a classification procedure chooses the codebook vectors that closest to the input vector and uses the codebook index as the resulting spectral representation. This is often referred to as the nearest-neighbor labeling or optimal encoding procedure. The classification procedure is essentially a quantizer that accepts, as input, a speech spectral vector and provides, as output, the codebook index of the codebook vectors that best matches the input [11].
After the enrollment session, the acoustic vectors extracted from input speech of a speaker provide a set of training vectors. Then, the next important step is to build a specific VQ codebook for this speech signal using those training vectors. There is a well-know algorithm, namely LBG algorithm for clustering a set of L training vectors into a set of M codebook vectors. The algorithm is formally implemented by the following recursive procedure:
1. Design a 1-vector codebook: this is the centroid of the entire set of training vectors (hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook yn according to the rule:

3. Nearest-Neighbor Search: for each training vector, find the codeword in the current codebook that is closest (in terms of similarity measurement), and assign that vector to the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold.
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by designing a 1-vector codebook, then uses a splitting technique on the codeword to initialize the search for a 2-vector codebook, and continues the splitting process until the desired M-vector codebook is obtained. Figure 4 shows block diagram of the LBG algorithm.

Figure 4. Block diagram of LBG algorithm

“Cluster vectors” is the nearest-neighbor search procedure which assigns each training vector to a cluster associated with the closest codeword. “Find centroids” is the centroid update procedure. “Compute D (distortion)” sums the distances of all training vectors in the nearest-neighbor search so as to determine whether the procedure has converged.

3. IMPLEMENTATION OF THE SYSTEM
The proposed speaker identification system is simulated in MATLAB with speech signals as input and produces the identity of speaker as output. The speaker utters his/her name once in a training session and again in a testing session later on. The sounds to be trained and tested were recorded as wave format.

In the proposed system, there are three main steps. In the first step, the new hybrid approach using spectral subtraction and endpoint detection is used to enhance the input speech signal. The spectral subtraction method eliminates the noise from the noisy input speech signal and the endpoint detection method removes the non speech segments in the enhanced speech signal. Figure 5 shows the waveforms of the noisy speech signal and the enhanced speech signal.

Figure 5. The waveforms of the noisy speech signal and the enhanced speech signal

In Figure 5, it can be seen that the signal waveform has been improved after spectral subtraction noise elimination has been applied. But there still exist non speech segments in the enhanced speech signal. After applying the endpoint detection method, it can be seen that the non speech segments have been removed and the signal waveform has been significantly improved.
In the second step, the speech signal is processed for feature extraction using MFCC feature extraction algorithm. In the third step, the speech signal is processed for feature matching and decision making using Vector Quantization (VQ) approach.
In the training phase, the codebook or reference model for each speech signal is constructed from the MFCC feature vectors using LBG clustering algorithm and store it in the database.
In the identification phase, the input speech signal is compared with the stored reference models in the database and the distance between them is calculated using Euclidean distance. And then, the system outputs the speech ID which has minimum distance as identification result.

4. EXPERIMENTAL RESULT
In this section, the performance of the proposed system is compared with the other two systems: endpoint detection based system and spectral subtraction based system.
The speech samples were collected from 20 adults, ten male speakers and ten female speakers. Speakers were asked to utter their name in normal speed, under normal laboratory conditions. The same microphone is used for all recordings. Speech signals are sampled at 8000 Hz. The speaker utters his/her name three times in the training session and one time in the testing session later on. Training samples are recorded by uttering the name of the speaker (e.g. “I am Zaw Win Aung”), which is about 2 seconds long for each sample. Testing samples are also recorded in the same way.
The speaker database consists of 60 speaker models which were collected from 20 adults, ten male speakers and ten female speakers (3 speech samples were collected from each speaker).
The proposed system takes less training time when compared to spectral subtraction based system. In contrast with the endpoint detection based system, the proposed system takes more training time to get the speaker models for speaker database because it uses two speech enhancement methods to improve the quality of the input speech signal. Figure 6 shows training time taken by the systems.

Figure 6. Training time taken by the systems
In the experiment, twenty speech samples which were collected from 10 male speakers and 10 female speakers are used as test samples. Before doing the experiment, the test samples are mixed with 0dB, -5dB, -10dB, -15dB, -20dB and -40 dB of white noise respectively. Six experiments are carried out using twenty test speech samples with different noise level (0dB, -5dB, -10dB, -15dB, -20dB and -40 dB).
According to the experiments, it can be seen that the testing time of the proposed system is less than those of the endpoint detection based system and spectral subtraction based system. The testing time taken by the systems is shown in Table 1. In the case of accuracy, the proposed system achieves higher accuracy than the other two systems. The accuracy of the systems is shown in Table 2.

Table 1. Testing time taken by the systems

Table 2. Accuracy of the systems

5. CONCLUSION
From this work it can be concluded that the proposed hybrid speech enhancement method more effectively eliminates the noise and non speech segments and improves the signal to noise rates without significantly impairing the speech intelligibility. Although the proposed system takes more training time when compared to endpoint detection based system, training time cannot affect the performance of the system and it is negligible in the real world applications. So it can be concluded that the proposed hybrid system is more robust and faster than the endpoint detection based system and spectral subtraction based system and it is suitable for working in real-time.

6. ACKNOWLEDGMENTS
The author would like to thank all of his colleagues who have given him support and encouragement during the period of the research work. The author would like to extend his thanks to his students who provided the speech samples for this research work. Finally the author would like to express his indebtedness and deep gratitude to his beloved parents, wife and sons for their kindness, support and understanding during the research work.