. .
.
Short Term Time Domain Processing of Speech
.
.

 

Need for Short Term Processing of Speech

 

 

Speech is produced from a time varying vocal tract system with time varying excitation. As a result the speech signal is non-stationary in nature. Most of the signal processing tools studied in signals and systems and signal processing assume time invariant system  and time invariant excitation, i.e. stationary signal. Hence these tools are not directly applicable for speech processing. This is because, use of such tools directly on speech violates their underlying assumption. However, even if you use them blindly and compute the output from the tool, then such an output is of little practical significance. For instance, the tool for total energy computation is a fundamental relation in signal processing. That is,

 

 

 

This relation is useful for the case of stationary signal having finite energy. Suppose, if you use this tool for computing total energy of a speech signal. No doubt, this gives total energy present in the speech signal. However, the total energy is of no use. This is because, from the nature of its production, we know that speech has time varying amplitude and energy. Therefore what is important in case of speech production is a tool that gives information about time varying energy. Thus a need for  different way of processing speech

 

An engineering solution proposed for processing speech was to make use of existing signal processing tools in a modified fashion. To be more specific, the tools can still assume the signal under processing to be stationary. Speech signal may be stationary when it is viewed in blocks  of 10-30 msec. Hence to process speech by different signal processing tools, it is viewed in terms of 10-30 msec. Such a processing is termed as Short Term Processing (STP).

 

Short Term Processing of speech can be performed either in time domain or in frequency domain. The particular domain of processing depends on the information from the speech that we are interested in. For instance, parameters like short term energy, short term zero crossing rate and short term autocorrelation can be computed from the time domain processing of speech. Alternatively, short term Fourier transform can be computed from the frequency domain processing of speech. Each of these parameters give different information about speech that can be used for automatic processing.

 

 

Short Term Energy Parameter

 

 

The energy associated with speech is time varying in nature. Hence the interest for any automatic processing of speech is to know how the energy is varying with time and to be more specific, energy associated with short term region of speech. By the nature of production, the speech signal consist of voiced, unvoiced and silence regions. Further the energy associated with voiced region is large compared to unvoiced region and silence region will not have least or  negligible energy. Thus short term energy can be used for voiced, unvoiced and silence classification of speech.

 

The relation for finding the short term energy can be derived from the total energy relation defined in signal processing.The total energy of an energy signal is given by

 

 

 

In case of short term energy computation we consider speech in terms of 10-30 msec . Let the samples in a frame of  speech are given by "n=0 to n=N-1", where " N " is the length of frame (samples), then for energy computation the  speech will be zero outside the frame length. Then for energy computation amplitude of the speech samples will be zero outside the  frame. Accordingly we can write above mentioned relation as

 

 

 

This relation will give total energy present in the frame of speech from " n=0 to n=N-1 ". To represent more specifically, only one frame of speech we use the relation

 

 

 

where "w(n)" represent the windowing function of finite duration. There are several windowing functions present in the signal processing literature. The mostly used ones include rectangular, hanning and hamming. For all time domain parameters estimation we use the rectangular window  for its simplicity.

 

Now we can write the relation of short term energy as follows

 

 

 

where "n" is the shift / rate in number of samples at which we are interested in knowing the short term energy. The shift can be as small as one sample or as large as frame size. The short term energy computed for every sample shift may not be required since the energy variation in case of speech is relatively slow. For this reason the shift is kept much larger than one sample. Usually it is about half the frame size.

 

 

 

 

Figure_1: Short term energy contour for the spech signal

 

The last point about the short term energy is the value for frame size. Since the stationary assumption in case of speech is valid for 10 to 30 msec, the typical value for the frame size is about 20 msec. Alternatively, for larger frame sizes we get much smoothed version of energy and may not find time varying nature of short term energy. Figure_1 shows the energy contours for speech signal taken for study .

 


Short Term Zero Crossing Rate (ZCR)

 

 

Zero Crossing Rate gives information about the number of zero-crossings present in a given signal. Intuitively, if the number of zero crossings are more in a given signal, then the signal is changing rapidly and accordingly the signal may contain  high frequency information.  On the similar lines, if the number of zero crossing are less, hence the signal is changing slowly and accordingly the signal may contain low frequency information. Thus ZCR gives an indirect information about the frequency content of the signal.

 

The ZCR in case of stationary signal is defined as,

  


 

This relation can be modified for non-stationary signals like speech and termed as short term ZCR. It is defined as

 

 

 

The factor "2" comes in the denominator to take care of the fact that there will be two zero crossings per cycle of one signal

 

 

 

 

Figure_2: Short term zero crossing rate of a speech signal

 

In case of speech the nature of signal changes with time over few msec. For instance, from initial voiced to unvoiced and back to voiced and so on. To have some useful information,  ZCR needs to be computed using typical frame size of 10-30 msec with half the frame size as shift. A speech signal for the message "  she had your suit in your greasy wash water all year" and its  short term ZCR computed are shown in Figure_2. As it can be observed, in case of unvoiced sounds like |s|, the ZCR value is significantly high compared to the region of voiced sounds like |a| and hence can be used for distinguishing voiced and unvoiced regions.

 

 

Short Term Autocorrelation:

 

 

Crosscorrelation tool from signal processing can be used for finding the similarity among the two sequences and refers to the case of having two different sequences for correlation. Autocorrelation refers to the case of having only one sequence for correlation. In autocorrelation, the interest is in observing how similar the signal characteristics with respect to time. This is achived by providing different time lag for the sequence and computing with the given sequence as reference.

 

The autocorrelation is a very useful tool in case of speech processing. However due to the non-stationary nature of speech, a short term version of the autocorrelation is needed. The autocorrelation of a stationary sequence rxx(k)  is given by

 


 

 

The corresponding short term autocorrelation of a non-stationary sequence s(n) is defined as

 


 

where sw(n)=s(m).w(n-m) is the windowed version of s(n). Thus for a given windowed segment of speech,the short term autocorrelation is a sequence. The nature of short term autocorrelation sequence is primarily different for voiced and unvoiced segments of speech. Hence information from the autocorrelation sequence can be used for discriminating voiced and unvoiced segments. Figure_3  and Figure_4 show segments of voiced and unvoiced speech and the corresponding autocorrelation sequences. The nature of autocorrelation sequence is different for the two cases indicating the difference in case of voiced and unvoiced sequence of speech.  

 

 

 

Figure_3: voiced segments of speech and  Autocorrelation seqence

 

 

 

Figure_4: unvoiced  segments of speech and Autocorrelation sequence

 

 

The autocorrelation is essentially a 3D plot for the case of non-stationary signal like speech.The autocorrelation sequence for a fixed frame "n" is a 2D plot rxx (k) and plotting such cases for different instants of time "n" leads to rxx (k). The 3D plot of short term auto correlation for a speech signal is in Figure-4

 

Figure-5: short term fourier transformation of speech signal

 

The typical frame size for computing short term autocorrelation should include at least two cycles of speech signal in the voiced speech case. To ensure this the size is used in the range 30-50 msec. The nature of autocorrelation sequence in case of autocorrelation of voiced speech can be explained for finding the periodicity of voiced speech. Accordingly, the autocorrelation of voiced speech should give strong peak at the periodic value and no such peak in case of unvoiced speech. Therefore, the autocorrelation of speech has become a standard approach for enhancing pitch . Figure_5 shows a speech signal and raw pitch contour estimation from speech signal by the autocorrelation analysis.

The smooth contour positions indicate the pitch values and the random values represent unvoiced portions.

 

 

Figure_6: A speech signal and raw pitch contour estimation from speech signal by the autocorrelation analysis

 

 

 

Cite this Simulator:

.....
..... .....

Copyright @ 2024 Under the NME ICT initiative of MHRD

 Powered by AmritaVirtual Lab Collaborative Platform [ Ver 00.13. ]