next_inactive up previous

IT694 (Seminar):
Very low bit rate Speech Coding

Chetan Vaity (02329901)
Under the guidance of Dr. Preeti Rao

18 July, 2003


The problem of efficient speech coding has seen considerable activity in the past decades. The speech production model based approach produced methods like Linear Predictive Coding (LPC) etc. which are the basis for many standards today. Another approach is the waveform segment comparison and concatenation method which has produced very low bit rates with good intelligibility. Improvements like using better distance measures, using HMMs and pitch coding further enhance the quality of these coders.

1. Introduction

Speech Coding is the process of coding speech signals for efficient transmission. The problem of reducing the bit rate of a speech representation, while preserving the quality of speech reconstructed from such a representation has received continuous attention in the past five decades.

Speech coded at 64 kilobits per second (kbits/s) using logarithmic PCM is considered as ``non-compressed'' and is often used as a reference for comparisons. The term medium-rate is used for coding in the range of 8-16 kbits/s, low-rate for systems working below 8 kbits/s and down to 2.4 kbits/s, and very-low-rate for coders operating below 2.4 kbits/s.

1.1 Applications

Recent advances in computer technology allow a wide variety of applications for speech coding. Transmission can either be real time as in normal telephone conversations, or off-line, as in storing speech for electronic mail forwarding of voice messages. In either case, the transmission bit rate (or storage requirements) is crucial to evaluate the practicality of different coding schemes.

Low bit rate concatenative coders can be very useful when requiring storage of large amount of pre-recorded speech. A talking book, which is the spoken equivalent of its printed version, requires huge space for storing speech waveforms unless a high compression coding scheme is applied. Similarly, for a wide variety of multimedia applications, such as language learning assistance, electronic dictionaries and encyclopedias there are potential applications of very low bit rate speech coders.

Interest in exchanging voice messages across the Internet is increasing. To save on bandwidth, such coders could have a large role to play. For scenarios where two persons (or a small set of persons) are frequently exchanging voice messages, concatenative synthesis could be employed.

1.2 Traditional speech coding

Natural speech waveforms are continuous in time and amplitude. Periodically sampling an analog waveform at the Nyquist rate (twice the highest frequency) converts it to a discrete-time signal. The signal amplitude at each time instant is quantized to one of a set of $L$ amplitude values (where $B = log_2 L$ is the number of bits used to digitally code each value). Digital communication of an analog amplitude $x$ consists of: A/D conversion, transmission of binary information over a digital channel, and D/A conversion to reconstruct the analog $x$ value. If the channel is noise-free, the output value differs from the input by an amount known as the quantization noise. The bit rate for a signal is the product of its sampling rate $F_s$ (in samples/second) and the number of bits $B$ used to code each sample.

The process of extraction of properties or features from a speech signal which are important for communication is called speech analysis. This involves a transformation of the speech signal into another signal, a set of signals, or a set of features, with the objective of simplification and data reduction. The standard model of speech production (a source exciting vocal tract filter) is implicit in many analysis methods, including LPC.

Most of the methods operate in the frequency domain as it offers the most useful parameters for speech processing. Human hearing appears to pay much more attention to spectral aspects of speech (e.g., amplitude distribution in frequency) than to phase or timing aspects.

1.3 Overview

In this report, the traditional low bit rate coders and techniques used to optimize them are discussed in Chapter 2. The waveform concatenative synthesis approach is discussed in the Chapter 3. Chapter 4 includes a description of an attempt made to implement simple concatenative speech compression.

2. Traditional approach

2.1 LPC based coding

LPC is one of the most common techniques for low-bit-rate speech coding. The popularity of LPC derives from its compact yet precise representation of the speech spectral magnitude as well as its relatively simple computation. LPC analysis produces a set or vector of real-valued features which represent an estimate to the spectrum of the windowed speech signal.

The LPC vector for a signal frame typically consists of about 8-12 spectral coefficients with 5-6 bits/coefficient. The gain level and pitch are coded with 2-4 bits each. In addition, the binary voiced/unvoiced decision is transmitted. Thus, a 2400 bits/second vocoder might send 60 bits/frame every 25 ms.

2.2 Vector Quantization

Most speech coders transmit time or frequency samples as independent (scalar) parameters, but coding efficiency can be enhanced by eliminating redundant information within blocks of parameters and transmitting a single index code to represent the entire block. This is Vector Quantization (VQ).

During the coding phase, basic analysis yields $k$ scalar parameters $\overline{v}$ (features) for each frame. Then, a particular k-dimensional vector, among a set of $M$ vectors stored in a codebook is chosen which corresponds most closely to the vector $\overline{v}$. A $log_2 M$ bit code (the index of the vector chosen from the codebook) is sent in place of the $k$ scalar parameters. The system's decoder must include a codebook identical to that of the coder. To synthesize the output speech, the decoder uses the parameters listed under the index in the codebook corresponding to the received code.

The key issues in VQ are the design and search of the codebook. In coders with scalar quantization, coding distortion comes from the finite precision for representing each parameter. VQ distortion comes instead from using synthesis parameters from a codebook entry, which differ from the parameters determined by the speech analyzer. The size of the codebook, $M$ should be large enough that each possible input vector corresponds to a codebook entry whose substitution for $\overline{v}$ yields output speech close to the original. However, efficient search procedures and storage considerations limit $M$ to smaller values.

The greater the degree of correlation among the vector elements, the more the bit rate can be lowered. LPC typically sends 50 bits/frame (10 coefficients, 5 bits each) with scalar quantization, but VQ succeeds with about 10 bits. A well-chosen set of 1024 spectra ($2^{10}$ for 10 bit VQ) can adequately represent most possible speech sounds.

2.3 Speech Coding based upon Vector Quantization

One of the first experimental comparisons between optimized scalar quantization and vector quantization is presented in [2]. In this work, the gain parameter is treated separately from the rest of the information. The signal is segmented into frames of $N$ samples for which one gain parameter is sent along with the $N$ sample vectors. This approach is called gain separation. In this method, the gain and spectral codebooks are separate, and each entry is decoded as a scalar gain times a waveform vector. Since this method allows separate codebook searches, only $2^L + 2^M$ entries must be examined instead of $2^L2^M$ in the original method, with $M$ spectral codebook entries and $L$ gain possibilities. Furthermore, the gain codebook search is much simpler since it involved scalars, rather than the k-dimensional vectors in the spectral codebook.

Another sub-optimal technique, binary tree search, is used for searching the codebook. In a full codebook search, the vector for each frame is compared with each of the $M$ codebook entries, thus requiring $M$ distance calculations. A binary tree search replaces $M$ comparisons with only $2log_2M$ comparisons. The $M$ codebook entries form the lowest nodes of the tree, with each higher node being represented by the centroid of all entries below it in the tree. Note that this approach doubles the memory requirement for the codebook.

The benefits of VQ are made quite apparent in the results of this work [2]. For a 10-bits/frame full search vector quantizer, the measured distortion is approximately 1.8 dB. The equivalent distortion for scalar quantizer occurs at approximately 37 bits/frame resulting in a difference of 27 bits/frame or a 73 percent reduction in bit-rate for the same equivalent distortion. With binary tree search, the distortion was slightly greater and the improvement obtained was about 66 percent.

In [3], VQ is applied to modify a 2400 bits/s LPC vocoder to operate at 800 bits/s, while retaining acceptable intelligibility and naturalness of quality. No change to the LPC design is done other than the quantization algorithms for the LPC parameters. One of the modifications here is the separation of pitch and voicing in addition to gain. The quantization technique of pitch and gain are scalar, having one value for 3 frames. Voiced and unvoiced speech spectra are in most cases very different, hence separate codebooks are employed for the two. The tree search has been modified so as to reduce the distortion in the binary tree approach. A 32 branches/node tree search has been found to be a good compromise, requiring only 1/16 the computation of the full search procedure, but achieving an average distortion very close to the full search codebook.

Some techniques for reducing the bit rate are discussed in [4].

2.4 Achieving very low rates

A new method where input speech is modeled as a sequence of variable-length segments is introduced in [5] and further optimized in [6]. An automatic segmentation algorithm is used to obtain segments with an average duration comparable to that of a phoneme. Such a segment is quantized as a single block. For segmentation, speech is considered as a succession of steady states separated by transitions. Spectral time-derivatives are thresholded to determine the middle of transitions. The threshold is chosen such that approximately 11 segments/s are obtained (equal to the expected phoneme rate). The distance between two segments is calculated using an ``equi-spaced'' sampled representation of the segment (spatial sampling in 14 LPC spectral dimensions). The euclidean distance between corresponding equi-spaced points on two segments is summed to arrive at a distance value.

Each segment, in this approach, is a 140 dimensional vector (14 spectral value x 10 spatial samples). Usually, a clustering algorithm is used to obtain an optimal set of segment templates for the codebook. For the large dimensionality of the segment vocoder, the expected quantization error of a properly chosen random quantizer is nearly equal to the distortion rate bound. Therefore, a computationally intensive clustering algorithm was not used and a random set of segments was obtained and used as the codebook. Approximately 8000 segment templates (13 bits) were used for coding and a further 8 bits were used for gain, pitch and timing information. Thus the bit rate obtained was 231 bits/s for 11 segments/s.

In [6], a further decrease in bit rate was achieved by using a segment network that restricts the number of segment templates that can follow a given template. The segment network is used in the following manner: if the current input segment is quantized to a given template, then only those segment templates that follow this template in the network can be used to quantize the following input segment. The implementation of the segment network was done in the following fashion. Suppose that the current input segment is quantized to a given template. The last spectrum of this best segment template is used to determine a subset of templates that are allowed in quantizing the following input segment. The templates allowed are those whose first spectrum is nearest (euclidean distance) to the last spectrum of the template used in quantizing the current input segment.

Comparison was done with an unconstrained case, using a total of 1024 segment templates (10 bits). With the segment network, the choice of segment templates is restricted to 256 and thus 8 bits/segment are needed to code the segment. Almost no difference in quantization error was found in the two approaches.

2.5 Phonetic vocoder

To code speech at very low data rates of about 100 bits/s implies approaching the true information rate of the communication. (about 60 phones at the rate of 15-20 phones/second). One approach is to extract detailed linguistic knowledge of the information embedded in the speech signal. In [7], a coding technique based on phonetic speech recognition is explored. The motivation for the work stems from the observation that Hidden Markov Model (HMM) based speech recognition systems (working on LPC features) tend to do an excellent job of modeling the acoustic data. A basic premise of the paper is that the quality of the acoustic match will be good even if the phone recognition accuracy is poor. In other words, a phone may be recognised as another phone by the HMM system. But, this will not result in significant deterioration of the speech quality, as the two phones will be very close acoustically.

A large acoustic phonetic database is used which has been phonetically transcribed and labeled with a set of 60 phones. A basic phone recognition system is implemented with a null grammar (any phone can follow any phone). A contextually rich sample of examplars of each of the 60 phones are clustered. The major way in which the phonetic vocoder distinguishes itself from a vector quantization system is the manner in which spectral information is transmitted. Rather than transmit indices in a VQ codebook representing each spectral vector, a phone index is transmitted along with auxiliary information describing the path through the model. Good overall synthesized speech quality was achieved with 8 clusters per phone. (i.e. 480 clusters). Thus, a simple inventory of phones is shown to be sufficient for capturing the bulk of acoustic information.

3. Concatenative synthesis of waveforms

In the previous chapter, it was discussed how speech coding at medium-rates and below is achieved using an analysis-synthesis process. In the analysis stage, speech is represented by a compact set of parameters which are encoded efficiently. In the synthesis stage, these parameters are decoded and used in conjunction with a reconstruction mechanism to form speech. In this chapter, a new method is discussed in which the original waveform corresponding to the nearest template segment (codebook entry) is used for synthesis. The primary difference from conventional coders is that no speech generation framework like the source-filter model is used. Instead, it is assumed that any speech signal can be reconstructed by concatenating short segment waveforms that are suitably chosen from a database. Such an approach is reported to give speech with a better perceptive quality than the LPC synthesized speech using pulse/noise excitation.

3.1 Waveform segment vocoder

The first foray into waveform based synthesis was made in [8]. Here, the decoding stage works with the waveforms of the nearest templates and not their spectral representation. Pitch, energy and duration of each template is independently modified to match those of the input segment. These modified segments are then concatenated to produce the output waveform. The paper describes algorithms used for time-scale modification and pitch-scale modification which are applied to the template waveforms so as to match the time and pitch characteristics of the input segment. Several sentences from a single male speaker are vocoded using his own templates (speaker-dependent paradigm). The speech has a choppy quality, presumably due to segment boundary discontinuities.

The authors view this work as a modification to the original LPC vocoder developed by them [5]. Hence, phoneme like segments were the basic elements in their waveform approach. These segments were then passed through considerable modifications which may in fact reduce the naturalness of the waveform.

3.2 Contemporary techniques

The waveform concatenation approach has been investigated in some detail in [9]. Here, frames are used instead of a segments as the units for selection. One advantage is that since frame selection does not require a time-warping process, the synthesis of the speech signal is done without time scale modification (as was needed in [8]). On the downside, the bit rate of frame based approach is greater, because longer segments contribute to high compression ratio of segmental coders. Mel-frequency cepstrum coefficients (MFCCs) are used as the feature parameters for the unit selection. The size of the database is about 76 minutes of speech corresponding to 460,000 frames each of 10ms duration. There are two databases as shown in Fig. 3.1. The first one contains MFCCs of the about 460,000 frames obtained by the feature extraction process. The second one contains speech waveforms that are used while generating the output waveform. The raw speech signal from which the MFCCs of the first database are computed is the same as those in the second database. In addition to the unit indices, pitch ($F0$) and gain parameters are also transmitted. The unit index represents the position where the selected unit is located in the database.

Figure 3.1: Block diagram of the coder in [9]

3.2.1 Unit selection

For unit selection, a novel approach is proposed in [10]. Units in the synthesis database are considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database.

Unit selection is based on two cost functions. The target cost, $C^t(u_i, t_i)$, is an estimate of the difference between a database unit, $u_i$, and the target (input frame), $t_i$, which it is supposed to represent. The concatenation cost, $C^c(u_{i-1}, u_i)$, is an estimate of the quality of a join between units ($u_{i-1}$ and $u_i$). The two costs are aimed towards minimizing distortion within an individual frame and preserving natural coarticulation.

Given the target specification, the sequence $t_{1}^n = (t_1, ...,t_n)$, the problem is to select the set of units, $u_{1}^n = (u_1, ...,u_n)$, which are closest to the target. In the state transition network terminology, the state occupancy cost is given by the target cost, and the state transition cost is given by the concatenation cost. Because any unit can potentially be followed by any other, the network is fully connected. It is worth pointing out that such a treatment of a synthesis database has many similarities to HMM-based speech recognition systems [11]. The important difference is that Markov models are probabilistic, whereas the the state transition network uses cost functions. Cost Functions

Each frame in the input speech and the synthesis database are characterized by an MFCC feature vector of size $p$. The target cost is calculated as the weighted sum of the differences between the elements of the target and candidate feature vectors.
C^t(t_i, u_i) = \sum_{j=1}^p w_{j}^t C_{j}^t(t_i, u_i)
\end{displaymath} (3.1)

$w_{j}^t$ is the weight for each feature and $C_{j}^t(t_i, u_i)$ is the euclidean distance between the corresponding features.

The concatenation cost, $C^c(u_{i-1}, u_i)$, is also determined by the weighted sum of the $p$ euclidean distances between the MFCC features.

C^c(u_{i-1}, u_i) = \sum_{j=1}^p w_{j}^c C_{j}^c(u_{i-1}, u_i)
\end{displaymath} (3.2)

As a special case, if $u_{i-1}$ and $u_i$ are consecutive units in the synthesis database, then their concatenation is natural and therefore has a cost of zero. This condition encourages the selection of multiple consecutive phonemes from the synthesis database.

The total cost for a sequence of $n$ units (i.e. a path through the state transition network) is the sum of the target and concatenation costs:

C(t_{1}^n, u_{1}^n) = \sum_{i=1}^n C^t(t_i, u_i) + \sum_{i=2}^n C^c(u_{i-1}, u_i)
\end{displaymath} (3.3)

The unit selection procedure is the task of determining the set of units $\overline{u}_{1}^n$ so that the total cost defined by (3.3) is minimized:
\overline{u}_{1}^n = \mathbf{min}_{u_1 \ldots u_n} C(t_{1}^n, u_{1}^n)
\end{displaymath} (3.4)

This minimization can be performed by Viterbi search [11]. Another criterion is to find the best sequence in the sense of maximizing the number of consecutive frames as well as minimizing the cost function. Pruning for faster Viterbi search

In the above equations, the number of units is very large ($n = 460,000$). To reduce computational time in the Viterbi algorithm, the search space is pruned. The pruning strategy is described below.

In the Viterbi search, the number of possible units is limited by choosing only those units whose spectral envelope is relatively close to that of the input frame. This pruning is done using vector quantization for clustering, thus partitioning the entire database space. Supposing that a given unit is vector quantized by a specific code vector, the units quantized into the same code vector are selected as candidate frames. Good results are obtained with a codebook size of 10 or 11 bits.

A phonetically labeled speech database is used, using which it can be predicted if a given path is possible in some context. For example, assuming that a database is labeled using half phones, the following frame of the current frame labeled with ``aa1'' (aa1 is the first half of the phoneme aa) must have phoneme ``aa1'' or ``aa2'' (aa2 is the last half of phoneme aa). All other combinations, like aa1-ae1 or aa1-k2, must be removed. In this way, the number of paths in the Viterbi algorithm can be reduced. About 50% of the total number of paths are pruned by this approach. The sound quality after this pruning process was almost the same.

3.2.2 Coding the selected unit sequence

Since concatenation cost in (3.2) is set to zero if two frames are consecutive in a database, the resulting unit sequence has many sequential frames. To take advantage of this property, a run-length coding technique is employed to compress the unit sequence. In this method, a series of consecutive frames are represented with the start frame and the number of the following consecutive frames. Thereby, a number of consecutive frames are encoded into only two variables.

3.2.3 Coding pitch

Accurate coding of the pitch($F$0) contour plays an important role in a very low rate coder since the correct pitch contour will increase naturalness. Piecewise linear approximation is used to implement contour-wise $F$0 coding. This method offers high compression, as only a small number of sampled points need to be transmitted instead of all individual samples. Of course, the intervals between the sampled points must be transmitted for proper interpolation. Piecewise linear approximation presumes some degree of smoothness for the function approximated. Therefore, the $F$0 contour is smoothed before compression. The methods used for finding the location of $F$0 points are discussed in the paper [9].

4. My comments

4.1 Phonetic labeling?

Some of the works in this field work with phonetically labeled speech signal databases, which are usually developed as part of text-to-speech(TTS) systems. Generation of such databases involves large amount of human effort. It is felt that such databases should not be used in the speech coding arena as they inherently bind the system to a particular voice (or a set of voices). Further, phonetic breakdown and manual labeling of speech is not feasible for generic speech coding. In fact, speech generated by TTS systems usually has a dry and machine-like unnatural sound. It is believed that using concatenative synthesis very natural sounding speech can be produced given that we have a phonetically and prosodically balanced input sample for creating the database.

4.2 An application

The audio news downloads which appear on news websites are a possible application where concatenative speech synthesis can be tried out. These news items are usually read out by a single speaker (or a small set of speakers) in a controlled studio environment. The database required for coding and synthesis can be prepared and made available for a one time download to users. The news items themselves can then just be the coded files which have the database indices, pitch contour coding etc. which will be a much smaller download that the current compressed audio files used.

Indeed, wherever such speaker-dependent environments exist, concatenative speech synthesis can be employed for obtaining natural sounding speech at very low bit rates (high compression ratios).

4.3 Project

The work done during a course project in April 2003 [12] is described in brief here.

4.3.1 Profile generation

In this work, profile is the term used for the speech database used for concatenative synthesis.

A recorded lecture was obtained from the DEP (Distance Education Programme, IIT Bombay) archives. All experiments have been conducted using this 2 hour sample (sampling rate: 22050 Hz, single channel).

A 15 minute sample was extracted for profile generation. This file was divided into 45000 files of 20 ms duration each, using the sox utility. 10 MFCC features were computed for each of these sound-slices. This was done using a software called cepstral.

It was assumed that 10000 different elementary sounds will be enough to characterize the range of sounds produced by a person. This number was arrived at empirically. The 45000 sound samples were then clustered into 10000 clusters based on their MFCC features. A bisection variant of the k-mean algorithm was used for clustering. The vcluster tool of the CLUTO clustering suite was used for this purpose. For each of these clusters, a representative sample was chosen randomly.

These 10000 representative sound samples are then assigned unique codes (the cluster numbers have been used as the codes). This collection of representative sounds and their codes is the profile, using which other sound samples can now be encoded.

4.3.2 Encoding

A new 10 second sample was taken from the lecture (disjoint from the earlier sample, of course) and divided into 20 ms slices. MFCC features were extracted from the 500 sound-slices created this way. Each of these feature vectors was taken and a closest match was found from the 10000 feature vectors of the representative samples of the profile. This was done by determining the minimum euclidean distance in the 10 dimensional feature space. Thus, for each of the 500 sound-slices, a representative sound from the profile was identified. The encoded file consists of this sequence of codes of the representative sound samples.

4.3.3 Decoding

The decoding is done using the encoded file and the profile (i.e 10000 representative sound slices). The resultant audio is created by successively concatenating the representative sound samples indicated in the encoded file. No smoothing was done due to time constraints, and it is believed that if done, smoothing will improve the quality of the resulting decoded sample.

Figure 4.1: Methodology followed in the project


Douglas O'Shaugnessy.
Speech Communications - Human and Machine.
Universities Press, 2001.

A. Buzo, Jr. A. H. Gray, R. M. Gray, and J. D. Markel.
Speech coding based upon vector quantization.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1980.

D. Y. Wong, B. H. Juang, and Jr. A. H. Gray.
Recent developments in vector quantization for speech processing.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1981.

Richard M. Schwartz and Salim E. Roucos.
A comparison of methods for 300-400 b/s vocoders.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1983.

S. Roucos, R. Schwartz, and J. Makhoul.
Segment quantization for very-low-rate speech coding.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1982.

S. Roucos, R. M. Schwartz, and J. Makhoul.
A segment vocoder at 150 b/s.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1983.

Joseph Picone and George R. Doddington.
A phonetic vocoder.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1989.

Salim Roucos and Alexander M. Wilgus.
The waveform segment vocoder: A new approach for very-low-rate speech coding.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1985.

Ki-Seung Lee and Richard V. Cox.
A very low bit rate speech coder based on a recognition/synthesis paradigm.
IEEE Transactions on Speech and Audio Processing, 2001.

Andrew J. Hunt and Alan W. Black.
Unit selection in a concatenative speech synthesis system using a large speech database.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1996.

Lawrence R. Rabiner.
A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, Volume 77, No.2, 1989.

Chetan Vaity, Shweta Agrawal, and Amreek Singh.
Profile based speech coding using clustering on elementary sounds.
Technical report, IT608 Course project, KReSIT, IIT Bombay ($\sim$chetanv/personal/speechcoding/it608_project /sc_report.pdf), 2003.

next_inactive up previous
Chetan Vaity 2003-07-21