Custom Search

Speech Synthesis

 

 

 

Speech synthesis is the process of artificially producing human speech.  A device or system that performs speech synthesis is called a speech synthesizer.  A speech synthesizer can be implemented either in software (using a computer) or hardware (using specialized electronic circuits).  Two common methods used in synthesizing speech are 'concatenative synthesis' and 'formant synthesis'.

   

 

Concatenative synthesis consists of recording many words and phrases (or even syllables) in digital format and then storing these in a database.  A sentence is synthetically spoken by retrieving the recording of each of its words (or syllables) from the database and playing them back in the right sequence. This process is called 'concatenative synthesis' because the various segments of recorded speech are concatenated (i.e., connected together) to form a spoken sentence.

 

Concatenative synthesis can now be easily done using today's computers, but there was a time when human-like speech synthesis was a major engineering challenge.  In the old days when computers were much slower and supporting hardware and software for digital processing of sound were scarce,  PC-based synthesized speech was almost unrecognizable.

  

Formant synthesis, on the other hand, creates its artificial human speech acoustically.  It doesn't use recorded segments of speech.  Instead, it synthesizes speech by producing acoustic waveforms whose parameters are varied in time.  Parameters varied to 'shape' a waveform include the waveform's fundamental frequency, voicing, noise levels, etc.

  

Other methods used in synthesizing speech include:

-  'Articulatory Synthesis', which applies computational techniques on models of the human vocal tract,

-  'Hidden Markov Model-based (or HMM-based) Synthesis', wherein speech waveforms are created using HMM's, and

-  'Sinewave Synthesis', which produce speech from pure-tone whistles.

       

Figure 1.  Speech synthesis can be implemented either through software (left) or hardware (right) or by combining both.

  

It is easy to tell if a speech synthesizer is good - the speech it produces is very human-like and can be easily understood by a human.  The property that refers to the closeness of a synthesized speech to the quality of a natural human voice is known as 'naturalness'.  On the other hand, the property that refers to how easily a synthesized speech could be understood by people is known as 'intelligibility.'

  

A speech synthesizer that has the capability to convert text into speech is known as a 'text-to-speech converter'.  Text-to-speech converters aid people with speech disabilities to communicate by sound.  Aside from text-to-speech conversion, speech synthesizers are also widely used in appliances, automobiles, telephone systems and computer/video games.

    

 

    

See Also:   More Industry Articles