|
Speech
synthesis
is the process of artificially producing human speech. A
device or system that performs speech synthesis is called a speech
synthesizer. A speech synthesizer can be implemented either in
software (using a computer) or hardware (using specialized
electronic circuits). Two common methods used in synthesizing
speech are 'concatenative synthesis' and 'formant synthesis'.
Concatenative synthesis
consists of recording many words and phrases (or even syllables) in
digital format and then storing these in a database. A
sentence is synthetically spoken by retrieving the recording of each
of its words (or syllables) from the database and playing them back
in the right sequence. This process is called 'concatenative
synthesis' because the various segments of recorded speech are
concatenated (i.e., connected together) to form a spoken sentence.
Concatenative
synthesis can now be easily done using today's computers, but there
was a time when human-like speech synthesis was a major engineering
challenge. In the old days when computers were much slower and
supporting hardware and software for digital processing of sound
were scarce, PC-based synthesized speech was almost
unrecognizable.
Formant synthesis, on
the other hand, creates its artificial human speech acoustically.
It doesn't use recorded segments of speech. Instead, it
synthesizes speech by producing acoustic waveforms whose parameters
are varied in time. Parameters varied to 'shape' a waveform
include the waveform's fundamental frequency, voicing, noise levels,
etc.
Other
methods used in synthesizing speech include:
- 'Articulatory
Synthesis', which applies computational techniques on models of the
human vocal tract,
-
'Hidden Markov Model-based (or HMM-based) Synthesis', wherein speech
waveforms are created using HMM's, and
- 'Sinewave
Synthesis', which produce speech from pure-tone whistles.
 |
|
Figure 1. Speech synthesis can be implemented either
through software (left) or hardware (right) or by combining
both. |
It is easy to tell if a
speech synthesizer is good - the speech it produces is very
human-like and can be easily understood by a human. The
property that refers to the closeness of a synthesized speech to the
quality of a natural human voice is known as 'naturalness'. On
the other hand, the property that refers to how easily a synthesized
speech could be understood by people is known as 'intelligibility.'
A speech
synthesizer that has the capability to convert text into speech is
known as a 'text-to-speech converter'. Text-to-speech
converters aid people with speech disabilities to communicate by
sound. Aside from text-to-speech conversion, speech
synthesizers are also widely used in appliances, automobiles,
telephone systems and computer/video games.
See Also:
More
Industry Articles
|