How does a computer convert speech into data that it can then manipulate or execute? Initially, when we speak, a microphone converts the analogue signal of our voice into digital chunks of data that the computer must analyze. It is from this data that the computer must extract enough information to confidently guess the word being spoken. This is not exactly an easy task for a computer. In fact, in the early 1990s, the best recognizers were yielding a 15% error rate on a relatively small 20,000 word dictation task. Now though, that error percentage has dropped to as low as 1-2%, although this can vary greatly between speakers. So how is this done?
Phonemes are best described as linguistic units. They are the sounds that group together to form our words, although quite how a phoneme converts into sound depends on many factors including the surrounding phonemes, speaker accent and age.
Here are a few examples:
English uses about 40 phonemes to convey the 500,000 or so words it contains, making them a relatively good data item for speech engines to work with.
Phonemes are often extracted by running the waveform through a Fourier Transform. This allows the waveform to be analyzed in the frequency domain. So, what does this mean? It is easier to understand this principle by looking at a spectrograph. A spectrograph is a 3D plot of a waveform's frequency and amplitude versus time. In many cases though, the amplitude of the frequency is expressed as a colour (either greyscale, or a gradient colour). For example, if I said "Countash" which contains the "sh" phoneme and "assure" which contains the "ss" phoneme, the two phoneme's would appear almost the same on...