Perceptual insight and speech recognition

I bought a book for a dollar the other day called New Horizons in Linguistics that was published in the early 70s. Regardless of how out if date it is, one of the articles was discussing the concept of speech perception and how our brains are highly educated in recognizing tiny variations in frequency, timing and intensity in order to identify certain phonemes as meaningful. One section in particular that caught my eye was this:

If we take a sample of the ‘same’ word uttered by a man, a woman, and a child, and make acoustic measurements on each of these, we shall not obtain absolute measurements that are identical or even particularly close to each other. There will be marked differences in overall intensity, spectral distribution of energy, fundamental frequency and duration. The fact that any listener will recognize the same word in all three cases is due to his reliance upon acoustic cues based in relative values, relations within each utterance, relations between the three utterances and, most important of all, relations between this particular utterance and others which might have come from the same speaker, for there is good evidence that the listener is able, on the basis of a very short sample, to infer a whole frame of reference for dealing with any individual’s speech. (D.B. Fry, “Speech Reception and Perception” 37)

It reminded me of this example of “perceptual insight” graciously hosted by the MRC Cognition and Brain Sciences Unit near Cambridge. If you listen to this clip: it should sound like a bunch of squeaks and bleeps. But if you listen to this clip:then go back and listen to the first one, you should easily be able to recognize the sounds in the first clip as meaningful words. As Matt Davis explains, the first clip is a simplified version of the spectrogram of the clear speech that tracks the center of the formant frequencies of the clear speech. If you go to his site and listen to the other four sine-wave/clear speech pairs, you should notice that it doesn’t take long to learn how to recognize the sine-wave speech as meaningful, even without listening to the clear speech first. This is an example of what Fry was pointing out, that human brains are very capable of synthesizing a whole frame of reference for the relative judgments of frequency, timing and intensity from only listening to a short sample of speech. This is similar to the process that occurs when you first meet someone with a thick accent, or you speak to a young child who is learning to talk – by developing a frame of reference based on what you do understand from their speech, you are able to recognize the patterns that make the sounds coming from their mouth into meaningful utterances.


