Transcription vs Speech Recognition (ASR)

Justin Dupree -

Transcription (the attempt to determine what's said as a whole) and speech recognition/grammar (the attempt to match up a word or words with a predefined list of choices) are very different creatures. With a grammar, when a user provides an answer, the speech recognition doesn't know what was said. It only knows that the sounds did or didn't match the sounds that could possibly match the grammar's choices.

For example, if the grammar is "red, blue, violet" and I say "blew", that's going to match. The sounds are the same, so it matches. When you set the confidence level low enough, "violent" might match, too, or even something like "pilot." The speech engine isn't trying to understand what someone said, it's just trying to match a pattern of sounds. In the collection of sounds that make up the word in the grammar, how far away is the sound that it just heard. That's why background noise can be matched as a positive match sometimes.

Transcription, on the other hand, works by hearing a long string of sounds and comparing patterns in a macro sense. If a 30 second snippet has "this" sound at the beginning and "this" sound at the end, then "this" sound in the middle is probably "X". It takes multiple passes and uses guesses about various components in the audio to make guesses about other components. Over several passes, it's able to refine those guesses into what was likely said.

Untrained transcription accuracy is terrible for general-purpose audio (untrained meaning not tuned to your voice and inflection). Call a Google Voice number and read it a passage from a book; the resulting transcription won't resemble the text you read. That's because they start with certain assumptions, including that it's a voicemail message. Google Voice is expecting it to start with a greeting and that it'll include someone's name. It's expecting something like: "Hi Joe, it's Josh. Can you call me about the meeting this afternoon?". This means the passage from a book you read will often end up looking like a voicemail message in the transcription, because that's the context they're expecting.

Even for trained speech recognition systems (something like Dragon Naturally Speaking), single words get awful results. There's no way for Dragon to tell if the noise you just made was "Blue" Blew" "eww" or even "stew"; it needs a lot of words to be able to get context about the sound. You'll notice as you use it, that words it's already transcribed will change over the course of your speech. It's using the context of what you're saying to revise it's earlier guesses.

Have more questions? Submit a request


Article is closed for comments.