Improving Transcription Quality

Justin Dupree -

Transcription works by hearing a long string of sounds and comparing patterns in a macro sense. If a 30 second snippet has "this" sound at the beginning and "this" sound at the end, then "this" sound in the middle is probably "X". It takes multiple passes and uses guesses about various components in the audio to make guesses about other components. Over several passes, it's able to refine those guesses into what was likely said.

The first step toward providing higher quality transcription is to improve the source data. Phone calls already have a disadvantage here, as the audio bandwidth on a telephone call is much lower than an in a face-to-face conversation. That's why people sound different over the phone. The audio is compressed, leaving the sound a little "off". Using a compressed audio file like MP3 on top of that makes the audio even lower quality. The fastest way to boost your transcription quality on Tropo is to use WAV files for your recordings instead of MP3.

Untrained transcription accuracy is terrible for general-purpose audio (untrained meaning not tuned to your voice and inflection). Call a Google Voice number and read it a passage from a book; the resulting transcription won't resemble the text you read. That's because they start with certain assumptions, including that it's a voicemail message - so do we. Both Google Voice and Tropo is expecting it to start with a greeting and that it'll include someone's name. It's expecting something like: "Hi Joe, it's Josh. Can you call me about the meeting this afternoon?". This means the passage from a book you read will often end up looking like a voicemail message in the transcription, because that's the context they're expecting.

Even for trained speech recognition systems (something like Dragon Naturally Speaking), single words get awful results. There's no way for Dragon to tell if the noise you just made was "Blue" Blew" "eww" or even "stew"; it needs a lot of words to be able to get context about the sound. You'll notice as you use it, that words it's already transcribed will change over the course of your speech. It's using the context of what you're saying to revise it's earlier guesses.

If what you're looking for is highly accurate transcription, especially transcription that includes non-standard words like proper nouns, or even single letters, your best bet is to implement an independent and likely human-driven transcription service. There are many available, and while we can't recommend one over another as we don't have any partnerships, finding one with an API that accepts a REST POST would be the easiest to integrate into your application (as you can POST your recording audio out of Tropo for use by other services).

Have more questions? Submit a request


Article is closed for comments.