Home United States USA — software Why companies like Amazon manually review voice data

Why companies like Amazon manually review voice data

213
0
SHARE

Amazon employs contract workers to review voice clips recorded by Alexa, a recent report revealed. How do other companies’ practices compare?
Last week, Bloomberg revealed unsavory details about Alexa’s ongoing development that were known within some circles but hadn’t previously been reported widely: Amazon employs thousands of contract workers in Boston, Costa Rica, India, Romania, and other countries to annotate thousands of hours of audio each day from devices powered by its assistant. “We take the security and privacy of our customers’ personal information seriously,” an Amazon spokesman told the publication, adding that customers can opt not to supply their voice recordings for feature development.
Bloomberg notes that Amazon doesn’t make explicitly clear in its marketing and privacy policy materials that it reserves some audio recordings for manual review. But what about other companies?
Today, most speech recognition systems are aided by deep neural networks — layers of neuron-like mathematical functions that self-improve over time — that predict phonemes, or perceptually distinct units of sound. Unlike automatic speech recognition (ASR) techniques of old, which relied on hand-tuned statistical models, deep neural nets translate sound in the form of segmented spectrograms, or representations of the spectrum of frequencies of sound, into characters.
Joe Dumoulin, chief technology innovation officer at Next IT, told Ars Technica in an interview that it takes 30-90 days to build a query-understanding module for a single language, depending on how many intents it needs to cover. That’s because during a typical chat with an assistant, users often invoke multiple voice apps in successive questions, and these apps repurpose variables like “town” and “city.” If someone asks for directions and follows up with a question about a restaurant’s location, a well-trained assistant needs to be able to suss out which thread to reference in its answer.
Moreover, most speech recognition systems tap a database of phones — distinct speech sounds — strung together to verbalize words. Concatenation, as it’s called, requires capturing the complementary diphones (units of speech comprising two connected halves of phones) and triphones (phones with half of a preceding phone at the beginning and a succeeding phone at the end) in lengthy recording sessions. The number of speech units can easily exceed a thousand; in a recent experiment, researchers at Alexa developed an acoustic model using 7,000 hours of manually annotated data. The open source LibriSpeech corpus contains over 1,000 hours of spoken English derived from audiobook recordings, while Mozilla’s Common Voice data set comprises over 1,400 hours of speech from 42,000 volunteer contributors across 18 languages.

Continue reading...