Fireside chat with's Daniel Kokotov, VP of Engineering within Speech Technologies

author avatar

18 Mar, 2022

Fireside chat with's Daniel Kokotov, VP of Engineering within Speech Technologies

#7 of our 'Voice of Innovation' fireside chat series: Robotics and AI reporter Rachel Gordon speaks to Kokotov on his vision to understand every human voice.

Wevolver and Syntiant are creating a series that explore the work of innovators and the future of pervasive AI. Syntiant is developing ultra-low-power AI processors. Because they believe in the importance of innovation, Syntiant is engaging in these fireside chats with engineers and designers who are on the cutting edge of their field.

This video is part of a series of fireside chats in which Wevolver and Syntiant partner to engage with global innovators. In previous discussions we spoke with The Things Industries CEO Wienke Giezeman, Star Wars Animatronic Designer Gustav Hoegen, and Fashiontech Designer Anouk Wipprecht. 

Using humans and AI to translate and transcribe 

When Rev was in its infancy, it was mainly comprised of human freelancers sitting behind their screens, listening with intent to provide true to word accuracy. But in 2016, with the onset of the deep learning revolution, speech-to-text took off at high speeds.  

“The coupling of hiring intelligent people and working with a dataset that we had created over many years of doing human transcription translated into a very powerful automated speech recognition ecosystem. Now it's humans and machines working together.” 

Rev’s speech recognition engine is trained on a corpus of human-transcribed text. When a client needs human level quality, a human will translate, but with access to more robust datasets, they’ll actually start with a draft from the speech recognition model. 

The dataset is fed to the speech recognition engine for training, and the more data it gets from humans, the better the engine becomes. In turn, the “Revvers”, or human freelancers, can save an exorbitant amount of time by starting with a more accurate draft produced by the machine.

“You can imagine, if we're doing transcripts for earnings calls, and all those need to be 100 percent accurate, they’re going to want human quality. On the other hand, if you’re doing something that necessitates a quick turnaround, where you might want something within minutes of having available audio, or even live – imagine a Zoom app that could show you a live transcript of our conversation - that will just be AI.” 

Speech-to-text for law enforcement

The ubiquity of the spoken word translated to text has started to feel like a simple extension of how we speak. Digital assistants are bubbling up on our shelves and mantels, telling us the weather or switching on a light. Legal officials transcribe interviews for depositions. And more recently, law enforcement has used speech to text for body camera equipment, a vertical Kokotov is exploring.

Rev is working with Axon, a manufacturer of body cam equipment, the largest that leverage software tools for police departments to capture the audio from body cameras, and quickly leverage it for writing up reports for evidence collection. They record, it gets uploaded to Axon, it goes through transcription with Rev, and they don't have to worry about forgetting crucial details when filing reports.

“Many of us have seen more body camera footage in the last year and a half than we've ever had before. It’s extremely important that we get the content of those conversations right because they affect a lot of people's lives, and they need to be scalable because officers are constantly in the field.” 

Low resource natural language processing

Speech recognition technologies notoriously favor the English language, and oftentimes fail to recognize different languages, accents, or speech patterns. Stepping even further into multilingual conversations - swinging back and forth between Portuguese and Arabic for example - no speech recognition engine would really handle that very well. However, new techniques, such as using multilingual transformers, are much more amenable to understanding these low resource languages.  

“I believe this will be one of the biggest challenges and themes over the next five to ten years, in taking speech recognition global, doing better with more languages, and improving these cross language situations. Our goal is to bring what we think is the best speech recognition to as many people as possible. In the service of this vision is understanding every human voice.”

More by Rachel Gordon

Communications and Media Relations Manager at CSAIL, MIT’s Computer Science and Artificial Intelligence Laboratory pioneers research in computing that improves the way people work, play, and learn.