Wevolver and Syntiant are creating a series that explore the work of innovators and the future of pervasive AI. Syntiant is developing ultra-low-power AI processors. Because they believe in the importance of innovation, Syntiant is engaging in these fireside chats with engineers and designers who are on the cutting edge of their field.
In previous conversations we spoke for example with The Things Industries CEO Wienke Giezeman, Rev.com’s Daniel Kokotov, Star Wars Animatronic Designer Gustav Hoegen, and Fashiontech Designer Anouk Wipprecht.
While working on her PhD in speech language technologies, Bajorek, after pouring over the datasets that power much of the tools that listen, translate, and talk back, realized something was missing.
For all of the technical innovation, she asked, who decides where that innovation goes?
“When we're looking at systemic change in these systems, we have to look at what our datasets look like, who is in the room, and who is deciding what this looks like. It’s not necessarily technical limitations, it's about having women and people of color in the room.”
In 2018 Bajorek launched Women in Voice; a nonprofit that addresses the lack of support for women in voice and conversational AI fields. WiV now has 21 chapters in 15 countries. This past year, WiV launched its first Summit, developed programming with partners like Google Assistant, Alexa Education & Alexa Startups, and Symbl.ai, and they also debuted a formal Career Accelerator.
Racial and gender bias is a well known but consistently pervasive problem in AI, especially with speech recognition technologies. There's difficulty recognizing female voices, those with accents, jargon and abbreviations, and low resource languages.
Despite encouraging research with multilingual transformers and sentence embeddings, a lot of current benchmarking for improvement is narrow in scope - usually focused on early adopter datasets, which can be at least 70% a white, male, English-speaking demographic.
“From a business perspective as well, if my preferred demographic to work for is going to be LatinX Gen Z, which statistically would be really smart here in the United States, we need to ask the right questions. If I want to do something that performs well for this type of voice, how does the software have to change? How does my dataset need to change? How does the performance of whatever my hyperparameters look like in a multilingual context need to be tweaked? If I'm looking for a multilingual Gen Z Latina dataset of speech samples, does that even exist? We need to recognize the right framework, and also the huge potential in knowing how to optimize and make systems more equitable,” says Bajorek.
To that end, Bajorek highlights the Mozilla Common Voice team as one example pushing the needle in the right direction, as they’re looking at what percentage of gender are currently in different buckets for different languages.
Automatic Speech Recognition (ASR) has continued to mature - what it looks like today has pivoted many times since the days of speaking each word separately and training models where you have to speak for three hours to adequately capture your voice, to what we see now with deep learning.
In the near future, Bajorek sees a world of EdTech that uses really robust language models in virtual reality.
“You can imagine a world using virtual reality or a similar immersive experience in a language setting that you’re learning the language of. For example, if you're learning Spanish, you could go to a bar and speak Spanish and have an interaction with someone, and the automated speech recognition will pick it up. And if you get it right, you'll choose different things to order. With this type of EdTech, we’re empowering people to experience different cultures and languages without even having to travel.”
Longer term, the goals are more challenging: major system improvements. “I'm really excited about ASR, Text to Speech (TTS), and Natural Language Processing (NLP). What I’m even more excited about is how it's going to transform our world and our use cases for day-to-day things that we barely even pay attention to today. All voices, all sounds, and being able to parse them perfectly in real time with zero latency is a very, very hard problem. I hope to live long enough that we find that solution, but it will be a multi-pronged solution that we probably have not seen yet today.”
Wevolver 2023