Earlier this summer, at re:MARS, an Amazon-hosted event focused on machine learning, automation, robotics, and space, Rohit Prasad, chief scientist and vice president of Alexa AI , aimed to wow audiences with a paranormal parlor trick. : talk with the dead. “While AI can’t take away that pain of loss, it can definitely make their memories last,” he said, before showing a short video that begins with an adorable boy. ask Alexa“Can grandma finish reading me The Wizard of Oz?”
The voice of the woman who reads a few sentences from the book sounds quite grandma. But without knowing Grandma, it was impossible to assess the resemblance. And the whole thing struck many observers as more than a little creepy – Ars Technica called the demo “morbid”. But Prasad’s revelation of how the ‘trick’ was performed was truly gasp-worthy: Amazon scientists were able to summon Grandma’s voice based on a minute-long audio sample only. And they can easily do the same with just about any voice, perspective that you may find exciting, terrifying, or a combination of both.
Fear of ‘deepfake’ voices capable of fooling humans or voice recognition technology is not unfounded – in one 2020 case, thieves used an artificially generated voice to convince a Hong Kong bank manager to unblock $400,000 in funds before the ruse was discovered. At the same time, as voice interactions with technology become more commonplace, brands want to be represented by unique voices. And consumers seem to want technology that sounds more human (although a Google voice assistant which mimicked “ums”, “mm-hmms” and other human speech tics, however, was criticized for being too realistic).
This has led to a wave of innovation and investment in AI-powered text-to-speech (TTS) technology. A Google Scholar search shows more than 20,000 text-to-speech research papers published since 2021. Globally, the text-to-speech market is expected to reach $7 billion in 2028 from around $2.3 billion in 2020, according to Emerging Research.
Today, the most widespread use of TTS is in digital assistants and chatbots. But the emerging applications of voice identity in games, media, personal communication, are easy to imagine: personalized voices for your virtual characters, text messages that read in your voice, voiceovers by absentee (or deceased) actors. The metaverse is also changing the way we interact with technology.
“There will be a lot more of these virtualized experiences, where the interaction is less and less a keyboard, and more about speech,” Frank Chang, founding partner of AI-focused venture capital fund Flying Fish tells Seattle. “Everyone thinks voice recognition is the hot thing, but at the end of the day, if you talk to something, don’t you want it to talk back to you? As long as it can be personalized – with your voice or the voice of someone you want to hear – all the better. Providing accessibility for people with vision problems, limited motor functions and other cognitive problems is another factor driving the development of voice technology, especially for e-learning.
Whether you like the idea of ”grandmother Alexa” or not, the demo highlights how quickly AI has had an impact on text-to-speech and suggests that convincing fake human voices could be much closer than we don’t think so.
The original Alexa, released with the Echo device in November 2014, would have been based on the voice of Nina Rolle, a Boulder-based voice-over artist (which neither Amazon nor Rolle have ever confirmed), and relied on technology developed by Polish text-to-speech company Ivona, acquired by Amazon in 2013. But the conversational style from Alexa left a lot to be desired. . In 2017, VentureBeat wrote: “Alexa is smart enough, but no matter what the AI-powered assistant is talking about, there’s no getting around its relatively flat and monotonous voice.”
Early versions of Alexa used a “concatenative” version of speech synthesis, which works by compiling a large library of recorded speech fragments from a single speaker, which can be recombined to produce complete words and sounds. Imagine a ransom note, where letters are cut and pasted together to form new sentences. This approach generates intelligible sound with an authentic-sounding timbre, but it requires many hours of recorded voice data and many fine adjustments, and its reliance on a library of recorded sounds makes it difficult to edit voices. Another technique, known as parametric TTS, does not use recorded speech, but instead starts with statistical models of individual speech sounds, which can be assembled into a sequence of words and phrases and processed by a synthesizer. voice called vocoder. (Google’s “standard” text-to-speech voices use a variation of this technology.) It offers more control over voice output, but sounds muffled and robotic. You wouldn’t want him to read you a bedtime story.
In an effort to create new, more expressive and natural voices, Amazon, Google, Microsoft, Baidu and other major text-to-speech players have all adopted some form of “neural TTS” in recent years. NTTS systems use deep learning neural networks trained on human speech to model audio waveforms from scratch, dynamically converting any text input into fluent speech. Neural systems are able to learn not only pronunciation, but also patterns of rhythm, stress, and intonation that linguists call “prosody.” And they can adopt new speaking styles or switch speaker “identities” with relative ease.
Google Cloud’s Text-to-Speech API currently offers developers over 100 neural voices in languages ranging from Arabic to Vietnamese (in addition to regional dialects), as well as “standard voices” that use parametric TTS oldest (listen now). Microsoft’s Azure gives developers access to over 330 neural voices in over 110 languages and dialects, with a range of speaking styles, including news, customer service, shouting, whispering, angry, excitement, joy, sadness and terror (Try it). Azure Neural Voices have also been adopted by companies such as ATT, Duolingo, and Progressive. (In March, Microsoft completed its acquisition of Nuance, a leader in conversational AI and a partner in building Apple’s Siri, whose Vocalizer service offers more than 120 neural chatbot voices in more than 50 languages.) Amazon’s Polly text-to-speech API supports approximately three dozen neural voices in 20 languages and dialects, in conversational and “newscaster” styles (listen a first demo here.
The technology behind Grandma’s speech demo was developed by scientists at Amazon’s Text-to-Speech Lab in Gdansk, Poland. In a research paper, the developers describe their new approach to cloning a new voice from a very limited sample – a “few hits” problem, in the parlance of machine learning. Essentially, they split the task into two parts. First, the system converts the text into “generic” speech, using a model trained on 10 hours of another speaker’s speech. Then, a “voice filter” – trained on a one-minute sample of the target speaker’s voice – imparts a new speaker identity, altering the characteristics of the generic voice to sound like the target speaker. Very few practice samples are needed to create new voices.
Rather than having to create a new text-to-speech model for each new voice, this modular approach turns the process of creating a new speaker identity into a computationally easier task of transforming one voice into another. On objective and subjective measures, the quality of synthetic speech generated in this way was comparable to speech from models trained on 30 times more data. That said, it cannot completely mimic a specific person’s speaking style. In an email to fast business, Alexa researchers explain that the voice filter only changes the timbre of the spoken voice, its basic resonance. The prosody of the voice – its rhythms and intonation – comes from the generic model of the voice. So it would be like reading Grandma’s voice, but without the distinctive way she would lengthen some words or pause long in between others.
Amazon won’t say when the new voice cloning capabilities will be available to developers and the public. In an email, a spokesperson wrote: “Alexa voice personalization is a highly sought after feature by our customers, who could use this technology to create many enjoyable experiences. We’re working to improve on the basic science we demonstrated at re:MARS and explore use cases that will delight our customers, with the necessary safeguards to prevent potential misuse.“
One can imagine offering the ability to customize something like Reading Sidekick, an Alexa feature that lets kids take turns reading with Alexa, in the voice of a loved one. And it’s easy to see how the “Grandma’s Voice” demo could portend an expanded cast of more adaptable celebrity voices for virtual assistants. Alexa’s current celebrity voices – Shaquille O’Neal, Melissa McCarthy, and Samuel L. Jackson – took around 60 hours of studio recordings to produce, and they’re somewhat limited in what they can do, answering questions about the weather, telling jokes and stories, and answering some questions, but defaulting to the standard Alexa voice for requests outside of the system’s comfort zone.
Google Assistant’s “celebrity voice cameos” by John Legend and Issa Rae, introduced in 2018 and 2019 but not currently supported, similarly combined pre-recorded audio with impromptu responses synthesized with WaveNet technology. The ability to develop more robust celebrity voices that can read any text input after a short recording session could be a game-changer and could even help boost stagnant smart speaker sales. (According to research firm Omdia, U.S. smart speaker shipments were down nearly 30% last year from 2020, including a nearly 51% drop in speaker shipments. Amazon Alexa smart phones.)
As big tech companies continue to invest in text-to-speech, one thing is certain: it will become increasingly difficult to tell whether the voice you hear is made by a human or by a human-made algorithm.