A Bot Can Say My Name Better Than I Can

It’s not that hard to say my name, Saahil Desai. Saahil: rhymes with sawmill, or at least that gets you 90 percent there. Desai: like decide with the last bit chopped off. That’s really it.

More often than not, however, my name gets butchered into a menagerie of gaffes and blunders. The most common one, Sa-heel, is at least an honest attempt—unlike its mutant twin, a monosyllabic mess that comes out sounding like seal. Others defy all possible logic. Once, a college classmate read my name, paused, and then confidently said, “Hi, Seattle.”

But the mispronunciations that bug me the most aren’t uttered by any human. They come from bots. All day long, Siri reads out my text messages through the AirPods wedged into my ears —and mangles my name into Sa-hul. It fares better than the AI service I use to transcribe interviews, which has identified me by a string of names that seem stripped from a failed British boy band (Nigel, Sal, Michael, Daniel, Scott Hill). Silicon Valley aspires for its products to be world-changing, but evidently that also means name-changing.

Or at least that’s what I thought. Listen to this:

It’s an AI voice named Adam from ElevenLabs, a start-up that specializes in voice cloning. (It’s sort of like the DALL-E of audio.) This bot not only says my name well; it says my name better than I can. After all, Saahil comes from Sanskrit, a language I do not speak. The end result is a dopamine hit of familiarity, an amazing feeling that’s like the tech equivalent of finding a souvenir key chain with your name on it.

In addition to chatbots that can write haiku and artbots that can render a pizza in the style of Picasso, the generative-AI revolution has unleashed voicebots that can finally nail my name. Just as ChatGPT learns from internet posts, ElevenLabs has trained its voices on a huge volume of audio clips to figure out how to talk as people do—at least 500,000 hours, compared with tens or hundreds of hours of audio with earlier speech models. “We have spent the last two years developing a new foundational model for speech,” ElevenLabs CEO Mati Staniszewski wrote in an email. “It means our model is context-aware and language agnostic and therefore better able to pick-up on nuances like names, as well as delivering the intonation and emotions that reflect the textual input.” The data that are part of newer voicebots might include any number of websites dedicated to pronouncing things, and if someone has correctly said your name in an audiobook, a podcast, or a YouTube video, newer AI models might have it down.

Companies such as Amazon, Google, Meta, and Microsoft are also developing more advanced voicebots—although they’re still a mixed bag. I tested the same sentence—“C’mon, it’s not that hard to say Saahil Desai”—on AI voice programs from each of them. They all could handle Desai, but I was not greeted with a chorus of perfect pronunciations of Saahil. Amazon’s Polly software, perhaps even worse than Siri, thinks my name is something like Saaaaal:

Both Google Cloud and Microsoft Azure were inoffensive but not perfect, slightly twisting Saahil into something recognizably foreign. Nothing could beat ElevenLabs, but Voicebox, an unreleased tool from Meta that the company recently touted as a “breakthrough in generative AI for speech,” got very close:

Computers can now say so many more names than just my own. “I noticed the same thing the other day when my student and I created a recording on ElevenLabs of CNN’s Anderson Cooper saying ‘Professor Hany Farid is a complete and total dips**t’ (it’s a long story),” Hany Farid, a UC Berkeley computer scientist, wrote in an email. “I was surprised at how well it pronounced my name. I’ve also noticed that it correctly pronounces the names of my non-American students.” Other tricky names I tested also fared well: ElevenLabs nailed Lupita Nyong’o and Timothée Chalamet, although it turned poor Pete Buttigieg’s last name into a very unfortunate Buttygig.

That AI voices can now say unusual names is no small feat. They face the same pronunciation struggles that leave many humans stumped; names like Giannis Antetokounmpo don’t abide by the rules of English, while even a simpler name can have multiple pronunciations (Andrea or Andrea?) or spellings (Michaela? Mikayla? Mikayla? Michela?). A name might still fall flat to our ears if an AI voice’s color and texture ring more HAL 9000 than human, Farid said.

Previous generations of voice assistants—Siri, Alexa, Google Assistant, your car’s GPS—just didn’t have enough information to get through all of these steps. (In some cases, you can provide that information yourself: A spokesperson for Apple told me that you can manually input a name’s phonetic spelling into the Contacts app to tweak how Siri reads it.) Over the years, this technology “really sort of plateaued,” Farid wrote. “It was just really struggling to get through that uncanny valley where it’s sort of human-like, but also a little weird. And then it just blasted through the door.” Advances in “deep-learning” techniques inspired by the human brain can more readily spot patterns in pitch, rhythm, and intonation.

That is the weird contradiction of AI right now: Even as this technology is prone to biases that can alienate users (voice assistants more frequently misidentify words from Black speakers than white speakers), it can also help pop smaller feelings of alienation that bubble up. To constantly hear bots bungle my name is a digital indignity that reminds me that my devices do not seem made with me in mind, even though Saahil Desai is a common name in India. My blue iPhone 12 is a six-inch slab that contains more of me than any other single thing in my life. And yet it still screws up the most basic thing about my identity.

But a world in which the bots can understand and speak my name, and yours, is also an eerie one. ElevenLabs is the same voice-cloning tech that has been used to make believable deepfakes—of a rude Taylor Swift, of Joe Rogan and Ben Shapiro debating Ratatouille, of Emma Watson reading a section of Mein Kampf. An AI scam pretending to be someone you know is far more believable when the voice on the other end can say your name just as your relatives do.

Once it became readily clear that I couldn’t stump ElevenLabs, I slotted in my middle name, Abhijit. Out came a terrible mess of syllables that would never fool me. Okay fine: I admit it’s actually pretty hard to say Saahil Abhijit Desai.

