Which voice assistant speaks the most languages, and why?

Contrary to popular Anglocentric belief, English isn’t the world’s most-spoken language by the total number of native speakers — nor is it the second. In fact, the West Germanic tongues rank third on the list, followed by Hindi, Arabic, Portuguese, Bengali, and Russian. (Mandarin and Spanish are first and second, respectively.)

Surprisingly, Google Assistant, Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana recognize a relatively narrow slice of those. It wasn’t until this fall that Samsung’s Bixby gained support for German, French, Italian, and Spanish — dialects collectively spoken by 616 million people worldwide. And it took years for Cortana to become conversant in Spanish, French, and Portuguese.

So why the snail’s pace of innovation? If you’re looking for an explanation, a good place to start might be the techniques used to train speech recognition algorithms. AI assistants, as it turns out, are a lot more complicated than meets the eye — or ear.

Why supporting a new language is so hard

Adding support for a language to a voice assistant is a multi-pronged process — one that requires a substantial amount of R&D on both the speech recognition and voice synthesis sides of the equation.

“From a voice interaction perspective, there’s two things that kind of work independent of each other,” Himi Khan, vice president of product at Clinc, a startup that builds conversational AI experiences for banks, drive-through restaurants, and automakers like Ford, told VentureBeat in an interview. “One is the speech to text — the act of taking the speech itself and converting it into some sort of visual text format. Then there’s the [natural language processing] component.”

Today, most speech recognition systems are aided by deep neural networks — layers of neuron-like mathematical functions that self-improve over time — that predict the phonemes, or perceptually distinct units of sound (for example, p, b, and d in the English words pad, pat, and bad). Unlike automatic speech recognition (ASR) techniques of old, which relied on hand-tuned statistical models that calculated probabilities for combinations of words to occur in a phrase, deep neural nets translate sound (in the form of segmented spectrograms, or representations of the spectrum of frequencies of sound) into characters. This not only reduces error rates, but largely obviates the need for human supervision.

But baseline language understanding isn’t enough. Without localization, voice assistants can’t pick up on cultural idiosyncrasies — or worse, appropriate norms from one culture to another. Joe Dumoulin, chief technology innovation officer at Next IT, told Ars Technica in an interview that it takes between 30 to 90 days to build a query-understanding module for a new language, depending on how many intents it needs to cover. And even market-leading smart speakers from the likes of Google and Amazon have trouble understanding speakers with certain accents. A September test conducted by Vocalize.ai found that Apple’s HomePod and Amazon Echo devices managed to catch only 78 percent of Chinese words compared to 94 percent of English and Indian words.

“At the core level, certain languages are very, very different. In English, for example, adjectives usually come before nouns and adverbs can come before or after — there’s different rules that are in place from a grammar perspective,” Khan said. “A good example of where this becomes really difficult is if someone says, ‘Starfish.’ Depending on your speech-to-text engine and things like that, it could be easy to associate ‘star’ to ‘fish’ as an adjective, or as a single noun. There’s all sort of different terms that are used and different speech patterns you have to adapt to.”

It’s tough enough with one language. Researchers at Amazon’s Alexa AI division described one of the potential problems in August 2018. During a typical chat with an assistant, users often invoke multiple voice apps in successive questions. These apps repurpose variables — for example, “town” and “city.” If someone asks for directions and follows up with a question about a restaurant’s location, a well-trained assistant needs to be able to suss out which thread to reference in its answer.

And then, the assistant has to respond. It wouldn’t be of much use if it couldn’t.

While cutting-edge text to speech (TTS) systems like Google’s Tacotron 2 (which builds voice synthesis models based on spectrograms) and WaveNet (which builds models based on waveforms) learn languages more or less from speech alone, conventional systems tap a database of phones — distinct speech sounds or gestures — strung together to verbalize words. Concatenation, as it’s called, requires capturing the complimentary diphones (units of speech comprising two connected halves of phones) and triphones (phones with half of a preceding phone at the beginning and a succeeding phone at the end) in lengthy recording sessions. The number of speech units can easily exceed a thousand.

Another technique, known as parametric TTS, taps mathematical models to recreate sounds that are then assembled into words and sentences. The data required to generate those sounds are stored in the parameters (variables), and the speech itself is created using a vocoder, a voice codec (a coder-decoder) that analyzes and synthesizes the output signals.

Still, TTS is an easier problem to tackle than language comprehension — particularly with deep neural networks like WaveNet at data scientists’ disposal. Amazon’s Polly cloud-based TTS service supports 28 languages, and Microsoft’s Azure speech recognition API supports over 75. And already, Google, Microsoft, and Amazon offer a select few voices in Chinese, Dutch, French, German, Italian, Japanese, Korean Swedish, and Turkish synthesized by AI systems.

Language support by assistant

Google Assistant

With the addition of more than 20 new languages in January, the Google Assistant took the crown among voice assistants in terms of the number of tongues it understands. It’s now conversant in 30 languages in 80 countries, up from 8 languages and 14 countries in 2017. They include:

Arabic (Egypt, Saudia Arabia)
Bengali
Chinese (Traditional)
Danish
Dutch
English (Australia, Canada, India, Indonesia, Ireland, Philippines, Singapore, Thailand, UK, US)
French (Canada, France)
German (Austria, Germany)
Gujarati
Hindi
Indonesian
Kannada
Italian
Japanese
Korean
Malayalam
Marathi
Norwegian
Polish
Portuguese (Brazil)
Russian
Spanish (Argentina, Chile, Colombia, Peru)
Swedish
Tamil
Telugu
Thai
Turkish
Urdu

Apple’s Siri

Apple’s Siri, which until January had Google Assistant beat in terms of sheer breadth of supported languages, comes in a close second. Currently, it supports 21 languages in 36 countries, and dozens of dialects for Chinese, Dutch, English, French, German, Italian, and Spanish:

Arabic
Chinese (Mandarin, Shanghainese, and Cantonese)
Danish
Dutch
English
Finnish
French
German
Hebrew
Italian
Japanese
Korean
Malay
Norwegian
Portuguese
Russian
Spanish
Swedish
Thai

Siri is also localized with unique voices in Australia, where voiceover artist Karen Jacobsen supplied lines and phrases, and in the U.K., where former technology journalist Jon Briggs provided his voice.

It’s a little less robust on the HomePod, however. Apple’s smart speaker gained support for French, German, and Canadian English, and with a software upgrade last fall became conversant in Spanish and Canadian French.

Microsoft’s Cortana

Cortana, which made its debut at Microsoft’s Build developer conference in April 2013 and later came to Windows 10, headphones, smart speakers, Android, iOS, Xbox One, and even Alexa via a collaboration with Amazon, might not support as many languages as Google Assistant and Siri. Still, it’s come a long way in six years. Here are the languages it recognizes:

Chinese (Simplified)
English (Australia, Canada, New Zealand, India, UK, US)
French (Canada, France)
German
Italian
Japanese
Portuguese (Brazil)
Spanish (Mexico, Spain)

Like Siri, Cortana has been extensively localized. The U.K. version — which is voiced by Anglo-French actress Ginnie Watson — speaks with a British accent and uses British idioms, while the Chinese version, dubbed Xiao Na, speaks Mandarin Chinese and has an icon featuring a face and two eyes.

Amazon’s Alexa

Alexa might be available on over 150 products in 41 countries, but it understands the fewest languages of any voice assistant:

English (Australia, Canada, India, UK, and US)
French (Canada, France)
German
Japanese (Japan)
Spanish (Mexico, Spain)

To be fair, Amazon has taken pains to localize the experience for new regions. When Alexa came to India last year, it launched with an “all-new English voice” that understood and could converse in local pronunciations.

And it’s worth noting that the situation is improving. More than 10,000 engineers are working on various components of its NLP stack, Amazon says, and the company’s bootstrapping expanded language support through crowdsourcing. Last year, it released Cleo, a gamified skill that rewards users for repeating phrases in local languages and dialects like Mandarin Chinese, Hindi, Tamil, Marathi, Kannada, Bengali, Telugu, and Gujarati.

Samsung’s Bixby

Samsung’s Bixby — the assistant built into the Seoul, South Korea company’s flagship and midrange Galaxy smartphone series and forthcoming Galaxy Home smart speaker — is available in 200 markets globally, but only supports a handful of languages in those countries:

English
Chinese
German
French
Italian
Korean
Spanish

Samsung has suffered NLP setbacks, historically. The Wall Street Journal reported in March 2017 that Samsung was forced to delay the release of the English version of Bixby because it had trouble getting the assistant to understand certain syntaxes and grammars.

How language support might improve in the future

Clearly, some voice assistants are further along on the language front than others. So what might it take to get them on the same footing?

A heavier reliance on machine learning might help, according to Khan.

“One of the main challenges of dealing with multi-language support is actually the grammar rules that go along with it, and having to think about and accommodate for those grammar rules,” he explained. “Most NLP models out there take a sentence, do parts-of-speech tagging — in a sense identifying the grammar, or the grammars within an utterance, and creating rules to determine how to interpret that grammar.”

With a “true” neural network stack — one that doesn’t rely heavily on language libraries, keywords, and dictionaries — the emphasis shifts from grammars to word embeddings and the relational patterns within word embeddings, Khan says. Then, it becomes possible to train a voice recognition system on virtually any language.

That’s Clinc’s approach — it advertises its technology as more or less language-agnostic. The company builds corpa by posing open-ended questions to a large number of native speakers, like “If you could talk to your phone and ask about your personal finances, what would you say?” It treats the responses as “tuner” datasets for real-world use.

So long as the datasets are curated and created in a native language, Clinc claims it can add support for a language with just three to 500 utterances — thousands fewer than are required with traditional, statistical methods.

“All the data we used to train our AI is curated by native speakers,” Khan said. “That way, the AI optimizes to actual consumer behavior.”

San Francisco-based Aiqudo takes a slightly different tact. The startup, which supplies the underlying technology behind Motorola’s Hello Moto assistant, focuses on intents — the action users want an intelligent system to perform — and creates “action indexes” across categories like restaurants, movies, and geographies to map given intents to apps, services, and features.

Aiqudo’s models don’t have to understand the entire language — just the intents. From the action indexes alone, they know, for example, that “Avia” in the utterance “Make a dinner reservation for tomorrow at 7 p.m. at Avia” likely refers to a restaurant rather than a TV show.

“We don’t really necessarily understand the language per se,” CEO John Foster told VentureBeat in a phone interview. “What we do is we essentially pre-train our algorithms with repositories of data that we can acquire, and then we go and statistically rank the words by their position on the page and their position relative to other words around them on the page. That becomes our basis for reading what is one of these words mean in various different contexts.”

Localization simply entails building region-specific action indexes. (“Avia” in Barcelona is likely to refer to something different than “Avia” in Mexico City.) This not only allows Aiquido’s models to gain support for new languages relatively quickly, but enables them to handle hybrid languages — languages that combine words, expressions, and idioms — like Spanglish.

“Our models don’t get confused by [hybrid languages], because [when] they look at a Hindi sentence, they just look for for the intent. And if some of the words are English and some of the words are in Hindi, that’s OK,” Foster said. “Most of what you need in terms of understanding intents already exists in English, so it’s just a matter of understanding those intense in the next language.”

Undoubtedly, Google, Apple, Microsoft, Amazon, Samsung, and others are already using techniques like those described by Foster and Khan to bring new languages to their respective voice assistants. But some had a head start, and others have to contend with legacy systems. That’s why Foster thinks it’ll take time before they’re all speaking the same languages.

He’s optimistic that they’ll get there eventually, though. “Understanding what the user said and the action that they want is ultimately what a voice assistant has to do for users,” he said.