In a post on its AI research blog, Microsoft today detailed a new language system, Speller100, that the company claims is one of the most comprehensive ever made in terms of linguistic coverage and accuracy. Comprising a number of AI models that understand speech in over 100 languages collectively, Speller100 now powers all spelling correction on Bing, which previously only supported spell check for about two dozen languages.
For languages with little web presence, it’s challenging to collect an amount of data sufficient to train a spell-correcting model. Moreover, systems can’t rely solely on training data to learn the spelling of a language. At its core, spelling correction is about building an error model and a language model, and not all errors are the same. For example, non-word errors occur when a word isn’t in the vocabulary for a given language, while real-word errors occur when the word exists but doesn’t fit in a larger context.
Speller100 is built around the concept of language families, or larger groups of languages based on similarities that multiple languages share. It also employs zero-shot learning, a technique that allows a model to learn and correct spelling without additional language-specific labeled training data.
To scale Speller100 to over 100 languages, Microsoft says it developed a spelling correction pretraining approach that relies on functions to take text extracted from webpages and generate errors like deletion, addition, rotation, and replacement. This eliminated the need for a massive dataset of misspelled searches, enabling Speller100 to reach 50% of correction recall for top candidates in languages for which zero training data existed. Deployed as-is on Bing, where about 15% of searches are misspelled, it would have reduced the number of misspellings by 7.5%.
To improve performance even further, Microsoft leveraged the orthographic, morphological, and semantic similarities between languages to build a dozen or so language family-based models. This maximized the zero-shot benefit and kept Speller100 compact enough for runtime, making the system well-suited to spelling correction for languages with relatively little training data, like Afrikaans and Luxembourgish.
Microsoft says that to date on Bing, Speller100 has reduced the number of pages with no results by up to 30% and the number of times users have to manually reformulate their searches by 5%. It’s also increased the number of times users click on Bing’s spelling suggestion from 8% to 67%.
Microsoft says it plans to implement Speller100 in more of its products going forward.
“Spelling correction is the very first component in the Bing search stack because searching for the correct spelling of what users mean improves all downstream search components,” principal applied science manager Jingwen Lu, principal applied software engineering manager Jidong Long, and vice president Rangan Majumder wrote in the blog post. “Our spelling correction technology powers several product experiences across Microsoft. Since it is important to us to provide all customers with access to accurate, state-of-the-art spelling correction, we are improving search so that it is inclusive of more languages from around the world with the help of large-scale AI.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more