Sapienz: Facebook’s push to automate software testing

It can take 15 years or more for research to transfer from academia to full industrial deployment. For the founders of Majicke, an automated software testing startup created out of University College London (UCL), it took not much over a year.

In September 2016, a trio of UCL researchers founded Majicke with the idea of building on decades of search-based software engineering (SBSE) research to create tools that automate the process of finding test cases. Traditionally designed by humans, test cases are used to determine whether software will function correctly under different circumstances. Majicke’s core product was Sapienz, a tool that leverages SBSE to automatically generate test sequences and find crashes.

In January 2017, Facebook announced that it was acqui-hiring Majicke’s founders, Professor Mark Harman (scientific advisor), Ke Mao (CTO), and Yue Jia (CEO), alongside some of the company’s assets — while Majicke itself was wound down.

Above: Sapienz: Ke Mao (CTO), Mark Harman (scientific advisor), and Yue Jia (CEO)

Today, Harman is an engineering manager at Facebook, where he is able to test the impact of his research on products used by billions of people — though he also maintains a part-time academic position at UCL. Mao and Jia are also now software engineers at Facebook.

Facebook already uses artificially intelligent software across its suite of public-facing products to automate myriad processes, from detecting illegal content to assisting with translations. Behind the scenes, the company has also been pushing to scale automated software testing and verification across its products in order to detect glitches long before they hit Google’s or Apple’s app stores.

Back in 2013, Facebook announced it was acquiring Monoidics, the London-based developer behind a static automated code verification tool called Infer Static Analyzer, which was designed to identify buggy mobile code early on and then demonstrate that the bug had been fixed. Around the same time, Harman and his team at UCL were doing research on generating test cases, a technique related to verification. “In testing, you try to find the presence of bugs so you can get rid of them, and in verification you prove the absence of bugs,” Harman said in a Q&A session held at Facebook’s London HQ.

The Monoidics acquisition, ultimately, was to be the genesis for Harman’s startup.

“We thought we should have a startup, too, if we were going to have an impact with this research,” Harman continued. “So we set up a startup called Majicke.”

Breaking things

Facebook has been known for its “move fast and break things” mantra since it first launched on the web 14 years ago. But with the advent of native mobile phone apps, rolling out fixes for bugs isn’t quite so easy. If a bug is found on the web, an update can be rolled out immediately, but mobile apps require the user to physically update their app to get a fix, which makes it all the more important to find bugs well before the app ships.

Above: Infer at work

A widely accepted principle in the software engineering realm is that the later a bug is caught, the more effort — and cost — goes into fixing it. This is where both Infer and Sapienz come into play.

Infer is actually complementary to Sapienz, and both teams still work from Facebook’s engineering hub in London. Together, the products let programmers build code without spending too much time testing for bugs.

Infer is what is known as a “static” analysis tool that is useful earlier in the development process, before the code is executed, while Sapienz is a dynamic analysis tool, which means it’s designed for an executable “runtime” environment. Infer basically pinpoints code that it think looks dodgy, while Sapienz confirms it by running the code and finding a crash.

“Sapienz’ job is to run the code in a realistic environment to see if it can cause a failure in practice,” Harman said. “If Sapienz finds a real problem, and Infer had a likely possible cause, then if we connect those two up we’ve got all the path between cause and effect.”

Sapienz runs on a whole bunch of emulators rather than the live version of an app — remember, the goal is to catch bugs before they ship. Here you can see an example of various instances of Facebook’s apps being tested by Sapienz — basically creating test sequences to try to catch problems in the code.

Above: Examples of Facebook apps being tested in emulators.

The most common bug identified by Sapienz is what is known in the industry as a null pointer, in which a referenced object in a line of code is invalid.

The ultimate goal of Sapienz is, of course, to expedite crash fixes so the final version of an app update is as polished as possible. But it’s also about allowing developers to move faster on the actual writing of new code, and to work on things that are more interesting.

“They [developers] would much rather be creative and create new products than try to work out why this particular pointer here was referencing something it shouldn’t or was a null,” Harman said.

Deployed

Sapienz was deployed for the first time in Facebook’s main Android app in September 2017. This represented a rapid rise in fortunes for Sapienz’ creators, in particular CTO Ke Mao, who worked as chief developer of the first incarnation of Sapienz while he a PhD student.

“He was able to go from being a PhD student to joining Facebook and seeing the work in his PhD deployed … I mean, it was starting to be deployed even before he’d submitted his thesis,” Harman added. “There’s research that shows how long it takes for an idea to go from conception to practice — 15 to 17 years it can take to go from academic research to industrial deployment. This PhD student did it in 17 months, if not fewer.”

In the months since its first deployment, Sapienz has been expanded to cover Facebook’s other Android apps, including those for Messenger, Instagram, and Workplace, as well as the main Facebook iOS app.

So what induces an esteemed computer engineering professor to join a company such as Facebook? Well, it all comes down to application at scale — the ability to see the impact of their work on more than 2 billion people.

“One of the things that attracts scholars to come work here [at Facebook] is that the biggest challenge in software engineering is scalability — how do you scale up the techniques you’re applying?,” Harman said. “In a university, you can work on fairly small-scale examples in laboratory conditions, but what you really want to be able to do is see ‘Can my ideas apply at very big scale?’”

According to Harman, around 100,000 changes are made to Facebook’s various products each week, which affords a significant opportunity to test Sapienz at scale.

“That kind of scale, as an academic … we can’t find that in very many other places,” he added.

Fixer upper

According to Harman, 75 percent of reported crashes end up getting fixed, which means that Sapienz — more often than not — is flagging genuine issues in the code.

“For an automated technique to have a fix rate of 75 percent is pretty impressive, because it’s very easy for an automated technique to generate all sorts of irrelevant noise for engineers,” he said.

As Facebook continues honing its bug-finding smarts, it’s simultaneously working on automated technology that will fix the code. “Our dream is a world in which we can automatically find faults in software and then automatically fix them, as well,” Harman added.

A few months back, Facebook unveiled SapFix, which is already in the early stages of deployment in the Facebook Android app. SapFix automatically generates fixes for specific bugs, though the final call on whether to accept the fix is made by a human engineer.

Underpinning this is a tool called Getafix, which provides fixes for bugs found by both Infer and Sapienz, and which learns from previous fixes conducted by engineers — so any recommendations it makes “are intuitive for engineers to review,” according to Facebook.

What we’re now seeing is a situation in which Infer and Sapienz are used to find and flag bugs and crashes, which will then trigger a patch generator via SapFix to fix the issues.

“This is very much bleeding edge, and it’s also a very current hot topic in the research community internationally,” Harman said. “We wanted to take all this technology, and the unique position we find ourselves in with both static and dynamic analysis, and see whether we can combine all these techniques to automatically fix some of the bugs we’re finding.”

As noted, 75 percent of bugs reported by Sapienz are fixed, but only a small portion of those are currently being fixed by SapFix — and yes, most of those are null pointers.

“About half of those that SapFix tries to fix, they actually work out to be good fixes and are accepted once checked [by an engineer],” Harman added.

Redundant?

To the casual observer, it may appear that we’re fast heading to a world in which developers will be redundant — or, at least, a significant chunk of them. But Harman doesn’t think that will be the case. For now, human developers still review the final code before it’s catapulted into the main codebase, and of course they have to generate the code in the first place.

“We wouldn’t let an automated technology loose on our codebase without having developer oversight,” Harman said.

But what about years into the future — does Harman every envisage a day when software engineers are sidelined?

“Theoretically, you could get to that place, but I’m not sure practically whether we would want to do that,” he continued. “Psychologists have studied for a long time the difference between ‘generating’ and ‘checking’, and checking is usually an order of magnitude easier than generating.”

A good analogy here would perhaps be that of a spell-check program on a computer. Though machines are getting better at generating meaningful text, for example in sports reporting, it’s not clear that they will ever be able to rival humans at generating prose and other creative works. But most people now use spell-checking systems to spot errors in their text, and desktop publishing has allowed anyone to produce professional-grade publications without complex equipment.

Could automated software testing and debugging have a similar impact and open up programming to more people? Harman thinks that could be one potential outcome in the future — “because coding becomes more exciting and creative, and less about the nitty gritty that puts a lot of people off,” he said.

In other words, programming becomes more about making than fixing.

Open-sourcing

In 2015, Facebook announced it was open-sourcing Infer to improve its efficacy, something the company is also planning for both Sapienz and SapFix — though it hasn’t provided a timescale for either. We are probably looking at years rather than months, though.

“Ultimately, we can make this technology available to the whole community, and it can also have just as much impact on software in general as it does on Facebook here,” Harman said. “We can make the technology open source and the community can work on this, develop it, and apply it to their problems.”

Facebook has a history of open-sourcing its technology, and the company is among the top contributors on GitHub. But it’s not purely an altruistic endeavor — open-sourcing also benefits Facebook, as the more projects Sapienz and SapFix are exposed to, the better the tools will become. The practice also plays an important role in attracting top technical talent to the company.

“One of the appeals for me, as an academic coming to Facebook, was the fact that Facebook has a good track record of making its code for infrastructural work on software engineering available,” Harman added.

Automation and AI are infiltrating just about every facet of society, so it makes sense that we’re also seeing such advances in the software engineering sphere. A few months back, Alphabet’s investment arm, GV, led a $20 million investment in automated software-testing startup Mabl, while San Francisco-based Sauce Labs has also raised big bucks for automated app testing smarts.

It seems that this concerted effort is part of a joint push to get engineers to a point where they can spend more time on creative stuff, rather than being bogged down in the nitty gritty of null pointers.