Uber AI ‘reliably’ completes all stages in Montezuma’s Revenge
In a blog post and forthcoming paper, AI scientists at Uber describe Go-Explore, a family of so-called quality diversity AI models capable of achieving maximum scores of over 2,000,000 on Montezuma’s Revenge and average scores over 400,000. (That’s compared to the current state-of-the-art model’s average and maximum score of 10,070 and 17,500, respectively.) Furthermore, in testing, the models were able to “reliably” solve the entire game up to level 159.
Additionally, and no less notably, the researchers claim that Go-Explore is the first AI system to achieve a score higher than 0 — 21,000 — in the Atari 2600 game Pitfall, “far surpassing” average human performance.
“All told, Go-Explore advances the state of the art on Montezuma’s Revenge and Pitfall by two orders of magnitude,” the Uber team wrote. “It does not require human demonstrations, but also beats the state-of-the-art performance on Montezuma’s Revenge of imitation learning algorithms that are given the solution in the form of human demonstrations … Go-Explore differs radically from other deep RL algorithms. We think it could enable rapid progress in a variety of important, challenging problems, especially robotics.”
The problem most AI models find difficult to overcome with Montezuma’s Revenge is its “spare rewards”; completing a stage requires learning complex tasks with infrequent feedback. Complicating matters, what little feedback the game provides is often deceptive, meaning that it encourages AI to maximize rewards in the short term instead of work toward a big-picture goal (for example, hitting an enemy repeatedly instead of climbing a rope close to the exit).
One way to solve the sparse rewards problem is by adding bonuses for exploration, otherwise known as intrinsic motivation (IM). But even models that make use of IM struggle with Montezuma’s Revenge and fail on Pitfall — the researchers theorize that a phenomenon known as detachment is to blame. Basically, algorithms “forget” about promising areas they’ve visited before, and so don’t return to them to find out whether they lead to new places or states. As a result, AI agents stop exploring, or stall when areas close to where they visited have already been explored.
“Imagine an agent between the entrances to two mazes. It may by chance start exploring the West maze and IM may drive it to learn to traverse, say, 50 percent of it,” the researchers wrote. “The agent may at some point begin exploring the East maze, where it will also encounter a lot of intrinsic rewards. After completely exploring the East maze, it has no explicit memory of the promising exploration frontier it abandoned in the West maze. It likely would also have no implicit memory of this frontier either … Worse, the path leading to the frontier in the West maze has already been explored, so no (or little) intrinsic motivation remains to rediscover it.”
The researchers propose a two-phase solution: exploration and robustification.
In the exploration phase, Go-Explore builds an archive of different game states — cells — and the various trajectories, or scores, that lead to them. It chooses a cell, returns to that cell, explores the cell, and, for all cells it visits, swaps it in as the trajectory if a given new trajectory is better (i.e., the score is higher).
The aforementioned cells are merely downsampled game frames — 11 by eight grayscale images with 8-pixel intensities, with frames similar enough not to warrant further exploration conflated.
The exploration phase confers a number of advantages. Thanks to the aforementioned archive, Go-Explore is able to remember and return to “promising” areas for exploration. By first returning to cells (by loading the game state) before exploring from them, it avoids over-exploring easily reached places. And because Go-Explore is able to visit all reachable states, the researchers claim it’s less susceptible to deceptive reward functions.
Another, optional element of Go-Explore improves its robustness further: domain knowledge. The model can input information about cells in which it’s learning, which on Montezuma’s Revenge includes stats extracted directly from pixels like x and y positions, the current room, and the current number of keys held.
The robustification stage acts as a shield against noise. If Go-Explore’s solutions are not robust to noise, it robustifies them into a deep neural network — layers of mathematical functions that mimic the behavior of neurons in the human brain — with an imitation learning algorithm.
In testing, when set loose on Montezuma’s Revenge, Go-Explore reached an average of 37 rooms and solved the first level 65 percent of the time. That’s better than the previous state of the art, which explored 22 rooms on average.
The current incarnation of Go-Explore taps a technique known as imitation learning to learn policies from demonstrations of the task at hand. The demonstrations in question can be performed by a human, but alternatively, the first phase of Go-Explore automatically generates them.
A full 100 percent of Go-Explore’s generated policies solved the first level of Montezuma’s Revenge, achieving a mean score of 35,410 — more than three times the previous state of the art of 10,070 and slightly better than the average for human experts of 34,900.
With domain knowledge added to the mix, Go-Explore performed even better. It found 238 rooms and solved over nine levels on average. And after robustification, it reached a mean of 29 levels and a mean score of 469,209.
“Go-Explore’s max score is substantially higher than the human world record of 1,219,200, achieving even the strictest definition of ‘superhuman performance,’” the researchers wrote. “This shatters the state of the art on Montezuma’s Revenge both for traditional RL algorithms and imitation learning algorithms that were given the solution in the form of a human demonstration.”
As for Pitfall, which requires more significant exploration and has sparser rewards (32 scattered over 255 rooms), Go-Explore was able to, with knowledge only of the position on the screen and room number, visit all 255 rooms and collect over 60,000 points in the exploration phase.
From trajectories collected in the exploration phase, the researchers managed to robustify trajectories that collect more than 21,000 points, outperforming both state of the art and average human performance.
They leave to future work models with “more intelligent” exploration policies and learned representations.
“It is remarkable that Go-Explore works by taking entirely random actions during exploration (without any neural network!) and that it is effective even when applied on a very simple discretization of the state space,” the researchers wrote. “Its success despite such surprisingly simplistic exploration strongly suggests that remembering and exploring from good stepping stones is a key to effective exploration, and that doing so even with otherwise naive exploration helps the search more than contemporary methods for finding new states and representing those states.”