A new study by a team of researchers at MIT, MIT-IBM Watson AI Lab, and DeepMind demonstrates the potential of symbolic AI applied to an image comprehension task. They say that in tests, their hybrid model managed to learn object-related concepts like color and shape, using that knowledge to suss out object relationships in a scene with minimal training data and “no explicit programming.”
“One way children learn concepts is by connecting words with images,” said study lead author Jiayuan Mao in a statement. “A machine that can learn the same way needs much less data, and is better able to transfer its knowledge to new scenarios.”
The team’s model comprises a perception component that translates the images into an object-based representation, and a language layer that extracts meanings from words and sentences and creates “symbolic programs” (i.e., instructions) that tell the AI how to answer the question. A third module runs the symbolic programs on the scene and spits out an answer, updating the model when it makes mistakes.
The researchers trained it on images paired with related questions and answers from Stanford University’s CLEVR image comprehension test set. (For example: “What’s the color of the object?” and “How many objects are both right of the green cylinder and have the same material as the small blue ball?”) The questions grew progressively harder as the model learned, and once it mastered object-level concepts, the model advanced to learning how to relate objects and their properties to each other.
In experiments, it was able to interpret new scenes and concepts “almost perfectly,” the researchers report, handily outperforming other bleeding-edge AI systems with just 5,000 images and 100,000 questions used (compared with 70,000 images and 700,000 questions). The team leaves to future work improving its performance on real-world photos and extending it to video understanding and robotic manipulation.