Researchers at the University of Science and Technology of China in Hefei claim to have made progress, though. In a paper published on the preprint server Arxiv.org this week (“Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio video Emotion Recognition“), they describe an AI system that can recognize a person’s emotional state with state-of-the-art accuracy on a popular benchmark.
“Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion,” they wrote. “Inspired by this cognitive process in human beings, it’s natural to simultaneously utilize audio and visual information in AER … The whole pipeline can be completed in a neural network.”
Part of the team’s AI system consists of audio-processing algorithms that, with speech spectrograms (visual representations of the spectrum of frequencies of sound over time) as input, help the overall AI model to hone in on regions most relevant to emotion. A second component runs video frames of faces through two computational layers: a basic face detection algorithm and a trio of “state-of-the-art” face recognition networks “fine-tuned” to make them “emotion-relevant.” It’s a trickier undertaking than it sounds — as the paper’s authors note, not all frames contribute equally to an emotional state, so they had to implement an attention mechanism that susses out important frames.
After features — i.e., measurable characteristics — have been extracted from all four facial recognition algorithms, they’re fused with speech features to “deeply capture” associations between them for a final emotion prediction. That’s the last step.
To “teach” the AI model to classify emotions, the team fed it 653 video and corresponding audio clips from AFEW8.0, a database of film and television shows used in the audio-video subchallenge of the EmotiW2018, a grand challenge in the ACM International Conference on Multimodal Interaction. In tests, it held its own, managing to categorize emotions from seven choices — “angry,” “disgust,” “fear,” “happy,” “neutral,” “sad,” and “surprise” correctly about 62.48 percent of the time on a validation set of 383 samples. Moreover, the researchers demonstrated that its video frame analyses were influenced by audio signals — in other words, the AI system took the relationship between speech and facial expressions into account in making its predictions.
That said, the model tended to fare better with emotions that had “obvious” characteristics like “angry,” “happy,” and “neutral,” while struggling with “disgust,” “surprise,” and other emotions with “weak” expressions or that could be easily confused with other emotions. Still, it performed nearly as well as a previous approach that employed five visual models and two audio models.
“Compared with the state-of-the-art approach,” the researchers wrote, “[our] proposed approach can achieve a comparable result with a single model, and make a new milestone with multi-models.”