2025-09-25
Haruka Asanuma
This time, I will report on the experimental results of how well the latest Vision-Language Models (VLMs), Gemini and GPT, can recognize facial expressions, using the In the Wild dataset we previously investigated.
For the experiment, we used Gemini 2.0 Flash and GPT-4o. Both models are representative commercial VLMs at present, boasting high performance in various image recognition benchmarks.
For the dataset, we used CAER, which was introduced in the previous article. CAER consists of video data filmed in everyday environments, making it suitable for measuring AI performance in real-world scenarios. Such datasets are referred to as 'In the Wild,' meaning they represent the more complex and unpredictable real world, rather than ideal laboratory environments. We also conducted experiments using CAER-S, a still image dataset extracted from videos.
First, we had Gemini 2.0 Flash, which can process video input, classify 7 types of emotions (Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise) using the following prompt.
Please identify the facial expression of the person in the video by choosing one from the following list of emotions.
If multiple people are shown, please select the expression of the person whose face is most clearly visible.
Emotion List: Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise
The output should only be the label name (e.g., 'Happy', 'Sad'), without any additional explanations or punctuation.
As a result, the Confusion Matrix showed a very high number of classifications as 'Neutral'. It became clear that even with rich video input, classification 'In the Wild' is quite challenging.

Since Gemini's accuracy was low in Experiment 1, we conducted a similar experiment with GPT-4o to investigate whether this was a model issue or the inherent difficulty of the task itself. As GPT cannot process video input, we used the CAER-S dataset.
For both models,most images were estimated as Neutral,and the classification accuracy was extremely low. From this result, it was suggested that, regardless of whether the input was video or still images,the 'In the Wild' facial expression recognition task itself is a very difficult challenge for VLMs.This was suggested.


Due to the strong bias towards Neutral, we evaluated the accuracy for six emotions, excluding Neutral.


To reduce misclassification as 'Neutral', we attempted the following two-stage estimation.
We also tried this approach, but Gemini classified everything asNeutral,and GPT classified most asSurprise,and,two-stage estimation does not work.It was found.


This experiment revealed the following:
Notably,Neutralthe unusually large bias towards this category is an interesting point. This suggests that VLMs, when "uncertain," chooseNeutralas the safest option. Alternatively, it's possible that many real-world expressions are not "clear expressions" like those used in research, but rather "ambiguous expressions close to Neutral."
To improve the accuracy of facial expression recognition, it appears we need a deeper understanding of dataset characteristics and model behavior, beyond just prompt engineering.
Next time, we'll delve deeper into whether VLMs truly struggle with facial expression recognition. To test the hypothesis that "recognition is possible but classification is difficult," we'll report the results of an experiment where VLMs were prompted to freely describe the facial expressions of people in images.