How do VLMs describe human facial expressions? A comparative experiment with Gemini and GPT-4o.
2025-10-03
Haruka Asanuma
Hello, I'm Asanuma, an intern at Quixotiks.
This time, we report the experimental results of having the latest VLM (Vision-Language Model) freely describe the expressions of people in images and verifying its capabilities.
In the previous blog post, we reported that it is difficult for models to accurately classify 'in the wild' facial expression datasets into 6-7 categories. Therefore, this time, by having the model describe human expressions, we aimed to clarify whether it "cannot recognize faces at all" or "can recognize faces but cannot classify the subtle nuances of expressions."
Models
For the experiment, we used Gemini-2.5-flash, Gemini-2.5-pro, and gpt-4o-2024-11-20.
These models are representative commercial VLMs that boast high performance in various image recognition benchmarks, and we adopted the latest models available at the time of the experiment.
Dataset: CAER-S
For the dataset, we used CAER-S, which was also used in the previous article. CAER-S consists of still images extracted from video data filmed in everyday environments, making it suitable for measuring AI performance in real-world scenarios.
(Note) CAER-S is published on this blog because it has a commercial use license.
Experiment
We input the following prompt in English and translated the resulting English output into Japanese for reporting. Since the CAER-S dataset primarily consists of Western content, we believed that instructing the models in English would maximize their performance compared to using Japanese.
Prompt
###💡
Please describe the facial expression of the person in the photo.
###
Results
Based on the results of the previous experiment, let's examine the results by dividing them into 'expressions that are easy to classify (Happy, Sad)', 'expressions that are often misclassified (Anger, Disgust, Surprise)', and 'expressions that cannot be classified at all (Fear)'.
Expressions that are easy to classify
In the previous experiment, we found that Happy and Sad were frequently classified successfully. First, let's look at the descriptions of Happy and Sad images.
Happy
All models correctly recognize the positive emotion of "looking happy" from the woman in the photo.
What's noteworthy is that Gemini recognizes this image as a scene from the American drama 'Friends'. Gemini Flash refers to the character by the proper noun "Monica," indicating that it is linking the image with relevant knowledge in its response.
Even humans would judge this image to be a typical expression of "sadness." All models detect negative emotions, but while both Gemini models express sadness with direct terms like "intense distress," "grief," and "deep sorrow," GPT-4o offers a slightly different interpretation, using "frustration and anguish."
Next, in the previous classification experiment, models often confused Anger, Disgust, Surprise .
Anger
This image, while not necessarily typical, is one that many people would likely identify as an expression of "anger."
All models do not definitively identify "anger," instead detecting negative but slightly different emotions such as "doubt," "perplexity," or "skepticism." While facial recognition is successful, it highlights the difficulty in classifying it into a specific category.
The woman in this photo clearly appears to be feeling unpleasant. It's an image that even a human would likely classify as 'disgust.'
Gemini captures related negative emotions such as 'irritation' and 'skepticism.' On the other hand, GPT offers an overall positive interpretation, such as 'slight amusement' or 'casual interest.'
In this example, Gemini appears to have captured the nuances of the expression more accurately.
Reading a 'surprise' expression from the person in this photo seems quite difficult even for humans.
None of the models identified 'surprise,' instead estimating negative expressions such as 'contemplation' or 'skepticism.' Given the context that the image was taken in a high-end restaurant, it might not necessarily be a negative surprise, highlighting the importance of contextual understanding.
Finally, let's look at 'Fear,' an emotion that was completely unclassifiable in the previous experiment.
Fear
Reading 'fear' from this man's expression is very difficult even for humans.
In fact, while all models identified negative emotions such as 'bewilderment' or 'anxiety,' none explicitly mentioned 'fear.' This outcome suggests that a single still image critically lacks the contextual information needed to accurately interpret expressions.
Facial expression recognition in real-world environments is extremely challenging. A closer look at the dataset revealed that it contained many images, especially for 'Fear' and 'Surprise,' that even humans would find difficult to categorize consistently. This is likely one reason why the model's accuracy struggled in the previous classification task.
Gemini generates more detailed and analytical descriptions. Overall, Gemini tended to generate more detailed descriptions than GPT-4o. Notably, it meticulously reported the movements of facial features, such as eyebrow angle, eye openness, and the shape of the mouth corners, as the basis for its expression judgments.
Does Gemini have the edge in overall facial expression interpretation? As seen in the 'Disgust' example, there were instances where it more accurately captured the nuances of emotion. Considering this alongside its specific descriptions based on facial features, it suggests that Gemini may be a step ahead in overall facial expression recognition capabilities.
No significant difference observed between Gemini Flash and Pro. Within the scope of this experiment, no significant performance difference was observed between Gemini Pro and the lighter Gemini Flash. This is an interesting point when considering cost-efficiency. Furthermore, Gemini Flash's response speed was approximately 1.3 to 1.5 times faster than Gemini Pro's. From this, it seemed that Gemini Flash would be a better choice for facial expression recognition tasks.
It was found that while the latest VLMs can grasp the general direction of expressions, such as positive or negative, they have not yet fully captured the subtle nuances of emotions. Distinguishing whether this challenge stems from the limited contextual information provided by a single still image or from the fundamental capabilities of the models themselves will be a crucial point for future investigation.