Facial expression classification with the "In the Wild" dataset

2025-09-25

Haruka Asanuma

This time, I will report on the experimental results of how well the latest Vision-Language Models (VLMs), Gemini and GPT, can recognize facial expressions, using the In the Wild dataset we previously investigated.


Models and Datasets Used in the Experiment

Models

For the experiment, we used Gemini 2.0 Flash and GPT-4o. Both models are representative commercial VLMs at present, boasting high performance in various image recognition benchmarks.

Dataset: CAER


For the dataset, we used CAER, which was introduced in the previous article. CAER consists of video data filmed in everyday environments, making it suitable for measuring AI performance in real-world scenarios. Such datasets are referred to as 'In the Wild,' meaning they represent the more complex and unpredictable real world, rather than ideal laboratory environments. We also conducted experiments using CAER-S, a still image dataset extracted from videos.

Experimental Results

Experiment 1: 7-Category Classification using Video [Gemini]


First, we had Gemini 2.0 Flash, which can process video input, classify 7 types of emotions (Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise) using the following prompt.

Please identify the facial expression of the person in the video by choosing one from the following list of emotions.
If multiple people are shown, please select the expression of the person whose face is most clearly visible.
Emotion List: Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise
The output should only be the label name (e.g., 'Happy', 'Sad'), without any additional explanations or punctuation.

As a result, the Confusion Matrix showed a very high number of classifications as 'Neutral'. It became clear that even with rich video input, classification 'In the Wild' is quite challenging.

Experiment 2: 7-category classification using still images [Gemini, GPT]

Since Gemini's accuracy was low in Experiment 1, we conducted a similar experiment with GPT-4o to investigate whether this was a model issue or the inherent difficulty of the task itself. As GPT cannot process video input, we used the CAER-S dataset.

For both models,most images were estimated as Neutral,and the classification accuracy was extremely low. From this result, it was suggested that, regardless of whether the input was video or still images,the 'In the Wild' facial expression recognition task itself is a very difficult challenge for VLMs.This was suggested.

Experiment 3: 6-category classification using still images (excluding Neutral)

Due to the strong bias towards Neutral, we evaluated the accuracy for six emotions, excluding Neutral.

  • Gemini 2.0 Flash: While the accuracy for Happy and Sad was relatively good, it could barely distinguish Fear. Furthermore, there was a tendency to misclassify Surprise and Fear as Sad, and the estimation of Angry was also unstable.
  • GPT-4o: In addition to Happy and Sad, the classification of Disgust was also relatively good. However, there was a strong tendency to misclassify Angry as Surprise.

Experiment 4: Attempt at Two-Stage Estimation

To reduce misclassification as 'Neutral', we attempted the following two-stage estimation.

  1. First, determine whether an image isNeutralornon-Neutral.
  2. Next,non-Neutralimages are subjected to 6-class classification.

We also tried this approach, but Gemini classified everything asNeutral,and GPT classified most asSurprise,and,two-stage estimation does not work.It was found.

Summary and Discussion

This experiment revealed the following:

  1. Facial expression recognition "in the wild" is extremely difficult, even with the latest VLMs.
  2. The classification of "Neutral" is a major factor in significantly reducing classification accuracy.
  3. While specific emotions (Happy, Sad) are relatively easy to recognize, complex emotions such as Fear and Anger are difficult to recognize.

Notably,Neutralthe unusually large bias towards this category is an interesting point. This suggests that VLMs, when "uncertain," chooseNeutralas the safest option. Alternatively, it's possible that many real-world expressions are not "clear expressions" like those used in research, but rather "ambiguous expressions close to Neutral."

To improve the accuracy of facial expression recognition, it appears we need a deeper understanding of dataset characteristics and model behavior, beyond just prompt engineering.

Next time, we'll delve deeper into whether VLMs truly struggle with facial expression recognition. To test the hypothesis that "recognition is possible but classification is difficult," we'll report the results of an experiment where VLMs were prompted to freely describe the facial expressions of people in images.