
2025-10-14
2025-10-21
Haruka Asanuma

I'm Asanuma, an intern at Quixotiks.
In our previous blog post, we investigated how well Gemini and GPT, both Large Vision-Language Models (VLMs), could interpret facial expressions in images. The results showed that,the general sentiment of expressions, such as positive or negative, could be captured.However, the previous dataset primarily consisted of images of Caucasians and younger individuals. As is often the case with datasets used in AI development, there was a bias. Therefore, this time, we decided to focus on what could be considered "outliers" in the dataset:Asians (Japanese) and the elderly.We decided to additionally investigate how accurately the latest VLMs can recognize their facial expressions.
This is an image of a man intently reading a newspaper. Each model described his expression as follows:

Interestingly,all models estimated negative-leaning emotions such as "perplexity," "worry," and "discomfort".
However, from my perspective as a Japanese person, it's quite natural for elderly individuals to make such a "stern face" when concentrating on a newspaper; it doesn't necessarily indicate strong negative emotions.
This discrepancy in interpretation with the AI might not only be due to cultural context differences, but alsoa bias in the age distribution of the training dataset. Perhaps the AI learned facial patterns from abundant data of younger people (e.g., frowning = dissatisfaction or perplexity) and applied them to images of the elderly. In other words,there is insufficient data to capture the nuances of facial expressions unique to the elderly, which might have led to these estimations based on the facial expressions of younger individuals.
Next is an image of a woman resting her chin on her hand, seemingly lost in thought.

For this image, all three models provided estimations very close to my own interpretation, such as "contemplation," "sadness," and "worry." The gesture of resting one's chin on a hand while lost in thought is not only culturally universal but alsoshows relatively little age-related difference, being commonly observed across generations. Therefore, it might have been easier for the AI to learn this pattern and make an accurate judgment.
This verification has shed light on the current capabilities and challenges of VLM's facial recognition.
While the expression of a woman lost in thought was recognized with high accuracy, the expression of a man reading a newspaper was interpreted as negative, diverging from human perception. This difference is likely due to the "universality" of certain expressions and the "bias in AI training data."
While the gesture of resting one's chin on a hand is universal across ages, the "stern face" made during concentration carries nuances specific to the elderly. The AI likely judged the expressions of the elderly, for which it has less data, based on the patterns of younger individuals, for whom it has abundant data, leading to the discrepancy in interpretation.
This indicates that for AI to truly understand human expressions, in addition to high-level contextual understanding abilities like situation and culture, the foundationalimportance of diverse and unbiased datasetswas reaffirmed. We will continue to pay close attention to the future evolution of VLM.