"In the Wild": Video Datasets for Facial Expression Recognition

2025-09-22

Haruka Asanuma

Once again, I'm reporting on a technical survey concerning Facial Expression Recognition (FER).

In the previous article, I introduced two types of facial expression recognition datasets: 'Controlled' and 'In the Wild'. This time, I will focus on and summarize 'In the Wild' (expression data collected under more natural conditions), which I felt was particularly important.

Recap: Controlled vs. In the Wild

Controlled datasets are created by recording participants displaying specific expressions in a uniform environment prepared by researchers. Their labels are accurate, making them suitable for basic research, such as studying the relationship between expressions and facial muscle movements. However, a challenge is that they diverge from the natural emotional expressions we show in daily life.

In the Wild datasets are collected in situations closer to the real world, including diverse backgrounds and face angles. While the data tends to be noisy and accurate labeling is difficult, it is indispensable for creating 'practically usable' models.

After researching various datasets, I felt that using 'In the Wild' data is more effective than 'Controlled' data when actually building a facial expression recognition model.

"In the Wild" Video Datasets

When I actually looked at 'In the Wild' datasets, I felt that many of them made it difficult to judge emotions from images alone. I believe it's important to capture expressions as videos, as their meaning lies not just in a momentary shape but also in the flow of movement and change before and after.

Therefore, in this article, I have organized and focused on representative 'In the Wild' datasets, particularly thosein video formatthat are provided.

AFEW (2012): A Pioneer of "In the Wild"

The first facial expression recognition dataset in a natural environment

Data Source: Movies
Features:
- Unlike data captured in lab environments, it provides more realistic data, including various lighting conditions, head movements, and age groups.
- Scenes were collected using keywords like "laugh" found in movie subtitles as clues.
Labeling:
- There are two annotators.
- The reliability of the labels has not been checked.
- The dominant emotion at the scene level and individual emotions for each personare both explicitly recorded.
License: For non-commercial use only
Source:

CAER (2019): Large-scale data also available for commercial use

The first large-scale "in the wild" video dataset.

Data Source: 79 TV programs
Features:
- Includes everyday situations and diverse contextsOver 13,000 video clipscollected.
Labeling:
- If two or more annotators (those who apply emotion labels) assigned the same emotion, that label was adopted.
- Annotators reported the reliability of the labels they applied, and if the average reliability was low, the data was excluded from the dataset.
Challenges: Upon reviewing the actual video data, there appears to be a bias in the race and age of the individuals depicted.
License: Commercial use permitted(However, copyright remains with the original video owner).
Source

DFEW (2020): Larger and more diverse movie data

A dataset with high-quality data labels.

Data Source: Over 1500 movies
Features:
- From movies of diverse genres such as comedy, tragedy, and war.Over 16,000 videoscollected.
Labeling:
- Each video was labeled by 10 expert annotators.
- The label is adopted when more than 60% of annotators assign the same label.
Challenges: There is a bias in the number of emotion labels, with extremely limited data for emotions such as "fear" and "disgust."
License: For non-profit research purposes only
Source

FERV39K (2022): Data considering "what kind of scene it is"

It incorporates a new concept, "scene (context)," rather than just classifying facial expressions.

Data Source: Various sources such as movies and TV shows
Features:
- such as "argument," "school," "business," and "crime"22 specific sceneswere set, and data was collected and classified accordingly.
- Approximately 39,000 videosThis includes a very large-scale video dataset.
- We are attempting to collect videos from diverse regions such as Asia, Africa, and Europe/America.
Labeling:
- Two-stage annotation structure
  - Stage 1: Three annotators label the data
  - Stage 2: Experts verify
License: For non-profit research purposes only
Source

MAFW (2022): Data including compound emotions and text

MAFW is a highly ambitious dataset that attempts to capture the complexity of facial expression recognition in greater depth.

Data Sources: YouTube, talk shows, etc.
Features:
- Multimodal: Not only video, but alsoaudio datais included.
- Compound Emotion Labels: Labels are also assigned for cases where multiple emotions exist simultaneously, such as 'a face mixed with joy and surprise' (e.g., 'anger + disgust').
- Emotion Description Text: Such as 'He frowned while sighing in relief',text that describes emotions and situations in sentencesprovided in English and Chinese.
Labeling:
- 11 skilled annotators
- Assign a score (0-1) for 11 emotion categories to each video.
- Expectation-Maximization (EM) Using an algorithm, we estimate the reliability of each annotator, calculating "the probability that the emotion was correctly assigned (reliability α)" and "the probability that it was not incorrectly assigned (reliability β)."
License: For non-profit research purposes only
Source
All the above data uses seven basic emotions (anger, disgust, fear, joy, neutrality, sadness, surprise) as emotion labels.

Key Dataset Comparison Summary

データセット	発表年	データソース	特徴	ライセンス
AFEW	2012	映画	"In the Wild"の先駆け	非営利のみ
CAER	2019	テレビ番組	大規模、商用利用可	公開
DFEW	2020	映画	多様な撮影条件（照明、背景など）	非営利のみ
FERV39K	2022	様々	シーン（場面）の概念を導入	非営利のみ
MAFW	2022	様々	音声、複合感情、テキスト記述付き	非営利のみ

Summary

Datasets for facial expression recognition are not merely filmed in "natural environments," but also consider

in what kind of scene (FERV39K)
whether multiple emotions are mixed (MAFW)
how it relates to other information, such as voice and behavior (MAFW)

and other morea shift towards context-aware approacheshas been observed.

Furthermore, most cutting-edge datasets are currently restricted to research purposes, posing significant challenges for business implementation.

Moving forward, we anticipate the emergence of datasets that overcome licensing issues, cover a wider range of ethnicities, cultures, and age groups, and are capable of capturing more complex emotions.

I'm excited to see the technological advancements these datasets will enable.

Note: While this blog post was written after consulting research papers to ensure accuracy, please refer to the original papers for precise information.

‍

More Blogs

June 13, 2026

Running torchaudio with DGX Spark (sm_121)

June 13, 2026

[MLflow] A Guide to Building a Secure LLM Learning Management Environment with Tailscale + AWS EC2/S3

June 13, 2026

Setting up NemoClaw on a DGX Spark via remote access

"In the Wild": Video Datasets for Facial Expression Recognition

Recap: Controlled vs. In the Wild

"In the Wild" Video Datasets

AFEW (2012): A Pioneer of "In the Wild"

CAER (2019): Large-scale data also available for commercial use

DFEW (2020): Larger and more diverse movie data

FERV39K (2022): Data considering "what kind of scene it is"

MAFW (2022): Data including compound emotions and text

Key Dataset Comparison Summary

Summary

Contact Us