Once again, I'm reporting on a technical survey concerning Facial Expression Recognition (FER).
In the previous article, I introduced two types of facial expression recognition datasets: 'Controlled' and 'In the Wild'. This time, I will focus on and summarize 'In the Wild' (expression data collected under more natural conditions), which I felt was particularly important.
Recap: Controlled vs. In the Wild
Controlled datasets are created by recording participants displaying specific expressions in a uniform environment prepared by researchers. Their labels are accurate, making them suitable for basic research, such as studying the relationship between expressions and facial muscle movements. However, a challenge is that they diverge from the natural emotional expressions we show in daily life.
In the Wild datasets are collected in situations closer to the real world, including diverse backgrounds and face angles. While the data tends to be noisy and accurate labeling is difficult, it is indispensable for creating 'practically usable' models.
After researching various datasets, I felt that using 'In the Wild' data is more effective than 'Controlled' data when actually building a facial expression recognition model.
"In the Wild" Video Datasets
When I actually looked at 'In the Wild' datasets, I felt that many of them made it difficult to judge emotions from images alone. I believe it's important to capture expressions as videos, as their meaning lies not just in a momentary shape but also in the flow of movement and change before and after.
Therefore, in this article, I have organized and focused on representative 'In the Wild' datasets, particularly thosein video formatthat are provided.
AFEW (2012): A Pioneer of "In the Wild"
The first facial expression recognition dataset in a natural environment
- Data Source: Movies
- Features:
- Unlike data captured in lab environments, it provides more realistic data, including various lighting conditions, head movements, and age groups.
- Scenes were collected using keywords like "laugh" found in movie subtitles as clues.
- Labeling:
- There are two annotators.
- The reliability of the labels has not been checked.
- The dominant emotion at the scene level and individual emotions for each personare both explicitly recorded.
- License: For non-commercial use only
- Source:
CAER (2019): Large-scale data also available for commercial use
The first large-scale "in the wild" video dataset.
- Data Source: 79 TV programs
- Features:
- Includes everyday situations and diverse contextsOver 13,000 video clipscollected.
- Labeling:
- If two or more annotators (those who apply emotion labels) assigned the same emotion, that label was adopted.
- Annotators reported the reliability of the labels they applied, and if the average reliability was low, the data was excluded from the dataset.
- Challenges: Upon reviewing the actual video data, there appears to be a bias in the race and age of the individuals depicted.
- License: Commercial use permitted(However, copyright remains with the original video owner).
- Source
DFEW (2020): Larger and more diverse movie data
A dataset with high-quality data labels.
- Data Source: Over 1500 movies
- Features:
- From movies of diverse genres such as comedy, tragedy, and war.Over 16,000 videoscollected.
- Labeling:
- Each video was labeled by 10 expert annotators.
- The label is adopted when more than 60% of annotators assign the same label.
- Challenges: There is a bias in the number of emotion labels, with extremely limited data for emotions such as "fear" and "disgust."
- License: For non-profit research purposes only
- Source
FERV39K (2022): Data considering "what kind of scene it is"
It incorporates a new concept, "scene (context)," rather than just classifying facial expressions.
- Data Source: Various sources such as movies and TV shows
- Features:
- such as "argument," "school," "business," and "crime"22 specific sceneswere set, and data was collected and classified accordingly.
- Approximately 39,000 videosThis includes a very large-scale video dataset.
- We are attempting to collect videos from diverse regions such as Asia, Africa, and Europe/America.
- Labeling:
- Two-stage annotation structure
- Stage 1: Three annotators label the data
- Stage 2: Experts verify
- License: For non-profit research purposes only
- Source
MAFW (2022): Data including compound emotions and text
MAFW is a highly ambitious dataset that attempts to capture the complexity of facial expression recognition in greater depth.
- Data Sources: YouTube, talk shows, etc.
- Features:
- Multimodal: Not only video, but alsoaudio datais included.
- Compound Emotion Labels: Labels are also assigned for cases where multiple emotions exist simultaneously, such as 'a face mixed with joy and surprise' (e.g., 'anger + disgust').
- Emotion Description Text: Such as 'He frowned while sighing in relief',text that describes emotions and situations in sentencesprovided in English and Chinese.
- Labeling:
- 11 skilled annotators
- Assign a score (0-1) for 11 emotion categories to each video.
- Expectation-Maximization (EM) Using an algorithm, we estimate the reliability of each annotator, calculating "the probability that the emotion was correctly assigned (reliability α)" and "the probability that it was not incorrectly assigned (reliability β)."
- License: For non-profit research purposes only
- Source
- All the above data uses seven basic emotions (anger, disgust, fear, joy, neutrality, sadness, surprise) as emotion labels.
Key Dataset Comparison Summary
| データセット |
発表年 |
データソース |
特徴 |
ライセンス |
| AFEW |
2012 |
映画 |
"In the Wild"の先駆け |
非営利のみ |
| CAER |
2019 |
テレビ番組 |
大規模、商用利用可 |
公開 |
| DFEW |
2020 |
映画 |
多様な撮影条件(照明、背景など) |
非営利のみ |
| FERV39K |
2022 |
様々 |
シーン(場面)の概念を導入 |
非営利のみ |
| MAFW |
2022 |
様々 |
音声、複合感情、テキスト記述付き |
非営利のみ |
Summary
Datasets for facial expression recognition are not merely filmed in "natural environments," but also consider
- in what kind of scene (FERV39K)
- whether multiple emotions are mixed (MAFW)
- how it relates to other information, such as voice and behavior (MAFW)
and other morea shift towards context-aware approacheshas been observed.
Furthermore, most cutting-edge datasets are currently restricted to research purposes, posing significant challenges for business implementation.
Moving forward, we anticipate the emergence of datasets that overcome licensing issues, cover a wider range of ethnicities, cultures, and age groups, and are capable of capturing more complex emotions.
I'm excited to see the technological advancements these datasets will enable.
Note: While this blog post was written after consulting research papers to ensure accuracy, please refer to the original papers for precise information.