I'm Asanuma, an intern at Quixotiks since October 2024. Having been involved in research, I'm now looking forward to sharing our survey findings in articles.
In this first installment, we'll delve into "facial expression recognition datasets" that AI uses to classify human expressions into emotion labels. While datasets can also be categorized by whether they contain images or videos, this article will specifically focus on explaining two crucial distinctions that define their nature: "Controlled" and "In the Wild."
Two Types of Facial Expression Recognition Datasets
Datasets used for AI to learn facial expressions can be broadly categorized into two types based on their creation method: data generated in a "controlled laboratory" setting and data collected from "real-world, everyday situations." These are commonly known as Controlled and In the Wild (Uncontrolled) datasets.
- Controlled: Background, lighting, and poses are predetermined. The expressions to be made are also specified by the researchers.
- In the Wild: Backgrounds and situations vary widely, capturing natural, spontaneous moments.
Now, let's explore the characteristics of each type and their representative datasets.
The World of Controlled: Expressions in a Controlled Laboratory
Controlled datasets are captured in a uniform environment prepared by researchers, with participants asked to make specific expressions.
- Characteristics:
- Backgrounds are plain white or gray.
- Subjects face the camera directly.
- Posed expressions based on instructions like "Please smile."
- Consistent lighting and other conditions, resulting in low noise.
- Representative example: CK+ (The Extended Cohn-Kanade Dataset)
- ✅ Advantages:
- The data is very clean and suitable for basic research, such as the relationship between expressions and facial muscle movements.
- The labeling accuracy is very high.
- ❌ Disadvantages:
- Since these are acted expressions, they diverge from the natural emotional expressions we show in daily life.
- AI trained solely on this data will find it difficult to handle the diverse expressions of the real world.
The "In the Wild" World: Everyday Expressions
To overcome the limitations of controlled datasets, the "In the Wild" dataset was born from the movement to collect data from the real world.
- Features:
- Cut from various videos, such as movies, TV shows, and YouTube.
- Diverse facial orientations, lighting conditions, and backgrounds.
- Rich in spontaneous, non-acted expressions.
- Examples: AFEW, DFEW, CAER, FERV39K, MAFW, etc.
✅ Advantages:
- More practical and essential for developing expression recognition models usable in the real world.
- Leads to improved recognition accuracy for hidden faces, various angles, and under complex lighting.
- ❌ Disadvantages (Challenges):
- The data contains a lot of noise (background, occlusions, etc.), making it difficult to handle and learn from.
- Labeling emotions is very difficult, for example, determining "Was that truly an expression of joy?" Since the person labeling the data is different from the person who actually made the expression, it's unclear if the emotional labeling is accurate.
Summary: Dataset Quick Reference Chart
| 特性 |
Controlled |
In the Wild (Uncontrolled) |
| 環境 |
実験室など、管理された環境 |
日常生活、映画、Web動画など |
| 表情 |
指示された演技表情 |
自然発生的な表情 |
| 撮影条件 |
正面、無背景、均一な照明 |
様々な角度、複雑な背景、多様な照明 |
| データ |
クリーンで扱いやすい |
ノイズが多く複雑 |
| 得意なこと |
基礎研究、顔の動きの分析 |
実社会での応用、頑健なモデル開発 |
| 課題 |
現実世界との乖離 |
ラベリングの難しさ |
This time, we explained "Controlled" and "In the Wild," which are major classification categories for facial expression recognition datasets. It's not a matter of one being superior to the other;both types of datasets play crucial roles depending on the research objectives.
Next time, we will delve deeper into the history of "In the Wild" datasets.