Sapiens: Meta's New AI for Understanding Human Behavior

type

status

date

slug

summary

Scenarios of Sapiens

The Sapiens model is mainly used in several key human vision task areas. Its application scenarios and uses include:

1. 2D Pose Estimation

Application scenarios: 2D posture estimation is one of the key technologies in video surveillance, virtual reality, motion capture, medical rehabilitation and other fields. It can recognize human posture, movement and gestures.

Functionality: Sapiens can accurately detect and predict key points of the human body (such as joints, facial features, etc.), and works well even in multi-person scenes. This makes it have broad application potential in motion analysis and human-computer interaction.

Sapiens: accurately detect and predict key points of the human body

2. Body Part Segmentation

Application scenarios: Accurate human body part segmentation is a basic technology in fields such as medical image analysis, virtual fitting, animation production, and augmented reality (AR).

Functionality: The Sapiens model can accurately classify each pixel in an image into different parts of the body (such as upper body, lower body, facial details, etc.). This helps develop more sophisticated virtual clothing fitting, medical diagnostic tools, and more natural virtual character animation.

Sapiens: accurately classify each pixel in an image into different parts of the body

3. Depth Estimation

Application scenarios: Depth estimation is crucial in autonomous driving, robot navigation, 3D modeling and virtual reality, helping to understand the three-dimensional structure in the scene.

Function: The Sapiens model is able to infer the depth information of a scene from a single image, especially in human scenes. By generating high-quality depth maps, it supports a variety of applications that require understanding spatial relationships, such as obstacle detection in autonomous driving and robot path planning.

4. Surface Normal Prediction

Application scenarios: Surface normal prediction is widely used in 3D rendering, physical simulation, reverse engineering, and lighting processing.

Function: The Sapiens model can infer the surface normal direction of each pixel in the image, which is essential for generating high-quality 3D models and achieving more realistic lighting effects. This function is particularly important in applications that require precise surface features, such as virtual reality and digital content creation.

5. Common Human Vision Tasks

Application scenarios: The Sapiens model can be applied to any scenario that requires understanding and analyzing human images, including social media content analysis, security monitoring, sports science research, and digital human generation.

Function: Due to its strong performance on multiple tasks, Sapiens can be used as a general base model to support various human-centric vision tasks, thereby accelerating the development of related applications.

6. Virtual Reality and Augmented Reality

Application scenarios: Virtual reality (VR) and augmented reality (AR) applications require highly accurate understanding of human posture and structure to achieve an immersive experience.

Function: Sapiens supports the creation of realistic human images in virtual environments by providing high-resolution, accurate human pose and part segmentation, and can dynamically adapt to changes in user movements.

7. Medical and Health

Application scenarios: In medical imaging and rehabilitation training, accurate posture detection and human segmentation can be used for patient monitoring, treatment tracking and rehabilitation guidance.

What it does: The Sapiens model helps medical professionals analyze patients’ posture and movement to provide more personalized and effective treatment plans.

Technical Methods of Sapiens

1. Dataset and preprocessing

Humans-300M dataset: The pre-training dataset for the Sapiens model is Humans-300M, a large-scale dataset containing 300 million "in-the-wild" human images. The dataset has been carefully curated to remove watermarks, text, artistic depictions, or unnatural elements.

Data filtering: We use a pre-trained bounding box detector to filter images and only keep images with a detection score higher than 0.9 and a bounding box size larger than 300 pixels to ensure data quality.

Multi-view capture and annotation: In order to accurately capture human body postures and parts, multi-view capture technology is used to acquire images, and 308 key points and 28 body part categories are manually annotated to generate high-quality annotated data.

2. Model Architecture

Vision Transformers (ViT): The Sapiens model uses the Vision Transformers (ViT) architecture, which has performed well in image classification and understanding tasks. By dividing the image into fixed-size non-overlapping patches, the model is able to handle high-resolution inputs and perform fine-grained reasoning.

Encoder-Decoder Architecture: The basic architecture of the model is an encoder-decoder. The encoder is responsible for extracting features from the image and initialized to pre-trained weights, while the decoder is a lightweight and task-specific module that is randomly initialized and fine-tuned together with the encoder.

3. Masked Autoencoder (MAE) Pre-training

Masking strategy: The MAE method is used for model pre-training. The model reconstructs the original image by observing partially masked images. This strategy enables the model to learn more robust feature representations.

High-resolution input: The input image resolution during pre-training is set to 1024 pixels, which brings 4 times the computational complexity compared to existing vision models, but also improves the output quality of the model.

Multi-task learning: By fine-tuning on high-quality labeled data, the Sapiens model is able to handle multiple tasks such as 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction.

4. Mission-critical methods

2D pose estimation: A top-down approach is used to detect the locations of K key points from the input image. The model determines their locations by predicting the heat map of each key point, and the mean squared error loss function (MSE) is used to optimize the model during training.

Body part segmentation: Each pixel in the input image is classified into C categories and trained using a weighted cross entropy loss function (WeightedCE). The model supports a standard 20-category segmentation vocabulary and an extended 28-category vocabulary.

Depth Estimation: A modified segmentation architecture is used for depth estimation with an output channel of 1 (regression task). High-resolution depth maps generated from synthetic data are used for training, and the loss function is relative depth loss (Ldepth).

Surface normal prediction: Predict the xyz component of the normal vector for each pixel. The loss function used in the training process is Lnormal, which includes L1 loss and dot product loss between normal vectors.

5. Large-scale pre-training and fine-tuning

Pre-training scale: The Sapiens model is pre-trained at scale on 300 million images and runs for 18 days on up to 1024 A100 GPUs using the PyTorch framework.

Optimization method: AdamW optimizer is used, combined with cosine annealing and linear decay learning rate strategies for optimization. Different learning rates are used at different levels to ensure the generalization ability of the model.

Fine-tuning strategy: Based on the pre-training, the input image is adjusted to an aspect ratio of 4:3 and standard data augmentation methods (such as cropping, scaling, flipping, and photometric distortion) are used.

Experimental Results of Sapiens

2D Pose Estimation

The Sapiens model performs well in 2D pose estimation tasks, especially in key point detection of the whole body, face, hands and feet, significantly surpassing existing state-of-the-art methods.

Body Part Segmentation

The Sapiens model achieves higher mean intersection over union (mIoU) and pixel accuracy (mAcc) in body part segmentation tasks, and performs particularly well in detail-rich segmentation tasks.

Depth Estimation

The Sapiens model performs well in depth estimation tasks, especially in human scenes, and its depth estimation accuracy is significantly better than existing methods, especially in complex scenes with multiple people.

Surface Normal Prediction

In the surface normal prediction task, the Sapiens model demonstrated higher accuracy and consistency, performed well in different scenarios, and significantly reduced the average angle error.

Pre-training data source

Human-centric pre-training datasets are crucial to improving the performance of Sapiens models in various tasks, demonstrating the importance of human-specific data.

Zero-shot generalization

The Sapiens model demonstrates extensive zero-shot generalization capabilities and is able to adapt to different scenarios, age groups, and viewpoints despite limited training data.

Project address: https://about.meta.com/realitylabs/codecavatars/sapiens

Paper: https://arxiv.org/pdf/2408.12569

GitHub: https://github.com/facebookresearch/sapiens

💡

Don't let slow servers hold you back. Upgrade today