type
status
date
slug
summary
tags
category
icon
password
Meta Reality Labs has developed a set of artificial intelligence models called "Sapiens". It mainly provides high-resolution models for processing human vision tasks, specifically designed to analyze and understand people and their actions in pictures or videos. These tasks include recognizing human postures, segmenting body parts, measuring depth, and judging the angle of the surface of objects. The model has been trained on more than 300 million human images and can perform well in various complex environments.
- 2D Pose Estimation: Recognizing and estimating the pose of the human body in 2D images.
- Body part segmentation: Accurately segment the human body parts in the image, such as identifying and distinguishing different parts such as hands, feet, and head.
- Depth Estimation: Predicting the depth of objects in an image, which helps understand distances and layout in 3D space.
- Surface Normal Prediction: Inferring the orientation of an object’s surface in an image is important for better understanding the object’s shape and material.
These models can handle very high-resolution images and perform well with very little labeled data or even completely synthetic data, making them very useful in real-world applications, especially when data is scarce.
In addition, the Sapiens model is simple in design and easy to expand. When the number of parameters of the model is increased, its performance in various tasks will be significantly improved. In multiple tests based on human vision, the Sapiens model has surpassed the existing baseline model and performed well.
Scenarios of Sapiens
The Sapiens model is mainly used in several key human vision task areas. Its application scenarios and uses include:
1. 2D Pose Estimation
- Application scenarios: 2D posture estimation is one of the key technologies in video surveillance, virtual reality, motion capture, medical rehabilitation and other fields. It can recognize human posture, movement and gestures.
- Functionality: Sapiens can accurately detect and predict key points of the human body (such as joints, facial features, etc.), and works well even in multi-person scenes. This makes it have broad application potential in motion analysis and human-computer interaction.
2. Body Part Segmentation
- Application scenarios: Accurate human body part segmentation is a basic technology in fields such as medical image analysis, virtual fitting, animation production, and augmented reality (AR).
- Functionality: The Sapiens model can accurately classify each pixel in an image into different parts of the body (such as upper body, lower body, facial details, etc.). This helps develop more sophisticated virtual clothing fitting, medical diagnostic tools, and more natural virtual character animation.
3. Depth Estimation
- Application scenarios: Depth estimation is crucial in autonomous driving, robot navigation, 3D modeling and virtual reality, helping to understand the three-dimensional structure in the scene.
- Function: The Sapiens model is able to infer the depth information of a scene from a single image, especially in human scenes. By generating high-quality depth maps, it supports a variety of applications that require understanding spatial relationships, such as obstacle detection in autonomous driving and robot path planning.
4. Surface Normal Prediction
- Application scenarios: Surface normal prediction is widely used in 3D rendering, physical simulation, reverse engineering, and lighting processing.
- Function: The Sapiens model can infer the surface normal direction of each pixel in the image, which is essential for generating high-quality 3D models and achieving more realistic lighting effects. This function is particularly important in applications that require precise surface features, such as virtual reality and digital content creation.
5. Common Human Vision Tasks
- Application scenarios: The Sapiens model can be applied to any scenario that requires understanding and analyzing human images, including social media content analysis, security monitoring, sports science research, and digital human generation.
- Function: Due to its strong performance on multiple tasks, Sapiens can be used as a general base model to support various human-centric vision tasks, thereby accelerating the development of related applications.
6. Virtual Reality and Augmented Reality
- Application scenarios: Virtual reality (VR) and augmented reality (AR) applications require highly accurate understanding of human posture and structure to achieve an immersive experience.
- Function: Sapiens supports the creation of realistic human images in virtual environments by providing high-resolution, accurate human pose and part segmentation, and can dynamically adapt to changes in user movements.
7. Medical and Health
- Application scenarios: In medical imaging and rehabilitation training, accurate posture detection and human segmentation can be used for patient monitoring, treatment tracking and rehabilitation guidance.
- What it does: The Sapiens model helps medical professionals analyze patients’ posture and movement to provide more personalized and effective treatment plans.
Technical Methods of Sapiens
1. Dataset and preprocessing
- Humans-300M dataset: The pre-training dataset for the Sapiens model is Humans-300M, a large-scale dataset containing 300 million "in-the-wild" human images. The dataset has been carefully curated to remove watermarks, text, artistic depictions, or unnatural elements.
- Data filtering: We use a pre-trained bounding box detector to filter images and only keep images with a detection score higher than 0.9 and a bounding box size larger than 300 pixels to ensure data quality.
- Multi-view capture and annotation: In order to accurately capture human body postures and parts, multi-view capture technology is used to acquire images, and 308 key points and 28 body part categories are manually annotated to generate high-quality annotated data.
2. Model Architecture
- Vision Transformers (ViT): The Sapiens model uses the Vision Transformers (ViT) architecture, which has performed well in image classification and understanding tasks. By dividing the image into fixed-size non-overlapping patches, the model is able to handle high-resolution inputs and perform fine-grained reasoning.
- Encoder-Decoder Architecture: The basic architecture of the model is an encoder-decoder. The encoder is responsible for extracting features from the image and initialized to pre-trained weights, while the decoder is a lightweight and task-specific module that is randomly initialized and fine-tuned together with the encoder.
3. Masked Autoencoder (MAE) Pre-training
- Masking strategy: The MAE method is used for model pre-training. The model reconstructs the original image by observing partially masked images. This strategy enables the model to learn more robust feature representations.
- High-resolution input: The input image resolution during pre-training is set to 1024 pixels, which brings 4 times the computational complexity compared to existing vision models, but also improves the output quality of the model.
- Multi-task learning: By fine-tuning on high-quality labeled data, the Sapiens model is able to handle multiple tasks such as 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction.
4. Mission-critical methods
- 2D pose estimation: A top-down approach is used to detect the locations of K key points from the input image. The model determines their locations by predicting the heat map of each key point, and the mean squared error loss function (MSE) is used to optimize the model during training.
- Body part segmentation: Each pixel in the input image is classified into C categories and trained using a weighted cross entropy loss function (WeightedCE). The model supports a standard 20-category segmentation vocabulary and an extended 28-category vocabulary.
- Depth Estimation: A modified segmentation architecture is used for depth estimation with an output channel of 1 (regression task). High-resolution depth maps generated from synthetic data are used for training, and the loss function is relative depth loss (Ldepth).
- Surface normal prediction: Predict the xyz component of the normal vector for each pixel. The loss function used in the training process is Lnormal, which includes L1 loss and dot product loss between normal vectors.
5. Large-scale pre-training and fine-tuning
- Pre-training scale: The Sapiens model is pre-trained at scale on 300 million images and runs for 18 days on up to 1024 A100 GPUs using the PyTorch framework.
- Optimization method: AdamW optimizer is used, combined with cosine annealing and linear decay learning rate strategies for optimization. Different learning rates are used at different levels to ensure the generalization ability of the model.
- Fine-tuning strategy: Based on the pre-training, the input image is adjusted to an aspect ratio of 4:3 and standard data augmentation methods (such as cropping, scaling, flipping, and photometric distortion) are used.
Experimental Results of Sapiens
2D Pose Estimation
The Sapiens model performs well in 2D pose estimation tasks, especially in key point detection of the whole body, face, hands and feet, significantly surpassing existing state-of-the-art methods.
Body Part Segmentation
The Sapiens model achieves higher mean intersection over union (mIoU) and pixel accuracy (mAcc) in body part segmentation tasks, and performs particularly well in detail-rich segmentation tasks.
Depth Estimation
The Sapiens model performs well in depth estimation tasks, especially in human scenes, and its depth estimation accuracy is significantly better than existing methods, especially in complex scenes with multiple people.
Surface Normal Prediction
In the surface normal prediction task, the Sapiens model demonstrated higher accuracy and consistency, performed well in different scenarios, and significantly reduced the average angle error.
Pre-training data source
Human-centric pre-training datasets are crucial to improving the performance of Sapiens models in various tasks, demonstrating the importance of human-specific data.
Zero-shot generalization
The Sapiens model demonstrates extensive zero-shot generalization capabilities and is able to adapt to different scenarios, age groups, and viewpoints despite limited training data.
Project address: https://about.meta.com/realitylabs/codecavatars/sapiens
- Author:KCGOD
- URL:https://kcgod.com/sapiens-visual-model-by-meta-ai
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones