X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer

type

status

date

slug

summary

category

icon

password

Last week, Runway launched a generative character performance tool that can convert videos into virtual character animations of any style while maintaining synchronized expressions, voice, and lip movements. Using just a camera to record an actor's performance, Act-One can transform the video into virtual character animation, capturing details like the actor's gaze, facial expressions, movement rhythm, and speaking patterns.

This week, ByteDance's team reached out to me, saying they also have a similar product in internal testing that performs even better than Runway's Act-One, and invited me to try it out.

I was genuinely surprised by what I discovered. ByteDance clearly has many impressive technologies, though they tend to keep them under wraps.

This tool doesn't have an official product name yet and is internally called X-Portrait 2. As the name suggests, they've been working on it for some time, given it's already in its second iteration.

X-Portrait 2 is an efficient AI-powered portrait animation generation tool. Users only need to provide a static portrait image and a "driving video" containing expressions and movements. X-Portrait 2 can then transfer the expressions and movements from the video to the static image, generating natural, fluid, and expressive animations.

It can not only transfer the actions and expressions of video characters to target images but also capture and reproduce extremely subtle facial expression changes, such as pouting, puffing cheeks, and frowning, making the generated animations both smooth and emotionally rich.

Let me share some test cases I've tried.

X-Portrait 2 can accurately capture and convey rapid head movements and even reproduce subtle facial expression changes and emotional transitions from the video, making the generated animations more realistic and vivid.

The model shows extreme adaptability, achieving cross-style expression transfer between different styles (such as realistic portraits and cartoon images).

It works well with both real human portraits and virtual characters like cartoons and comic characters.

Previously, such results required actors to wear motion capture equipment or use camera-based motion capture technology. Now it can be achieved with simple images and videos using prompts.

Separation of "Face" and "Expression": Changing Expressions Without Altering Faces

To maintain the original appearance while animating photos, X-Portrait 2 employs a method that separates "face" and "expression."

This approach is like splitting a person's appearance from their expressions, allowing only the expressions to change while preserving the original facial features.

This separation method ensures that photos maintain their original appearance while mimicking video expressions, preventing facial structure changes due to expressions.

Detailed Motion Reproduction: Capturing Every Detail

X-Portrait 2 is highly sensitive to subtle expressions and rapid movements. For instance, quick head turns, pouting, or slight eyebrow raises are all captured and reproduced by the model, resulting in very detailed video effects. This precise motion reproduction makes it particularly suitable for visual effects or animation production, making generated characters appear more realistic.

Compared to state-of-the-art methods like X-Portrait and the recently released Runway Act-One, X-Portrait 2 can faithfully represent rapid head movements, subtle expression changes, and intense personal emotions, which are crucial for high-quality content creation (such as animation and film production).

Technical Innovations:

High-Precision Expression Encoder: Achieving Realistic Reproduction of Subtle Expressions

Capturing subtle emotional changes: X-Portrait 2's expression encoder, trained on large-scale datasets, can capture and reproduce complex facial details and emotional changes. For example, it can accurately reproduce small but crucial expressions like pouting, puffed cheeks, and frowning, making the generated animations not just mechanical imitations but full of personality and subtle emotions.

High-fidelity expression transfer: The encoder preserves the original video's emotions and tone during generation, making expressions more natural and accurately conveying emotional intensity, providing creators with an animation generation experience that surpasses traditional methods.

Strong Appearance and Motion Disentanglement

Separating appearance from expression changes: X-Portrait 2's technical architecture separates image appearance from expression and motion, allowing the model to focus solely on transferring expression and motion information without altering the static portrait's appearance. This separation ensures the independence and consistency of expression generation, especially when handling complex dynamic changes, making expression transfer more natural.

Support for multi-style applications: The separation of appearance and motion means the model can easily apply to images of different styles. Whether it's realistic portraits or cartoon characters, X-Portrait 2 can accurately transfer expressions to the target style. This cross-style capability allows creators to integrate image materials of different styles into one project, enriching creative expression.

Innovative Application of Generative Diffusion Models

Multi-view training and diffusion generation: Using generative diffusion models trained on multi-view data. This model can reproduce expression changes from different angles, making animation generation more fluid and realistic. Through multi-view training, the diffusion model ensures natural and coherent expression movements at every angle, avoiding the inconsistency issues found in traditional methods.

Denoising mechanism and consistency optimization: The diffusion model uses a denoising mechanism during generation, producing higher quality images and reducing noise in expression and motion transitions. This denoising process ensures clarity in complex expressions and rapid movements, making generated animations smoother and more refined.

Highly Adaptive Cross-Domain Expression Transfer Capability

Support for cross-domain applications: X-Portrait 2's cross-domain transfer capability makes it suitable for animation needs across different styles and domains, easily achieving expression transfer from realistic portraits to virtual characters, comic styles, and more. This cross-domain adaptability allows flexible use in creation, providing creators with a broader range of style choices.

Compatibility with multiple driving inputs: Supports various types of driving videos, whether they're film shots, animations, or user-recorded videos. This compatibility not only enhances the tool's applicability but also provides creators with greater freedom in choosing driving videos, allowing them to select the most suitable driving source for different needs.

Enhancement of Realism and Dynamic Expressiveness

Realism and detail capture: Can meticulously reproduce rapid head movements, subtle facial changes, and emotional characteristics, enhancing the realism of generated animations. Compared to traditional methods, this model shows clear advantages in high dynamic expressiveness, making generated animations closer to real footage effects.

Cinema-grade animation quality: Excels in generating dynamic scenes, suitable for high-quality film and animation production. Whether it's subtle emotional conveyance or dramatic expression changes, X-Portrait 2 maintains coherent expression fluidity, bringing cinema-grade animation quality to content creation.

Project github: https://byteaigc.github.io/X-Portrait2/