StoryMaker: Ensure Consistent Faces, Styles & Poses for All Characters

type

status

date

slug

summary

What Issues Does StoryMaker Solved

Overall consistency

Previous image generation methods can maintain consistency in facial features, but cannot achieve consistency in other aspects such as clothing, hairstyle, and body. StoryMaker achieves full consistency of characters by combining facial and cropped character image information.

Avoid mixing characters and backgrounds

When generating multiple characters, traditional methods easily lead to confusion between characters or between characters and backgrounds. StoryMaker ensures the independence between different characters and backgrounds by constraining the cross-attention areas.

Pose Decoupling

StoryMaker uses pose decoupling technology to enable characters to show different poses in different images without relying on the poses of the reference image, making the generated stories more diverse.

Key Features of StoryMaker

Multi-character consistent generation

StoryMaker is able to generate images of multiple characters with consistency in face, clothing, hairstyle, body, etc. This is very important for generating coherent and narrative image sequences.

Background and character separation

By constraining the cross-attention of different characters and backgrounds, StoryMaker can effectively avoid the confusion between characters and backgrounds or between different characters, thereby maintaining the clear separation of various parts in the image.

Pose diversity

StoryMaker supports posture decoupling. By combining with ControlNet, it can generate character images with different postures while maintaining character consistency. This allows the character to show diverse actions and postures in different scenes.

High-fidelity image generation

Through LoRA technology, StoryMaker is able to enhance the fidelity and visual quality of images, maintaining consistency while generating images while ensuring the details and realism of the images.

Flexible text control

StoryMaker can control the background, pose, and style of generated images through text prompts, allowing users to generate image sequences that meet narrative needs based on different scene requirements.

Support for multiple applications

The model supports functions such as clothing exchange and character interpolation, and can be integrated with other generation plug-ins (such as LoRA and ControlNet) to provide a variety of generation application scenarios.

Core Technologies and Main Methods of StoryMaker

1. Core Technologies

Positional-aware Perceiver Resampler (PPR): PPR is one of the core technologies of StoryMaker, responsible for extracting facial and character features from reference images. It processes facial features and character features separately through independent Resampler modules, and then combines positional embedding to distinguish different characters. In this way, StoryMaker can maintain the independence of each character in a multi-character scene and prevent the characteristics of the character from being confused with the background or other characters.

Function: Used to extract and process the facial, clothing, hairstyle, and body features of characters, integrate this information into character feature embedding, and distinguish different characters through position information.
How it works: The PPR module generates embedding information from facial features and character features, combined with position information to distinguish characters from background, ensuring the consistency of each character's appearance.

Decoupled Cross-Attention: This technology is based on IP-Adapter and allows the image generation model to decouple the image features of the character from the text prompts during the generation process. Specifically, it injects facial and character features into the text-image generation model separately through a dual cross-attention mechanism to ensure that the generated image can maintain the consistency of the character.

Function: Avoid confusion between multiple characters and backgrounds, and ensure the independence of characters and backgrounds in the image.
How it works: Through the split cross-attention mechanism, character and background information are independently embedded into the generative model to ensure clear segmentation of different elements.

ControlNet: ControlNet is used to solve the control problem of pose generation. By decoupling the pose information of the character image, StoryMaker can control the pose changes of the character in the generated image through text prompts or reference poses, which is especially important for the generation of multiple image sequences.

Function: Supports the generation of character images with different poses, thereby increasing the diversity and narrative ability of generated images.
How it works: Combined with ControlNet, StoryMaker can control the generated character poses through text or predefined poses while maintaining character consistency.

LoRA (Low-Rank Adaptation): LoRA is an important module used to improve the quality and consistency of image generation. It enhances the consistency of character identity, clothing, and hairstyle by fine-tuning the cross-attention of each layer, while improving the quality of details in the generated images. LoRA's low-rank adaptation technology effectively reduces the computational burden of training while maintaining high-quality output.

Function: Enhance the fidelity and visual quality of images and maintain the consistency of characters.
How it works: By adding a LoRA layer to each attention mechanism, only a small number of low-rank parameters are trained to improve the quality and detail of the generated images.

2. Main Methods

Information extraction: StoryMaker uses facial recognition models (such as ArcFace) to extract facial features from reference images, and uses CLIP image encoder to extract character clothing, hairstyle, and body information. The extracted facial and character features are passed to two independent Resampler modules for processing to ensure that different features of each character are effectively extracted.

Refinement and fusion of reference information: The extracted facial and character features are fused through the PPR module and combined with position information embedding to distinguish different characters. This process also introduces a learnable background embedding to ensure the separation between characters and background. In multi-character scenes, this module effectively prevents feature confusion between characters and between characters and background.

Decoupled Cross Attention: StoryMaker uses the decoupled cross attention mechanism to fuse the extracted image information with the text prompts. Through the dual attention mechanism, facial features and character features are embedded into the text-image generation model separately, ensuring that the generated image not only retains the consistency of the character's face, but also maintains the consistency of other features such as clothing and hairstyle.

Pose decoupling: During training, StoryMaker uses Pose-ControlNet to control the pose of the character in the generated image. By decoupling the pose of the reference image, StoryMaker can generate character images with different poses through text prompts, and can also input new poses during inference to control the dynamics of the character in the generated image.

Loss constraints during training: To prevent interference between characters and between characters and background, StoryMaker uses segmentation masks to constrain the cross-attention regions and calculates the mean square error (MSE) loss between the cross-attention soft value and the segmentation mask. This design helps StoryMaker achieve effective separation of characters and background.

Overall loss function: The final loss function combines the diffusion loss and the cross-attention loss constraints to ensure that the generated images maintain high quality while maintaining consistency and independence between multiple characters and backgrounds.

Experimental Results of StoryMaker

StoryMaker: visual comparison on single character condition generation

Quantitative Evaluation

In multiple benchmark tests, StoryMaker performs well in image consistency, especially in the consistency of facial, clothing, and body features. Compared with other models, StoryMaker performs well in the following aspects:

CLIP-I (Image Consistency): StoryMaker achieved the highest overall image consistency score, being able to simultaneously maintain consistency of faces, hairstyles, and clothing.

Facial Similarity: Although InstantID is slightly higher in facial similarity, StoryMaker is the only model that can keep the identities, clothing, and hairstyles of multiple characters consistent in multi-character scenes.

CLIP-T (Text Consistency): Due to its focus on character consistency, StoryMaker suffers slightly from the consistency of textual cues, but is still able to generate high-quality images that are relevant to the text.

Qualitative Assessment

Single character generation: In the single character generation task, StoryMaker maintains the consistency of face and clothing, outperforming IP-Adapter-FaceID and InstantID , which perform well in face consistency but weak in clothing consistency.

Multi-character generation: In the multi-character generation task, StoryMaker is able to maintain the consistency of multiple characters in face, clothing, and hairstyle through independent feature embedding. In addition, StoryMaker can generate multiple poses and adjust the background and style based on text prompts.

Composite of two portraits

Uses of StoryMaker

StoryMaker shows a wide range of application prospects, especially in personalized image generation, storytelling, and digital creation.

The following are some typical applications of StoryMaker:

Personalized story generation

Uses: Generate a series of images with consistent characters based on a reference image to tell a complete story. Control background, pose and scene changes through text prompts.

Example: Generate a five-image story describing a "day in the life of an office worker" where the character's pose and background vary based on text prompts, but the character's face, clothing, and hairstyle remain consistent.

Clothing exchange

Uses: Generate images of different clothing styles by replacing the character's clothing images while keeping the character's face and body consistent.

Example: Replace a character's clothes with other clothing images to generate a series of images of the character wearing different clothes.

Character Interpolation

Uses: Perform feature interpolation between two characters to generate an image that combines the features of the two characters.

Example: Generate a new character image with mixed features by interpolating between two characters, maintaining a natural transition between faces and clothing.

Multimodal plugin integration

Uses: StoryMaker, as a plug-in module, can be used in conjunction with tools such as LoRA or ControlNet to generate more diverse and personalized images while maintaining character consistency.

Example: Generate images of different styles by combining LoRA while maintaining the consistency of the characters.

GitHub: https://github.com/RedAIGC/StoryMaker

Model download: https://huggingface.co/RED-AIGC/StoryMaker

Technical report: https://arxiv.org/pdf/2409.12576