Sync, Swap, Style: ReSyncer's AI Magic

type

status

date

slug

summary

ReSyncer solves the following issues:

1. Multifunctional audio and video synchronization

Problem: Existing lip-syncing techniques are usually focused on specific tasks, such as generating lip-sync videos or performing facial editing. These techniques usually require specific training on long video clips, are inefficient, and may have visible defects in the quality of the generated videos.

Solution: The ReSyncer framework achieves efficient unified model training by reconfiguring the style-based generator and incorporating 3D facial dynamic information. This framework not only generates high-fidelity lip-sync videos, but also supports multiple features such as fast personalized fine-tuning, video-driven lip-sync generation, speaking style transfer, and even face-swapping.

2. High-quality lip-sync generation

Problem: Many existing methods rely on low-dimensional audio information to directly modify high-dimensional visual data, which may cause unstable mouth movements or other visual defects in the video. In addition, traditional methods are prone to leaving visible artifacts when processing high-quality videos.

Solution: ReSyncer uses 3D facial meshes as intermediate representations and combines them with the style-injected lip-sync converter (Style-SyncFormer) to generate high-quality, stable lip-sync videos through unified training. This framework effectively solves the problem of cross-domain information injection between audio and image domains, and improves the stability and visual quality of the generated results.

3. Unified face swapping and lip syncing

Problem: Traditionally, face swapping and lip syncing are usually handled separately. The two tasks require different models and training methods, resulting in low efficiency.

Solution: The ReSyncer framework implements face swapping and lip syncing in a unified model by leveraging 3D facial meshes and style space information. This enables the framework to achieve high-fidelity face swapping while maintaining high-quality lip sync generation, meeting the diverse needs of creating virtual performers.

ReSyncer: Unified face swapping and lip syncing — ReSyncer: **Unified face swapping and lip syncing**

Main Features of ReSyncer

High-fidelity audio-synced lip-sync video generation

ReSyncer can generate lip-synced animation videos based on audio, ensuring that the mouth movements accurately match the input sound. For example, you can put a recording on a person's face and make the person's mouth movements and voice match exactly.

For example, if you have an audio clip and need to make a video of someone "saying" the clip, ReSyncer can perfectly match every subtle movement of the person's mouth with the audio, and the resulting video looks like the person is really saying the clip.

Personalized fine-tuning

ReSyncer can quickly learn and adjust to a specific person's mouth shape and facial movement patterns, with only a few seconds of video data, so you can use it to create personalized videos suitable for different people. It also supports personalized adjustments, allowing users to fine-tune the generated content to suit specific needs.

Suppose you want this tool to learn your mouth shape and facial expressions. It only needs to watch a few seconds of your video, and then it can generate a mouth animation that is very suitable for you, which feels like it is tailor-made for you.

Video-driven lip-sync

In addition to driving lip sync through audio, ReSyncer can also drive synchronization based on mouth movements in other videos, allowing the generated character to imitate the speaking movements in existing videos. This means you can use the movements in one video to control the mouth in another video.

For example, if you have two videos, one of which is someone speaking and the other is another person's face, ReSyncer can make the person in the second video "speak" according to the mouth movements of the person in the first video, so that the two videos can be seamlessly combined.

Speaking style transfer

Not only can you match people's mouths to audio, but you can also "transfer" one person's speaking style (such as tone, rhythm, and expression) to another person, so that the generated video presents a specific speaking style. For example, you can make one person "speak" in the way another person speaks.

For example, if you have a speaker who always speaks slowly and methodically, ReSyncer can allow another person to imitate the speaker's style while speaking, and the resulting video will make it feel like the other person has learned how the speaker speaks.

Face Swap

The framework also supports high-quality face swapping, which can replace the speaker's face in the video while keeping the mouth movements, expressions and audio in sync. This means not only can the face be swapped, but the swapped face can also continue to be synchronized with the sound. This allows users to seamlessly replace different faces in the video, which is suitable for a variety of creative scenarios.

ReSyncer can not only do this, but also ensure that the mouth shape of the replaced face still accurately matches the audio when speaking, making it look as if the "new face" originally belonged to this body.

Versatile unified model

A notable feature of ReSyncer is that it implements all of the above functions through a unified model. This means that users do not need to use different tools for different tasks (such as lip sync and face swapping), ReSyncer can complete all of these tasks with one model.

You only need one tool to complete all these complex tasks, saving time and energy.

Real-time processing and application

ReSyncer can be used in real-time live broadcasts, and it can generate video output synchronized with sound in real time. This means that you can use it to make a virtual character "speak" in a live broadcast, and the character's mouth shape will be synchronized with the sound in real time.

If you use an avatar in your live stream, ReSyncer can synchronize the character's mouth movements with your speech, making it seem like you are speaking live. This is very helpful for live streams that require a virtual host or digital avatar.

ReSyncer: how it works to synchronize the character's mouth movements with your speech

Technical Methods of ReSyncer

1. Reconfiguration of Style-based generator

Core idea: The key innovation of ReSyncer is to reconfigure the traditional Style-based generator so that it can more effectively handle the synchronization problem between audio and vision. It is responsible for converting audio input into 3D facial dynamics (i.e. changes in 3D facial meshes).

Implementation: By incorporating 3D facial dynamic information into the generator’s training process, the generator’s ability to handle complex facial expressions and lip shape changes is enhanced.

detail:

It uses the pre-trained Wav2Vec2 model to extract audio features and then processes them through the Transformer structure.
This module learns how to generate displacements of a facial mesh based on audio, thereby capturing the speaker’s mouth movements and changes in expression.

2. Style-injected Transformer

Core idea: A technology called Style-injected Transformer is introduced to integrate audio information and 3D facial dynamic information.

How it works: The Transformer network extracts features from the audio and combines these features with the dynamic changes of the 3D facial model to generate facial animations that are synchronized with the audio. This process consists of two steps:

3D facial dynamics prediction: predicting the displacement and change of the corresponding 3D facial model from audio features.
Style Injection: These changes are injected into the generated facial model via a Transformer network to generate lip-synced videos with personalized style.

3. Use of 3D facial dynamics

Core idea: Use 3D facial dynamic model as an intermediate representation to better guide the fusion of audio and vision.

What it does: During training, a 3D facial mesh (i.e., the geometric structure of the face) is used to guide the generator in reconstructing the target frame. This approach overcomes the limitation of directly using low-dimensional audio information to modify high-dimensional visual data, making the generated lip sync more accurate and natural.

Detail:

During training, the model compares the facial mesh to a standard template mesh and learns variations in facial expressions.
Such prediction of 3D facial dynamics enables high-precision lip synchronization and provides strong spatial guidance for subsequent face generation.

ReSyncer: prediction of 3D facial dynamics enables high-precision lip synchronization — **ReSyncer:** prediction of 3D facial dynamics enables high-precision lip synchronization

4. Multifunctional integration and unified model

Core idea: ReSyncer designs a unified model architecture that can support multiple functions (such as lip synchronization, speaking style transfer, face swapping, etc.) at the same time, without the need to design and train models for each function separately.

How it works: By introducing different data streams and training objectives into the same generator, the model can perform multiple tasks simultaneously. For example, the model can accept both audio and video as input and generate results that match the audio lip movements and can perform face swapping.

Detail:

Personalized fine-tuning: By quickly learning the facial features of a specific person, the generator can be personalized and generate mouth animations that match the characteristics of that specific person.
Video-driven: Not only can mouth shapes be generated based on audio, but also mouth shapes of target videos can be driven by facial movements of other videos.
Face Swapping: By introducing additional training strategies, ReSyncer can achieve face swapping while maintaining the consistency of lip sync and facial expressions.

5. Loss function and training strategy

Core idea: To ensure that the generated lip-sync videos have high fidelity, ReSyncer uses a variety of loss functions to guide the training of the model.

Specific operations:

L1 loss: used to minimize the pixel difference between the generated video frames and the real frames.
VGG loss: ensures visual quality by comparing the features extracted in the VGG network between the generated frames and the real frames.
Adversarial loss: Combined with the discriminator of StyleGAN2, it enhances the authenticity and consistency of generated videos.
Temporal consistency loss: ensures that the generated consecutive frames have consistent facial dynamics in time and avoids flickering or discontinuity.

6. Video-driven lip sync and face swapping

Core idea: In addition to audio-driven lip sync, ReSyncer also supports lip sync and face swapping through other videos.

Specific operation: By introducing different reference videos during the training process, the model learns how to map the facial expressions and movements of one person to another, achieving high-fidelity face swapping and style transfer.

Experimental Results of ReSyncer

1. Experimental Dataset

HDTF and VoxCeleb2: used to train and test lip-syncing models. These datasets contain a large number of high-quality video and audio clips.

FaceForensics++ (FF++): Used to evaluate the effectiveness of face swapping, including 1,000 high-quality facial videos.

2. Quantitative Evaluation of Lip-sync Generation

In the experiment, ReSyncer is compared with several other advanced methods such as Wav2Lip, ReTalking, SadTalker and StyleSync. The evaluation indicators include:

SSIM (Structural Similarity Index): Used to measure the similarity between the generated video and the original video.
PSNR (Peak Signal-to-Noise Ratio): A measure of image quality, where higher values indicate better quality.
LMD (Mouth Marker Distance): Measures the difference between the generated mouth shape and the real mouth shape. The smaller the value, the better.
∆Sync (synchronization difference): Indicates the synchronization error between the generated video and the real video. The smaller the value, the better.

Result:

ReSyncer's SSIM and PSNR values on the HDTF and VoxCeleb2 datasets are higher than those of the comparison methods, indicating that the videos it generates are of higher quality.

In terms of LMD and ∆Sync indicators, ReSyncer also shows the lowest error, indicating that its lip synchronization is more accurate.

**ReSyncer: Quantitative Evaluation of Lip-sync Generation**

3. Quantitative Evaluation of Face Swapping

ReSyncer is compared with other face swapping methods such as SimSwap, InfoSwap, StyleSwap, and E4S on the FaceForensics++ dataset. The evaluation metrics include:

ID Retrieval: Measures the similarity of the generated image to the target identity, with higher values being better.
ID Similarity: The ArcFace model calculates the cosine similarity between the generated image and the target identity. The higher the value, the better.
Pose Error: The pose difference between the generated image and the target image. The smaller the value, the better.
Expression Error: The difference in expression between the generated image and the target image. The smaller the value, the better.

Result:

ReSyncer performs close to the best on the ID Retrieval and ID Similarity metrics, showing that its face-changing effect can well preserve the characteristics of the target identity.

In terms of Pose Error and Expression Error indicators, ReSyncer also shows the lowest error, indicating its excellent performance in preserving the consistency of pose and expression.

**ReSyncer: Quantitative Evaluation of Face Swapping**

4. User Research and Subjective Evaluation

A user study was conducted to evaluate the quality of the generated videos using Mean Opinion Scores (MOS). Participants rated the videos on a scale of 1 to 5 based on the quality of the lip sync, the visual quality of the generated videos, and the realism of the videos.

result: ReSyncer achieves the highest scores in all three metrics, significantly outperforming other methods, especially in terms of video realism and visual quality.

5. Qualitative Evaluation of Lip Sync and Face Swapping

The lip-synced videos generated by ReSyncer have higher visual quality, more accurate lip movements, and better detail preservation than other methods.

In the face-swapping task, ReSyncer can not only preserve the characteristics of the target identity, but also achieve more natural expressions and lip synchronization, making the face-swapping effect more realistic.

6. Ablation Experiment

Ablation experiments are performed by removing certain key components in the ReSyncer framework (such as Mesh-Inject and Mesh-Style) to evaluate the impact of these components on the overall performance.

result: After removing Mesh-Inject or Mesh-Style, the performance of the generated videos on LMD and ∆Sync metrics degrades significantly, indicating that the spatial guidance of 3D facial mesh is crucial for generating high-quality lip-sync videos.

ReSyncer: Ablation Experiment — ReSyncer: **Ablation Experiment**

Project Address of ReSyncer:

https://guanjz20.github.io/projects/ReSyncer/

Paper: https://arxiv.org/pdf/2408.03284

👍🏼

Experience unparalleled speed and performance