LVCD: Instantly Turn Line Drawings Into Stunning Color Animations

type

status

date

slug

summary

Key Features LVCD

Line drawing video coloring

LVCD can automatically colorize black and white line drawing animation frames. By using the color information of the reference frame, the system can generate a color animation video with time consistency, which is suitable for long-time sequence animation production and effectively ensures the color coherence of multi-frame animation.

Large motion processing

LVCD is particularly good at handling animation scenes with large movements. Through the diffusion model and reference attention mechanism, the color of the characters and background in the animation can be kept consistent even in fast or large movements, avoiding color misalignment or distortion.

Long video generation

LVCD supports the generation of long videos, which are not limited by the fixed length of the original model. Through the segmented sampling mechanism and overlapping blending modules, the system is able to generate long sequence animations that exceed the limitations of the original model and maintain color and content consistency across multiple video segments.

Time consistency guarantee

The core function of LVCD is to maintain time consistency between frames. By using "Overlapped Blending Module" and "Prev-Reference Attention", it ensures the color and content consistency between frames in long sequence animations, avoiding color jumps or inconsistencies between frames.

Reference frame color migration

LVCD can use the color information in the reference frame to accurately migrate these colors to other frames. Even if the other frames have large differences in content or motion amplitude from the reference frame, the system can ensure color consistency and generate a coherent visual effect.

Support for diverse line drawing input

LVCD can handle multiple types of line drawing input, including hand-drawn line drawings and automatically generated line drawings. Regardless of the line drawing style, the system can perform accurate coloring, demonstrating strong adaptability.

Technical Methods of LVCD

Model structure: LVCD is built on the Stable Video Diffusion (SVD) model and introduces Sketch-guided ControlNet and Reference Attention to handle complex animation scenes. The model gradually generates high-quality animation frames with consistent time through the latent space generated by noise.

Temporal consistency sampling: By introducing Overlapped Blending and Prev-Reference Attention mechanisms, LVCD is able to generate long-term consistent videos across multiple segments, reducing the cumulative error in the generation process.

1. Stable Video Diffusion (SVD) Model Basics

SVD is the author's basic model for video generation. It generates videos based on the diffusion model and is mainly divided into two parts:

VAE encoder and decoder: used to map input video frames into a low-dimensional latent space and decode the latent variables back to video frames.

U-Net network: fine-tuned to denoise these latent variables and generate temporally consistent videos by introducing temporal layers such as 3D convolutions and temporal attention layers.

2. Sketch-guided ControlNet

Sketch-guided ControlNet is another core technology of LVCD, which allows users to control the generated video content by inputting line drawings. ControlNet is an extended structure based on neural networks, which is specially designed to process input structured data such as line drawings. In LVCD, ControlNet combines the input line drawings with the pre-trained diffusion model to ensure that the generated video strictly corresponds to the layout and shape of the line drawings.

The role of ControlNet is to ensure that the video generated by the model matches the structure and layout of the line drawing by introducing the line drawing as an additional condition. The author copied and modified the encoder of U-Net and added a zero-initialized convolutional layer to encode the line drawing and connect the line drawing features with the input of U-Net, finally guiding the generation of color animation consistent with the line drawing. Its working mechanism is as follows:

LVCD encodes the input line drawing, extracts the structural information, and passes this information to the generative model.

The generative model then uses this information to generate a color animation that conforms to the line drawing structure.

By guiding the generation process, ControlNet ensures that the resulting video is not only color accurate, but also maintains continuity and accuracy during scenes with large movements.

3. Reference Attention

When processing long sequence videos, the reference attention mechanism is one of the important technologies of LVCD to ensure color consistency and cross-frame coherence. The role of this mechanism is to extract color and other visual information from the input reference frame and propagate it to subsequent frames. The specific working method is as follows:

The reference frame(i.e. the frame that has been painted) serves as the primary reference for color and texture.

Reference Attention uses spatial matching technology to establish long-distance dependencies between the reference frame and other frames to be generated, ensuring that colors are smoothly propagated from the reference frame to the moving frame.

In large-scale action scenes, LVCD's model can accurately apply color information to other frames far away from the reference frame through this long-distance spatial matching, avoiding color jumps or inconsistencies.

Through this mechanism, the model not only matches local pixels, but also globally propagates the color information in the reference frame to enhance the color consistency between animation frames.

4. Overlapped Blending Module

The overlapping mixing module solves the problem of connecting different segments in long video generation. Because the diffusion model can only generate videos of fixed length, in order to solve this limitation, LVCD generates long videos by dividing them into multiple short segments. In order to ensure a smooth transition between different segments, the overlapping mixing module uses the overlapping parts of adjacent frames for mixing. The specific steps are as follows:

At the end of each generated video, overlap the last few frames with the first few frames of the next segment.

The overlapping frames mix the color and structure information of the two segments to ensure a natural transition from the previous segment to the next segment without any discontinuity or color inconsistency.

This mixing method effectively avoids abrupt switching between paragraphs in long videos and ensures that the generated video is smooth and natural.

6. Prev-Reference Attention

Previous reference attention is a technique used by LVCD to ensure inter-frame consistency in long-term video generation. It ensures that the color and content between the previous and next frames are consistent by querying the reference information of the previous frames when generating subsequent frames. The specific process is as follows:

When generating a new frame, the system uses the forward reference attention mechanism to call the previously generated frames as references, especially the overlapping first few frames, to ensure that the newly generated frame can be seamlessly connected with the previous frames.

This method can prevent color changes or jitter problems caused by cumulative errors, and is particularly suitable for the generation of long-duration animated videos.

7. Segmented Sampling and Sequential Sampling

Traditional diffusion models , such as those used for image generation, are typically limited to generating video clips of fixed length. For example, a model might only be able to generate videos of up to 14 frames, known as a fixed-length limitation . This is a problem for generating long animations or videos, which may contain hundreds or even thousands of frames instead of just a few.

LVCD breaks through this limitation by introducing segmented sampling and sequential sampling technology. Specifically:

Segmented Sampling: LVCD divides a long video into multiple smaller segments, each of which can be a fixed length (e.g., 14 frames) that the original model can handle.

Sequential sampling: After generating each segment, LVCD stitches them together in sequence. To ensure that the transition between each segment is natural and seamless, LVCD uses techniques such as Overlapped Blending Module and Prev-Reference Attention. These techniques can use partially overlapping frames between segments to ensure video continuity.

In this way, LVCD is able to generate long videos of unlimited length , rather than being limited by the fixed length of the original model. For example, if a video of several minutes needs to be generated, LVCD can generate each segment in segments, and then seamlessly connect them to finally output a complete long video.

Breaking through the fixed length limitation: The model is no longer restricted to generating only short clips, but can generate animated videos of any length.

Maintain consistency: Even when generated in segments, LVCD uses technology to ensure that the color and content between each segment are consistent, and the resulting long-duration video will not have obvious transition issues or color jumps.

This means that LVCD can generate very long videos without being limited by the model's ability to generate only short videos, while also ensuring coherence and consistency between the various parts of the video.

8. VAE and U-Net Architecture

In the overall structure of LVCD, VAE (Variational Autoencoder) and U-Net are the basic architectures, which help the system compress and decode video frames during the generation process:

VAE: Used to compress the input video frames into representations in the latent space and gradually decode them into color frames during the generation process. This process speeds up generation by reducing the data dimension while ensuring the high quality of the generated frames.

U-Net: Used for progressive denoising to restore the information in the latent space to high-definition video frames. U-Net introduces a spatiotemporal layer in LVCD to help the model deal with the temporal consistency problem between consecutive frames.

9. Loss Function and Optimization Strategy

To ensure high quality and temporal consistency of generated results, LVCD uses a variety of loss functions during training:

MSE loss (mean square error): used to constrain the difference between the generated frame and the reference frame to ensure that the generated frame is consistent with the reference frame.

Temporal consistency loss: specifically used to ensure that the content and color between video frames remain consistent, avoiding abrupt color changes or breaks between generated frames.

By optimizing these loss functions, LVCD can generate more coherent long-term animations and ensure the consistency of color and structure between frames.

Experimental Results of LVCD

Temporal consistency and video quality

On multiple animation test sets, LVCD significantly outperforms existing state-of-the-art methods in terms of frame quality, video consistency, and temporal continuity. Especially in large-scale action scenes, LVCD performs well and can effectively avoid color differences and distortion issues between frames.

Quantitative evaluation

LVCD performs well in terms of frame and video quality (FID and FVD), frame similarity (PSNR, LPIPS, SSIM), and temporal consistency (TC), significantly outperforming existing methods based on GAN and other generative models.

Application Scenario of LVCD

Animation production: LVCD can be used to automatically generate color animations, reducing the workload of manual coloring.

Comic Video Generation: This model can generate long and consistent comic videos based on line drawings and reference frames, which is suitable for automatic coloring tasks of cartoons and comics in various artistic styles.

limitation

Loss of details: When processing some tiny details, the loss of details may occur due to the loss of VAE reconstruction.

Partial color of new objects is inaccurate: When new objects partially appear in the scene, the color may be confused with surrounding objects.

Project address: https://luckyhzt.github.io/lvcd

Paper: https://arxiv.org/pdf/2409.12960