DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Computer vision is one of the most exciting and well-researched fields within the AI community today, and despite the rapid enhancement of the computer vision models, a longstanding challenge that still troubles developers is image animation. Even today, image animation frameworks struggle to convert still images into their respective video counterparts that display natural dynamics while preserving the original appearance of the images. Traditionally, image animation frameworks focus primarily on animating natural scenes with domain-specific motions like human hair or body motions, or stochastic dynamics like fluids and clouds. Although this approach works to a certain extent, it does limit the applicability of these animation frameworks to more generic visual content. 

Furthermore, conventional image animation approaches concentrate primarily on synthesizing oscillating and stochastic motions, or on customizing for specific object categories. However, a notable flaw with the approach is the strong assumptions that are imposed on these methods that ultimately limits their applicability especially across general scenarios like open-domain image animation. Over the past few years, T2V or Text to Video models have demonstrated remarkable success in generating vivid and diverse videos using textual prompts, and this demonstration of T2V models is what forms the foundation for the DynamiCrafter framework. 

The DynamiCrafter framework is an attempt to overcome the current limitations of image animation models and expand their applicability to generic scenarios involving open-world images. The DynamiCrafter framework attempts to synthesize dynamic content for open-domain images, converting them into animated videos. The key idea behind DynamiCrafter is to incorporate the image as guidance into the generative process in an attempt to utilize the motion prior of the already existing text to video diffusion models. For a given image, the DynamiCrafter model first implements a query transformer that projects the image into a text-aligned rich context representation space, facilitating the video model to digest the image content in a compatible manner. However, the DynamiCrafter model still struggles to preserve some visual details in the resultant videos, a problem that the DynamiCrafter model overcomes by feeding the full image to the diffusion model by concatenating the image with the initial noises, therefore supplementing the model with more precise image information. 

This article aims to cover the DynamiCrafter framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art image and video generation frameworks. So let’s get started. 

Animating a still image often offers an engaging visual experience for the audience as it seems to bring the still image to life. Over the years, numerous frameworks have explored various methods of animating still images. Initial animation frameworks implemented physical simulation based approaches that focused on simulating the motion of specific objects. However, due to the independent modeling of each object category, these approaches were neither effective nor they had generalizability. To replicate more realistic motions, reference-based methods emerged that transferred motion or appearance information from reference signals like videos to the synthesis process. Although reference based approaches delivered better results with better temporal coherence when compared to simulation based approaches, they needed additional guidance that limited their practical applications. 

In recent years, a majority of animation frameworks focus primarily on animating natural scenes with stochastic, domain-specific or oscillating motions. Although the approach implemented by these frameworks work to a certain extent, the results these frameworks generate are not satisfactory, with significant room for improvement. The remarkable results achieved by Text to Video generative models in the past few years has inspired the developers of the DynamiCrafter framework to leverage the powerful generative capabilities of Text to Video models for image animation. 

The key foundation of the DynamiCrafter framework is to incorporate a conditional image in an attempt to govern the video generation process of Text to Video diffusion models. However, the ultimate goal of image animation still remains non-trivial since image animation requires preservation of details as well as understanding visual contexts essential for creating dynamics. However, multi-modal controllable video diffusion models like VideoComposer have attempted to enable video generation with visual guidance from an image. However, these approaches are not suitable for image animation since they either result in abrupt temporal changes or low visual conformity to the input image owing to their less comprehensive image injection mechanisms. To counter this hurdle, the DyaniCrafter framework proposes a dual-stream injection approach, consisting of visual detail guidance, and text-aligned context representation. The dual-stream injection approach allows the DynamiCrafter framework to ensure the video diffusion model synthesizes detail-preserved dynamic content in a complementary manner. 

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

For a given image, the DynamiCrafter framework first projects the image into the text-aligned context representation space using a specially designed context learning network. To be more specific, the context representation space consists of a learnable query transformer to further promote its adaptation to the diffusion models, and a pre-trained CLIP image encoder to extract text-aligned image features. The model then uses the rich context features using cross-attention layers, and the model uses gated fusion to combine these text features with the cross-attention layers. However, this approach trades the learned context representations with text-aligned visual details that facilitates semantic understanding of image context allowing reasonable and vivid dynamics to be synthesized. Furthermore, in an attempt to supplement additional visual details, the framework concatenates the full image with the initial noise to the diffusion model. As a result, the dual-injection approach implemented by the DynamiCrafter framework guarantees visual conformity as well as plausible dynamic content to the input image. 

Moving along, diffusion models or DMs have demonstrated remarkable performance and generative prowess in T2I or Text to Image generation. To replicate the success of T2I models to video generation, VDM or Video Diffusion Models are proposed that uses a space-time factorized U-New architecture in pixel space to model low-resolution videos. Transferring the learnings of T2I frameworks to T2V frameworks will help in reducing the training costs. Although VDM or Video Diffusion Models have the ability to generate high quality videos, they only accept text prompts as the sole semantic guidance that might not reflect a user’s true intentions or might be vague. However, the results of a majority of VDM models rarely adhere to the input image and suffers from the unrealistic temporal variation issue. The DynamiCrafter approach is built upon text-conditioned Video Diffusion Models that leverage their rich dynamic prior for animating open-domain images. It does so by incorporating tailored designs for better semantic understanding and conformity to the input image. 

DynamiCrafter : Method and Architecture

For a given still image, the DyanmiCrafter framework attempts to animate the image to video i.e. produce a short video clip. The video clip inherits the visual contents from the image, and exhibits natural dynamics. However, there is a possibility that the image might appear in the arbitrary location of the resulting frame sequence. The appearance of an image in an arbitrary location is a special kind of challenge observed in image-conditioned video generation tasks with high visual conformity requirements. The DynamiCrafter framework overcomes this challenge by utilizing the generative priors of pre-trained video diffusion models. 

Image Dynamics from Video Diffusion Prior

Usually, open-domain text to video diffusion models are known to display dynamic visual content modeled conditioning on text descriptions. To animate a still image with Text to Video generative priors, the frameworks should first inject the visual information in the video generation process in a comprehensive manner. Furthermore, for dynamic synthesis, the T2V model should digest the image for context understanding, while it should also be able to preserve the visual details in the generated videos. 

Text Aligned Context Representation

To guide video generation with image context, the DynamiCrafter framework attempts to project the image into an aligned embedding space allowing the video model to use the image information in a compatible fashion. Following this, the DynamiCrafter framework employs the image encoder to extract image features from the input image since the text embeddings are generated using a pre-trained CLIP text encoder. Now, although the global semantic tokens from the CLIP image encoder are aligned with the image captions, it primarily represents the visual content at the semantic level, thus failing to capture the full extent of the image. The DynamiCrafter framework implements full visual tokens from the last layer of the CLIP encoder to extract more complete information since these visual tokens demonstrate high-fidelity in conditional image generation tasks. Furthermore, the framework employs context and text embeddings to interact with the U-Net intermediate features using the dual cross-attention layers. The design of this component facilitates the ability of the model to absorb image conditions in a layer-dependent manner. Furthermore, since the intermediate layers of the U-Net architecture associate more with object poses or shapes, it is expected that the image features will influence the appearance of the videos predominantly especially since the two-end layers are more linked to appearance. 

Visual Detail Guidance

The DyanmiCrafter framework employs rich-informative context representation that allows the video diffusion model in its architecture to produce videos that resemble the input image closely. However, as demonstrated in the following image, the generated content might display some discrepancies owing to the limited capability of the pre-trained CLIP encoder to preserve the input information completely, since it has been designed to align language and visual features. 

To enhance visual conformity, the DynamiCrafter framework proposes to provide the video diffusion model with additional visual details extracted from the input image. To achieve this, the DyanmiCrafter model concatenates the conditional image with per-frame initial noise and feeds them to the denoising U-Net component as guidance. 

Training Paradigm

The DynamiCrafter framework integrates the conditional image through two complementary streams that play a significant role in detail guidance and context control. To facilitate the same, the DynamiCrafter model employs a three-step training process

  1. In the first step, the model trains the image context representation network. 
  2. In the second step, the model adapts the image context representation network to the Text to Video model. 
  3. In the third and final step, the model fine-tunes the image context representation network jointly with the Visual Detail Guidance component. 

To adapt image information for compatibility with the Text-to-Video (T2V) model, the DynamiCrafter framework suggests developing a context representation network, P, designed to capture text-aligned visual details from the given image. Recognizing that P requires many optimization steps for convergence, the framework’s approach involves initially training it using a simpler Text-to-Image (T2I) model. This strategy allows the context representation network to concentrate on learning about the image context before integrating it with the T2V model through joint training with P and the spatial layers, as opposed to the temporal layers, of the T2V model. 

To ensure T2V compatibility, the DyanmiCrafter framework merges the input image with per-frame noise, proceeding to fine-tune both P and the Visual Discrimination Model’s (VDM) spatial layers. This method is chosen to maintain the integrity of the T2V model’s existing temporal insights without the adverse effects of dense image merging, which could compromise performance and diverge from our primary goal. Moreover, the framework employs a strategy of randomly selecting a video frame as the image condition to achieve two objectives: (i) to avoid the network developing a predictable pattern that directly associates the merged image with a specific frame location, and (ii) to encourage a more adaptable context representation by preventing the provision of overly rigid information for any particular frame. 

DynamiCrafter : Experiments and Results

The DynamiCrafter framework first trains the context representation network and the image cross-attention layers on Stable Diffusion. The framework then replaces the Stable Diffusion component with VideoCrafter and further fine-tunes the context representation network and spatial layers for adaptation, and with image concatenation. At inference, the framework adopts the DDIM sampler with multi-condition classifier-free guidance. Furthermore, to evaluate the temporal coherence and quality of the videos synthesized in both the temporal and spatial domains, the framework reports FVD or Frechet Video Distance, as well as KVD or Kernel Video Distance, and evaluates the zero-shot performance on all the methods of MSR-VTT and UCF-101 benchmarks. To investigate the perceptual conformity between the generated results and the input image, the framework introduces PIC or Perceptual Input Conformity, and adopts the perceptual distance metric DreamSim as the function of distance. 

The following figure demonstrates the visual comparison of generated animated content with different styles and content. 

As it can be observed, amongst all the different methods, the DynamiCrafter framework adheres to the input image condition well, and generates temporally coherent videos. The following table contains the statistics from a user study with 49 participants of the preference rate for Temporal Coherence (T.C), and Motion Quality (M.C) along with the selection rate for visual conformity to the input image. (I.C). As it can be observed, the DynamiCrafter framework is able to outperform existing methods by a considerable margin. 

The following figure demonstrates the results achieved using the dual-stream injection method and the training paradigm. 

Final Thoughts

In this article, we have talked about DynamiCrafter, an attempt to overcome the current limitations of image animation models and expand their applicability to generic scenarios involving open-world images. The DynamiCrafter framework attempts to synthesize dynamic content for open-domain images, converting them into animated videos. The key idea behind DynamiCrafter is to incorporate the image as guidance into the generative process in an attempt to utilize the motion prior of the already existing text to video diffusion models. For a given image, the DynamiCrafter model first implements a query transformer that projects the image into a text-aligned rich context representation space, facilitating the video model to digest the image content in a compatible manner. However, the DynamiCrafter model still struggles to preserve some visual details in the resultant videos, a problem that the DynamiCrafter model overcomes by feeding the full image to the diffusion model by concatenating the image with the initial noises, therefore supplementing the model with more precise image information.