CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Recent frameworks attempting at text to video or T2V generation leverage diffusion models to add stability in their training process, and the Video Diffusion Model, one of the pioneers in the text to video generation frameworks, expands a 2D image diffusion architecture in an attempt to accommodate video data, and train the model on video and image jointly from scratch. Building on the same, and in order to implement a powerful pre-trained image generator like Stable Diffusion, recent works inflate their 2D architecture by interleaving temporal layers between the pre-trained 2D layers, and finetune the new model on unseen large datasets. Despite their approach, text to video diffusion models face a significant challenge since the ambiguity of solely used text descriptions to generate the video sample often results in the text to video model having weaker control over the generation. To tackle this limitation, some models provide enhanced guidance while some others work with precise signals to control the scene or human motions in the synthesized videos precisely. On the other hand, there are a few text to video frameworks that adopt images as the control signal to the video generator resulting in either an accurate temporal relationship modeling, or high video quality. 

It would be safe to say that controllability plays a crucial role in image and video generative tasks since it allows users to create the content they desire. However, existing frameworks often overlook the precise control of camera pose that serves as a cinematic language to express the deeper narrative nuances to the model better. To tackle the current controllability limitations, in this article, we will talk about CameraCtrl, a novel idea that attempts to enable accurate camera pose control for text to video models. After parameterizing the trajectory of the camera precisely, the model trains a plug and play camera module on a text to video model, and leaves the other components untouched. Furthermore, the CameraCtrl model also conducts a comprehensive study on the effect of various datasets, and suggests that videos with similar appearances and diverse camera distribution can enhance the overall controllability and generalization abilities of the model. Experiments conducted to analyze the performance of the CameraCtrl model on real world tasks indicate the efficiency of the framework in achieving precise and domain-adaptive camera control, carving a way forward for the pursuit of customized and dynamic video generation from camera pose and textual inputs. 

This article aims to cover the CameraCtrl framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

The recent development and advancement of diffusion models have advanced text guided video generation significantly in recent years, and revolutionized the content design workflows. Controllability plays a significant role in practical video generation applications since it allows users to customize the generated results as per their needs and requirements. With high controllability, the model is able to enhance the realism, quality, and the usability of the videos it generated, and while text and image inputs are used commonly by models to enhance the overall controllability, they often lack precise control over motion and content. To tackle this limitation, some frameworks have proposed to leverage control signals like pose skeleton, optical flow, and other multi-modal signals to enable more accurate control to guide video generation. Another limitation faced by existing frameworks is that they lack precise control over stimulating or adjusting camera points in video generation since the ability to control the camera is crucial since it not only enhances the realism of the generated videos, but by allowing customized viewpoints, it also enhances user engagement, a feature that is essential in game development, augmented reality, and virtual reality. Furthermore, managing camera movements skillfully allows creators to highlight character relationships, emphasize emotions, and guide the focus of the target audience, something of great importance in film and advertising industries. 

To tackle and overcome these limitations, the CameraCtrl framework, a learnable and precise plug and play camera module with the ability to control the viewpoints of the camera for video generation. However, integrating a customized camera into an existing text to video model pipeline is a task easier said than done, forcing the CameraCtrl framework to look for ways on how to represent and inject the camera in the model architecture effectively. On the same note, the CameraCtrl framework adopts plucker embeddings as the primary form of camera parameters, and the reason for opting for plucker embeddings can be credited to their ability to encode geometric descriptions of the camera pose information. Furthermore, to ensure the generalizability and applicability of the CameraCtrl model post training, the model introduces a camera control model that only accepts plucker embeddings as the input. To ensure the camera control model is trained effectively, the framework and its developers conduct a comprehensive study to investigate how different training data affects the framework from synthetic to realistic data. The experimental results indicate that implementing data with diverse camera pose distribution and similar appearance to the original base model achieves the best trade-off between controllability and generalizability. The developers of the CameraCtrl framework have implemented the model on top of the AnimateDiff framework, thus enabling precise control in video generation across different personalized ones, demonstrating its versatility and utility in a wide range of video creation contexts. 

The AnimateDiff framework adopts the efficient LoRA fine-tuning approach to obtain the weights of the model for different types of shots. The Direct-a-video framework proposes to implement a camera embedder to control the pose of the cameras during the process of video generation, but it conditions only on three camera parameters, limiting the control ability of the camera to most basic types. On the other hand, frameworks including MotionCtrl designs a motion controller that accepts more than three input parameters and is able to produce videos with more complex camera poses. However, the need to fine-tune parts of the generated videos hampers the generalizability of the model. Furthermore, some frameworks incorporate additional structural control signals like depth maps into the process to enhance the controllability for both image and text generation. Typically, the model feeds these control signals into an additional encoder, and then injects the signals into a generator using various operations. 

CameraCtrl: Model Architecture

Before we can have a look at the architecture and training paradigm for the camera encoder, it is vital for us to understand different camera representations. Typically, a camera pose refers to intrinsic and extrinsic parameters, and one of the straightforward choices to let a video generator condition on the camera pose is to feed raw values regarding the camera parameters into the generator. However, implementing such an approach might not enhance accurate camera control for a few reasons. First, while the rotation matrix is constrained by orthogonality, the translation vector is typically unstrained in magnitude, and leads to a mismatch in the learning process that can affect the consistency of control. Second, using raw camera parameters directly can make it difficult for the model to correlate these values with image pixels, resulting in decreased control over visual details. To avoid these limitations, the CameraCtrl framework chooses plucker embeddings as the representation for the camera pose since the plucker embeddings have geometric representations of each pixel of the video frame, and can provide a more elaborate description of the camera pose information. 

Camera Controllability in Video Generators

As the model parameterizes the trajectory of the camera into a plucker embedding sequence i.e. spatial maps, the model has the choice to use an encoder model to extract the camera features, and then fuse the camera features into video generators. Similar to text to image adapter, the CameraCtrl model introduces a camera encoder designed specifically for videos. The camera encoder includes a temporal attention model after each convolutional block, allowing it to capture the temporal relationships of camera poses throughout the video clip. As demonstrated in the following image, the camera encoder accepts only plucker embedding input, and delivers multi-scale features. After obtaining the multi-scale camera features, the CameraCtrl model aims to integrate these features into the U-net architecture of the text to video model seamlessly, and determines the layers that should be used to incorporate the camera information effectively. Furthermore, since a majority of existing frameworks adopt a U-Net like architecture that contain both the temporal and spatial attention layers, the CameraCtrl model injects the camera representations into the temporal attention block, a decision that is backed by the ability of the temporal attention layers to capture temporal relationships, aligning with the inherent casual and sequential nature of a camera trajectory with the spatial attention layers picturing the individual frames. 

Learning Camera Distributions

Training the camera encoder component within the CameraCtrl framework on a video generator requires a large amount of well labeled and annotated videos with the model being capable of obtaining the camera trajectory using structure from motion or SfM approach. The CameraCtrl framework attempts to select the dataset with appearances matching the training data of the base text to video model closely, and have a camera pose distribution as wide as possible. Samples in the dataset generated using virtual engines exhibit diverse camera distribution since developers have the flexibility to control the parameters of the camera during the rendering phase, although it does suffer from a distribution gap when compared to datasets containing real world samples. When working with datasets containing real world samples, the distribution of the camera is usually narrow, and in such cases, the framework needs to find a balance between the diversity among different camera trajectories and the complexity of individual camera trajectory. Complexity of individual camera trajectory ensures that the model learns to control complex trajectories during the training process, while the diversity among different camera trajectories ensures the model does not overfit to certain fixed patterns. Furthermore, to monitor the training process of the camera encoder, the CameraCtrl framework proposes the camera alignment metric to measure the control quality of the camera by quantifying the error between the camera trajectory of the generated samples and the input camera conditions. 

CameraCtrl : Experiments and Results

The CameraCtrl framework implements the AnimateDiff model as its base text to video model and a major reason behind the same is that the training strategy of the AnimateDiff model allows its motion module to integrate with text to image base models or text to image LoRAs to accommodate video generation across different genres and domains. The model uses the Adam optimizer to train the model with a constant learning rate of 1e-4. Furthermore, to ensure the model does not impact the video generation capabilities of the original text to video model negatively, the CameraCtrl framework utilizes the FID or Frechet Inception Distance metric to assess the appearance quality of the video, and compares the quality of the generated video before and after including the camera module. 

To assess its performance, the CameraCtrl framework is evaluated against two existing camera control frameworks: MotionCtrl and AnimateDiff. However, since the AnimateDiff framework has support for only eight basic camera trajectories, the comparison between CameraCtrl and AnimateDiff is limited to three basic trajectories. On the other hand, for comparison against MotionCtrl, the framework selects over a thousand random camera trajectories from existing dataset in addition to base camera trajectories, generates videos using these trajectories, and evaluates them using the TransErr and RotErr metrics. 

As it can be observed, the CameraCtrl framework outperforms the AnimateDiff framework in basic trajectory, and delivers better results when compared against the MotionCtrl framework on the complex trajectory metric. 

Furthermore, the following figure demonstrates the effect of the camera encoder architecture on the overall quality of the generated samples. Row a to Row d represent the results generated with camera encoder implemented in the architecture: ControlNet, ControlNet with temporal attention, T2I Adaptor, and T2I adaptor with temporal attention respectively. 

In the following figure, the first two desplaces the video generated using a combination of SparseCtrl framework’s RGB encoder, and the method used in the CameraCtrl framework. 

Final Thoughts

In this article, we have talked about CameraCtrl, a novel idea that attempts to enable accurate camera pose control for text to video models. After parameterizing the trajectory of the camera precisely, the model trains a plug and play camera module on a text to video model, and leaves the other components untouched. Furthermore, the CameraCtrl model also conducts a comprehensive study on the effect of various datasets, and suggests that videos with similar appearances and diverse camera distribution can enhance the overall controllability and generalization abilities of the model. Experiments conducted to analyze the performance of the CameraCtrl model on real world tasks indicate the efficiency of the framework in achieving precise and domain-adaptive camera control, carving a way forward for the pursuit of customized and dynamic video generation from camera pose and textual inputs.