AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Over the years, the creation of realistic and expressive portraits animations from static images and audio has found a range of applications including gaming, digital media, virtual reality, and a lot more. Despite its potential application, it is still difficult for developers to create frameworks capable of generating high-quality animations that maintain temporal consistency and are visually captivating. A major cause for the complexity is the need for intricate coordination of lip movements, head positions, and facial expressions to craft a visually compelling effect. 

In this article, we will be talking about AniPortrait, a novel framework designed to generate high-quality animations driven by a reference portrait image and an audio sample. The working of the AniPortrait framework is divided into two stages. First, the AniPortrait framework extracts the intermediate 3D representations from the audio samples, and projects them into a sequence of 2D facial landmarks. Following this, the framework employs a robust diffusion model coupled with a motion module to convert the landmark sequences into temporally consistent and photorealistic animations. The experimental results demonstrate the superiority and ability of the AniPortrait framework to generate high quality animations with exceptional visual quality, pose diversity, and facial naturalness, therefore offering an enhanced and enriched perceptual experience. Furthermore, the AniPortrait framework holds remarkable potential in terms of controllability and flexibility, and can be applied effectively in areas including facial reenactment, facial motion editing, and more. This article aims to cover the AniPortrait framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

Creating realistic and expressive portrait animations has been the focus of researchers for a while now owing to its incredible potential and applications spanning from digital media and virtual reality to gaming and more. Despite years of research and development, producing high-quality animations that maintain temporal consistency and are visually captivating still presents a significant challenge. A major hurdle for developers is the need for intricate coordination between head positions, visual expressions, and lip movements to craft a visually compelling effect. Existing methods have failed to tackle these challenges, primarily since a majority of them rely on limited capacity generators like NeRF, motion-based decoders, and GAN for visual content creation. These networks exhibit limited generalization capabilities, and are unstable in generating high quality content. However, the recent emergence of diffusion models has facilitated the generation of high-quality images, and some frameworks built on top of diffusion models along with temporal modules have facilitated the creation of compelling videos, allowing diffusion models to excel. 

Building upon the advancements of diffusion models, the AniPortrait framework aims to generate high quality animated portraits using a reference image, and an audio sample. The working of the AniPortrait framework is split in two stages. In the first stage, the AniPortrait framework employs transformer-based models to extract a sequence of 3D facial mesh and head pose from audio input, and projects them subsequently into a sequence of 2D facial landmarks. The first stage facilitates the AniPortrait framework to capture lip movements and subtle expressions from the audio in addition to head movements that synchronize with the rhythm of the audio sample. The second stage, the AniPortrait framework employs a robust diffusion model and integrates it with a motion module to transform the facial landmark sequence into a photorealistic and temporally consistent animated portrait. To be more specific, the AniPortrait framework draws upon the network architecture from the existing AnimateAnyone model that employs Stable Diffusion 1.5, a potent diffusion model to generate lifelike and fluid based on a reference image and a body motion sequence. What is worth noting is that the AniPortrait framework does not use the pose guider module within this network as it implemented in AnimateAnyone framework, but it redesigns it, allowing the AniPortrait framework not only to maintain a lightweight design but also exhibits enhanced precision in generating lip movements. 

Experimental results demonstrate the superiority of the AniPortrait framework in creating animations with impressive facial naturalness, excellent visual quality, and varied poses. By employing 3D facial representations as intermediate features, the AniPortrait framework gains the flexibility to modify these representations as per its requirements. The adaptability significantly enhances the applicability of the AniPortrait framework across domains including facial reenactment and facial motion editing. 

AniPortrait: Working and Methodology

The proposed AniPortrait framework comprises two modules, namely Lmk2Video, and Audio2Lmk. The Audio2Lmk module attempts to extract a sequence of landmarks that captures intricate lip movements and facial expressions from audio input while the Lmk2Video module uses this landmark sequence to generate high-quality portrait videos with temporal stability. The following figure presents an overview of the working of the AniPortrait framework. As it can be observed, the AniPortrait framework first extracts the 3D facial mesh and head pose from the audio, and projects these two elements into 2D key points subsequently. In the second stage, the framework employs a diffusion model to transform the 2D key points into a portrait video with two stages being trained concurrently within the network. 

Audio2Lmk

For a given sequence of speech snippets, the primary goal of the AniPortrait framework is to predict the corresponding 3D facial mesh sequence with vector representations of translation and rotation. The AniPortrait framework employs the pre-trained wav2vec method to extract audio features, and the model exhibits a high degree of generalization, and is capable of recognizing intonation and pronunciation from the audio accurately that plays a crucial role in generating realistic facial animations. By leveraging the acquired robust speech features, the AniPortrait framework is able to effectively employ a simple architecture consisting of two fc layers to convert these features into 3D facial meshes. The AniPortrait framework observes that this straightforward design implemented by the model not only enhances the efficiency of the inference process, but also ensures accuracy. When converting audio to pose, the AniPortrait framework employs the same wav2vec network as the backbone, although the model does not share the weights with the audio to mesh module. It is majorly due to the fact that pose is associated more with tone and rhythm present in the audio, which holds a different emphasis when compared against audio to mesh tasks. To account for the impact of the previous states, the AniPortrait framework employs a transformer decoder to decode the pose sequence. During this process, the framework integrates the audio features into the decoder using cross-attention mechanisms, and for both the modules, the framework trains them using the L1 loss. Once the model obtains the pose and mesh sequence, it employs perspective projection to transform these sequences into a 2D sequence of facial landmarks that are then utilized as input signals for the subsequent stage. 

Lmk2Video

For a given reference portrait image and a sequence of facial landmarks, the proposed Lmk2Video module creates a temporally consistent portrait animation, and this animation aligns the motion with the landmark sequence, and maintains an appearance that is in consistency with the reference image, and finally, the framework represents the portrait animation as a sequence of portrait frames. The design of the Lmk2Video’s network structure seeks inspiration from the already existing AnimateAnyone framework. The AniPortrait framework employs a Stable Diffusion 1.5, an extremely potent diffusion model as its backbone, and incorporates a temporal motion module that effectively converts multi-frame noise inputs into a sequence of video frames. At the same time, a ReferencenNet network component mirrors the structure of Stable Diffusion 1.5, and employs it to extract the appearance information from the reference image, and integrates it into the backbone. The strategic design ensures that the facial ID remains consistent throughout the output video. Differentiating from the AnimateAnyone framework, the AniPortrait framework enhances the complexity of the PoseGuider’s design. The original version of the AnimateAnyone framework comprises only a few convolution layers post which the landmark features merge with the latents a the input layer of the backbone. The AniPortrait framework discovers that the design falls short in capturing intricate movements of the lips, and to tackle this issue, the framework adopts the multi-scale strategy of the ConvNet architecture, and incorporates landmark features of corresponding scales into different blocks of the backbone. Furthermore, the AniPortrait framework introduces an additional improvement by including the landmarks of the reference image as an additional input. The cross-attention module of the PoseGuider component facilitates the interaction between the target landmarks of each frame and the reference landmarks. This process provides the network with additional cues to comprehend the correlation between appearance and facial landmarks, thus assisting in the generation of portrait animations with more precise motion. 

AniPortrait: Implementation and Result

For the Audio2Lmk stage, the AniPortrait framework adopts the wav2vec2.0 component as its backbone, and leverages the MediaPipe architecture to extract 3D meshes and 6D poses for annotations. The model sources the training data for the Audio2Mesh component from its internal dataset that comprises nearly 60 minutes of high-quality speech data sourced from a single speaker. To ensure the 3D mesh extracted by the MediaPipe component is stable, the voice actor is instructed to face the camera, and maintain a steady head position during the entirety of the recording process. For the Lmk2Video module, the AniPortrait framework implements a two-stage training approach. In the first stage, the framework focuses on training ReferenceNet, and PoseGuider, the 2D component of the backbone, and leaves out the motion module. In the second step, the AniPortrait framework freezes all the other components, and concentrates on training the motion module. For this stage, the framework makes use of two large-scale high-quality facial video datasets to train the model, and processes all the data using the MediaPipe component to extract 2D facial landmarks. Furthermore, to enhance the sensitivity of the network towards lip movements, the AniPortrait model differentiates the upper and lower lips with distinct colors when rendering the pose image from 2D landmarks. 

As demonstrated in the following image, the AniPortrait framework generates a series of animations that demonstrate superior quality as well as realism.

The framework then utilizes an intermediate 3D representation that can be edited to manipulate the output as per the requirements. For instance, users can extract landmarks from a certain source and alter its ID, therefore allowing the AniPortrait framework to create a facial reenactment effect. 

Final Thoughts

In this article, we have talked about AniPortrait, a novel framework designed to generate high-quality animations driven by a reference portrait image and an audio sample. By simply inputting a reference image and an audio clip, the AniPortrait framework is capable of generating a portrait video that features natural movement of heads, and smooth lip motion. By leveraging the robust generalization capabilities of the diffusion model, the AniPortrait framework generates animations that display impressive realistic image quality, and lifelike motion. The working of the AniPortrait framework is divided into two stages. First, the AniPortrait framework extracts the intermediate 3D representations from the audio samples, and projects them into a sequence of 2D facial landmarks. Following this, the framework employs a robust diffusion model coupled with a motion module to convert the landmark sequences into temporally consistent and photorealistic animations. The experimental results demonstrate the superiority and ability of the AniPortrait framework to generate high quality animations with exceptional visual quality, pose diversity, and facial naturalness, therefore offering an enhanced and enriched perceptual experience. Furthermore, the AniPortrait framework holds remarkable potential in terms of controllability and flexibility, and can be applied effectively in areas including facial reenactment, facial motion editing, and more.