BrushNet: Plug and Play Image Inpainting with Dual Branch Diffusion

Image inpainting is one of the classic problems in computer vision, and it aims to restore masked regions in an image with plausible and natural content. Existing work employing traditional image inpainting techniques like Generative Adversarial Networks or GANS, and Variational Auto-Encoders or VAEs often require auxiliary hand-engineered features but at the same time, do not deliver satisfactory results. Over the past few years, diffusion-based methods have gained popularity within the computer vision community owing to their remarkable high-quality image generation capabilities, output diversity, and fine-grained control. Initial attempts at employing diffusion models for text-guided image inpainting modified the standard denoising strategy by sampling the masked regions from a pre-trained diffusion model, and the unmasked areas from the given image. Although these methods resulted in satisfactory performance across simple image inpainting tasks, they struggled with complex mask shapes, text prompts, and image contents that resulted in an overall lack of coherence. The lack of coherence observed in these methods can be owed primarily to their limited perceptual knowledge of mask boundaries, and unmasked image region context. 

Despite the advancements, research, and development of these models over the past few years, image inpainting is still a major hurdle for computer vision developers. Current adaptations of diffusion models for image inpainting tasks involve modifying the sampling strategy, or the development of inpainting-specific diffusion models often suffer from reduced image quality, and inconsistent semantics. To tackle these challenges, and pave the way forward for image inpainting models, in this article, we will be talking about BrushNet, a novel plug and play dual-branch engineered framework that embeds pixel-level masked image features into any pre-trained diffusion model, thus guaranteeing coherence, and enhanced outcome on image inpainting tasks. The BrushNet framework introduces a novel paradigm under which the framework divides the image features and noisy latent into separate branches. The division of image features and noisy latents diminishes the learning load for the model drastically, and facilitates a nuanced incorporation of essential masked image information in a hierarchical fashion. In addition to the BrushNet framework, we will also be talking about BrushBench, and BrushData that facilitate segmentation-based performance assessment and image inpainting training respectively. 

This article aims to cover the BrushNet framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

Image inpainting, a method that attempts to restore the mission regions of an image while maintaining overall coherence has been a long standing problem in the computer vision field, and it has troubled developers and researchers for a few years now. Image inpainting finds its applications across a wide variety of computer vision tasks including image editing, and virtual try-ons. Recently, diffusion models like Stable Diffusion, and Stable Diffusion 1.5 have demonstrated remarkable ability to generate high-quality images, and they provide users the flexibility to control the semantic and structural controls. The remarkable potential of diffusion models is what has prompted researchers to resort to diffusion models for high-quality image inpainting tasks that align with the input text prompts. 

The methods employed by traditional diffusion-based text guided inpainting frameworks can be split into two categories, Sampling Strategy Modification and Dedicated Inpainting Models. The Sampling strategy modification method modifies the standard denoising process by sampling the masked regions from a pre-trained diffusion model, and copy-pastes the unmasked regions from the given image in each denoising step. Although sampling strategy modification approaches can be implemented in arbitrary diffusion models, they often result in incoherent inpainting results since they have limited perceptual knowledge of mask boundaries, and unmasked image region context. On the other hand, dedicated inpainting models fine-tune a image inpainting model designed specifically by expanding the dimensions of the input channel of the base diffusion model to incorporate corrupted image and masks. While dedicated inpainting models enable the diffusion model to generate more satisfactory results with specialized shape-aware and content aware models, it might or might not be the best architectural design for image inpainting models. 

As demonstrated in the following image, dedicated inpainting models fuse masked image latent, noisy latent, text, and mask at an early stage. The architectural design of such dedicated inpainting models easily influences the masked image features, and prevents the subsequent layers in the UNet architecture from obtaining pure masked image features due to the text influence. Furthermore, handling the generation and condition in a single branch imposes further burden on the UNet architecture, and since these approaches also require fine-tuning in different variations of the diffusion backbone, these approaches are often time-exhaustive with limited transferability. 

It might appear that adding an additional branch dedicated to extract masked image features might be an adequate solution to the problems mentioned above, however, existing frameworks often result in extracting and inserting inadequate information when applied directly to inpainting. As a result, existing frameworks like ControlNet yield unsatisfactory results when compared against dedicated inpainting models. To tackle this issue in the most effective manner possible, the BrushNet framework introduces an additional branch to the original diffusion network, and thus creates a more suitable architecture for image inpainting tasks. The design and architecture of the BrushNet framework can be summed up in three points. 

  1. Instead of initializing convolution layers randomly, the BrushNet framework implements a VAE encoder to process the masked image. As a result, the BrushNet framework is able to extract the image features for adaptation to the UNet distribution more effectively. 
  2. The BrishNet framework gradually incorporates the full UNet feature layer by layer into the pre-trained UNet architecture, a hierarchical approach that enables dense per-pixel control. 
  3. The BrushNet framework removes text cross-attention from the UNet component to ensure pure image information is considered in the additional branch. Furthermore, the BrushNet model also proposes to implement a blurred blending strategy to attain better consistency along with a higher range of controllability in unmasked regions of the image. 

BrushNet : Method and Architecture

The following figure gives us a brief overview of the BrushNet framework. 

As it can be observed, the framework employs a dual-branch strategy for masked image guidance insertion, and uses blending operations with a blurred mask to ensure better preservation of unmasked regions. It is worth noting that the BrushNet framework is capable of adjusting the added scale to achieve flexible control. For a given masked image input, and the mask, the BrushNet model outputs an inpainted image. The model first downsample the mask to accommodate the size of the latent, and the masked image is fed as an input to the VAE encoder to align the distribution of the latent space. The model then concatenates the masked image latent, the noisy latent, and the downsampled mask, and uses it as the input. The features that the model extracts are then added to the pre-trained UNet layer after a zero convolution block. After denoising, the model blends the masked image and the generated image with a blurred mask. 

Masked Image Guidance

The BrushNet framework inserts the masked image feature into the pre-trained diffusion network using an additional branch, that separates the feature extraction of masked images from the process of image generation explicitly. The input is formed by concatenating the masked image latent, noisy latent, and the downsampled mask. To be more specific, the noisy latent provides information for image generation during the current generation process, and helps the framework enhance the semantic coherence of the masked image feature. The BrushNet framework then extracts the masked image latent from the masked image using a Variational AutoEncoder. Furthermore, the framework employs cubic interpolation to downsample the mask in an attempt to ensure the mask size aligns with the masked image latent, and the noisy latent. To process the masked image features, the BrushNet framework implements a clone of the pre-trained diffusion model, and excludes the cross-attention layers of the diffusion model. The reason is the pre-trained weights of the diffusion model serve as a strong prior for extracting the features of the masked image, and excluding the cross-attention layers ensure that the model only considers pure image information within the additional branch. The BrushNet framework inserts the features into the frozen diffusion model layer by layer, thus enabling hierarchical dense per-pixel control, and also employs zero convolution layers to establish a connection between the trainable BrushNet model, and the locked model, ensuring the harmful noise have no influence over the hidden states in the trainable copy during the initial training stages. 

Blending Operation

As mentioned earlier, conducting the blending operation in latent space resizes the masks that often results in several inaccuracies, and the BrushNet framework encounters a similar issue when it resizes the mask to match the size of the latent space. Furthermore, it is worth noting that encoding and decoding operations in Variational AutoEncoders have inherent limited operations, and may not ensure complete image reconstruction. To ensure the framework reconstructs a fully consistent image of the unmasked region, existing works have implemented different techniques like copying the unmasked regions from the original image. Although the approach works, it often results in a lack of semantic coherence in the generation of the final results. On the other hand, other methods like adopting latent blending operations face difficulty in preserving the desired information in the unmasked regions. 

Flexible Control

The architectural design of the BrushNet framework makes it a suitable choice for plug and play integrations inherently to various pre-trained diffusion models, and enables flexible preservation scale. Since the BrishNet framework does not alter the weights of the pre-trained diffusion model, developers have the flexibility to integrate it as a plug and play component with a fine-tuned diffusion model, allowing easy adoption and experimentation with pre-trained models. Furthermore, developers also have the option to control the preservation scale of the unmasked regions by incorporating the features of the BrushNet model into the frozen diffusion model with the given weight w that determines the influence of the BrushNet framework on the preservation scale, offering developers the ability to adjust the desired levels of preservation. Finally, the BrushNet framework allows users to adjust the blurring scale, and decide whether or not to implement the blurring operation, therefore easily customizing the preservation scale of the unmasked regions, making room for flexible adjustments and fine-grained control over the image inpainting process. 

BrushNet : Implementation and Results

To analyze its results, the BrushNet framework proposes BrushBench, a segmentation-based image inpainting dataset with over 600 images, with each image accompanied by a human-annotated mask, and caption annotation. The images in the benchmark dataset are distributed evenly between natural and artificial images, and also ensures even distribution among different categories, enabling a fair evaluation across different categories. To enhance the analysis of the inpainting tasks even further, the BrushNet framework categorizes the dataset into two distinct parts on the basis of the methods used: segmentation-based, and brush masks. 

Quantitative Comparison

The following table compares the BrushNet framework against existing diffusion-based image inpainting models on the BrushBench dataset with the Stable Diffusion as the base model. 

As it can be observed, the BrushNet framework demonstrates remarkable efficiency across masked region preservation, text alignment, and image quality. Furthermore, models like Stable Diffusion Inpainting, HD-Painter, PowerPaint, and others demonstrate strong performance on image inside-inpainting tasks, although they fail to replicate their performance on outside-inpainting tasks especially in terms of text alignment and image quality. Overall, the BrushNet framework delivers the strongest results. 

Furthermore, the following table compares the BrushNet framework against existing diffusion-based image inpainting models on the EditBench dataset, and the performance is comparable to the one observed on the BrushBench dataset. The results indicate the BrushNet framework delivers strong performance across a wide range of image inpainting tasks with different mask types. 

Qualitative Comparison

The following figure qualitatively compares the BrushNet framework against existing image inpainting methods, with results covering artificial intelligence and natural images across different inpainting tasks including  random mask inpainting, segmentation mask inside inpainting, and segmentation mask outside-inpainting. 

As it can be observed, the BrushNet framework delivers remarkable results in the coherence of the unmasked region, and the coherent regions, and successfully realizes the awareness of the background information owing to the implementation of the dual-branch decoupling approach. Furthermore, the untouched branch of the pre-trained diffusion model also provides the advantage of covering different data domains like anime and painting better, resulting in better performance across different scenarios. 

Final Thoughts

In this article we have talked about BrushNet, a novel plug and play dual-branch engineered framework that embeds pixel-level masked image features into any pre-trained diffusion model, thus guaranteeing coherence, and enhanced outcome on image inpainting tasks. The BrushNet framework introduces a novel paradigm under which the framework divides the image features and noisy latent into separate branches. The division of image features and noisy latents diminishes the learning load for the model drastically, and facilitates a nuanced incorporation of essential masked image information in a hierarchical fashion. In addition to the BrushNet framework, we will also be talking about BrushBench, and BrushData that facilitate segmentation-based performance assessment and image inpainting training respectively.