One of the core challenges in computer vision-based models is the generation of high-quality segmentation masks. Recent advancements in large-scale supervised training have enabled zero-shot segmentation across various image styles. Additionally, unsupervised training has simplified segmentation without the need for extensive annotations. Despite these developments, constructing a computer vision framework capable of segmenting anything in a zero-shot setting without annotations remains a complex task. Semantic segmentation, a fundamental concept in computer vision models, involves dividing an image into smaller regions with uniform semantics. This technique lays the groundwork for numerous downstream tasks, such as medical imaging, image editing, autonomous driving, and more.
To advance the development of computer vision models, it’s crucial that image segmentation isn’t confined to a fixed dataset with limited categories. Instead, it should act as a versatile foundational task for various other applications. However, the high cost of collecting labels on a per-pixel basis presents a significant challenge, limiting the progress of zero-shot and supervised segmentation methods that require no annotations and lack prior access to the target. This article will discuss how self-attention layers in stable diffusion models can facilitate the creation of a model capable of segmenting any input in a zero-shot setting, even without proper annotations. These self-attention layers inherently understand object concepts learned by a pre-trained stable diffusion model.
Semantic segmentation is a process that divides an image into various sections, with each section sharing similar semantics. This technique forms the foundation for numerous downstream tasks. Traditionally, zero-shot computer vision tasks have depended on supervised semantic segmentation, utilizing large datasets with annotated and labeled categories. However, implementing unsupervised semantic segmentation in a zero-shot setting remains a challenge. While traditional supervised methods are effective, their per-pixel labeling cost is often prohibitive, highlighting the need for developing unsupervised segmentation methods in a less restrictive zero-shot setting, where the model neither requires annotated data nor prior knowledge of the data.
To address this limitation, DiffSeg introduces a novel post-processing strategy, leveraging the capabilities of the Stable Diffusion framework to build a generic segmentation model capable of zero-shot transfer on any image. Stable Diffusion frameworks have proven their efficacy in generating high-resolution images based on prompt conditions. For generated images, these frameworks can produce segmentation masks using corresponding text prompts, typically including only dominant foreground objects.
Contrastingly, DiffSeg is an innovative post-processing method that creates segmentation masks by utilizing attention tensors from the self-attention layers in a diffusion model. The DiffSeg algorithm is composed of three key components: iterative attention merging, attention aggregation, and non-maximum suppression, as illustrated in the following image.
The DiffSeg algorithm preserves visual information across multiple resolutions by aggregating the 4D attention tensors with spatial consistency, and utilizing an iterative merging process by sampling anchor points. These anchors serve as the launchpad for the merging attention masks with same object anchors absorbed eventually. The DiffSeg framework controls the merging process with the help of KL divergence method to measure the similarity between two attention maps.
When compared with clustering-based unsupervised segmentation methods, developers do not have to specify the number of clusters beforehand in the DiffSeg algorithm, and even without any prior knowledge, the DiffSeg algorithm can produce segmentation without utilizing additional resources. Overall, the DiffSeg algorithm is “A novel unsupervised and zero-shot segmentation method that makes use of a pre-trained Stable Diffusion model, and can segment images without any additional resources, or prior knowledge.”
DiffSeg : Foundational Concepts
DiffSeg is a novel algorithm that builds on the learnings of Diffusion Models, Unsupervised Segmentation, and Zero-Shot Segmentation.
Diffusion Models
The DiffSeg algorithm builds on the learnings from pre-trained diffusion models. Diffusion models is one of the most popular generative frameworks for computer vision models, and it learns the forward and reverse diffusion process from a sampled isotropic Gaussian noise image to generate an image. Stable Diffusion is the most popular variant of diffusion models, and it is used to perform a wide array of tasks including supervised segmentation, zero-shot classification, semantic-correspondence matching, label-efficient segmentation, and open-vocabulary segmentation. However, the only issue with diffusion models is that they rely on high-dimensional visual features to perform these tasks, and they often require additional training to take complete advantage of these features.
Unsupervised Segmentation
The DiffSeg algorithm is closely related to unsupervised segmentation, a modern AI practice that aims to generate dense segmentation masks without employing any annotations. However, to deliver good performance, unsupervised segmentation models do need some prior unsupervised training on the target dataset. Unsupervised segmentation based AI frameworks can be characterized into two categories: clustering using pre-trained models, and clustering based on invariance. In the first category, the frameworks make use of the discriminative features learned by pre-trained models to generate segmentation masks whereas frameworks finding themselves in the second category use a generic clustering algorithm that optimizes the mutual information between two images to segment images into semantic clusters and avoid degenerate segmentation.
Zero-Shot Segmentation
The DiffSeg algorithm is closely related to zero-shot segmentation frameworks, a method with the capability to segment anything without any prior training or knowledge of the data. Zero-shot segmentation models have demonstrated exceptional zero-shot transfer capabilities in recent times although they require some text input and prompts. In contrast, the DiffSeg algorithm employs a diffusion model to generate segmentation without querying and synthesizing multiple images and without knowing the contents of the object.
DiffSeg : Method and Architecture
The DiffSeg algorithm makes use of the self-attention layers in a pre-trained stable diffusion model to generate high-quality segmentation tasks.
Stable Diffusion Model
Stable Diffusion is one of the fundamental concepts in the DiffSeg framework. Stable Diffusion is a generative AI framework, and one of the most popular diffusion models. One of the main characteristics of a diffusion model is a forward and a reverse pass. In the forward pass, a small amount of Gaussian noise is added to an image iteratively at every time step until the image becomes an isotropic Gaussian noise image. On the other hand, in the reverse pass, the diffusion model iteratively removes the noise in the isotropic Gaussian noise image to recover the original image without any Gaussian noise.
The Stable Diffusion framework employs an encoder-decoder, and a U-Net design with attention layer where it uses an encoder to first compress an image into a latent space with smaller spatial dimensions, and utilizes the decoder to decompress the image. The U-Net architecture consists of a stack of modular blocks, where each block is composed of either of the following two components: a Transformer Layer, and a ResNet layer.
Components and Architecture
Self-attention layers in diffusion models grouping information of inherent objects in the form of spatial attention maps, and DiffSeg is a novel post-processing method to merge attention tensors into a valid segmentation mask with the pipeline consisting of three main components: attention aggregation, non-maximum suppression, and iterative attention.
Attention Aggregation
For an input image that passes through the U-Net layers, and the Encoder, the Stable Diffusion model generates a total of 16 attention tensors, with 5 tensors for each of the dimensions. The primary goal of generating 16 tensors is to aggregate these attention tensors with different resolutions into a tensor with the highest possible resolution. To achieve this, the DiffSeg algorithm treats the 4 dimensions differently from one another.
Out of the four dimensions, the last 2 dimensions in the attention sensors have different resolutions yet they are spatially consistent since the 2D spatial map of the DiffSeg framework corresponds to the correlation between the locations and the spatial locations. Resultantly, the DiffSeg framework samples these two dimensions of all attention maps to the highest resolution of them all, 64 x 64. On the other hand, the first 2 dimensions indicate the location reference of the attention maps as demonstrated in the following image.
As these dimensions refer to the location of the attention maps, the attention maps need to be aggregated accordingly. Additionally, to ensure that the aggregated attention map has a valid distribution, the framework normalizes the distribution after aggregation with every attention map being assigned a weight proportional to its resolution.
Iterative Attention Merging
While the primary goal of attention aggregation was to compute an attention tensor, the primary aim is to merge the attention maps in the tensor to a stack of object proposals where each individual proposal contains either the stuff category or the activation of a single object. The proposed solution to achieve this is by implementing a K-Means algorithm on the valid distribution of the tensors to find the clusters of the objects. However, using K-Means is not the optimal solution because K-Means clustering requires users to specify the number of clusters beforehand. Furthermore, implementing a K-Means algorithm might result in different results for the same image since its stochastically dependent on the initialization. To overcome the hurdle, the DiffSeg framework proposes to generate a sampling grid to create the proposals by merging attention maps iteratively.
Non-Maximum Suppression
The previous step of iterative attention merging yields a list of object proposals in the form of probability ot attention maps where each object proposal contains the activation of the object. The framework makes use of non-maximum suppression to convert the list of object proposals into a valid segmentation mask, and the process is an effective approach since each element in the list is already a map of the probability distribution. For every spatial location across all maps, the algorithm takes the index of the largest probability, and assigns a membership on the basis of the index of the corresponding map.
DiffSeg : Experiments and Results
Frameworks working on unsupervised segmentation make use of two segmentation benchmarks namely Cityscapes, and COCO-stuff-27. The Cityscapes benchmark is a self-driving dataset with 27 mid-level categories whereas the COCO-stuff-27 benchmark is a curated version of the original COCO-stuff dataset that merges 80 things and 91 categories into 27 categories. Furthermore, to analyze the segmentation performance, the DiffSeg framework uses mean intersection over union or mIoU and pixel accuracy or ACC, and since the DiffSeg algorithm is unable to provide a semantic label, it uses the Hungarian matching algorithm to assign a ground truth mask with each predicted mask. In case the number of predicted masks exceeds the number of ground truth masks, the framework will take into account the unmatched predicted tasks as false negatives.
Additionally, the DiffSeg framework also emphasizes on the following three works to run interference: Language Dependency or LD, Unsupervised Adaptation or UA, and Auxiliary Image or AX. Language Dependency means that the method needs descriptive text inputs to facilitate segmentation for the image, Unsupervised Adaptation refers to the requirement for the method to to use unsupervised training on the target dataset whereas Auxiliary Image refers that the method needs additional input either as synthetic images, or as a pool of reference images.
Results
On the COCO benchmark, the DiffSeg framework includes two k-means baselines, K-Means-S and K-Means-C. The K-Means-C benchmark includes 6 clusters that it calculated by averaging the number of objects in the images it evaluates whereas the K-Means-S benchmark uses a specific number of clusters for each image on the basis of the number of objects present in the ground truth of the image, and the results on both these benchmarks are demonstrated in the following image.
As it can be seen, the K-Means baseline outperforms existing methods, thus demonstrating the benefit of using self-attention tensors. What’s interesting is that the K-Means-S benchmark outperforms the K-Means-C benchmark that indicates that the number of clusters is a fundamental hyper-parameter, and tuning it is important for every image. Furthermore, even when relying on the same attention tensors, the DiffSeg framework outperforms the K-Means baselines that proves the ability of the DiffSeg framework to not only provide better segmentation, but also avoid the disadvantages posed by using K-Means baselines.
On the Cityscapes dataset, the DiffSeg framework delivers results similar to the frameworks utilizing input with lower 320-resolution while outperforming frameworks that take higher 512-resolution inputs across accuracy and mIoU.
As mentioned before, the DiffSeg framework employs several hyper-parameters as demonstrated in the following image.
Attention aggregation is one of the fundamental concepts employed in the DiffSeg framework, and the effects of using different aggregation weights is demonstrated in the following image with the resolution of the image being constant.
As it can be observed, high-resolution maps in Fig (b) with 64 x 64 maps yield most detailed segmentations although the segmentations do have some visible fractures whereas lower resolution 32 x 32 maps tends to over-segment details although it does result in enhanced coherent segmentations. In Fig (d), low resolution maps fail to generate any segmentation as the entire image is merged into a singular object with the existing hyper-parameter settings. Finally, Fig (a) that makes use of proportional aggregation strategy results in enhanced details and balanced consistency.
Final Thoughts
Zero-shot unsupervised segmentation is still one of the greatest hurdles for computer vision frameworks, and existing models either rely on non zero-shot unsupervised adaptation or on external resources. To overcome this hurdle, we have talked about how self-attention layers in stable diffusion models can enable the construction of a model capable of segmenting any input in a zero-shot setting without proper annotations as these self-attention layers hold the inherent concepts of the object that a pre-trained stable diffusion model learns. We have also talked about DiffSeg, a novel post-pressing strategy, aims to harness the potential of the Stable Diffusion framework to construct a generic segmentation model that can implement zero-shot transfer on any image. The algorithm relies on Inter-Attention Similarity and Intra-Attention Similarity to merge attention maps iteratively into valid segmentation masks to achieve state of the art performance on popular benchmarks.