Visual Instruction Tuning for Pixel-Level Understanding with Osprey

With the recent enhancement of visual instruction tuning methods, Multimodal Large Language Models (MLLMs) have demonstrated remarkable general-purpose vision-language capabilities. These capabilities make them key building blocks for modern general-purpose visual assistants. Recent models, including MiniGPT-4, LLaVA, InstructBLIP, and others, exhibit impressive visual reasoning and instruction-following abilities. Although a majority of them rely on image-text pairs for image-level vision-language alignment, they perform well in this domain. However, their reliance on box-level and image-level understanding is the primary reason MLLMs fall short in replicating their performance on fine-grained vision-language alignment tasks at the pixel level. Additionally, the limited availability of mask-based instruction data for training poses challenges in further enhancing MLLMs.

Osprey is a mask-text instruction training method with the primary aim of extending MLLMs. It incorporates fine-grained masked regions in language instruction to achieve pixel-level visual-language understanding. To accomplish this, the Osprey framework curates a mask-based region-text dataset with over 700 thousand samples. It injects pixel-level representation into Large Language Models (LLMs) to design a vision-language model. Notably, the Osprey framework adopts a convolutional CLIP model as its vision encoder and integrates a mask-aware visual extractor into its architecture. This allows for precise extraction of visual mask features from high-resolution input.

In this article, we will discuss the Osprey framework and delve deeper into its architecture. We will also explore the curated region-text dataset with over 700 thousand samples and compare its performance in various region understanding tasks. So, let’s get started.

Multimodal Large Language Models like MiniGPT-4, Otter, Qwen-LV, InstructBLIP and others are the frontrunners for developing general-purpose visual assistants, and they are renowned for their exceptional multimodal and vision generative capabilities. However, Multimodal Large Language Models suffer from a major challenge as they deliver unsatisfactory results on fine-grained image understanding tasks like captioning, region classification, and reasoning. A major reason for the sub-par performance on fine-grained image understanding tasks is the lack of alignment at region-level. Recent MLLMs like GPT4RoI, Shikra and others aim to enable region-level understanding in vision-language models by processing bounding-box specified regions, and leveraging visual instruction tuning with spatial features at object-level. 

Although the approach to enable region-level understanding might improve the performance, employing sparse bounding boxes as the referring input region directly might introduce irrelevant background features leading to inaccurate region-text pair alignment for visual instruction tuning on large language models. During the inference process, the box-level referring input might not be able to detect & represent the object precisely; that might result in semantic deviation as demonstrated in the following image. 

Visual Instruction Tuning for Pixel-Level Understanding with Osprey

In comparison, using fine-grained masks instead of coarse bounding boxes as the referring input might be able to represent objects with more precision. Recently developed SAM or Segment Anything Model trains on billions of high-quality masks, demonstrates remarkable segmentation quality on zero-shot objects and supports the use of points or simple bounding boxes as prompts. However, the SAM framework cannot generate primary semantic labels, nor can they provide detailed semantic captions and attributes. As a result, existing models lack inherent multimodal fine-grained information, and have a limited-understanding of scenes in the real-world. 

To tackle the challenges faced by the existing MLLMs, Osprey, a novel mask-text instruction training method aims to extend the capabilities of multimodal large language models for fine-grained understanding on pixel-level. The Osprey framework introduces a mask-aware visual extractor that captures visual mask features with varying granularity precisely. The framework then interleaves the visual features with language instructions to generate the input sequence for the large language model, and leverages convolutional CLIP architecture to facilitate the use of high resolution input. Owing to its design and architecture, the Osprey framework is able to achieve fine-grained semantic understanding for object-level and part-level regions, and provides detailed object attributes along with primary object category and enhanced descriptions of complex scenes. 

By leveraging the capabilities of visual instruction tuning, the Osprey framework enables new capabilities beyond image-level and box-level understanding of the scenes as the Osprey framework can generate fine-grained semantics using class-agnostic masks from off the shelf SAMs. Additionally, Osprey also shows remarkable capabilities across referring object classification, open-vocabulary recognition, regional-level captioning, and detailed region description tasks. 

Osprey : Methodology and Architecture

The following figure demonstrates the architecture overview of the Osprey framework consisting of a large language model, pixel-level mask aware visual extractor, and an image-level vision encoder. 

For a given image, the input language, and the referring mask regions, the framework performs conversion and tokenization to generate embeddings before sending the language embedding sequences and interleaved mask features to the large language model to obtain fine-grained semantic understandings.

Convolutional CLIP Vision Encoder

The vision encoder deployed in a majority of multimodal large language models is exemplified using a ViT-based CLIP model. As a result, the framework adopts an image resolution of either 224×224 pixels or 336 x 336 pixels. However, the use of the ViT-based CLIP model makes it difficult for the model to achieve fine-grain image understanding of pixel-level representations, a problem amplified further in small regions. Furthermore, the computational overload associated with the ViT architecture hinders the possibility of increasing the input image resolution. 

To tackle the challenge, the Osprey framework implements a convolutional CLIP model as the vision encoder in its architecture. Traditionally, Convolutional Neural Networks based CLIP models have demonstrated remarkable generalization capabilities across different input resolutions when put against vision transformer based CLIP models. Implementing a CNN-based CLIP model makes room for fast inference and efficient training without compromising on the model’s performance. Furthermore, a  CNN-based CLIP model is capable of generating multi-scale feature maps that the framework then directly uses for feature extraction in each subsequent object region. 

Mask Aware Visual Extractor

In contrast to existing region-based models that use sparse bounding boxes as the referring input, the Osprey framework uses detailed mask regions to implement object-based representations. The Osprey model employs a mask aware visual extractor component to capture pixel-level features within each object region. The mask ware visual extractor component encodes mask-level visual features, and additionally, gathers the spatial position information of each region. 

To implement this, Osprey first uses the multi-level image features generated by the vision encoder to adopt the mask-pooling operation, and for every single-;evel feature, the framework pools all the features that lie within the mask region. The model then encodes the features across different layers by passing each feature through a linear projection layer that generates region-level embeddings, and fuses multi-level features by performing summation. The model then uses a MLP layer to produce the visual mask token. Furthermore, Osprey preserves the spatial geometry of the object region by encoding the pixel-level position relationship by implementing a binary mask for each object region. In the end, Osprey includes the visual mask token and its respective spatial tokens for each mask region embedding. 

LLM Tokenization

As mentioned earlier, the model extracts the image-level embeddings of an image by feeding it into a pre-trained CNN-based visual encoder. For textual information, the model first uses pre-trained LLM tokenizers to tokenize text sequences, and then projects these tokenized text sequences into text embeddings. For mask-based regions, the model defines a special token as a placeholder, and then substitutes it with a spatial token along with a mask token. When the model refers to an object region in the text input, it appends the placeholder after its region name that allows the mask regions to mix with texts well resulting in complete sentences without the tokenization space. Furthemore, apart from user instructions, the model also includes a prefix prompt, a special token that serves as a placeholder, that is then replaced by the vision encoder’s image-level embeddings. Finally, the framework interleaves the region-level & image-level visual tokens along with text tokens, and feeds it into the large language model to comprehend the user instructions and the image with different regions in the object. 

Osprey : Three Stage Training Process

The Osprey framework deploys a three stage training process in which each of the training phases is supervised by minimizing a next-token prediction loss.

Stage 1: Image-Text Alignment Training

In the first stage, the Osprey framework deploys the CNN-based CLIP vision encoder to train the image-level features and language connector to train the model for image-text feature alignment. In the first stage, the framework employs three components: a pre-trained large language model, a pre-trained vision encoder, and an image-level projector. The framework also adopts a MLP layer to serve as the vision-language connector that helps in enhancing Osprey’s multimodal generative capabilities. 

Stage 2: Mask-Text Alignment Pre-Training

In the second stage, Osprey loads the weight trained in the first stage, and employs its Mask-Aware Visual Extractor component to capture pixel-level region features. In the second stage, the framework only trains the Mask-Aware Visual Extractor to align language embeddings with mask-based region features. Furthermore, the model collects pixel-level mask pairs and short texts from part-level and publicly-available object-level datasets, and converts them into instruction-following data to further train the model. 

Stage 3: End-to-End Fine Tuning

In the third and the final stage, the model fixes the weights of the vision encoder, and finetunes the large language model, mask-based region feature extractor, and the image-level projector components in its architecture. The primary aim of training in the third stage is to extend the model’s capability to follow user instructions accurately, and efficiently perform pixel-level region understanding tasks. 

After implementing the three training stages, the Osprey framework is capable of understanding complex scenarios defined by user instructions and based on pixel-level mask regions. 

Osprey : Experimental Results

To evaluate its performance, Osprey developers conduct a wide array of experiments to demonstrate the model’s capabilities in classification, pixel-level region-based recognition, and complex descriptions. 

Open-Vocabulary Segmentation

The primary goal of open-vocabulary segmentation is to generate mask-based region recognition and its respective category explicitly. To achieve open-vocabulary segmentation, Osprey first uses an input text prompt, following which the model adopts ground-truth mask regions for model interference to assess model’s performance in open-vocabulary recognition tasks. On the basis of the sentence response generated by the multimodal large language model, Osprey calculates the semantic similarity between the vocabulary list and output of each dataset. The following figure compares Osprey against state of the art multimodal large language models. 

As it can be observed, the Osprey framework outperforms existing methods by a considerable margin on both the Cityscapes and the ADE20K-150 dataset. The results indicate Osprey’s ability to outperform existing approaches, and achieve robust understanding and recognition on fine-grained object regions. 

Referring Object Classification

In the Referring Object Classification task, the model is required to classify the object within a specific region of an image. To evaluate its classification capabilities, the Osprey framework uses two semantic relevance metrics including Semantic IoU or S-IoU and Semantic Similarity or SS. Semantic IoU represents the overlap of words between the ground-truth and the prediction labels whereas Semantic Similarity measures the similarity predicted and/or ground-truth labels in a semantic space. The following image demonstrates Osprey’s performance in the Referring Object Classification task when put against models employing box-level and image-level approaches. 

Detailed-Region Description

In the Detailed-Region Description task, the model evaluates its performance on instruction-following detailed description capabilities along with other region-level approaches. The model randomly selects an input inference prompt from a list of predefined prompts, and leverages the GPT-4 LLM framework to measure the quality of the response generated by the model against the input referring regions comprehensively. Using the instruction generation pipeline, the model generates questions, and seeks GPT-4’s answers following which the LLM assesses the correctness of semantics and precision of referring understanding. The following table demonstrates the performance of Osprey against state of the art models on Detailed-Region Description tasks. 

Region-Level Captioning

The Osprey framework also outperforms current approaches on Region-Level Captioning tasks with the results contained in the following image. 

Final Thoughts

In this article, we have talked about Osprey, a mask-text instruction training method with the primary aim of extending MLLMs by incorporating fine-grained masked regions in language instruction to achieve pixel-level visual-language understanding. To accomplish its goal, the Osprey framework curates a mask-based region-text dataset with over 700 thousand samples, and injects pixel-level representation into LLM to design a vision-language model. The Osprey framework aims to enhance MLLMs for fine-grained visual understanding significantly, and by implementing a CNN-based CLIP model and a mask-aware visual extractor, Osprey attains the capability to understand images at both part-level and object-level regions.