Enabling spatial understanding in vision-language learning models remains a core research challenge. This understanding underpins two crucial capabilities: grounding and referring. Referring enables the model to accurately interpret the semantics of specific regions, while grounding involves using semantic descriptions to localize these regions.
Developers have introduced Ferret, a Multimodal Large Language Model (MLLM), capable of understanding spatial referring across any granularity or shape in an image and accurately grounding open-vocabulary descriptions. Ferret uses a novel hybrid representation combining continuous features and discrete coordinates to represent image regions. Its spatial-aware visual sampler handles varying sparsity in shapes, allowing it to process diverse region inputs like free-form shapes, bounding boxes, and points.
Ferret’s approach enables it to excel in classical grounding and referring tasks and surpass other MLLMs in localization-demanding and region-based multimodal communication. This article delves into Ferret’s architecture and methodology, highlighting its impressive performance in various multimodal language tasks. Let’s explore this further.
Referring in a model is a capability that allows the model to comprehend the semantics of given specific regions accurately whereas grounding makes it essential for the model to use the given semantic descriptions to localize the regions. Although they might differ in their respective tasks, both referring and grounding have the same fundamental concept: alignment of spatial semantics and information. However, despite sharing the same concept, existing models learn grounding and referring individually. Although the method works, it poses a hurdle in achieving human-like capabilities since humans can learn from one task, and apply the learnings to other tasks seamlessly, and are able to effortlessly integrate grounding/referring capabilities with reasoning and daily dialogue. The Ferret framework takes inspiration from the above mentioned gap in existing MLLM frameworks and studies three main questions:
- How to unify grounding and referring capabilities in the framework, and how will their unison benefit one another?
- Humans use versatile types of regions like box, point, scribble, free-form shapes for referring? How to represent these versatile regions?
- How to make grounding and referring instruction-following, robust, and open-vocabulary, that are critical for their practical and real-time applications?
The Ferret framework is a novel refer and ground Multimodal Large Language Model that attempts to target these questions. The Ferret framework chooses a Multimodal Large Language Model as its foundation owing to their remarkable global vision and language understanding capabilities. Furthermore, to unify the grounding and referring capabilities, the Ferret framework represents the coordinates of regions in natural language numerical form. However, in practice, it is inefficient to use box coordinates or even single points to represent versatile region shapes like scribbles, strokes, or complex polygons as these shapes are critical for enhanced precision and more universal human-model interaction. To tackle this issue, the Ferret framework employs a spatial-aware visual sampler that acquires the visual regions for regions irrespective of the shape, thus negotiating with varying sparsity in these shapes. The framework then combines the continuous visual features with discrete coordinates to represent the visual regions in the input, resulting in the creation of a hybrid region representation in Ferret.
The Ferret framework deploys the above methods to resolve input that mixes free-form text with referred regions, and is able to seamlessly generate the coordinates for each groundable object with generating text to ground the mentioned objects in the output. By doing so, Ferret is the first framework to process free-formed input regions in Multimodal Large Language Models. Furthermore, the Ferret framework absorbs remarkable open-vocabulary capabilities of spatial localization and understanding, allowing the framework to achieve superior performance when evaluated on conventional grounding and referring tasks.
Moving along, the Ferret framework seeks inspiration from three existing AI frameworks including Multimodal Large Language Models, MLLMs for Referring and Grounding, and Unifying Grounding and VL Understanding.
The introduction of Large Language Models including GPT, DALL-E, PaLM, LLaMA, and BLOOM, has changed the landscape in NLP research, resulting in significant advancements of multimodal language models. The earlier multimodal language models focussed primarily on large scale image-text generation with some notable examples being PaLI, SimVLM, GIT, BLIP-2, FLAMINGO, CM3, and PaLI-X. However, since the Flamingo framework achieved efficient integration of LLMs with a pre-trained CLIP image encoder through cross-gated attention blocks resulting in remarkable multimodal few-shot learning capabilities. The current research is looking for ways to utilize pre-trained large language models for visual instruction tuning with notable examples being MiniGPT-4, Otter, InstructBLIP and more. What’s more is that recent models like Emu and GILL have shown remarkable success in using MLLMs for image generation and image retrieval. The Ferret framework also refers to prior research that focuses on unifying text and bounding box output for Vision Language models.
Ferret : Methodology and Architecture
Hybrid-Region Representations
Point, box, and free-form shapes are the three dominant formats that a language model uses when referring to specific regions. On one hand, the point and the box format can be accurately represented by coordinates, mapping free form shapes is a bit challenging since free-form shapes are versatile. Being versatile, free-form shapes can encompass a wide array of regions including masks, polygons, and scribbles. Using coordinates to depict free-form shapes is a complex task that hinders the model’s capability to learn to establish a correlation between the regions and the corresponding coordinates. Furthermore, the use of coordinates for free-form shapes is computationally expensive and obscure.
To tackle this problem and to generalize across all three formats, the Ferret framework proposes a hybrid region representation that synergizes continuous visual features with discrete coordinates to refer to a particular region.
For continuous visual features, for a given region, the Ferret framework first constructs a 2D binary mask of the same size as the image, and marks a value 1 within the targeted region while assigning a value 0 outside the region. The model then extracts the binary mask together with the extracted image feature map, and then sends it to the spatial-aware visual sampler.
Architecture
The architecture of the Ferret model comprises three main components
- An image encoder to extract image embeddings.
- A Spatial Aware Visual Samples to extract regional continuous features.
- A Large Language Model to model text, image, and region features jointly.
The image is first feeded into the pre-trained visual encoder to extract the image embeddings. For text inputs, the framework first uses a pre-trained LLM tokenizer to tokenize the text sequence, and then projects these tokens into text embeddings. For referred regions, Ferret appends a special token and the coordinates as a placeholder for continuous features after the region name. If the region’s name is unknown or is complex to describe as a result of inclusion of several objects, the framework just uses area or region name.
One of the major challenges dealing with referred regions is that their shape can be quite varying, meaning they can have different shapes, and are not just limited to rectangle boxes or points. Referred regions with irregular shapes cannot be processed with traditional methods like Grid-based processing including patch attention or convolution techniques. To tackle this issue, the Ferret framework proposes a Spatial-Aware Visual Sampler. For a given extracted feature map with a binary region mask, the Ferret model first randomly samples N number of points within the binary region mask.
For every individual point, the model obtains its feature by performing bilinear interpolation. The N points are then fed into a waterfall of blocks with each of them passing through three different stages: sampling, gathering, and pooling. In the Sampling phase, a fixed number of points are sampled from N number of points available using FPS or Farthest Point Sampling algorithm that guarantees adequate coverage. In the second step, for each sample point, the framework searches for its k nearest neighbors from the pool of available N points. For each group, the model then fuses the features of a sample point with its neighbor points. In the final step, the Ferret framework conducts a max pooling to fuse k neighbor features into one feature to act as the representation for the point sampled. By performing these three steps, the Ferret framework is left with fewer points but features space with higher density because it not only incorporates the features of local neighbors but also their relative positions.
GPT-Assisted Visual Data Generation
Dialogue Instruction Tuning Data is of critical importance to Multimodal Large Language Models are they not only help in converting existing dataset by templates, but they also help the model understand human intention and generate appropriate response. A majority of MLLMs use a few-shot prompting method to obtain visual instruction tuning data, where the model provides textual description of scenes in the image along with human annotated dialogues as few-shot demonstrations. However, existing instruction tuning methods focus primarily on describing the entire image without specifying spatial-related information explicitly. The Ferret framework emphasizes on region-based knowledge to collect refer and ground instruction tuning data in three steps.
- In addition to using global captions and objects, the framework provides symbolic scene description that describes the physical relationship between the region captions and objects while also providing their coordinates.
- For human-annotated dialogues, the framework adds coordinates after groundable objects or regions either in input or output or both with the dialogues focussing primarily on specific regions that helps in prompting the language model implicitly to follow the similar patterns for new dialogue generation.
- It might be possible that the dialogue generated by the framework might not follow the rules and patterns as instructed by few-shot examples and the system prompts. To tackle this issue, the framework again uses a language model to refine the dialogues generated by the model initially.
Spatial Negative Mining
Prior research has demonstrated that multimodal large language models have a high probability of hallucinating when responding to Yes or No questions. To ensure the Ferret model does not hallucinate in similar conditions, the framework employs Spatial Negative Mining approach with Image-Conditioned Category Localization and Semantics-conditioned Category Localization. Both these methods ask the model to localize specific object categories that enable the model to recognize the absence of certain objects in the image.
Ferret : Results and Experimentation
To analyze its performance, the Ferret framework is evaluated on conventional grounding and referring benchmarks after which the framework is evaluated in a more complex multimodal chatting task and testing its refer-and-ground capabilities.
The model’s capability to understand referring is evaluated by how accurately a model can understand the semantics of the referred region given a referred region in the image or the question. To measure the model’s accuracy, objects, the most basic semantics are considered first as it is not only fundamental but also easy to define. To mimic human-level versatility, the framework replaces the location of the object within the image with a free form shape, a box, and a point. For a free-form shape, the model randomly generates strokes within the Ground Truth object for simulation. For box, the Ferret framework uses the ground truth bounding box provided by the LVIS component. Finally, for point, the model randomly samples a point within the ground truth object that is also near the boundary of the ground truth object. The results on the three types of referring are demonstrated in the following image.
The Ferret framework demonstrates remarkable performance in referential dialogue tasks, making room for integration with different visual learning tasks, especially the ones with grounding outputs. To assess its grounding capability, the Ferret framework first subjects itself to benchmark visual grounding tasks with a generative paradigm. The framework then evaluates its ability on grounded captioning tasks to measure the alignment between the regions and the words.
In visual grounding tasks, the framework aims to ground language queries into aligned regions of the image, and as it can be seen in the following image, the Ferret framework demonstrates remarkable performance across all benchmarks, and the performance is comparable to the one achieved by specialized fine-tuning methods.
For grounded captioning tasks, the model needs to generate a caption, and then ground the generated noun phrases to image regions. The final prediction made by the model consists of three components: visual regions as boxes, text captions, and grounding alignments between boxes and words. The results are demonstrated in the following image, and as it can be observed, the framework delivers performance comparable to state of the art methods.
Finally, multimodal chatting is one of the most desired capabilities within a MLLM, and existing MLLMs primarily evaluate detailed descriptions, conversation, and complex reasoning with the language model as a judge. However, as no dataset evaluates multimodal chatting with mandatory referring or grounding actions, it leaves a gap. To bridge this gap, the Ferret framework covers three region-based questions to evaluate its referring and grounding capabilities in multimodal chatting tasks. The results are demonstrated in the following image.
Finally, the Ferret framework is compared directly against the state of the art GPT framework, and the results are demonstrated below.
Final Thoughts
In this article, we have talked about Ferret, a multimodal large language model demonstrating remarkable grounding and referring capabilities. The Ferret framework can refer to image regions irrespective of its shape, and can establish grounding for text predicted by the model automatically. Ferret employs a spatial-aware visual sampler capable of handling varying sparsity displayed by different shapes to extract the continuous features of versatile regions. As a result, the Ferret framework can input diverse region inputs including free-form shapers, bounding boxes, and points.