The Digital Insider | Category: Artificial Intelligence

The Race for AI Reasoning is Challenging our Imagination

New reasoning models from Google and OpenAI…

Sora AI Review: Will AI Replace Videographers For Good?

Have you ever wanted to create high-quality videos from nothing but words? In February 2024, OpenAI unveiled Sora, an AI system capable of creating photorealistic videos from text prompts that can be up to 20 seconds long. Since December 2024, the tool has been accessible to…

Ecologists find computer vision models’ blind spots in retrieving wildlife images

Try taking a picture of each of North America’s roughly 11,000 tree species, and you’ll have a mere fraction of the millions of photos within nature image datasets. These massive collections of snapshots — ranging from butterflies to humpback whales — are a great research tool for ecologists because they provide evidence of organisms’ unique behaviors, rare conditions, migration patterns, and responses to pollution and other forms of climate change.

While comprehensive, nature image datasets aren’t yet as useful as they could be. It’s time-consuming to search these databases and retrieve the images most relevant to your hypothesis. You’d be better off with an automated research assistant — or perhaps artificial intelligence systems called multimodal vision language models (VLMs). They’re trained on both text and images, making it easier for them to pinpoint finer details, like the specific trees in the background of a photo.

But just how well can VLMs assist nature researchers with image retrieval? A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), University College London, iNaturalist, and elsewhere designed a performance test to find out. Each VLM’s task: locate and reorganize the most relevant results within the team’s “INQUIRE” dataset, composed of 5 million wildlife pictures and 250 search prompts from ecologists and other biodiversity experts.

Looking for that special frog

In these evaluations, the researchers found that larger, more advanced VLMs, which are trained on far more data, can sometimes get researchers the results they want to see. The models performed reasonably well on straightforward queries about visual content, like identifying debris on a reef, but struggled significantly with queries requiring expert knowledge, like identifying specific biological conditions or behaviors. For example, VLMs somewhat easily uncovered examples of jellyfish on the beach, but struggled with more technical prompts like “axanthism in a green frog,” a condition that limits their ability to make their skin yellow.

Their findings indicate that the models need much more domain-specific training data to process difficult queries. MIT PhD student Edward Vendrow, a CSAIL affiliate who co-led work on the dataset in a new paper, believes that by familiarizing with more informative data, the VLMs could one day be great research assistants. “We want to build retrieval systems that find the exact results scientists seek when monitoring biodiversity and analyzing climate change,” says Vendrow. “Multimodal models don’t quite understand more complex scientific language yet, but we believe that INQUIRE will be an important benchmark for tracking how they improve in comprehending scientific terminology and ultimately helping researchers automatically find the exact images they need.”

The team’s experiments illustrated that larger models tended to be more effective for both simpler and more intricate searches due to their expansive training data. They first used the INQUIRE dataset to test if VLMs could narrow a pool of 5 million images to the top 100 most-relevant results (also known as “ranking”). For straightforward search queries like “a reef with manmade structures and debris,” relatively large models like “SigLIP” found matching images, while smaller-sized CLIP models struggled. According to Vendrow, larger VLMs are “only starting to be useful” at ranking tougher queries.

Vendrow and his colleagues also evaluated how well multimodal models could re-rank those 100 results, reorganizing which images were most pertinent to a search. In these tests, even huge LLMs trained on more curated data, like GPT-4o, struggled: Its precision score was only 59.6 percent, the highest score achieved by any model.

The researchers presented these results at the Conference on Neural Information Processing Systems (NeurIPS) earlier this month.

Inquiring for INQUIRE

The INQUIRE dataset includes search queries based on discussions with ecologists, biologists, oceanographers, and other experts about the types of images they’d look for, including animals’ unique physical conditions and behaviors. A team of annotators then spent 180 hours searching the iNaturalist dataset with these prompts, carefully combing through roughly 200,000 results to label 33,000 matches that fit the prompts.

For instance, the annotators used queries like “a hermit crab using plastic waste as its shell” and “a California condor tagged with a green ‘26’” to identify the subsets of the larger image dataset that depict these specific, rare events.

Then, the researchers used the same search queries to see how well VLMs could retrieve iNaturalist images. The annotators’ labels revealed when the models struggled to understand scientists’ keywords, as their results included images previously tagged as irrelevant to the search. For example, VLMs’ results for “redwood trees with fire scars” sometimes included images of trees without any markings.

“This is careful curation of data, with a focus on capturing real examples of scientific inquiries across research areas in ecology and environmental science,” says Sara Beery, the Homer A. Burnell Career Development Assistant Professor at MIT, CSAIL principal investigator, and co-senior author of the work. “It’s proved vital to expanding our understanding of the current capabilities of VLMs in these potentially impactful scientific settings. It has also outlined gaps in current research that we can now work to address, particularly for complex compositional queries, technical terminology, and the fine-grained, subtle differences that delineate categories of interest for our collaborators.”

“Our findings imply that some vision models are already precise enough to aid wildlife scientists with retrieving some images, but many tasks are still too difficult for even the largest, best-performing models,” says Vendrow. “Although INQUIRE is focused on ecology and biodiversity monitoring, the wide variety of its queries means that VLMs that perform well on INQUIRE are likely to excel at analyzing large image collections in other observation-intensive fields.”

Inquiring minds want to see

Taking their project further, the researchers are working with iNaturalist to develop a query system to better help scientists and other curious minds find the images they actually want to see. Their working demo allows users to filter searches by species, enabling quicker discovery of relevant results like, say, the diverse eye colors of cats. Vendrow and co-lead author Omiros Pantazis, who recently received his PhD from University College London, also aim to improve the re-ranking system by augmenting current models to provide better results.

University of Pittsburgh Associate Professor Justin Kitzes highlights INQUIRE’s ability to uncover secondary data. “Biodiversity datasets are rapidly becoming too large for any individual scientist to review,” says Kitzes, who wasn’t involved in the research. “This paper draws attention to a difficult and unsolved problem, which is how to effectively search through such data with questions that go beyond simply ‘who is here’ to ask instead about individual characteristics, behavior, and species interactions. Being able to efficiently and accurately uncover these more complex phenomena in biodiversity image data will be critical to fundamental science and real-world impacts in ecology and conservation.”

Vendrow, Pantazis, and Beery wrote the paper with iNaturalist software engineer Alexander Shepard, University College London professors Gabriel Brostow and Kate Jones, University of Edinburgh associate professor and co-senior author Oisin Mac Aodha, and University of Massachusetts at Amherst Assistant Professor Grant Van Horn, who served as co-senior author. Their work was supported, in part, by the Generative AI Laboratory at the University of Edinburgh, the U.S. National Science Foundation/Natural Sciences and Engineering Research Council of Canada Global Center on AI and Biodiversity Change, a Royal Society Research Grant, and the Biome Health Project funded by the World Wildlife Fund United Kingdom.

Hunyuan-Large and the MoE Revolution: How AI Models Are Growing Smarter and Faster

Artificial Intelligence (AI) is advancing at an extraordinary pace. What seemed like a futuristic concept just a decade ago is now part of our daily lives. However, the AI we encounter now is only the beginning. The fundamental transformation is yet to be witnessed due to…

Monetizing Research for AI Training: The Risks and Best Practices

As the demand for generative AI grows, so does the hunger for high-quality data to train these systems. Scholarly publishers have started to monetize their research content to provide training data for large language models (LLMs). While this development is creating a new revenue stream for…

3 AI use cases to elevate your strategy

Learn how to harness AI to refine your segments, outpace your competitors, and craft impactful thought leadership….

Bridging the ‘Space Between’ in Generative Video

New research from China is offering an improved method of interpolating the gap between two temporally-distanced video frames – one of the most crucial challenges in the current race towards realism for generative AI video, as well as for video codec compression. In the example video…

Startup’s autonomous drones precisely track warehouse inventories

Whether you’re a fulfillment center, a manufacturer, or a distributor, speed is king. But getting products out the door quickly requires workers to know where those products are located in their warehouses at all times. That may sound obvious, but lost or misplaced inventory is a major problem in warehouses around the world.

Corvus Robotics is addressing that problem with an inventory management platform that uses autonomous drones to scan the towering rows of pallets that fill most warehouses. The company’s drones can work 24/7, whether warehouse lights are on or off, scanning barcodes alongside human workers to give them an unprecedented view of their products.

“Typically, warehouses will do inventory twice a year — we change that to once a week or faster,” says Corvus co-founder and CTO Mohammed Kabir ’21. “There’s a huge operational efficiency you gain from that.”

Corvus is already helping distributors, logistics providers, manufacturers, and grocers track their inventory. Through that work, the company has helped customers realize huge gains in the efficiency and speed of their warehouses.

The key to Corvus’s success has been building a drone platform that can operate autonomously in tough environments like warehouses, where GPS doesn’t work and Wi-Fi may be weak, by only using cameras and neural networks to navigate. With that capability, the company believes its drones are poised to enable a new level of precision for the way products are produced and stored in warehouses around the world.

A new kind of inventory management solution

Kabir has been working on drones since he was 14.

“I was interested in drones before the drone industry even existed,” Kabir says. “I’d work with people I found on the internet. At the time, it was just a bunch of hobbyists cobbling things together to see if they could work.”

In 2017, the same year Kabir came to MIT, he received a message from his eventual Corvus co-founder Jackie Wu, who was a student at Northwestern University at the time. Wu had seen some of Kabir’s work on drone navigation in GPS-denied environments as part of an open-source drone project. The students decided to see if they could use the work as the foundation for a company.

Kabir started working on spare nights and weekends as he juggled building Corvus’ technology with his coursework in MIT’s Department of Aeronautics and Astronautics. The founders initially tried using off-the-shelf drones and equipping them with sensors and computing power. Eventually they realized they had to design their drones from scratch, because off-the-shelf drones did not provide the kind of low-level control and access they needed to build full-lifecycle autonomy.

Kabir built the first drone prototype in his dorm room in Simmons Hall and took to flying each new iteration in the field out front.

“We’d build these drone prototypes and bring them out to see if they’d even fly, and then we’d go back inside and start building our autonomy systems on top of them,” Kabir recalls.

While working on Corvus, Kabir was also one of the founders of the MIT Driverless program that built North America’s first competition-winning driverless race cars.

“It’s all part of the same autonomy story,” Kabir says. “I’ve always been very interested in building robots that operate without a human touch.”

From the beginning, the founders believed inventory management was a promising application for their drone technology. Eventually they rented a facility in Boston and simulated a warehouse with huge racks and boxes to refine their technology.

By the time Kabir graduated in 2021, Corvus had completed several pilots with customers. One customer was MSI, a building materials company that distributes flooring, countertops, tile, and more. Soon MSI was using Corvus every day across multiple facilities in its nationwide network.

The Corvus One drone, which the company calls the world’s first fully autonomous warehouse inventory management drone, is equipped with 14 cameras and an AI system that allows it to safely navigate to scan barcodes and record the location of each product. In most instances, the collected data are shared with the customer’s warehouse management system (typically the warehouse’s system of record), and any discrepancies identified are automatically categorized with a suggested resolution. Additionally, the Corvus interface allows customers to select no-fly zones, choose flight behaviors, and set automated flight schedules.

“When we started, we didn’t know if lifelong vision-based autonomy in warehouses was even possible,” Kabir says. “It turns out that it’s really hard to make infrastructure-free autonomy work with traditional computer vision techniques. We were the first in the world to ship a learning-based autonomy stack for an indoor aerial robot using machine learning and neural network based approaches. We were using AI before it was cool.”

To set up, Corvus’ team simply installs one or more docks, which act as a charging and data transfer station, on the ends of product racks and completes a rough mapping step using tape measurers. The drones then fill in the fine details on their own. Kabir says it takes about a week to be fully operational in a 1-million-square-foot facility.

“We don’t have to set up any stickers, reflectors, or beacons,” Kabir says. “Our setup is really fast compared to other options in the industry. We call it infrastructure-free autonomy, and it’s a big differentiator for us.”

From forklifts to drones

A lot of inventory management today is done by a person using a forklift or a scissor lift to scan barcodes and make notes on a clipboard. The result is infrequent and inaccurate inventory checks that sometimes require warehouses to shut down operations.

“They’re going up and down on these lifts, and there are all of these manual steps involved,” Kabir says. “You have to manually collect data, then there’s a data entry step, because none of these systems are connected. What we’ve found is many warehouses are driven by bad data, and there’s no way to fix that unless you fix the data you’re collecting in the first place.”

Corvus can bring inventory management systems and processes together. Its drones also operate safely around people and forklifts every day.

“That was a core goal for us,” Kabir says. “When we go into a warehouse, it’s a privilege the customer has given us. We don’t want to disrupt their operations, and we build a system around that idea. You can fly it whenever you need to, and the system will work around your schedule.”

Kabir already believes Corvus offers the most comprehensive inventory management solution available. Moving forward, the company will offer more end-to-end solutions to manage inventory the moment it arrives at warehouses.

“Drones actually only solve a part of the inventory problem,” Kabir says. “Drones fly around to track rack pallet inventory, but a lot of stuff gets lost even before it makes it to the racks. Products arrive, they get taken off a truck, and then they are stacked on the floor, and before they are moved to the racks, items have been lost. They’re mislabelled, they’re misplaced, and they’re just gone. Our vision is to solve that.”

Robots with Feeling: How Tactile AI Could Transform Human-Robot Relationships

Sentient robots have been a staple of science fiction for decades, raising tantalizing ethical questions and shining light on the technical barriers of creating artificial consciousness. Much of what the tech world has achieved in artificial intelligence (AI) today is thanks to recent advances in deep…

Seven Trends to Expect in AI in 2025

Another year, another investment in artificial intelligence (AI). That has certainly been the case for 2024, but will the same momentum continue for 2025 as many organizations begin to question its ROI? According to most analysts, the answer is an overwhelming yes with global investment expected…