Understanding the visual knowledge of language models

Understanding the visual knowledge of language models

You’ve likely heard that a picture is worth a thousand words, but can a large language model (LLM) get the picture if it’s never seen images before?

As it turns out, language models that are trained purely on text have a solid understanding of the visual world. They can write image-rendering code to generate complex scenes with intriguing objects and compositions — and even when that knowledge is not used properly, LLMs can refine their images. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this when prompting language models to self-correct their code for different images, where the systems improved on their simple clipart drawings with each query.

The visual knowledge of these language models is gained from how concepts like shapes and colors are described across the internet, whether in language or code. When given a direction like “draw a parrot in the jungle,” users jog the LLM to consider what it’s read in descriptions before. To assess how much visual knowledge LLMs have, the CSAIL team constructed a “vision checkup” for LLMs: using their “Visual Aptitude Dataset,” they tested the models’ abilities to draw, recognize, and self-correct these concepts. Collecting each final draft of these illustrations, the researchers trained a computer vision system that identifies the content of real photos.

“We essentially train a vision system without directly using any visual data,” says Tamar Rott Shaham, co-lead author of the study and an MIT electrical engineering and computer science (EECS) postdoc at CSAIL. “Our team queried language models to write image-rendering codes to generate data for us and then trained the vision system to evaluate natural images. We were inspired by the question of how visual concepts are represented through other mediums, like text. To express their visual knowledge, LLMs can use code as a common ground between text and vision.”

To build this dataset, the researchers first queried the models to generate code for different shapes, objects, and scenes. Then, they compiled that code to render simple digital illustrations, like a row of bicycles, showing that LLMs understand spatial relations well enough to draw the two-wheelers in a horizontal row. As another example, the model generated a car-shaped cake, combining two random concepts. The language model also produced a glowing light bulb, indicating its ability to create visual effects. 

“Our work shows that when you query an LLM (without multimodal pre-training) to create an image, it knows much more than it seems,” says co-lead author, EECS PhD student, and CSAIL member Pratyusha Sharma. “Let’s say you asked it to draw a chair. The model knows other things about this piece of furniture that it may not have immediately rendered, so users can query the model to improve the visual it produces with each iteration. Surprisingly, the model can iteratively enrich the drawing by improving the rendering code to a significant extent.”

The researchers gathered these illustrations, which were then used to train a computer vision system that can recognize objects within real photos (despite never having seen one before). With this synthetic, text-generated data as its only reference point, the system outperforms other procedurally generated image datasets that were trained with authentic photos.

The CSAIL team believes that combining the hidden visual knowledge of LLMs with the artistic capabilities of other AI tools like diffusion models could also be beneficial. Systems like Midjourney sometimes lack the know-how to consistently tweak the finer details in an image, making it difficult for them to handle requests like reducing how many cars are pictured, or placing an object behind another. If an LLM sketched out the requested change for the diffusion model beforehand, the resulting edit could be more satisfactory.

The irony, as Rott Shaham and Sharma acknowledge, is that LLMs sometimes fail to recognize the same concepts that they can draw. This became clear when the models incorrectly identified human re-creations of images within the dataset. Such diverse representations of the visual world likely triggered the language models’ misconceptions.

While the models struggled to perceive these abstract depictions, they demonstrated the creativity to draw the same concepts differently each time. When the researchers queried LLMs to draw concepts like strawberries and arcades multiple times, they produced pictures from diverse angles with varying shapes and colors, hinting that the models might have actual mental imagery of visual concepts (rather than reciting examples they saw before).

The CSAIL team believes this procedure could be a baseline for evaluating how well a generative AI model can train a computer vision system. Additionally, the researchers look to expand the tasks they challenge language models on. As for their recent study, the MIT group notes that they don’t have access to the training set of the LLMs they used, making it challenging to further investigate the origin of their visual knowledge. In the future, they intend to explore training an even better vision model by letting the LLM work directly with it.

Sharma and Rott Shaham are joined on the paper by former CSAIL affiliate Stephanie Fu ’22, MNG ’23 and EECS PhD students Manel Baradad, Adrián Rodríguez-Muñoz ’22, and Shivam Duggal, who are all CSAIL affiliates; as well as MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported, in part, by a grant from the MIT-IBM Watson AI Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They present their paper this week at the IEEE/CVF Computer Vision and Pattern Recognition Conference.

Embracer Shuts Down Developer Of The New Alone In The Dark

Embracer Shuts Down Developer Of The New Alone In The Dark

Embracer has shut down Pieces Interactive, the developer of the Alone in the Dark reimagining that arrived earlier this year. The studio is the latest to be closed by the troubled company over the last 10 months. 

Pieces Interactive’s website now only features a graphic listing “2007-2024 Thanks For Playing With Us” and a short blurb detailing its history: 

Pieces Interactive released over ten titles on PC, Console and Mobile since 2007, both our own concepts such as Puzzlegeddon, Fret Nice, Leviathan Warships, Robo Surf and Kill to Collect, as well as work for hire titles such as Magicka 2 and several DLCs for Magicka. Our client list includes Paradox Interactive, Koei Tecmo, Arrowhead Game Studios, Koch Media and RaceRoom Entertainment.

In 2017, Pieces Interactive were acquired by Embracer Group after working with the expansion for Titan Quest, Titan Quest: Ragnarök and third expansion for Titan Quest, Titan Quest: Atlantis.

Our last release was the reimagening of Alone in the Dark.

Alone in the Dark launched on March 20 to mixed reviews, currently sitting at a 63 on Metacritic. The game stars actors Jodie Comer and David Harbour, who lent their voices and likenesses, in a third-person reimagining of the 1992 survival horror classic. 

[embedded content]

Following the collapse of a $2 billion deal with the Saudi Arabia-backed Savvy Games Group in 2023, Embracer has undergone a massive restructuring to stay afloat, resulting in a significant reduction of headcount across its many studios. That has included laying off over 1,000 employees (including those at Eidos-Montreal), selling off its studios (Saber Interactive, Gearbox Entertainment), and outright closing others (Free Radical Design, Volition Games). Embracer has most recently split itself into three separate companies, with its studios divided among them. 

Monster Hunter Wilds Preview – The Chase Is On – Game Informer

Monster Hunter is known for its protagonists overcoming seemingly insurmountable odds in their pursuit of taking down the massive monsters that populate the region, but with Monster Hunter Wilds, Capcom may be taking it to a whole new level. While at Summer Game Fest Play Days, I took in an extended gameplay demo involving an Alpha Monster hunt. I left the demo excited to dive into this seemingly improved take on what the very popular Monster Hunter World delivered in 2018.

Monster Hunter World served as Capcom’s big push into making a more mainstream and globally appealing entry in the franchise. The team worked hard to bring the franchise up to global triple-A standards and included several quality-of-life improvements, as well as a simultaneous ship date across Japanese and Western markets and additional language localizations. The result was a smash success, with Monster Hunter World currently sitting atop the franchise’s sales charts. Capcom and the Monster Hunter development team hope to go even bigger with Wilds.

Monster Hunter Wilds Preview – The Chase Is On – Game Informer

“For Monster Hunter Wilds, it’s pretty much a similar approach to what we have accomplished with World in that we want to use what were then the most high-spec machines available to create the world of Monster Hunter in unprecedented detail and depth,” series producer Ryozo Tsujimoto says. “That was true for Monster Hunter World, and for Monster Hunter Wilds, it’s the same approach, but now, hardware has advanced so much in the intervening years that we’re just able to go even further than we did in that direction.”

From the start of the demo, the graphical enhancements are obvious. The lush environments, improved animations, and better faces are immediately evident as the on-screen character walks through the base camp populated with humans and Palico, but once the character hops on his mount (which are larger this time around), Wilds really starts cooking with gas.

Monster Hunter Wilds

A new tool that allows players to pick up items from a distance while riding a mount is just the start, as we have our eyes on hunting an Alpha Doshaguma. Venturing out into the eponymous wilds from base camp requires no load screen, and thanks to the day/night cycle, players must be intentional about when they start their hunts since certain monsters only appear at specific times. Because of this, the character must wait for prime Doshaguma hunting hours, so he asks his Palico friends to set up a mobile camp in the field. These camps are extremely handy, but they can be destroyed by monsters, so you must be strategic about your placement.

Before the hunt, the character cooks a meal via an extended cooking sequence that is, in the words of the demo’s commentator, quite sensual. The detailed food looks terrific, and the character’s facial expressions and sounds seem to reflect that it tastes as good as it looks. After feasting, the hunter is off to find the Doshaguma. Using the Ghille Mantle, the hunter sneaks right past the standard monsters and right up to the Alpha. He lands a heavy blow, and all chaos breaks loose. The giant bear-like monster alerts all its buddies, and they swarm the hunter in the enclosed space. 

Monster Hunter Wilds

The only way the hunter is going to survive this is by trying to thin the herd, but that’s not going to happen in this tight space. The hunter calls upon his mount and makes a break for it. The four congregated monsters give chase. While they’re hot on your tail, you have ways to slow them down or even take them out of the fight. You can lead them through Bramble or even other monsters’ territory. In this case, the demo player leads them right through a pack of smaller monsters who don’t hesitate to jump up on the four Doshaguma. This slows them down, but two stay on the hunter’s trail.

After a few more maneuvers, the hunter loses all of the Doshaguma except the Alpha target. Now it’s time to lead it through even more traps in hopes of slowing it down and landing some damage. First, the hunter leads them through Balahara territory, which results in the Balahara opening up a quicksand trap that sucks the Doshaguma into it. The bear-like monster escapes, so the hunter enters a nearby thunderstorm. There, the area’s Apex monster appears and attacks the Doshaguma but does not finish it off. The hunter then rides into a nearby cave where boulders crash onto the monster from the ceiling. 

Monster Hunter Wilds

It’s obvious that having knowledge of the map will pay dividends as you fight a powerful monster like this Alpha Doshaguma. “With the maps being so much bigger now – two times or more as large as Monster Hunter World – being completely seamless, my approach is to give the player as wide a toolset as possible and place these things so that they can make a choice on what kind of strategy they want to take,” director Yuya Tokuda says. “This is a game you can play for dozens or hundreds of hours, and even though it’s very big, you will see the same field so many times, and I don’t want it to feel static, but we always have our, ‘I know that if I go here, this is what happens. That’s option A, and there’s also option B: The way the field actually changes with the daylight cycle and extreme weather system means that you are always able to, ‘Okay, well now that it’s this time of day, if I go here, I know that [a particular monster] will be available for me to track. Or for the next time when it’s a different time of day or different weather option, we’ve provided the information to let you decide how to take it differently this time. You can go through your old favorite strategies or just decide to change it up on a whim. I feel my job as the director or the designer is to give you the tools you need to hunt. How you use them is really up to you.”

To prevent players from feeling overwhelmed by the options available to them, the Palico, who can talk this time around, will call things out to ensure the player is aware of what is at their fingertips. “They will call out these things just to make you aware of them, but it doesn’t force you to do them,” Tokuda says. “So, if there’s a storm coming in, then one of the characters might tell you to watch out for that. […] I just want to ensure that the players don’t miss out on all the exciting new features we added because they didn’t know what to look out for.”

Monster Hunter Wilds

After the chase through the cave, the Alpha Doshaguma begins limping, but the hunter also needs to rest, so he lets the monster escape. The hunter returns to the mobile camp, changes to a long sword, and sends up a signal flare to call co-op partners. Together, they set up traps in the area where the monster roams. One hunter lays a pitfall trap while the others prepare their own ways. Once ready, our main hunter sneaks up on the Doshaguma and sets up explosive barrels near where it’s sleeping. The hunter blasts the barrels, and they explode, serving as the rudest alarm clock. It wakes up, understandably irritated, and chases the hunters once more. The main hunter leads it into the pitfall trap, where it gets wedged in the ground, and the team unloads on the already-injured beast. The Doshaguma climbs out, but the main hunter jumps on its back, grabs its mane, and begins stabbing it from the top. Thanks to this maneuver, the pre-existing injuries from the thrilling chase, and the teamwork from the co-op partners, the Alpha Doshaguma finally falls, and the hunters use the carving knife to cut the carcass. The entire sequence was white-knuckled, breathless, and edge-of-my-seat thrilling, and I can’t wait to undertake similar chases once I have the game in my possession.

When Monster Hunter World arrived in 2018, I was excited to finally give the series a shot. I liked what I played, but I didn’t make it very far into the campaign. After leaving my Monster Hunter Wilds demo and speaking with the creators, I am ready to download World and its Iceborne expansion in anticipation of Monster Hunter Wilds’ 2025 release.

NVIDIA presents latest advancements in visual AI

NVIDIA researchers are presenting new visual generative AI models and techniques at the Computer Vision and Pattern Recognition (CVPR) conference this week in Seattle. The advancements span areas like custom image generation, 3D scene editing, visual language understanding, and autonomous vehicle perception. “Artificial intelligence, and generative…

The A Quiet Place Game Breaks Its Silence With First Reveal Trailer

The A Quiet Place Game Breaks Its Silence With First Reveal Trailer

Fans of the A Quiet Place films may remember that a video game based on the franchise was announced almost three years ago. After years of worrying yet oddly appropriate silence, publisher Saber Interactive has revealed the title and first trailer for A Quiet Place: The Road Ahead.

The first-person adventure is an original story starring a young woman who must trek across a new area of the film’s post-apocalyptic landscape for unknown reasons (though the game’s description mentions she’s dealing with some kind of family drama). Like the movies, keeping quiet to avoid the wrath of the sound-sensitive alien invaders requires using tools such as a customized microphone that measures the sound levels in an area. 

[embedded content]

Stealth is key, and players must carefully observe environments to create their own paths while avoiding noise-making hazards such as stepping on broken glass, as seen in the video. The trailer didn’t show any combat, but if it’s anything like the films, guns are likely to be present but perhaps deemphasized (they’re quite loud, you see). 

Interestingly, A Quiet Place: The Road Ahead seems to have switched developers. When the game was first announced, Illogika was at the helm. Now, the game credits Stormind Games, the team behind the Remothered horror series and, most recently, Batora: Lost Haven. 

A Quiet Place: The Road Ahead was originally slated to launch in 2022, the same year A Quiet Place Part II hit theaters. With this trailer arriving ahead of the June 28 premiere of A Quiet Place: Day One, the game is again riding the momentum of a new film release. A Quiet Place: The Road Ahead is coming sometime this year for PlayStation 5, Xbox Series X/S, and PC. 

A smarter way to streamline drug discovery

A smarter way to streamline drug discovery

The use of AI to streamline drug discovery is exploding. Researchers are deploying machine-learning models to help them identify molecules, among billions of options, that might have the properties they are seeking to develop new medicines.

But there are so many variables to consider — from the price of materials to the risk of something going wrong — that even when scientists use AI, weighing the costs of synthesizing the best candidates is no easy task.

The myriad challenges involved in identifying the best and most cost-efficient molecules to test is one reason new medicines take so long to develop, as well as a key driver of high prescription drug prices.

To help scientists make cost-aware choices, MIT researchers developed an algorithmic framework to automatically identify optimal molecular candidates, which minimizes synthetic cost while maximizing the likelihood candidates have desired properties. The algorithm also identifies the materials and experimental steps needed to synthesize these molecules.

Their quantitative framework, known as Synthesis Planning and Rewards-based Route Optimization Workflow (SPARROW), considers the costs of synthesizing a batch of molecules at once, since multiple candidates can often be derived from some of the same chemical compounds.

Moreover, this unified approach captures key information on molecular design, property prediction, and synthesis planning from online repositories and widely used AI tools.

Beyond helping pharmaceutical companies discover new drugs more efficiently, SPARROW could be used in applications like the invention of new agrichemicals or the discovery of specialized materials for organic electronics.

“The selection of compounds is very much an art at the moment — and at times it is a very successful art. But because we have all these other models and predictive tools that give us information on how molecules might perform and how they might be synthesized, we can and should be using that information to guide the decisions we make,” says Connor Coley, the Class of 1957 Career Development Assistant Professor in the MIT departments of Chemical Engineering and Electrical Engineering and Computer Science, and senior author of a paper on SPARROW.

Coley is joined on the paper by lead author Jenna Fromer SM ’24. The research appears today in Nature Computational Science.

Complex cost considerations

In a sense, whether a scientist should synthesize and test a certain molecule boils down to a question of the synthetic cost versus the value of the experiment. However, determining cost or value are tough problems on their own.

For instance, an experiment might require expensive materials or it could have a high risk of failure. On the value side, one might consider how useful it would be to know the properties of this molecule or whether those predictions carry a high level of uncertainty.

At the same time, pharmaceutical companies increasingly use batch synthesis to improve efficiency. Instead of testing molecules one at a time, they use combinations of chemical building blocks to test multiple candidates at once. However, this means the chemical reactions must all require the same experimental conditions. This makes estimating cost and value even more challenging.

SPARROW tackles this challenge by considering the shared intermediary compounds involved in synthesizing molecules and incorporating that information into its cost-versus-value function.

“When you think about this optimization game of designing a batch of molecules, the cost of adding on a new structure depends on the molecules you have already chosen,” Coley says.

The framework also considers things like the costs of starting materials, the number of reactions that are involved in each synthetic route, and the likelihood those reactions will be successful on the first try.

To utilize SPARROW, a scientist provides a set of molecular compounds they are thinking of testing and a definition of the properties they are hoping to find.

From there, SPARROW collects information on the molecules and their synthetic pathways and then weighs the value of each one against the cost of synthesizing a batch of candidates. It automatically selects the best subset of candidates that meet the user’s criteria and finds the most cost-effective synthetic routes for those compounds.

“It does all this optimization in one step, so it can really capture all of these competing objectives simultaneously,” Fromer says.

A versatile framework

SPARROW is unique because it can incorporate molecular structures that have been hand-designed by humans, those that exist in virtual catalogs, or never-before-seen molecules that have been invented by generative AI models.

“We have all these different sources of ideas. Part of the appeal of SPARROW is that you can take all these ideas and put them on a level playing field,” Coley adds.

The researchers evaluated SPARROW by applying it in three case studies. The case studies, based on real-world problems faced by chemists, were designed to test SPARROW’s ability to find cost-efficient synthesis plans while working with a wide range of input molecules.

They found that SPARROW effectively captured the marginal costs of batch synthesis and identified common experimental steps and intermediate chemicals. In addition, it could scale up to handle hundreds of potential molecular candidates.

“In the machine-learning-for-chemistry community, there are so many models that work well for retrosynthesis or molecular property prediction, for example, but how do we actually use them? Our framework aims to bring out the value of this prior work. By creating SPARROW, hopefully we can guide other researchers to think about compound downselection using their own cost and utility functions,” Fromer says.

In the future, the researchers want to incorporate additional complexity into SPARROW. For instance, they’d like to enable the algorithm to consider that the value of testing one compound may not always be constant. They also want to include more elements of parallel chemistry in its cost-versus-value function.

“The work by Fromer and Coley better aligns algorithmic decision making to the practical realities of chemical synthesis. When existing computational design algorithms are used, the work of determining how to best synthesize the set of designs is left to the medicinal chemist, resulting in less optimal choices and extra work for the medicinal chemist,” says Patrick Riley, senior vice president of artificial intelligence at Relay Therapeutics, who was not involved with this research. “This paper shows a principled path to include consideration of joint synthesis, which I expect to result in higher quality and more accepted algorithmic designs.”

“Identifying which compounds to synthesize in a way that carefully balances time, cost, and the potential for making progress toward goals while providing useful new information is one of the most challenging tasks for drug discovery teams. The SPARROW approach from Fromer and Coley does this in an effective and automated way, providing a useful tool for human medicinal chemistry teams and taking important steps toward fully autonomous approaches to drug discovery,” adds John Chodera, a computational chemist at Memorial Sloan Kettering Cancer Center, who was not involved with this work.

This research was supported, in part, by the DARPA Accelerated Molecular Discovery Program, the Office of Naval Research, and the National Science Foundation.