Plaintiffs in the case of Kadrey et al. vs. Meta have filed a motion alleging the firm knowingly used copyrighted works in the development of its AI models. The plaintiffs, which include author Richard Kadrey, filed their “Reply in Support of Plaintiffs’ Motion for Leave to…
How Sudolabs brings practical AI solutions to multiple industries
Sudolabs is helping clients identify impactful use cases and ensure the seamless execution of those solutions. …
The Future of AI for Business Infrastructure: Why Private, Bare-Metal Solutions Powered by Apple Silicon Are Ideal for IT Departments
As businesses, particularly small to medium-sized IT departments, look to incorporate AI into their operations, they face a complex and evolving market. While the promises of AI are exciting, the landscape is filled with uncertainties. Public AI chatbots are widely available but raise significant concerns about…
DeepSeek-V3: How a Chinese AI Startup Outpaces Tech Giants in Cost and Performance
Generative AI is evolving rapidly, transforming industries and creating new opportunities daily. This wave of innovation has fueled intense competition among tech companies trying to become leaders in the field. US-based companies like OpenAI, Anthropic, and Meta have dominated the field for years. However, a new…
From Tweets to Calls: How AI is Transforming the Acoustic Study of Migratory Birds
Every year, billions of birds travel across continents and oceans. These journeys are not only fantastic to watch, but they are also essential for keeping nature in balance. Birds pollinate flowers, spread seeds, and help control pests, playing a big part in keeping our environment healthy….
Harman Kaur, Vice President of AI at Tanium – Interview Series
Harman Kaur is the Vice President of AI at Tanium, a leader in Autonomous Endpoint Management (AEM) with the industry’s only true real-time platform for AI. She brings over a decade of experience in leadership, technical innovation, and cybersecurity, gained through her tenure at Tanium and service…
Copyright concerns create need for a fair alternative in AI sector – AI News
When future generations look back at the rise of artificial intelligence technologies, the year 2025 may be remembered as a major turning point, when the industry took concrete steps towards greater inclusion, and embraced decentralised frameworks that recognise and fairly compensate every stakeholder. The growth of…
Microsoft releases Phi-4 language model on Hugging Face
Microsoft has officially released its latest language model, Phi-4, on the AI repository Hugging Face. The model is available under the permissive MIT licence, allowing broad usage for developers, researchers, and businesses alike—a significant step for democratising AI innovations. Unveiled in December 2024, Phi-4 has been…
Teaching AI to communicate sounds like humans do
Whether you’re describing the sound of your faulty car engine or meowing like your neighbor’s cat, imitating sounds with your voice can be a helpful way to relay a concept when words don’t do the trick.
Vocal imitation is the sonic equivalent of doodling a quick picture to communicate something you saw — except that instead of using a pencil to illustrate an image, you use your vocal tract to express a sound. This might seem difficult, but it’s something we all do intuitively: To experience it for yourself, try using your voice to mirror the sound of an ambulance siren, a crow, or a bell being struck.
Inspired by the cognitive science of how we communicate, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have developed an AI system that can produce human-like vocal imitations with no training, and without ever having “heard” a human vocal impression before.
To achieve this, the researchers engineered their system to produce and interpret sounds much like we do. They started by building a model of the human vocal tract that simulates how vibrations from the voice box are shaped by the throat, tongue, and lips. Then, they used a cognitively-inspired AI algorithm to control this vocal tract model and make it produce imitations, taking into consideration the context-specific ways that humans choose to communicate sound.
The model can effectively take many sounds from the world and generate a human-like imitation of them — including noises like leaves rustling, a snake’s hiss, and an approaching ambulance siren. Their model can also be run in reverse to guess real-world sounds from human vocal imitations, similar to how some computer vision systems can retrieve high-quality images based on sketches. For instance, the model can correctly distinguish the sound of a human imitating a cat’s “meow” versus its “hiss.”
In the future, this model could potentially lead to more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to help students learn new languages.
The co-lead authors — MIT CSAIL PhD students Kartik Chandra SM ’23 and Karima Ma, and undergraduate researcher Matthew Caren — note that computer graphics researchers have long recognized that realism is rarely the ultimate goal of visual expression. For example, an abstract painting or a child’s crayon doodle can be just as expressive as a photograph.
“Over the past few decades, advances in sketching algorithms have led to new tools for artists, advances in AI and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “In the same way that a sketch is an abstract, non-photorealistic representation of an image, our method captures the abstract, non-phono–realistic ways humans express the sounds they hear. This teaches us about the process of auditory abstraction.”
The art of imitation, in three parts
The team developed three increasingly nuanced versions of the model to compare to human vocal imitations. First, they created a baseline model that simply aimed to generate imitations that were as similar to real-world sounds as possible — but this model didn’t match human behavior very well.
The researchers then designed a second “communicative” model. According to Caren, this model considers what’s distinctive about a sound to a listener. For instance, you’d likely imitate the sound of a motorboat by mimicking the rumble of its engine, since that’s its most distinctive auditory feature, even if it’s not the loudest aspect of the sound (compared to, say, the water splashing). This second model created imitations that were better than the baseline, but the team wanted to improve it even more.
To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different based on the amount of effort you put into them. It costs time and energy to produce sounds that are perfectly accurate,” says Chandra. The researchers’ full model accounts for this by trying to avoid utterances that are very rapid, loud, or high- or low-pitched, which people are less likely to use in a conversation. The result: more human-like imitations that closely match many of the decisions that humans make when imitating the same sounds.
After building this model, the team conducted a behavioral experiment to see whether the AI- or human-generated vocal imitations were perceived as better by human judges. Notably, participants in the experiment favored the AI model 25 percent of the time in general, and as much as 75 percent for an imitation of a motorboat and 50 percent for an imitation of a gunshot.
Toward more expressive sound technology
Passionate about technology for music and art, Caren envisions that this model could help artists better communicate sounds to computational systems and assist filmmakers and other content creators with generating AI sounds that are more nuanced to a specific context. It could also enable a musician to rapidly search a sound database by imitating a noise that is difficult to describe in, say, a text prompt.
In the meantime, Caren, Chandra, and Ma are looking at the implications of their model in other domains, including the development of language, how infants learn to talk, and even imitation behaviors in birds like parrots and songbirds.
The team still has work to do with the current iteration of their model: It struggles with some consonants, like “z,” which led to inaccurate impressions of some sounds, like bees buzzing. They also can’t yet replicate how humans imitate speech, music, or sounds that are imitated differently across different languages, like a heartbeat.
Stanford University linguistics professor Robert Hawkins says that language is full of onomatopoeia and words that mimic but don’t fully replicate the things they describe, like the “meow” sound that very inexactly approximates the sound that cats make. “The processes that get us from the sound of a real cat to a word like ‘meow’ reveal a lot about the intricate interplay between physiology, social reasoning, and communication in the evolution of language,” says Hawkins, who wasn’t involved in the CSAIL research. “This model presents an exciting step toward formalizing and testing theories of those processes, demonstrating that both physical constraints from the human vocal tract and social pressures from communication are needed to explain the distribution of vocal imitations.”
Caren, Chandra, and Ma wrote the paper with two other CSAIL affiliates: Jonathan Ragan-Kelley, MIT Department of Electrical Engineering and Computer Science associate professor, and Joshua Tenenbaum, MIT Brain and Cognitive Sciences professor and Center for Brains, Minds, and Machines member. Their work was supported, in part, by the Hertz Foundation and the National Science Foundation. It was presented at SIGGRAPH Asia in early December.