How AI can help your business get off to a flyer

Is there a problem nowadays that AI cannot solve? In all honesty, there are not many it seems. By using algorithms and ploughing through copious amounts of data and applying learnings absorbed from them, AI can spot patterns and build instruction manual-like approaches to tackle certain…

Study: Transparency is often lacking in datasets used to train large language models

In order to train more powerful large language models, researchers use vast dataset collections that blend diverse data from thousands of web sources.

But as these datasets are combined and recombined into multiple collections, important information about their origins and restrictions on how they can be used are often lost or confounded in the shuffle.

Not only does this raise legal and ethical concerns, it can also damage a model’s performance. For instance, if a dataset is miscategorized, someone training a machine-learning model for a certain task may end up unwittingly using data that are not designed for that task.

In addition, data from unknown sources could contain biases that cause a model to make unfair predictions when deployed.

To improve data transparency, a team of multidisciplinary researchers from MIT and elsewhere launched a systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent had information that contained errors.

Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.

“These types of tools can help regulators and practitioners make informed decisions about AI deployment, and further the responsible development of AI,” says Alex “Sandy” Pentland, an MIT professor, leader of the Human Dynamics Group in the MIT Media Lab, and co-author of a new open-access paper about the project.

The Data Provenance Explorer could help AI practitioners build more effective models by enabling them to select training datasets that fit their model’s intended purpose. In the long run, this could improve the accuracy of AI models in real-world situations, such as those used to evaluate loan applications or respond to customer queries.

“One of the best ways to understand the capabilities and limitations of an AI model is understanding what data it was trained on. When you have misattribution and confusion about where data came from, you have a serious transparency issue,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author on the paper.

Mahari and Pentland are joined on the paper by co-lead author Shayne Longpre, a graduate student in the Media Lab; Sara Hooker, who leads the research lab Cohere for AI; as well as others at MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research is published today in Nature Machine Intelligence.

Focus on finetuning

Researchers often use a technique called fine-tuning to improve the capabilities of a large language model that will be deployed for a specific task, like question-answering. For finetuning, they carefully build curated datasets designed to boost a model’s performance for this one task.

The MIT researchers focused on these fine-tuning datasets, which are often developed by researchers, academic organizations, or companies and licensed for specific uses.

When crowdsourced platforms aggregate such datasets into larger collections for practitioners to use for fine-tuning, some of that original license information is often left behind.

“These licenses ought to matter, and they should be enforceable,” Mahari says.

For instance, if the licensing terms of a dataset are wrong or missing, someone could spend a great deal of money and time developing a model they might be forced to take down later because some training data contained private information.

“People can end up training models where they don’t even understand the capabilities, concerns, or risk of those models, which ultimately stem from the data,” Longpre adds.

To begin this study, the researchers formally defined data provenance as the combination of a dataset’s sourcing, creating, and licensing heritage, as well as its characteristics. From there, they developed a structured auditing procedure to trace the data provenance of more than 1,800 text dataset collections from popular online repositories.

After finding that more than 70 percent of these datasets contained “unspecified” licenses that omitted much information, the researchers worked backward to fill in the blanks. Through their efforts, they reduced the number of datasets with “unspecified” licenses to around 30 percent.

Their work also revealed that the correct licenses were often more restrictive than those assigned by the repositories.   

In addition, they found that nearly all dataset creators were concentrated in the global north, which could limit a model’s capabilities if it is trained for deployment in a different region. For instance, a Turkish language dataset created predominantly by people in the U.S. and China might not contain any culturally significant aspects, Mahari explains.

“We almost delude ourselves into thinking the datasets are more diverse than they actually are,” he says.

Interestingly, the researchers also saw a dramatic spike in restrictions placed on datasets created in 2023 and 2024, which might be driven by concerns from academics that their datasets could be used for unintended commercial purposes.

A user-friendly tool

To help others obtain this information without the need for a manual audit, the researchers built the Data Provenance Explorer. In addition to sorting and filtering datasets based on certain criteria, the tool allows users to download a data provenance card that provides a succinct, structured overview of dataset characteristics.

“We are hoping this is a step, not just to understand the landscape, but also help people going forward to make more informed choices about what data they are training on,” Mahari says.

In the future, the researchers want to expand their analysis to investigate data provenance for multimodal data, including video and speech. They also want to study how terms of service on websites that serve as data sources are echoed in datasets.

As they expand their research, they are also reaching out to regulators to discuss their findings and the unique copyright implications of fine-tuning data.

“We need data provenance and transparency from the outset, when people are creating and releasing these datasets, to make it easier for others to derive these insights,” Longpre says.

“Many proposed policy interventions assume that we can correctly assign and identify licenses associated with data, and this work first shows that this is not the case, and then significantly improves the provenance information available,” says Stella Biderman, executive director of EleutherAI, who was not involved with this work. “In addition, section 3 contains relevant legal discussion. This is very valuable to machine learning practitioners outside companies large enough to have dedicated legal teams. Many people who want to build AI systems for public good are currently quietly struggling to figure out how to handle data licensing, because the internet is not designed in a way that makes data provenance easy to figure out.”

The Friday Roundup – Adobe at it Again and Some Editing Tips

Adobe Winning Hearts as Only They Can! Back in 2013 Adobe switched most of its products such as Premiere Pro and After Effects etc. over to a subscription based model. Of course they are perfectly within their rights to move forward on a pricing model as…

Shaktiman Mall, Principal Product Manager, Aviatrix – Interview Series

Shaktiman Mall is Principal Product Manager at Aviatrix. With more than a decade of experience designing and implementing network solutions, Mall prides himself on ingenuity, creativity, adaptability and precision. Prior to joining Aviatrix, Mall served as Senior Technical Marketing Manager at Palo Alto Networks and Principal…

How MIT’s online resources provide a “highly motivating, even transformative experience”

Charalampos (Haris) Sampalis was well established in his career as a product manager at a telecommunications company in Greece. Yet, as someone who enjoys learning, he was on a mission to acquire more knowledge and develop new skills. That’s how he discovered MIT Open Learning resources.

With a bachelor’s degree in computer science from the University of Crete and a master’s in innovation management and entrepreneurship from Hellenic Open University — the only online/distance learning university in Greece — Sampalis had developed expertise in product management and digital strategy. In 2016, he turned to MITx within MIT Open Learning and found a wealth of knowledge and a community of learners who broadened his horizons.

“I’m a person who likes to be constantly absorbing educational information,” Sampalis says. “I strongly believe that education shouldn’t be under boundaries, or strictly belong to specific periods in our lives. I started with computer science, and it grew from there, following programs on a regular basis that may help me expand my horizons and strengthen my skills.”

Sampalis built his life and career in Athens, which makes MIT Open Learning’s digital resources more valuable. He completed courses in computer science, including 6.00.1x (Introduction to Computer Science and Programming Using Python), 11.155x (Design Thinking for Leading and Learning) and Becoming an Entrepreneur back in 2016 and 2017 through MITx, which offers hundreds of high-quality massive open online courses adapted from the MIT classroom for learners worldwide. Sampalis has also enrolled in Management in Engineering: Strategy and Leadership and Management in Engineering: Accounting and Planning, which are part of the MITx MicroMasters Program in Principles of Manufacturing.

“I really appreciate the fact that an established institution like MIT was offering programs online,” he says. “I work full time and it’s not easy at this period of my life to leave everything behind and move to another continent for education — something I might have done at another time in my life. So, this is a model that allows me to access MIT resources and grow myself as part of a community that shares similar interests and seeks further collaborations, even locally where I live, something that makes the overall experience really unique.” 

In 2022, Sampalis applied for and completed the MIT Innovation Leadership Bootcamp. Part of MIT Open Learning, MIT Bootcamps are intensive and immersive educational programs for the global community of innovators, entrepreneurs, and changemakers. The Innovation Leadership Bootcamp was offered online, and Sampalis jumped at the opportunity. 

“I was in collaborative mode, having daily interactions with a diverse group of individuals scattered around the world, and that took place during an intensive 10-week period of my life that really taught me a lot,” says Sampalis. “Working with a global team was extremely engaging. It was a highly motivating, even transformative experience.”

MITx and MIT Bootcamps are both hands-on and interactive experiences offered by MIT Open Learning, which is exactly what appealed to Sampalis. One of the best parts, he says, is that community and collaborations with those he met through MIT continued even after the boot camp concluded. Participants remain in touch not only with their cohort, but with a broader community of over 1,800 other participants from around the world, and have access to continued coaching and mentorship.

Overall, the community of learners has been a highlight of Sampalis’ MIT Open Learning experience.

“What is so beneficial is not just that I get a certificate from MIT and access to a highly valuable repository of knowledge resources, but the fact that I have been exposed to the full umbrella of what Open Learning has to offer — and I share that with other learners,” he says. “I’m part of MIT now. I continue to learn for myself, and I also try to give back, by supporting Open Learning and sharing my story and resources.”

The AI Scientist: A New Era of Automated Research or Just the Beginning

Scientific research is a fascinating blend of deep knowledge and creative thinking, driving new insights and innovation. Recently, Generative AI has become a transformative force, utilizing its capabilities to process extensive datasets and create content that mirrors human creativity. This ability has enabled generative AI to…

Is Sentiment Analysis Effective in Predicting Trends in Financial Markets?

Sentiment analytics transforms financial market prediction by uncovering insights traditional analysis often misses. This strategy captures the market’s mood and attitude toward assets and industries by processing text data from news, social media and financial reports.  As its effectiveness becomes more evident, interest in using sentiment…

Integrating Contextual Understanding in Chatbots Using LangChain

In recent years, the digital world has seen significant changes, with chatbots becoming vital tools in customer service, virtual assistance, and many other areas. These AI-driven agents have advanced quickly, now handling various tasks, from answering simple questions to managing complex customer interactions. However, despite their…

Can AI writing tools and human writers coexist?

The demand for content is as high as ever in today’s digital world, with businesses, individuals, and marketers seeking fresh, engaging content to connect with their audiences. This increasing demand has resulted in the rise of AI-powered content writing tools, raising concerns from human writers about…

California Assembly passes controversial AI safety bill

The California State Assembly has approved the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047). The bill, which has sparked intense debate in Silicon Valley and beyond, aims to impose a series of safety measures on AI companies operating within California. These…