Malicious open-source packages: Insights from Check Point’s Head of Data Science

Malicious open-source packages: Insights from Check Point’s Head of Data Science

Ori Abramovsky is the Head of Data Science of the Developer-First group at Check Point, where he leads the development and application of machine learning models to the source code domain. With extensive experience in various machine learning types, Ori specializes in bringing AI applications to life. He is committed to bridging the gap between theory and real-world application and is passionate about harnessing the power of AI to solve complex business challenges.

In this thoughtful and incisive interview, Check Point’s Developer-First Head of Data Science, Ori Abramovsky discusses malicious open-source packages. While malicious open-source packages aren’t new, their popularity among hackers is increasing. Discover attack vectors, how malicious packages conceal their intent, and risk mitigation measures. The best prevention measure is…Read the interview to find out.

What kinds of trends are you seeing in relation to malicious open-source packages?

The main trend we’re seeing relates to the increasing sophistication and prevalence of malicious open-source packages. While registries are implementing stricter measures, such as PyPI’s recent mandate for users to adopt two-factor authentication, the advances of Large Language Models (LLMs) pose significant challenges to safeguarding against such threats. Previously, hackers needed substantial expertise in order to create malicious packages. Now, all they need is access to LLMs and to find the right prompts for them. The barriers to entry have significantly decreased.

While LLMs democratise knowledge, they also make it much easier to distribute malicious techniques. As a result, it’s fair to assume that we should anticipate an increasing volume of sophisticated attacks. Moreover, we’re already in the middle of that shift, seeing these attacks extending beyond traditional domains like NPM and PyPI, manifesting in various forms such as malicious VSCode extensions and compromised Hugging Face models. To sum it up, the accessibility of LLMs empowers malicious actors, indicating a need for heightened vigilance across all open-source domains. Exciting yet challenging times lie ahead, necessitating preparedness.

Are there specific attack types that are most popular among hackers, and if so, what are they?

Malicious open-source packages can be applied based on the stage of infection: install (as part of the install process), first use (once the package has been imported), and runtime (infection is hidden as part of some functionality and will be activated once the user will use that functionality). Install and first use attacks typically employ simpler techniques; prioritizing volume over complexity, aiming to remain undetected long enough to infect users (assuming that some users will mistakenly install them). In contrast, runtime attacks are typically more sophisticated, with hackers investing efforts in concealing their malicious intent. As a result, the attacks are harder to detect, but come with a pricier tag. They last longer and therefore have higher chances of becoming a zero-day affecting more users.

Malicious packages employ diverse methods to conceal their intent, ranging from manipulating package structures (the simpler ones will commonly include only the malicious code, the more sophisticated ones can even be an exact copy of a legit package), to employing various obfuscation techniques (from classic methods such as base64 encoding, to more advanced techniques, such as steganography). The downside of using such concealment methods can make them susceptible to detection, as many Yara detection rules specifically target these signs of obfuscation. Given the emergence of Large Language Models (LLMs), hackers have greater access to advanced techniques for hiding malicious intent and we should expect to see more sophisticated and innovative concealment methods in the future.

Hackers tend to exploit opportunities where hacking is easier or more likely, with studies indicating a preference for targeting dynamic installation flows in registries like PyPI and NPM due to their simplicity in generating attacks. While research suggests a higher prevalence of such attacks in source code languages with dynamic installation flows, the accessibility of LLMs facilitates the adaptation of these attacks to new platforms, potentially leading hackers to explore less visible domains for their malicious activities.

How can organisations mitigate the risk associated with malicious open-source packages? How can CISOs ensure protection/prevention?

The foremost strategy for organisations to mitigate the risk posed by malicious open-source packages is through education. One should not use open-source code without properly knowing its origins. Ignorance in this realm does not lead to bliss. Therefore, implementing practices such as double-checking the authenticity of packages before installation is crucial. Looking into aspects like the accuracy of package descriptions, their reputation, community engagement (such as stars and user feedback), the quality of documentation in the associated GitHub repository, and its track record of reliability is also critical. By paying attention to these details, organisations can significantly reduce the likelihood of falling victim to malicious packages.

The fundamental challenge lies in addressing the ignorance regarding the risks associated with open-source software. Many users fail to recognize the potential threats and consequently, are prone to exploring and installing new packages without adequate scrutiny. Therefore, it is incumbent upon Chief Information Security Officers (CISOs) to actively participate in the decision-making process regarding the selection and usage of open-source packages within their organisations.

Despite best efforts, mistakes can still occur. To bolster defences, organisations should implement complementary protection services designed to monitor and verify the integrity of packages being installed. These measures serve as an additional layer of defence, helping to detect and mitigate potential threats in real-time.

What role does threat intelligence play in identifying and mitigating risks related to open-source packages?

Traditionally, threat intelligence has played a crucial role in identifying and mitigating risks associated with open-source packages. Dark web forums and other underground channels were primary sources for discussing and sharing malicious code snippets. This allowed security professionals to monitor and defend against these snippets using straightforward Yara rules. Additionally, threat intelligence facilitated the identification of suspicious package owners and related GitHub repositories, aiding in the early detection of potential threats. While effective for simpler cases of malicious code, this approach may struggle to keep pace with the evolving sophistication of attacks, particularly in light of advancements like Large Language Models (LLMs).

These days, with the rise of LLMs, it’s reasonable to expect hackers to innovate new methods through which to conduct malicious activity, prioritizing novel techniques over rehashing old samples that are easily identifiable by Yara rules. Consequently, while threat intelligence remains valuable, it should be supplemented with more advanced analysis techniques to thoroughly assess the integrity of open-source packages. This combined approach ensures a comprehensive defence against emerging threats, especially within less-monitored ecosystems, where traditional threat intelligence may be less effective.

What to anticipate in the future?

The emergence of Large Language Models (LLMs) is revolutionising every aspect of the software world, including the malicious domain. From the perspective of hackers, this development’s immediate implication equates to more complicated malicious attacks, more diverse attacks and more attacks, in general (leveraging LLMs to optimise strategies). Looking forward, we should anticipate hackers trying to target the LLMs themselves, using techniques like prompt injection or by trying to attack the LLM agents. New types and domains of malicious attacks are probably about to emerge.

Looking at the malicious open-source packages domain in general, a place we should probably start watching is Github. Historically, malicious campaigns have targeted open-source registries such as PyPI and NPM, with auxiliary support from platforms like GitHub, Dropbox, and Pastebin for hosting malicious components or publishing exploited data pieces. However, as these registries adopt more stringent security measures and become increasingly monitored, hackers are likely to seek out new “dark spots” such as extensions, marketplaces, and GitHub itself. Consequently, malicious code has the potential to infiltrate EVERY open-source component we utilise, necessitating vigilance and proactive measures to safeguard against such threats.