Steven Hillion, SVP of Data and AI at Astronomer – Interview Series

Steven Hillion is the Senior Vice President of Data and AI at Astronomer, where he leverages his extensive academic background in research mathematics and over 15 years of experience in Silicon Valley’s machine learning platform development. At Astronomer, he spearheads the creation of Apache Airflow features specifically designed for ML and AI teams and oversees the internal data science team. Under his leadership, Astronomer has advanced its modern data orchestration platform, significantly enhancing its data pipeline capabilities to support a diverse range of data sources and tasks through machine learning.

Can you share some information about your journey in data science and AI, and how it has shaped your approach to leading engineering and analytics teams?

I had a background in research mathematics at Berkeley before I moved across the Bay to Silicon Valley and worked as an engineer in a series of successful start-ups. I was happy to leave behind the politics and bureaucracy of academia, but I found within a few years that I missed the math. So I shifted into developing platforms for machine learning and analytics, and that’s pretty much what I’ve done since.

My training in pure mathematics has resulted in a preference for what data scientists call ‘parsimony’ — the right tool for the job, and nothing more.  Because mathematicians tend to favor elegant solutions over complex machinery, I’ve always tried to emphasize simplicity when applying machine learning to business problems. Deep learning is great for some applications — large language models are brilliant for summarizing documents, for example — but sometimes a simple regression model is more appropriate and easier to explain.

It’s been fascinating to see the shifting role of the data scientist and the software engineer in these last twenty years since machine learning became widespread. Having worn both hats, I am very aware of the importance of the software development lifecycle (especially automation and testing) as applied to machine learning projects.

What are the biggest challenges in moving, processing, and analyzing unstructured data for AI and large language models (LLMs)?

In the world of Generative AI, your data is your most valuable asset. The models are increasingly commoditized, so your differentiation is all that hard-won institutional knowledge captured in your proprietary and curated datasets.

Delivering the right data at the right time places high demands on your data pipelines — and this applies for unstructured data just as much as structured data, or perhaps more. Often you’re ingesting data from many different sources, in many different formats. You need access to a variety of methods in order to unpack the data and get it ready for use in model inference or model training. You also need to understand the provenance of the data, and where it ends up in order to “show your work”.

If you’re only doing this once in a while to train a model, that’s fine. You don’t necessarily need to operationalize it. If you’re using the model daily, to understand customer sentiment from online forums, or to summarize and route invoices, then it starts to look like any other operational data pipeline, which means you need to think about reliability and reproducibility. Or if you’re fine-tuning the model regularly, then you need to worry about monitoring for accuracy and cost.

The good news is that data engineers have developed a great platform, Airflow,  for managing data pipelines, which has already been applied successfully to managing model deployment and monitoring by some of the world’s most sophisticated ML teams. So the models may be new, but orchestration is not.

Can you elaborate on the use of synthetic data to fine-tune smaller models for accuracy? How does this compare to training larger models?

It’s a powerful technique. You can think of the best large language models as somehow encapsulating what they’ve learned about the world, and they can pass that on to smaller models by generating synthetic data. LLMs encapsulate vast amounts of knowledge learned from extensive training on diverse datasets. These models can generate synthetic data that captures the patterns, structures, and information they have learned. This synthetic data can then be used to train smaller models, effectively transferring some of the knowledge from the larger models to the smaller ones. This process is often referred to as “knowledge distillation” and helps in creating efficient, smaller models that still perform well on specific tasks. And with synthetic data then you can avoid privacy issues, and fill in the gaps in training data that’s small or incomplete.

This can be helpful for training a more domain-specific generative AI model, and can even be more effective than training a “larger” model, with a greater level of control.

Data scientists have been generating synthetic data for a while and imputation has been around as long as messy datasets have existed. But you always had to be very careful that you weren’t introducing biases, or making incorrect assumptions about the distribution of the data. Now that synthesizing data is so much easier and powerful, you have to be even more careful. Errors can be magnified.

A lack of diversity in generated data can lead to ‘model collapse’. The model thinks it’s doing well, but that’s because it hasn’t seen the full picture. And, more generally, a lack of diversity in training data is something that data teams should always be looking out for.

At a baseline level, whether you are using synthetic data or organic data, lineage and quality are paramount for training or fine-tuning any model. As we know, models are only as good as the data they’re trained on.  While synthetic data can be a great tool to help represent a sensitive dataset without exposing it or to fill in gaps that might be left out of a representative dataset, you must have a paper trail showing where the data came from and be able to prove its level of quality.

What are some innovative techniques your team at Astronomer is implementing to improve the efficiency and reliability of data pipelines?

So many! Astro’s fully-managed Airflow infrastructure and the Astro Hypervisor supports dynamic scaling and proactive monitoring through advanced health metrics. This ensures that resources are used efficiently and that systems are reliable at any scale. Astro provides robust data-centric alerting with customizable notifications that can be sent through various channels like Slack and PagerDuty. This ensures timely intervention before issues escalate.

Data validation tests, unit tests, and data quality checks play vital roles in ensuring the reliability, accuracy, and efficiency of data pipelines and ultimately the data that powers your business. These checks ensure that while you quickly build data pipelines to meet your deadlines, they are actively catching errors, improving development times, and reducing unforeseen errors in the background. At Astronomer, we’ve built tools like Astro CLI to help seamlessly check code functionality or identify integration issues within your data pipeline.

How do you see the evolution of generative AI governance, and what measures should be taken to support the creation of more tools?

Governance is imperative if the applications of Generative AI are going to be successful. It’s all about transparency and reproducibility. Do you know how you got this result, and from where, and by whom? Airflow by itself already gives you a way to see what individual data pipelines are doing. Its user interface was one of the reasons for its rapid adoption early on, and at Astronomer we’ve augmented that with visibility across teams and deployments. We also provide our customers with Reporting Dashboards that offer comprehensive insights into platform usage, performance, and cost attribution for informed decision making. In addition, the Astro API enables teams to programmatically deploy, automate, and manage their Airflow pipelines, mitigating risks associated with manual processes, and ensuring seamless operations at scale when managing multiple Airflow environments. Lineage capabilities are baked into the platform.

These are all steps toward helping to manage data governance, and I believe companies of all sizes are recognizing the importance of data governance for ensuring trust in AI applications. This recognition and awareness will largely drive the demand for data governance tools, and I anticipate the creation of more of these tools to accelerate as generative AI proliferates. But they need to be part of the larger orchestration stack, which is why we view it as fundamental to the way we build our platform.

Can you provide examples of how Astronomer’s solutions have improved operational efficiency and productivity for clients?

Generative AI processes involve complex and resource-intensive tasks that need to be carefully optimized and repeatedly executed. Astro, Astronomer’s managed Apache Airflow platform, provides a framework at the center of the emerging AI app stack to help simplify these tasks and enhance the ability to innovate rapidly.

By orchestrating generative AI tasks, businesses can ensure computational resources are used efficiently and workflows are optimized and adjusted in real-time. This is particularly important in environments where generative models must be frequently updated or retrained based on new data.

By leveraging Airflow’s workflow management and Astronomer’s deployment and scaling capabilities, teams can spend less time managing infrastructure and focus their attention instead on data transformation and model development, which accelerates the deployment of Generative AI applications and enhances performance.

In this way, Astronomer’s Astro platform has helped customers improve the operational efficiency of generative AI across a wide range of use cases. To name a few, use cases include e-commerce product discovery, customer churn risk analysis, support automation, legal document classification and summarization, garnering product insights from customer reviews, and dynamic cluster provisioning for product image generation.

What role does Astronomer play in enhancing the performance and scalability of AI and ML applications?

Scalability is a major challenge for businesses tapping into generative AI in 2024. When moving from prototype to production, users expect their generative AI apps to be reliable and performant, and for the outputs they produce to be trustworthy. This needs to be done cost-effectively and businesses of all sizes need to be able to harness its potential. With this in mind, by using Astronomer, tasks can be scaled horizontally to dynamically process large numbers of data sources. Astro can elastically scale deployments and the clusters they’re hosted on, and queue-based task execution with dedicated machine types provides greater reliability and efficient use of compute resources. To help with the cost-efficiency piece of the puzzle, Astro offers scale-to-zero and hibernation features, which help control spiraling costs and reduce cloud spending. We also provide complete transparency around the cost of the platform. My own data team generates reports on consumption which we make available daily to our customers.

What are some future trends in AI and data science that you are excited about, and how is Astronomer preparing for them?

Explainable AI is a hugely important and fascinating area of development. Being able to peer into the inner workings of very large models is almost eerie.  And I’m also interested to see how the community wrestles with the environmental impact of model training and tuning. At Astronomer, we continue to update our Registry with all the latest integrations, so that data and ML teams can connect to the best model services and the most efficient compute platforms without any heavy lifting.

How do you envision the integration of advanced AI tools like LLMs with traditional data management systems evolving over the next few years?

We’ve seen both Databricks and Snowflake make announcements recently about how they incorporate both the usage and the development of LLMs within their respective platforms. Other DBMS and ML platforms will do the same. It’s great to see data engineers have such easy access to such powerful methods, right from the command line or the SQL prompt.

I’m particularly interested in how relational databases incorporate machine learning. I’m always waiting for ML methods to be incorporated into the SQL standard, but for some reason the two disciplines have never really hit it off.  Perhaps this time will be different.

I’m very excited about the future of large language models to assist the work of the data engineer. For starters, LLMs have already been particularly successful with code generation, although early efforts to supply data scientists with AI-driven suggestions have been mixed: Hex is great, for example, whereas Snowflake is uninspiring so far. But there is huge potential to change the nature of work for data teams, much more than for developers. Why? For software engineers, the prompt is a function name or the docs, but for data engineers there’s also the data. There’s just so much context that models can work with to make useful and accurate suggestions.

What advice would you give to aspiring data scientists and AI engineers looking to make an impact in the industry?

Learn by doing. It’s so incredibly easy to build applications these days, and to augment them with artificial intelligence. So build something cool, and send it to a friend of a friend who works at a company you admire. Or send it to me, and I promise I’ll take a look!

The trick is to find something you’re passionate about and find a good source of related data. A friend of mine did a fascinating analysis of anomalous baseball seasons going back to the 19th century and uncovered some stories that deserve to have a movie made out of them. And some of Astronomer’s engineers recently got together one weekend to build a platform for self-healing data pipelines. I can’t imagine even trying to do something like that a few years ago, but with just a few days’ effort we won Cohere’s hackathon and built the foundation of a major new feature in our platform.

Thank you for the great interview, readers who wish to learn more should visit Astronomer.