From prototype to product with generative AI and large models

Shivani Poddar, Engineering Lead at Google, gave this presentation at the Generative AI Summit.

In this talk, we’re going to revisit the challenges that are present with GenAI prototypes today, and I’ll share three strategies to overcome them. 

Transforming GenAI prototypes into impactful solutions

There has been an explosion of GenAI applications. I was marveling at the sheer number of applications that there are in healthcare and other domains. 

I was just at a hackathon event in San Francisco, and within 24 hours, a set of 70 people came up with over 30 products. They summed up what the early, quick prototypes are like in GenAI. There’s a lot of duct tape, people are whipping up models, putting in some data, and sprinkling some UI on it. 

My hope is that we can understand the journey that takes these products into something that you and I will use in the future. 

So, now that you know that the challenges in GenAI are real, products are everywhere and they’re duct-taped together, how do we overcome these challenges? 

I want you to remember three steps:

  • Define and design
  • Measure and metrify
  • Innovate and iterate 

So, let’s get into it. 

3 strategies for overcoming GenAI challenges

Define and design 

What do we mean by define and design? Often in our excitement of building new products, we forget to carefully think through our ultimate use cases, our ultimate users, and our ultimate product definitions. 

If my product is summarizing medical notes for a physician, I have to make sure that my model is factual. It’s not hallucinating, and it’s reliable for the end user. 

If my product is content writing and is a creative use case, then I can afford to be a little bit nonfactual and be creative with my underlying models. 

So, the kind of use case you have and the kind of end users that will be taking and using your product define what the product affordances are that you’re going to be implementing. 

The definition step is the first step before we even start building out these products. 

As soon as you’ve defined it, the next stage is designing your product. A lot of what you’re designing is to bring this technology in a controlled way to your end users. A user comes in and you have a big model in the back end, and this allows the user to input anything. This could be any user with any intention who comes in and inputs anything within your system. 

An alternative approach could be to have a drop-down menu of different types of prompts, just to educate the user so that they can interact with your system within the scope that you’ve designed for interaction. 

Carefully train through your use cases and make sure you design your product to be within those use cases for the user. 

Measure and metrify

Now you’ve defined and designed your product, you’ve gone ahead and ideally integrated a large model within this back end. The next step is to measure the limitations and define what the success and the failure modes are for the end product that you’re building. 

The first step to measuring any product is to start with data. It’s important to build holistic data sets that capture both the ideal use cases that you’re designing for and that you want to light up with your system, as well as the risks that the users might encounter when they interact with your system in unexpected ways. 

This also means that your data represents all the users that you’re going to be interacting with. Make sure that you have user coverage or data coverage for the different languages that you’re going to be launching your product in, as well as different locales and different ethnicities.

It’s super important to ensure that your data is representative, holistic, adversarial, and diverse enough to have the absolute best coverage for your use case and the failures and successes you might encounter. 

Once you have your data, the next step is to measure your system. Any system that comes out of the box can be anywhere. So it’s your job as the person who’s taking this prototype to product to first define what metrics you care about and then make sure that your data can help you quantify those metrics. 

Armed with that good holistic data, the next step is to measure your end-to-end system. Good success metrics can mean things like, what was the task completion rate? If a user came to my product with a task in mind, how many times were they able to complete that task end-to-end? 

A failure mode can be, if a user came to my system, how many times did I elicit a response that they weren’t looking for? What was the toxicity rate? What was the hatefulness in my model outputs? 

This will help you catch not only the times when your product is behaving in the intended ways but also help you catch the times when your product is behaving in unintended ways and help you fix those.

I’ll give you an example. 

For creative writing applications, it might be okay to have violent outputs. Say someone is writing a murder mystery. We may want to have model outputs that are categorized as violent. But if someone’s writing a storytelling application for kids, then none of these are going to be good to go. 

How good or bad these metrics are is completely dependent on the underlying application that you have. So defining and designing your application is super important before you even go out to measure it. 

We now know largely where this underlying foundation model works for us, and we also know where the breakages are. 

Innovate and iterate

The next and final outcome is innovating. Innovating can be done in a million ways, but I’m only going to be talking about two of those. 

The first step is making sure that the data that you collected in step two is holistic and representative of your use case and your users. Make sure you go ahead and fine-tune your model to represent it with similar data which can be used for training and fine-tuning so that this foundation model can work for you in the context of your application and your products. 

The second part is model configurability. This means ensuring that you have the right guardrails to toggle away some of those metrics and toggle away the right inputs and outputs that go in and out of your system. 

You have a system and users trigger your system with some inputs that we call prompts. These inputs then can be of three types. They can be the perfect inputs because these are the best users you want and they’re using your system in the exact way that you intended. 

You can choose to make these inputs better. There are techniques like RAG and dynamic prompting, where you essentially append a bunch of context or knowledge bases to this prompt to enrich the prompt and make it easier for your system to deal with the prompt. 

The second kind of input is someone is trying to break your system or the input isn’t exactly the input that you intended, in which case you may have to either block the input entirely or do the third thing, which is a dynamic rewriting of the input. 

You might say, “Even though this user’s input isn’t exactly what my model supports, I’m going to try to rewrite it with smart techniques,” which could either be just rewriting it using another large model or other rewriting techniques. So make sure it’s within the scope of the use cases that your product supports. 

So, you have a prompt input and you’ve gone ahead and made sure it fits within the design constraints that you’ve set out for your product. The next step is to send it to your custom model. Hopefully, this is now the fine-tuned model that you’ve retrained and fine-tuned to serve for your product and your use cases. There are a bunch of things that you can do within this step. 

Oftentimes, large models will give you not just one output, but a set of outputs. These could be the top eight or the top 10. Depending on how you configure it, you can take all of these outputs and re-rank them based on your product principles. 

Maybe your product principle is creative and you want to write murder mysteries and have a lot of violence in there. Go ahead and re-rank it to make sure that it abides by what you have in mind for your metrics. 

Now your input is sorted, you have a good model, and you’ve made sure that you’ve gone ahead and even re-ranked some of the outputs in a way that conforms with your product requirements. 

The final step is outputs. Sometimes we still don’t have control over what outputs we might get out of these models. In that case, there are a bunch of different techniques. 

If the output is wholly out of left field, it might be worthwhile to get the model back and rewrite some of these outputs. You might block the output altogether if you don’t have the ability to rewrite it in some way. Or hopefully, if you’ve done everything correctly, you get the best outputs and you send this off to the user. 

We did a lot. We defined, designed, measured, metrified, and now we’ve innovated. Are we done? No, we’re not.

I think the magic and the bones of these models are that our use cases and users are ever-evolving, and we’ve not had a lot of these systems in the wild for a very long time. 

So it’s fundamental to make sure that once you launch these systems, you go back and revisit all of those steps in light of the new data that you’re getting. 

It’s critical for you as developers, product people, or innovators to make sure that you’ve discovered what the unknown and knowns are so that you discover where the users and the gaps are within your system. 

The importance of keeping up with GenAI innovations

The generative AI space is new and rapidly evolving, and this problem is hard. So to that end, I think there are two big things that are super important. 

Having approaches that generalize well across problems, platforms, and domains is integral to making sure that an approach you develop doesn’t get deprecated in the next three months. Working towards that generalizability is key to making sure that your organization can pivot through the change that’s going to come through the many months. 

The second thing I think is super interesting and has emerged over the past many months of the space evolving is the accessibility to each of these techniques. It’s important to choose the right developer tool when you go out to build your company to make sure it gives you the ability to deploy each of those things.

Going back to the hackathon I was just at, I built the app that I built with two developer tools. It was painstaking for me to make sure that I was able to take the input into the output, and then from ChatGPT, and then from that prompt and other platform, and in the process, do all of that manually by myself. 

So, instead of spending developer cycles in all of this process, you want to choose the tools that have already done this hard work for you. 

Lastly, there’s a lot of ongoing research that’s coming out of this. There are multi-agent frameworks that Microsoft just launched. There’s AutoGen, and there are a chain of verification techniques that are coming up, which tell you how to make sure your model is more grounded. 

Make sure that you keep up with these innovations no matter how hard they are. All of these are sometimes net small increments but can give us huge value addition in how good our products are at overcoming some of those challenges that we talked about.

There’s also a great product landscape that’s evolving. Not just as the research is evolving, but the products that have been deployed with all of these things that we talked about are also coming out. 

So, drawing inspiration from both the evolving research landscape, as well as the evolving product landscape is critical to make sure that we make the biggest leaps and do the craziest innovation that’s yet to come in this domain. So keep on building.