Optimize LLM with DSPy : A Step-by-Step Guide to build, optimize, and evaluate AI systems

As the capabilities of large language models (LLMs) continue to expand, developing robust AI systems that leverage their potential has become increasingly complex. Conventional approaches often involve intricate prompting techniques, data generation for fine-tuning, and manual guidance to ensure adherence to domain-specific constraints. However, this process can be tedious, error-prone, and heavily reliant on human intervention.

Enter DSPy, a revolutionary framework designed to streamline the development of AI systems powered by LLMs. DSPy introduces a systematic approach to optimizing LM prompts and weights, enabling developers to build sophisticated applications with minimal manual effort.

In this comprehensive guide, we’ll explore the core principles of DSPy, its modular architecture, and the array of powerful features it offers. We’ll also dive into practical examples, demonstrating how DSPy can transform the way you develop AI systems with LLMs.

What is DSPy, and Why Do You Need It?

DSPy is a framework that separates the flow of your program (modules) from the parameters (LM prompts and weights) of each step. This separation allows for the systematic optimization of LM prompts and weights, enabling you to build complex AI systems with greater reliability, predictability, and adherence to domain-specific constraints.

Traditionally, developing AI systems with LLMs involved a laborious process of breaking down the problem into steps, crafting intricate prompts for each step, generating synthetic examples for fine-tuning, and manually guiding the LMs to adhere to specific constraints. This approach was not only time-consuming but also prone to errors, as even minor changes to the pipeline, LM, or data could necessitate extensive rework of prompts and fine-tuning steps.

DSPy addresses these challenges by introducing a new paradigm: optimizers. These LM-driven algorithms can tune the prompts and weights of your LM calls, given a metric you want to maximize. By automating the optimization process, DSPy empowers developers to build robust AI systems with minimal manual intervention, enhancing the reliability and predictability of LM outputs.

DSPy’s Modular Architecture

At the heart of DSPy lies a modular architecture that facilitates the composition of complex AI systems. The framework provides a set of built-in modules that abstract various prompting techniques, such as dspy.ChainOfThought and dspy.ReAct. These modules can be combined and composed into larger programs, allowing developers to build intricate pipelines tailored to their specific requirements.

Each module encapsulates learnable parameters, including the instructions, few-shot examples, and LM weights. When a module is invoked, DSPy’s optimizers can fine-tune these parameters to maximize the desired metric, ensuring that the LM’s outputs adhere to the specified constraints and requirements.

Optimizing with DSPy

DSPy introduces a range of powerful optimizers designed to enhance the performance and reliability of your AI systems. These optimizers leverage LM-driven algorithms to tune the prompts and weights of your LM calls, maximizing the specified metric while adhering to domain-specific constraints.

Some of the key optimizers available in DSPy include:

BootstrapFewShot: This optimizer extends the signature by automatically generating and including optimized examples within the prompt sent to the model, implementing few-shot learning.
BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot several times with random search over generated demonstrations, selecting the best program over the optimization.
MIPRO: Generates instructions and few-shot examples in each step, with the instruction generation being data-aware and demonstration-aware. It uses Bayesian Optimization to effectively search over the space of generation instructions and demonstrations across your modules.
BootstrapFinetune: Distills a prompt-based DSPy program into weight updates for smaller LMs, allowing you to fine-tune the underlying LLM(s) for enhanced efficiency.

By leveraging these optimizers, developers can systematically optimize their AI systems, ensuring high-quality outputs while adhering to domain-specific constraints and requirements.

Getting Started with DSPy

To illustrate the power of DSPy, let’s walk through a practical example of building a retrieval-augmented generation (RAG) system for question-answering.

Step 1: Setting up the Language Model and Retrieval Model

The first step involves configuring the language model (LM) and retrieval model (RM) within DSPy.

To install DSPy run:

pip install dspy-ai

DSPy supports multiple LM and RM APIs, as well as local model hosting, making it easy to integrate your preferred models.

import dspy
# Configure the LM and RM
turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

Step 2: Loading the Dataset

Next, we’ll load the HotPotQA dataset, which contains a collection of complex question-answer pairs typically answered in a multi-hop fashion.

from dspy.datasets import HotPotQA
# Load the dataset
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)
# Specify the 'question' field as the input
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

Step 3: Building Signatures

DSPy uses signatures to define the behavior of modules. In this example, we’ll define a signature for the answer generation task, specifying the input fields (context and question) and the output field (answer).

class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")

Step 4: Building the Pipeline

We’ll build our RAG pipeline as a DSPy module, which consists of an initialization method (__init__) to declare the sub-modules (dspy.Retrieve and dspy.ChainOfThought) and a forward method (forward) to describe the control flow of answering the question using these modules.

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
    super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

Step 5: Optimizing the Pipeline

With the pipeline defined, we can now optimize it using DSPy’s optimizers. In this example, we’ll use the BootstrapFewShot optimizer, which generates and selects effective prompts for our modules based on a training set and a metric for validation.

from dspy.teleprompt import BootstrapFewShot
# Validation metric
def validate_context_and_answer(example, pred, trace=None):
answer_EM = dspy.evaluate.answer_exact_match(example, pred)
answer_PM = dspy.evaluate.answer_passage_match(example, pred)
return answer_EM and answer_PM
# Set up the optimizer
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
# Compile the program
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

Step 6: Evaluating the Pipeline

After compiling the program, it is essential to evaluate its performance on a development set to ensure it meets the desired accuracy and reliability.

from dspy.evaluate import Evaluate
# Set up the evaluator
evaluate = Evaluate(devset=devset, metric=validate_context_and_answer, num_threads=4, display_progress=True, display_table=0)
# Evaluate the compiled RAG program
evaluation_result = evaluate(compiled_rag)
print(f"Evaluation Result: {evaluation_result}")

Step 7: Inspecting Model History

For a deeper understanding of the model’s interactions, you can review the most recent generations by inspecting the model’s history.

# Inspect the model's history
turbo.inspect_history(n=1)

Step 8: Making Predictions

With the pipeline optimized and evaluated, you can now use it to make predictions on new questions.

# Example question
question = "Which award did Gary Zukav's first book receive?"
# Make a prediction using the compiled RAG program
prediction = compiled_rag(question)
print(f"Question: {question}")
print(f"Answer: {prediction.answer}")
print(f"Retrieved Contexts: {prediction.context}")

Minimal Working Example with DSPy

Now, let’s walk through another minimal working example using the GSM8K dataset and the OpenAI GPT-3.5-turbo model to simulate prompting tasks within DSPy.

Setup

First, ensure your environment is properly configured:

import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
# Set up the LM
turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=250)
dspy.settings.configure(lm=turbo)
# Load math questions from the GSM8K dataset
gsm8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gsm8k.train[:10], gsm8k.dev[:10]
print(gsm8k_trainset)

The gsm8k_trainset and gsm8k_devset datasets contain a list of examples with each example having a question and answer field.

Define the Module

Next, define a custom program utilizing the ChainOfThought module for step-by-step reasoning:

class CoT(dspy.Module):
def __init__(self):
super().__init__()
self.prog = dspy.ChainOfThought("question -&amp;amp;gt; answer")
def forward(self, question):
return self.prog(question=question)

Compile and Evaluate the Model

Now compile it with the BootstrapFewShot teleprompter:

from dspy.teleprompt import BootstrapFewShot
# Set up the optimizer
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4)
# Optimize using the gsm8k_metric
teleprompter = BootstrapFewShot(metric=gsm8k_metric, **config)
optimized_cot = teleprompter.compile(CoT(), trainset=gsm8k_trainset)
# Set up the evaluator
from dspy.evaluate import Evaluate
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)
evaluate(optimized_cot)
# Inspect the model's history
turbo.inspect_history(n=1)

This example demonstrates how to set up your environment, define a custom module, compile a model, and rigorously evaluate its performance using the provided dataset and teleprompter configurations.

Data Management in DSPy

DSPy operates with training, development, and test sets. For each example in your data, you typically have three types of values: inputs, intermediate labels, and final labels. While intermediate or final labels are optional, having a few example inputs is essential.

Creating Example Objects

Example objects in DSPy are similar to Python dictionaries but come with useful utilities:

qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")
print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Output:

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.

Specifying Input Keys

In DSPy, Example objects have a with_inputs() method to mark specific fields as inputs:

print(qa_pair.with_inputs("question"))
print(qa_pair.with_inputs("question", "answer"))

Values can be accessed using the dot operator, and methods like inputs() and labels() return new Example objects containing only input or non-input keys, respectively.

Optimizers in DSPy

A DSPy optimizer tunes the parameters of a DSPy program (i.e., prompts and/or LM weights) to maximize specified metrics. DSPy offers various built-in optimizers, each employing different strategies.

Available Optimizers

BootstrapFewShot: Generates few-shot examples using provided labeled input and output data points.
BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot multiple times with random search over generated demonstrations.
COPRO: Generates and refines new instructions for each step, optimizing them with coordinate ascent.
MIPRO: Optimizes instructions and few-shot examples using Bayesian Optimization.

Choosing an Optimizer

If you’re unsure where to start, use BootstrapFewShotWithRandomSearch:

For very little data (10 examples), use BootstrapFewShot.
For slightly more data (50 examples), use BootstrapFewShotWithRandomSearch.
For larger datasets (300+ examples), use MIPRO.

Here’s how to use BootstrapFewShotWithRandomSearch:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10, num_threads=4)
teleprompter = BootstrapFewShotWithRandomSearch(metric=YOUR_METRIC_HERE, **config)
optimized_program = teleprompter.compile(YOUR_PROGRAM_HERE, trainset=YOUR_TRAINSET_HERE)

Saving and Loading Optimized Programs

After running a program through an optimizer, save it for future use:

optimized_program.save(YOUR_SAVE_PATH)

Load a saved program:

loaded_program = YOUR_PROGRAM_CLASS()
loaded_program.load(path=YOUR_SAVE_PATH)

Advanced Features: DSPy Assertions

DSPy Assertions automate the enforcement of computational constraints on LMs, enhancing the reliability, predictability, and correctness of LM outputs.

Using Assertions

Define validation functions and declare assertions following the respective model generation. For example:

dspy.Suggest(
len(query) &amp;lt;= 100,
"Query should be short and less than 100 characters",
)
dspy.Suggest(
validate_query_distinction_local(prev_queries, query),
"Query should be distinct from: " + "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
)

Transforming Programs with Assertions

from dspy.primitives.assertions import assert_transform_module, backtrack_handler
baleen_with_assertions = assert_transform_module(SimplifiedBaleenAssertions(), backtrack_handler)

Alternatively, activate assertions directly on the program:

baleen_with_assertions = SimplifiedBaleenAssertions().activate_assertions()

Assertion-Driven Optimizations

DSPy Assertions work with DSPy optimizations, particularly with BootstrapFewShotWithRandomSearch, including settings like:

Compilation with Assertions
Compilation + Inference with Assertions

Conclusion

DSPy offers a powerful and systematic approach to optimizing language models and their prompts. By following the steps outlined in these examples, you can build, optimize, and evaluate complex AI systems with ease. DSPy’s modular design and advanced optimizers allow for efficient and effective integration of various language models, making it a valuable tool for anyone working in the field of NLP and AI.

Whether you’re building a simple question-answering system or a more complex pipeline, DSPy provides the flexibility and robustness needed to achieve high performance and reliability.