Zephyr-7B : HuggingFace’s Hyper-Optimized LLM Built on Top of Mistral 7B

Introduction

The evolution of open large language models (LLMs) has significantly impacted the AI research community, particularly in developing chatbots and similar applications. Following the release of models like LLaMA, there’s been a surge in research on efficient fine-tuning, extended prompt handling, retrieval augmented generation (RAG), and quantization.

The LLaMA model, for instance, marked a new era in fine-tuning and prompt contextualization, paving the way for subsequent models like MosaicML’s MPT, Together AI’s RedPajama-INCITE, TII’s Falcon, and Meta’s Llama 2. Each of these models contributes unique capabilities, enhancing the overall functionality and scope of LLMs.

Mistral AI, a startup from Paris and founded by former Google DeepMind and Meta employees, has made a name for itself with its first offering: Mistral 7B.

Mistral 7B’s edge lies in its efficiency, delivering similar or enhanced capabilities compared to peers like Llama 2 but with less computational demand.

Specifically tuned for instructional tasks, Mistral 7B Instruct shines on platforms like Hugging Face, where it surpasses other models of the same size and competes closely with those having nearly double its parameters.

Building on this, Hugging Face introduced Zephyr 7B Alpha, showcasing that a fine-tuned Mistral 7B can indeed surpass the abilities of significantly larger chat models and, in some tasks, even rival GPT-4. The “Alpha” was just the beginning, as Zephyr 7B Beta followed shortly.

This article will explore how Zephyr 7B leverages the power of larger models to refine its ability to respond and align with human instruction, a process made possible through the technique of knowledge distillation. This method involves training smaller models on the complex patterns learned by larger ones, reducing training demands without sacrificing language modeling capabilities. We’ll delve into the specifics of Hugging Face’s knowledge distillation approach.

Knowledge distillation

A key innovation in developing models like Zephyr-7B is distilled supervised fine-tuning (dSFT). This method involves using the output from a larger, more capable ‘teacher’ model to train a smaller ‘student’ model, enhancing its accuracy. While distillation improves open models on various tasks, a gap in performance compared to teacher models still exists.

Knowledge distillation is a method in machine learning where a compact model, referred to as the “student,” is taught to replicate the performance of a larger, more complex “teacher” model. This technique enables the student to perform tasks that were previously beyond its capacity by transferring the intricate patterns learned by the teacher.

Knowledge Distillation | Teacher-Student Model

The student model trains on the output probabilities or features generated by the teacher model, focusing on matching these outputs rather than just the final predictions. This allows the student to learn the nuanced decision-making processes of the teacher, often resulting in improved performance over training with only the ground truth data.

Historically, knowledge distillation has been utilized in models like Hinton’s original distillation networks, and more recently in NLP with models such as DistilBERT, which distilled the BERT model into a smaller, faster version that retains most of the original’s language understanding capabilities. Another example is TinyBERT, which goes further in optimizing the size and speed for mobile or edge devices.

In the case of Zephyr-7B, knowledge distillation is used to imbue a smaller 7B parameter model with the capabilities of its larger counterparts. By doing so, Zephyr-7B achieves a balance between performance and efficiency, making it suitable for environments where computational resources are limited, without sacrificing the quality of interaction and understanding.

In developing Zephyr-7B, researchers tackled the challenge of aligning a small open LLM entirely through distillation. They introduced an approach called distilled direct preference optimization (dDPO), which uses AI Feedback from an ensemble of teacher models as preference data. This method, requiring no human annotation, significantly reduces the time and resources needed for model training.

Constructing ZEPHYR-7B

To validate dDPO, researchers constructed ZEPHYR-7B, an aligned version of the Mistral-7B model. The process involved three steps:

  1. dSFT using the UltraChat dataset:Distilled Supervised Fine-Tuning (dSFT) is an advanced method to train large language models (LLMs) by leveraging the output of larger, more capable “teacher” models. It begins with a raw LLM which is trained to respond to user prompts. Unlike traditional supervised fine-tuning (SFT) that uses a fixed dataset, dSFT employs a dynamic approach where the model itself generates instructions and responses. This method, known as self-instruct, involves using the teacher model to both answer and refine instructions based on responses.The process starts with a set of seed prompts (x₀₁, x₀₂, …, x₀_J) representing diverse topics. Each prompt is refined iteratively: for a given prompt x₀, a response y₀ is generated by the teacher model, and then a new instruction x₁ is sampled based on x₀ and y₀. The final dataset C = {(x₁, y₁), …, (x_J, y_J)} is used for fine-tuning the model.
  2. Incorporating AI feedback data from UltraFeedback:This data was crucial for refining the model’s responses. In this step, the model generates responses to various prompts (like describing how to make chocolate brownies) which are then ranked by a more advanced model such as GPT-4. The highest scoring response (yw) and a randomly chosen lower-scoring response (yl) form a feedback dataset D.
  3. Applying dDPO:The last phase, Distilled Direct Preference Optimization (dDPO), involves refining the dSFT model by maximizing the probability of ranking the preferred responses higher. This is achieved by using a reward function rθ(x, y) in the preference model, which is based on the optimal LLM policy π* and the original policy πdSFT. The optimization objective is formulated as πθ = max π E (x, yw, yl) ∼ D log σ (β log π(yw|x)/πdSFT(yw|x) − β log π(yl|x)/πdSFT(yl|x)), which simplifies the training process by starting with the dSFT version of the model and iterating through each AIF triple.
The method used in Zephyr-7B mirrors the processes utilized in InstructGPT.

The method used in Zephyr-7B mirrors the processes utilized in InstructGPT.

Remarkably, Zephyr-7B achieves performance comparable to much larger 70B-parameter models aligned with human feedback. It excels in both academic benchmarks and conversational capabilities, highlighting the effectiveness of preference learning in model development. For further exploration, models, code, and instructions are available at Hugging Face’s GitHub Repository.

Addressing the Challenge of Intent Alignment

A notable concern with LLMs has been their alignment with human intent. Previous models often failed to produce responses that matched user preferences, leading to inaccurate or irrelevant answers. However, recent benchmarks like MT-Bench and AlpacaEval have provided tools to quantify and improve this aspect, highlighting the superior performance of proprietary models trained with human feedback over those trained solely via distillation.

Evaluation Methods

The evaluation of Zephyr 7B involved rigorous testing across benchmarks that assess a model’s conversational abilities in both single and multi-turn contexts:

  • MT-Bench: This multi-turn benchmark requires a model to address 160 questions spanning eight domains. Each response is rated by GPT-4, with the model’s final score reflecting the average over two rounds of questions.
  • AlpacaEval: In this single-turn benchmark, the model is presented with 805 questions across various subjects. The focus here is on the model’s helpfulness, with GPT-4 scoring the responses to determine a comparative win rate.

Additionally, Zephyr 7B was tested on the Open LLM Leaderboard, which, while not a direct assessment of conversational skills, offers insights into the model’s reasoning and truthfulness post-fine-tuning.

Zephyr 7B was compared to a variety of open and proprietary models, including those with different sizes and alignment methods. It established new benchmarks for 7B models on MT-Bench and AlpacaEval and showed competitive performance against larger models, validating the effectiveness of direct preference optimization (dDPO) in training.

The SFT and DPO training phases were meticulously configured, spanning multiple epochs and fine-tuning learning rates and batch sizes for optimal performance. The final Zephyr model emerged not only resistant to overfitting but also enhanced in dealing with practical tasks and academic benchmarks.

Datasets and Results

Datasets Utilized

Performance and Outcomes

The below chart illustrates the performance of Zephyr 7B across various task categories against other models such as GPT-3.5-turbo, Claude 1, GPT-4, and Llama-2-70b-chat. Categories might include Writing, Humanities, Roleplay, Reasoning, STEM, Extraction, Coding, and Math.

From the chart, we can infer which domains Zephyr 7B excels in and which domains might need further improvement. For instance, if Zephyr’s line stretches further out on the Writing axis compared to others, it suggests that Zephyr is particularly strong in generating written content. Conversely, if the line is closer to the center on the Math axis, it may indicate a relative weakness in solving math problems.

The radar chart helps in identifying the strengths and weaknesses of Zephyr 7B, providing a visual representation of where it stands against larger models like GPT-4 and specialized models like Llama-2-70b-chat.

Model Performance Radar Chart

Model Performance Radar Chart

Comparing various language models on two benchmarks: MT-Bench and AlpacaEval. The models are evaluated based on their size, alignment method (such as dSFT for distilled supervised fine-tuning or dDPO for distilled direct preference optimization), and performance scores. Zephyr stands out with high scores in both benchmarks, indicating its effectiveness in generating aligned responses.

MT-Bench and AlpacaEval

MT-Bench and AlpacaEval

Conclusion

In conclusion, the development of Zephyr-7B demonstrates that alignment and distillation of conversational capabilities from a large language model (LLM) onto a smaller model can be achieved without reliance on sampling-based methods. By employing direct preference optimization (DPO) with AI feedback, Zephyr-7B leverages the strong foundation of Mistral-7B to set a new benchmark for 7B parameter chat models, showcasing the ability of smaller, open-source models to understand and respond to user intent effectively.

However, this study is not without its limitations. The reliance on GPT-4 as an evaluator for benchmarks introduces a bias towards models that are distilled from it, potentially favoring over accurate responses. Additionally, the scalability of this method to larger models, such as LLAMA2-70B, and its impact on performance gains remain areas for further research. These limitations highlight the need for continuous innovation and the development of unbiased evaluation methods in the AI community.

Looking beyond the study, it’s evident that the potential for smaller models to perform at the level of larger counterparts can democratize AI, allowing for more accessible and efficient use in various applications. The success of Zephyr-7B encourages further exploration into open-source models, which can accelerate advancements in AI by fostering collaborative research and development.