Zephyr: Direct Distillation of LLM Alignment

The ability and performance of smaller, open large language models have advanced significantly in recent years, and we have witnessed the progress from early GPT-2 models to more compact, accurate, and effective LLM frameworks that make use of a considerably larger amount of tokens that the “compute-optimal” amount of tokens recommended by the Chinchilla scaling laws. Furthermore, developers have demonstrated that these smaller LLM frameworks can be trained further using a proprietary-models based dSFT or Distilled Supervised Fine-Tuning approach, that uses the output from an effective teacher model as supervised data for the student model in an attempt to boost the accuracy. 

In this article, we will be talking about the Zephyr-7B framework, a state of the art chat benchmark for 7B parameter models that does not require human annotations. The primary aim of the framework is to enable developers to produce smaller large language models that are aligned to the user intent closer than ever before. The Zephyr-7B framework not only examines the application of current approaches for larger LLM frameworks like dSFT, but also explores the possibility of using other approaches to learn a chat model with better alignment with the user intent. We will be taking a deeper dive into the Zephyr framework, and explore its architecture, working, and results. So let’s get started. 

As mentioned earlier, language models have progressed rapidly in recent years, from the earlier GPT-2 frameworks to current GPT-4 and MiniGPT-5 LLM frameworks that although are highly token exhaustive, are now more accurate,  and much more efficient. A major highlight of these advanced LLM frameworks is that they incorporate a significantly higher amount of tokens than the number of tokens that were earlier considered to be computationally optimal under the Chinchilla scaling laws. Furthermore, developers and researchers working on LLM frameworks have learned that these smaller LLM frameworks can be trained further using a proprietary-models based dSFT or Distilled Supervised Fine-Tuning approach, that uses the output from an effective teacher model as supervised data for the student model in an attempt to boost the accuracy. The distillation strategy has proven itself to be a highly effective, and useful tool to maximize the potential and abilities of open models on a wide array of tasks, although it yet cannot replicate the performance achieved by the teacher model. Additionally, users have often reported that these models often display “intent misalignment”, meaning the models do not behave in a manner that aligns with the requirements of the end users, leading to incorrect outputs that do not provide the right output or responses to the user inputs or queries. 

Intent alignment has always been a major challenge for developers with recent works focusing on development of benchmarks like AlpacaEval and MT-Bench developed to target the misalignment. The motivation for developing the Zephyr framework can be credited to the problem of using distillation to align a small open LLM framework entirely where the primary step is to utilize an AIF or Artificial Intelligence Feedback to obtain preference data from an ensemble of the teacher model, and then applying distilled preference optimization directly as the primary learning objective, an approach that is referred to as dDPO or Denoising Diffusion Policy Optimization. The main highlight of the dDPO approach is that unlike its predecessors like PPO or Proximal Preference Optimization, it does not require human sampling or annotations, and also reduces the time it takes to train a language model. Furthermore, it also allows developers to maximize the rewards of the final sample by paying close attention to the sequence of the denoising steps right from the beginning till the end, in other words, throughout its entirety. 

Developers have developed the Zephyr-7B framework to validate this approach, and in some ways, it is an aligned version of the state of the art Mistral-7B framework. The framework first uses dSFT or Distilled Supervised Fine-Tuning based on the UltraChat dataset, and applies the dDPO or Denoising Diffusion Policy Optimization approach on the feedback data. Experiments indicate that the Zephyr-7B framework with 7 billion parameters delivers results comparable to the one delivered by human-feedback aligned chat models with over 70 billion parameters. Furthermore, experiments also indicate that results can be improved both in terms of benchmarks that take conversational capabilities into account, as well as standard academic benchmarks, and the use of preference learning is critical to achieve the desired results. 

The above figure demonstrates the performance of various language models on the MT-bench benchmark. The Zephyr-7B framework that is trained using the dDPO approach is put up against proprietary as well as open-access, larger language models like GPT-3.5 turbo, Llama-2-70B, and more that were trained using additional reinforcement learning, and also included a huge amount of human feedback. As it can be clearly seen that despite the sheer difference in the number of parameters that these frameworks use, the Zephyr-7B framework delivers comparable results against most of them, and outperforms several frameworks in different domains. 

Zephyr-7B : Method, Working and Architecture

The primary goal of the Zephyr-7B framework is to help an open-source large language model align as close as possible to the user intent, and throughout its entirety, the Zephyr-7B framework assumes access to a large teacher model that is queried using prompt generation. The Zephyr-7B follows an approach similar to the one used in the InstructGPT framework, and aims to generate an effective, and accurate student model. 

The following figure briefly demonstrates the three primary steps involved in the working of the Zephyr-7B framework. 

  1. dSFT for large-scale dataset construction using a self-instruction style. 
  2. AIF collection using an ensemble of completing chat models followed by preference binarization, and scoring by GPT-4. 
  3. dPO of the dSFT model by making use of the feedback data. 

dSFT or Distilled Supervised Fine-Tuning

The framework starts with a raw Large Language Model that first needs to be trained to respond to user prompts. Traditionally, training these LLM frameworks to respond to user prompts is done using SFT or Supervised Fine Tuning on a dataset consisting of high-quality instructions, and their corresponding responses. Since, the Zephyr-7B framework has access to a teacher language model, the framework can generate instructions and responses, and train the model directly on these instructions and responses, and this approach is known as dSFT or distilled SFT. The following figure demonstrates the distillation performed by SFT where x represents a set of seed prompts constructed with the primary purpose of representing a diverse set of topical domains, y represents the sample response, that is refined using a new sample instruction represented by x1 and C represents the end point in the final dataset. 

AI Feedback through Preferences

Human feedback is used to assign Large Language Models as they can provide the required additional signals, and these human feedbacks are traditionally provided through preferences on the quality of the responses generated by the LLM frameworks. However, the Zephyr framework uses AI Feedback from the teacher model on other models’ generated outputs instead of human feedback for distillation purposes. The approach followed by the Zephyr framework is influenced by the one used by the UltraFeedback framework that uses the teacher model to provide preferences on the outputs of the model. 

Similar to the SFT or Supervised Fine Tuning approach, it starts with a set of prompts, where x represents every individual prompt that is then fed to a collection of four models like Llama, Falcon, Claude, and more, each of which generate a response of their own. These responses are then fed as an input to the teacher model like GPT-3 or GPT-4, and the model outputs a score for the input response. After collecting the output scores, the model saves the response with the highest score. 

dDPO or Distilled Direct Preference Optimization

dDPO is the final step of the Zephyr framework, and its primary goal is to refine the dSFT teacher model by maximizing the probability of ranking the preferred response in a preference model that is determined by a reward function by utilizing the student language model. The previous step involving the use of AI feedback focussed primarily on using Reinforcement Learning methods like PPO or Proximal Policy Optimization for maximum optimization with respect to the reward generated. In this step, the reward is first trained, and then sampled from the current policy to calculate the updates, and thus maximizing the optimization. DPO or Direct Preference Optimization follows a similar approach to optimize the preference model directly using the static data. The objective after plugging the reward function into preference model can be written as

Zephyr-7B : Experiments, Benchmarks and Results

The Zephyr framework conducts its fine-tuning experiments on the current state of the art Mistral-7B framework that delivers comparable performance to much larger language models on a wide array of natural language processing or NLP tasks. 

Datasets

The Zephyr framework makes use of two dialogue datasets that have been distilled from a mixture of proprietary and open models, that have previously proved themselves to be effective in producing effective chat models. 

UltraChat

UltraChat is a self-refinement dataset that consists of nearly 1.5 million multi-turn dialogues spread over 30 topics, and 20 text materials generated by the GPT-3.5-Turbo framework. To tackle the incorrect capitalization issue faced by the UltraChat dataset, the framework applies a truecasing heuristics approach to get rid of the grammatical errors. 

UltraFeedback

The UltraFeedback is a prompt dataset with over 64k prompts, with each of these prompts having four individual LLM responses. The Zephyr framework uses the highest mean score obtained from the UltraFeedback dataset to construct binary preferences, and one of the remaining three LLM responses is rejected as random. 

Evaluation

To evaluate the performance of the Zephyr framework, developers have opted for two chat benchmarks, one single-turn, and one multi-turn, in an attempt to evaluate the ability of the model to follow user instructions, and respond accordingly. 

MT-Bench

The MT-Bench evaluation benchmark consists of 160 questions spread over 8 unique knowledge areas, and under the MT-Bench benchmark, the model has to answer an initial question, and provide a response on the follow-up question. 

AlpacaEval

AlpacaEval is a single-turn benchmark under which the model or the framework generates user responses to over 800 questions spread across different topics with the primary focus being on helpfulness. 

In addition to these two primary benchmarks, the Zephyr-7B framework is also evaluated on Open LLM Leaderboard for multiclass classification tasks, ARC, HellaSwag, MMLU, and more. Furthermore, regardless of what benchmark the Zephyr-7B framework is evaluated on, it is compared against a range of proprietary and open models, with their alignment procedures being the only differentiating factor. 

Results

Let’s now have a look at how the Zephyr-7B framework performs, and compares against current state of the art language models. 

Implementation of dDPO Approach Boosts Chat Capabilities

The following table compares the performance of the Zephyr-7B framework against state of the art language models on the AlpacaEval, and MT-Bench benchmarks. 

As it can be clearly seen, when put against open 7B models, the Zephyr-7B framework not only significantly outperforms dSFT models across the two benchmarks, but also sets new state of the art standards. Furthermore, the Zephyr-7B framework also manages to outscore the XWIN-LM-7B framework, which is one of the rare models trained on the dPPO or distilled PPO approach. Furthermore, the performance delivered by the Zephyr-7B framework is comparable to the results delivered by much larger language models like Llama2-Chat with over 70B parameters. 

dDPO Boosts Academic Task Performance

The following figure compares the performance of the Zephyr-7B framework against a wide array of open-source, and proprietary LLM frameworks. 

As it can be seen, the Zephyr-7B framework significantly outperforms LLM frameworks with 7B parameters, and the gap between its performance, and the one delivered by the best performing dSFT models is also noticeable. As the number of parameters increases, the Zephyr-7B framework does fall short, although it matches the performance delivered by frameworks with 40 billion parameters. 

Preference Optimization

In the following figure, we evaluate how the different steps followed in the alignment process impacts the performance. As it can be observed, the dDPO approach when combined with dSFT significantly boosts the performance on both the MT-Bench and AlpacaEval datasets. 

Finally, in the following figure we can see the testing and training accuracies during the DPO implementation. As it can be seen, the DPO approach does not affect the performance of the model on downstream tasks. 

Conclusion

In this article, we have talked about the Zephyr-7B framework based on the current state of the art Mistral-7B framework that aims to solve the current challenge of alignment distillation from a large language model to a much smaller pretrained framework. The primary aim of the framework is to enable developers to produce smaller large language models that are aligned to the user intent closer than ever before. The Zephyr-7B framework not only examines the application of current approaches for larger LLM frameworks like dSFT, but also explores the possibility of using other approaches to learn a chat model with better alignment with the user intent.

However, despite the promising results, the Zephyr-7B framework is not perfect, and some work still needs to be done. One of the obvious limitations is using the GPT-4 framework to evaluate MT-Bench and AlpacaEval benchmarks, which has often been biased towards the models it distills itself. However, the Zephyr-7B framework hopes to carve a way for exploring the capabilities of smaller open models that are capable of aligning with the user intent and interactions.