MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Owing to its robust performance and broad applicability when compared to other methods, LoRA or Low-Rank Adaption is one of the most popular PEFT or Parameter Efficient Fine-Tuning methods for fine-tuning a large language model. The LoRA framework employs two low-rank matrices to decompose, and approximate the updated weights in the FFT or Full Fine Tuning, and the LoRA framework modifies these trainable parameters accordingly by adjusting the rank of the matrices. The major benefit of implementing the process is that it facilitates the LoRA framework to merge these matrices without the inference latency after fine-tuning. Furthermore, although recent large language models deliver remarkable performance on in-context learning tasks, certain scenarios still require fine-tuning, and can be categorized broadly into three types. The first type, instruction tuning, aims to align LLMs better with end tasks and user preferences without enhancing the knowledge and capabilities of LLMs, an approach that simplifies the process of dealing with varied tasks and complex instructions. The second type includes complex reasoning tasks like mathematical problem solving. Finally, the third type is continual pretraining, an approach that attempts to enhance the overall domain-specific capabilities of large language models. 

In this article, we will talk about whether low-rank updating impacts the performance of the LoRA framework as it has been observed that low-rank updating mechanism might hamper the ability of the large language model to learn and memorize new knowledge. Building on the same, in this article we will talk about MoRA, a new method that achieves high-rank updating while maintaining the same number of trainable parameters, by employing a square matrix. To achieve this, the MoRA framework reduces input dimension and increases output dimension for the square matrix by introducing the corresponding non-parameter operators. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes the MoRA framework deployable like LoRA. 

This article aims to cover the MoRA framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

As the size and the capabilities of the language models are increasing, PEFT or Parameter Efficient Fine-Tuning is emerging as one of the most popular and efficient methods to adapt LLMs to specific downstream tasks. Compared to FFT or Full Fine Tuning, that updates all parameters, PEFT only modifies a fraction of the total parameters as on some tasks it can achieve similar performance as FFT by updating less than 1% of the total parameters, thus reducing memory requirements for optimizer significantly while facilitating the storage and deployment of models. Furthermore, amongst all the existing PEFT methods, LoRA is the one most popular today, especially for LLMs. One of the major reasons why LoRA methods deliver better performance when compared to PEFT methods like adapters or prompt tuning is that LoRA uses low-rank matrices to update parameters, with the framework having the control to merge these matrices into the original model parameters, without adding to the computational requirements during inference. Although there are numerous methods that attempt to improve LoRA for large language models, a majority of these models rely on GLUE to validate their efficiency, either by requiring few trainable parameters, or by achieving better performance. 

Furthermore, experiments conducted on LoRA across a wide array of tasks including continual pretraining, mathematical reasoning, and instruction tuning indicate that although LoRA-based frameworks demonstrate similar performance across these tasks, and deliver performance on instruction tuning tasks comparable to FFT-based methods. However, the LoRA-based models could not replicate the performance on continual pretraining, and mathematical reasoning tasks. A possible explanation for this lack of performance can be the reliance on LoRA on low-rank matrix updates, since the low-rank update matrix might struggle to estimate the full-rank updates in FFT, especially in memory intensive tasks that require memorizing domain-specific knowledge like continual pretraining. Since the rank of the low-rank update matrix is smaller than the full rank, it caps the capacity to store new information using fine-tuning. Building on these observations, the MoRA attempts to maximize the rank in the low-rank update matrix while maintaining the same number trainable parameters, by employing a square matrix as opposed to the use of low-rank matrices in traditional LoRA-based models. The following figure compares the MoRA framework with LoRA under the same number of trainable parameters. 

In the above image, (a) represents LoRA, and (b) represents MoRA. W is the frozen weight from the model, M is the trainable matrix in MoRA, A and B are trainable low-rank matrices in LoRA, and r represents the rank in LoRA and MoRA. As it can be observed, the MoRA framework demonstrates a greater capacity than LoRA-based models with a large rank. Furthermore, the MoRA framework develops corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the trainable matrix M. Furthermore, the MoRA framework grants the flexibility to use a low-rank update matrix to substitute the trainable matrix M and the operators, ensuring the MoRA method can be merged back into the large language model like LoRA. The following table compares the performance of FFT, LoRA, LoRA variants and our method on instruction tuning, mathematical reasoning and continual pre-training tasks. 

MoRA : Methodology and Architecture

The Influence of Low-Rank Updating

The key principle of LoRA-based models is to estimate full-rank updates in FFT by employing low-rank updates. Traditionally, for a given pre-trained parameter matrix, LoRA employs two low-rank matrices to calculate the weight update. TO ensure the weight updates are 0 when the training begins, the LoRA framework initializes one of the low-rank matrices with a Gaussian distribution while the other with 0. The overall weight update in LoRA exhibits a low-rank when compared to fine-tuning in FFT, although low-rank updating in LoRA delivers performance on-par with full-rank updating on specific tasks including instruction tuning and text classification. However, the performance of the LoRA framework starts deteriorating for tasks like continual pretraining, and complex reasoning. On the basis of these observations, MoRA proposes that it is easier to leverage the capabilities and original knowledge of the LLM to solve tasks using low-rank updates, but the model struggles to perform tasks that require enhancing capabilities and knowledge of the large language model

Methodology

Although LLMs with in-context learning are a major performance improvement over prior approaches, there are still contexts that rely on fine-tuning broadly falling into three categories. There are LLMs tuning for instructions, by aligning with user tasks and preferences, which do not considerably increase the knowledge and capabilities of LLMs. This makes it easier to work with multiple tasks and comprehend complicated instructions. Another type is about involving complex reasoning tasks that are like mathematical problem-solving for which general instruction tuning comes short when it comes to handling complex symbolic multi-step reasoning tasks. Most related research is in order to improve the reasoning capacities of LLMs, and it either requires designing corresponding training datasets based on larger teacher models such as GPT-4 or rephrasing rationale-corresponding questions along a reasoning path. The third type, continual pretraining, is designed to improve the domain-specific abilities of LLMs. Unlike instruction tuning, fine-tuning is required to enrich related domain specific knowledge and skills. 

Nevertheless, the majority of the variants of LoRA almost exclusively use GLUE instruction tuning or text classification tasks to evaluate their effectiveness in the context of LLMs. As fine-tuning for instruction tuning requires the least resources compared to other types, it may not represent proper comparison among LoRA variants. Adding reasoning tasks to evaluate their methods better has been a common practice in more recent works. However, we generally employ small training sets (even at 1M examples, which is quite large). LLMS struggle to learn proper reasoning from examples of this size. For example, some approaches utilize the GSM8K with only 7.5K training episodes. However, these numbers fall short of the SOTA method that was trained on 395K samples and they make it hard to judge the ability of these methods to learn the reasoning power of NLP.

Based on the observations from the influence of low-rank updating, the MoRA framework proposes a new method to mitigate the negative effects of low-rank updating. The basic principle of the MoRA framework is to employ the same trainable parameters to the maximum possible extent to achieve a higher rank in the low-rank update matrix. After accounting for the pre-trained weights, the LoRA framework uses two low-rank matrices A and B with total trainable parameters for rank r. However, for the same number of trainable parameters, a square matrix can achieve the highest rank, and the MoRA framework achieves this by reducing the input dimension, and increasing the output dimension for the trainable square matrix. Furthermore, these two functions ought to be non parameterized operators and expected to execute in linear time corresponding to the dimension. 

MoRA: Experiments and Results

To evaluate its performance, the MoRA framework is evaluated on a wide array of tasks to understand the influence of high-rank updating on three tasks: memorizing UUID pairs, fine-tuning tasks, and pre-training. 

Memorizing UUID Pairs

To demonstrate the improvements in performance, the MoRA framework is compared against FFT and LoRA frameworks on memorizing UUID pairs. The training loss from the experiment is reflected in the following image. 

It is worth noting that for the same number of trainable parameters, the MoRA framework is able to outperform the existing LoRA models, indicating it benefitted from the high-rank updating strategy. The character-level training accuracy report at different training steps is summarized in the following table.

As it can be observed, when compared to LoRA, the MoRA framework takes fewer training steps to memorize the UUID pairs. 

Fine-Tuning Tasks

To evaluate its performance on fine-tuning tasks, the MoRA framework is evaluated on three fine-tuning tasks: instruction tuning, mathematical reasoning, and continual pre-training, designed for large language models, along with a high-quality corresponding dataset for both the MoRA and LoRA models. The results of fine-tuning tasks are presented in the following table. 

As it can be observed, on mathematical reasoning and instruction tuning tasks, both the LoRA and MoRA models return similar performance. However, the MORA model emerges ahead of the LoRA framework on continual pre-training tasks for both biomedical and financial domains, benefitting from high-rank update approach to memorize new knowledge. Furthermore, it is vital to understand that the three tasks are different from one another with different requirements, and different fine-tuning abilities. 

Pre-Training

To evaluate the influence of high-rank updating on the overall performance, the transformer within the MoRA framework is trained from scratch on the C4 datasets, and performance is compared against the LoRA and ReLoRA models. The pre-training loss along with the corresponding complexity on the C4 dataset are demonstrated in the following figures. 

As it can be observed, the MoRA model delivers better performance on pre-training tasks when compared against LoRA and ReLoRA models with the same amount of trainable parameters. 

Furthermore, to demonstrate the impact of high-rank updating on the rank of the low-rank update matrix, the MoRA framework analyzes the spectrum of singular values for the learned low-rank update matrix by pre-training the 250M model, and the results are contained in the following image. 

Final Thoughts

In this article, we have talked about whether low-rank updating impacts the performance of the LoRA framework as it has been observed that low-rank updating mechanism might hamper the ability of the large language model to learn and memorize new knowledge. Building on the same, in this article we will talk about MoRA, a new method that achieves high-rank updating while maintaining the same number of trainable parameters, by employing a square matrix. To achieve this, the MoRA framework reduces input dimension and increases output dimension for the square matrix by introducing the corresponding non-parameter operators. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes the MoRA framework deployable like LoRA.