For LLMs to become ubiquitous and find its way to low-resource CPUs and mobile phones, which is almost certainly bound to happen soon, it is crucial to break free from the reliance on high-compute GPUs to train them. And this appears to be already begun, thanks to recent open-source models that match ChatGPT3.5 (GPT4 in some cases) in performance as well as some phenomenal inventions with fine-tuning them.
So if you, like me, have come across more AI jargon (it’s getting of hand) like PEFT, LORA, QLORA and quantization, and wondering what in the world they mean…these next few posts should hopefully help. Let’s dive into these in the next few posts.
But first, what are the problems with fine-tuning LLMs?
LLMs contain huge numbers of parameters in these neural networks which leads to 2 problems- they need lot more compute to train, file sizes become huge as they need to save the weights associated with these parameters.
As touched on in this AI Glossary entry, Parameter-Efficient Fine-Tuning (PEFT) aims to fine-tune LLMs (or even image generation models like Stable Diffusion) to enhance their performance while optimizing for resources like time, energy, and computational power.
PEFT focuses on adjusting a small subset of key parameters (typically 15-20% of the weights), largely maintaining the original pre-trained model structure.
For example: fine-tuning a 3-billion-parameter model on consumer hardware with 11GB RAM becomes feasible with PEFT.
Techniques used in PEFT include adaptors, low-rank adaptations, prefix tuning, and prompt tuning. These methods concentrate on fine-tuning specific model parts.
PEFT allows for the creation of "lightweight tiny checkpoints," small sets of parameters that enable the model to switch between tasks efficiently.
Mitigates the problem of “catastrophic forgetting” - where a fine-tuned model can forget it’s original pre-trained capabilities due to changes during fine-tuning, an issue more likely to happen with full fine-tuning. Since PEFT only updates a small subset of parameters, it’s more robust against this catastrophic forgetting effect.
This approach is crucial for making AI both efficient and widely accessible, allowing for optimized performance with minimal resource use. The new parameters are combined with the original LLM weights for inference.
Another benefit is that PEFT weights can be trained for different tasks on the same base model and can be easily swapped out for inference, allowing efficient adaptation of the original model to multiple tasks.
PEFT Techniques
Let’s briefly go through the main classes of PEFT techniques. As you would expect, significant trade-offs between the various PEFT methods spanning factors like parameter efficiency, compute efficiency, inference costs etc.

There are 3 main classes
Selective methods: identify which parameters you want to update, train only certain components of the model or specific layers, even individual parameter types
Re-parameterization methods: includes LoRA, where attention matrices are replaced by their smaller rank-decomposition matrices to significantly reduce the no of weights to compute during training
Additive methods: carry out fine-tuning by keeping all of the original LLM weights frozen and introducing new trainable components.
Adapter methods add new trainable layers to the architecture of the model, typically inside the encoder or decoder components after the attention or feed-forward layers.
Soft prompt methods keep the model architecture fixed and frozen, and focus on manipulating the input to achieve better performance. This can be done by
— adding trainable parameters to the prompt embeddings
— keeping the input fixed and retraining the embedding weights
LoRA is broadly used in practice because of the comparable performance to full fine tuning for many tasks and data sets, so stay tuned for a deep dive into LORA in the next post.