>

Transformers Optimization: Part 1 - KV Cache

Image by Martin Adams In this Transformers Optimization series, we will explore various optimization techniques for Transformer models. As a kickoff piece, we will dive deep into KV Cache, an inference optimization technique to significantly enhance the inference performance of large language models. What is KV Cache? A common technique for improving the performance of large model inferences is by using the KV cache of the last inference. Using the KV cache of the last inference improves inference performance and reduces end-to-end latency without affecting any accuracy....

October 7, 2023 · 8 min · Rajan Ghimire

Decoding Strategies in Language Models

The Auto-regression and Decoding Strategies Auto-regressive language generation assumes that the element of the output sequence at time-step $t$ is determined by the input sequence and time-steps before $t$. $$ P\left(w_{1: T} \mid W_0\right)=\prod_{t=1}^T P\left(w_t \mid w_{1: t-1}, W_0\right), \text { with } w_{1: 0}=\emptyset $$ where $W_0$ is the input sequence; $W_t$ is the word at timestep $t$; T is determined by the position of a token. source Language models, especially those like the GPT and LLaMa, are auto-regressive....

September 15, 2023 · 14 min · Rajan Ghimire

Supercharge Your LLaMA: Fine-Tuning Made Effortless and Efficient 🚀

In this blog, we’ll core concepts behind the LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, diving into its use of zero-init attention and how it blends new instructional cues without compromising pre-existing knowledge. We will also cover the practical implementation of the LLaMa-Aadapter. To facilitate understanding, let’s cover the concepts like Prompt Tuning, Prefix Tuning, and Adapter that collectively form the core of LLaMA-Adapter, empowe­ring it with unique capabilities and efficiencies....

September 8, 2023 · 13 min · Rajan Ghimire

The Secret Sauce of LLaMA🦙 : A Deep Dive!

Modern Llama. Generated using Stable Diffusion v2 In the information era, a new king reigns supreme: the language model. With the internet drowning in a never-ending flood of data, there is a growing demand for intelligent machines that can not only absorb this data but also produce, analyze, and interact with it in previously unthinkable ways. Enter LLaMA, Meta’s Large Language Model, a monument to artificial intelligence’s current peak. But what lurks underneath the many layers of this behemoth?...

August 20, 2023 · 34 min · Rajan Ghimire

LORA(Low Rank Adaptation) : A Deeper Dive

LoRA is a fast fine-tuning approach developed by Microsoft researchers for adapting huge models to specific tasks and datasets. The idea behind LoRA is that a single LLM model can be used for various tasks by incorporating different neurons or features to handle each task. By identifying the appropriate features from a pool of many and improving them, we can obtain better outcomes for specific tasks. Fine-tuning Let, $L =$ Loss function $X,y =$ Input and output data....

March 6, 2023 · 7 min · Rajan Ghimire