>

Transformers Optimization: Part 1 - KV Cache

Image by Martin Adams In this Transformers Optimization series, we will explore various optimization techniques for Transformer models. As a kickoff piece, we will dive deep into KV Cache, an inference optimization technique to significantly enhance the inference performance of large language models. What is KV Cache? A common technique for improving the performance of large model inferences is by using the KV cache of the last inference. Using the KV cache of the last inference improves inference performance and reduces end-to-end latency without affecting any accuracy....

October 7, 2023 · 8 min · Rajan Ghimire

Decoding Strategies in Language Models

The Auto-regression and Decoding Strategies Auto-regressive language generation assumes that the element of the output sequence at time-step $t$ is determined by the input sequence and time-steps before $t$. $$ P\left(w_{1: T} \mid W_0\right)=\prod_{t=1}^T P\left(w_t \mid w_{1: t-1}, W_0\right), \text { with } w_{1: 0}=\emptyset $$ where $W_0$ is the input sequence; $W_t$ is the word at timestep $t$; T is determined by the position of a token. source Language models, especially those like the GPT and LLaMa, are auto-regressive....

September 15, 2023 · 14 min · Rajan Ghimire

Supercharge Your LLaMA: Fine-Tuning Made Effortless and Efficient 🚀

In this blog, we’ll core concepts behind the LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, diving into its use of zero-init attention and how it blends new instructional cues without compromising pre-existing knowledge. We will also cover the practical implementation of the LLaMa-Aadapter. To facilitate understanding, let’s cover the concepts like Prompt Tuning, Prefix Tuning, and Adapter that collectively form the core of LLaMA-Adapter, empowe­ring it with unique capabilities and efficiencies....

September 8, 2023 · 13 min · Rajan Ghimire

The Secret Sauce of LLaMA🦙 : A Deep Dive!

Modern Llama. Generated using Stable Diffusion v2 In the information era, a new king reigns supreme: the language model. With the internet drowning in a never-ending flood of data, there is a growing demand for intelligent machines that can not only absorb this data but also produce, analyze, and interact with it in previously unthinkable ways. Enter LLaMA, Meta’s Large Language Model, a monument to artificial intelligence’s current peak. But what lurks underneath the many layers of this behemoth?...

August 20, 2023 · 34 min · Rajan Ghimire