Rajan Ghimire

Transformers Optimization: Part 1 - KV Cache

Image by Martin Adams In this Transformers Optimization series, we will explore various optimization techniques for Transformer models. As a kickoff piece, we will dive deep into KV Cache, an inference optimization technique to significantly enhance the inference performance of large language models. What is KV Cache? A common technique for improving the performance of large model inferences is by using the KV cache of the last inference. Using the KV cache of the last inference improves inference performance and reduces end-to-end latency without affecting any accuracy....

Decoding Strategies in Language Models

The Auto-regression and Decoding Strategies Auto-regressive language generation assumes that the element of the output sequence at time-step $t$ is determined by the input sequence and time-steps before $t$. $$ P\left(w_{1: T} \mid W_0\right)=\prod_{t=1}^T P\left(w_t \mid w_{1: t-1}, W_0\right), \text { with } w_{1: 0}=\emptyset $$ where $W_0$ is the input sequence; $W_t$ is the word at timestep $t$; T is determined by the position of a token. source Language models, especially those like the GPT and LLaMa, are auto-regressive....

Supercharge Your LLaMA: Fine-Tuning Made Effortless and Efficient 🚀

In this blog, we’ll core concepts behind the LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, diving into its use of zero-init attention and how it blends new instructional cues without compromising pre-existing knowledge. We will also cover the practical implementation of the LLaMa-Aadapter. To facilitate understanding, let’s cover the concepts like Prompt Tuning, Prefix Tuning, and Adapter that collectively form the core of LLaMA-Adapter, empowering it with unique capabilities and efficiencies....

The Secret Sauce of LLaMA🦙 : A Deep Dive!

Modern Llama. Generated using Stable Diffusion v2 In the information era, a new king reigns supreme: the language model. With the internet drowning in a never-ending flood of data, there is a growing demand for intelligent machines that can not only absorb this data but also produce, analyze, and interact with it in previously unthinkable ways. Enter LLaMA, Meta’s Large Language Model, a monument to artificial intelligence’s current peak. But what lurks underneath the many layers of this behemoth?...

Semantic Segmentation from scratch in PyTorch.

In this blog, we will use DeepLabv3+ architecture to build our person segmentation pipeline entirely from scratch. DeepLabv3+ Architecture: The DeepLabv3 paper was introduced in “Rethinking Atrous Convolution for Semantic Image Segmentation”. After DeepLabv1 and DeepLabv2 are invented, authors tried to RETHINK or restructure the DeepLab architecture and finally come up with a more enhanced DeepLabv3. The DeepLabv3+ was introduced in “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation” paper....

Quantization in PyTorch: Optimizing Architectures for Enhanced Performance

Introduction In the rapidly evolving world of machine learning, one of the fundamental challenges is to make deep learning models run more efficiently. Model quantization is a strategy that allows for the reduction of memory requirements and computational needs, making the deployment of such models on hardware with constrained resources feasible and more efficient. In this blog, we’re going to take a deep dive into the realm of PyTorch model quantization....

LORA(Low Rank Adaptation) : A Deeper Dive

LoRA is a fast fine-tuning approach developed by Microsoft researchers for adapting huge models to specific tasks and datasets. The idea behind LoRA is that a single LLM model can be used for various tasks by incorporating different neurons or features to handle each task. By identifying the appropriate features from a pool of many and improving them, we can obtain better outcomes for specific tasks. Fine-tuning Let, $L =$ Loss function $X,y =$ Input and output data....

Vision Transformer (ViT)

Transformers were widely used in the field of natural language processing when they were first developed. Many researchers have begun using the Transformer architecture in other domains, like computer vision, as a result of Transformers’ success in the field of Natural Language Processing (NLP). One such architecture, called the Vision Transformer, was developed by Google Research and Brain Team to tackle the challenge of image classification. Naturally, you must have prior knowledge of how Transformers function and the issues it addressed in order to grasp how ViT operates....