Why Fine-Tuning Still Matters in 2026
Pre-trained large language models are trained on broad internet text — which means they know a lot about everything in general and not enough about your specific domain. A model trained on Common Crawl has never seen your company's internal GST filing procedures, your hospital's Malayalam discharge summaries, or the specific tone your legal team uses when drafting notices under Indian contract law. When you deploy such a model without adaptation, it answers correctly on average but fails on your actual workload.
Fine-tuning solves this by continuing model training on your domain-specific data. The result is a model that retains general capabilities while becoming measurably better at your specific tasks — whether that means classifying Kerala government procurement documents, extracting structured data from handwritten income certificates, or generating customer support responses that match your brand voice in Manglish. The alternative, prompt engineering alone, hits a ceiling quickly when the task requires knowledge the base model genuinely lacks.
The barrier historically was hardware. Full fine-tuning of a 7B parameter model requires approximately 80-140GB of GPU VRAM in bf16 precision — far beyond what any individual developer or small team in India could access without serious cloud spend. LoRA and QLoRA changed that equation fundamentally.
What LoRA and QLoRA Actually Do
LoRA (Low-Rank Adaptation) works on a simple mathematical insight: the weight updates needed during fine-tuning tend to have low intrinsic rank. Instead of updating all model weights directly, LoRA freezes the original model and injects pairs of small trainable matrices — called adapters — at specific layers. If a weight matrix has dimensions 4096×4096, LoRA approximates its update as two smaller matrices: 4096×r and r×4096, where r is the rank (typically 8-64). At rank 16, you are training 2×4096×16 = 131,072 parameters instead of 16,777,216 — a 128x reduction for that layer alone.
Across a full 7B model, LoRA adapters typically represent less than 1% of total parameters. Training time and VRAM requirements drop proportionally. The frozen base model stays on disk or in memory untouched; only the tiny adapter weights are updated. After training, you can merge the adapters back into the base model for inference, or serve them separately with hot-swapping for multi-tenant deployments.
QLoRA (Quantized LoRA, introduced by Dettmers et al. in 2023) adds a second innovation: it loads the frozen base model in 4-bit NormalFloat (NF4) precision instead of the standard 16-bit. This cuts the base model's memory footprint by roughly 75%. A Mistral-7B model that occupies 14GB in fp16 loads into approximately 4-5GB in 4-bit. Add the LoRA adapters and optimizer states, and you can train a 7B model within the 15GB VRAM of a free Colab T4 GPU — something that was genuinely impossible with full fine-tuning or even standard LoRA two years ago.
Setting Up QLoRA Fine-Tuning on Google Colab
The Hugging Face ecosystem — specifically the PEFT (Parameter-Efficient Fine-Tuning) library combined with bitsandbytes and TRL (Transformer Reinforcement Learning) — handles all the complexity. Here is a working setup for fine-tuning Mistral-7B-Instruct on a custom Indian legal Q&A dataset.
First, install the required packages. On a fresh Colab runtime with a T4 GPU assigned:
!pip install -q transformers==4.40.0 peft==0.10.0 bitsandbytes==0.43.0 trl==0.8.6 datasets accelerate
Next, load the base model in 4-bit precision using BitsAndBytesConfig:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
Configure the LoRA adapter. For a domain Q&A task, target the query and value projection matrices — these carry the most semantic information:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 20,971,520 || all params: 7,262,474,240 || trainable%: 0.29
Use TRL's SFTTrainer for supervised fine-tuning. It handles dataset formatting, gradient accumulation, and mixed-precision training automatically:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
dataset = load_dataset("json", data_files="your_dataset.jsonl", split="train")
training_args = TrainingArguments(
output_dir="./mistral-legal-qlora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
fp16=True,
logging_steps=25,
save_steps=200,
optim="paged_adamw_8bit",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
packing=True,
)
trainer.train()
Dataset Preparation for Indian Use Cases
The dataset format matters as much as the training code. For instruction fine-tuning, each example should follow the Alpaca format: a system prompt, a user instruction, and the expected model output. For a GST accounting assistant, an example looks like this:
{
"text": "<s>[INST] You are a GST compliance assistant for Indian businesses. A trader asks: 'My supplier filed GSTR-1 late for March 2025. Can I still claim ITC in my GSTR-3B?' [/INST] You can claim Input Tax Credit for supplies reflected in GSTR-2B even if your supplier filed GSTR-1 late, provided the invoice appears in your GSTR-2B for the relevant period. Per Section 16(2)(aa) of the CGST Act, ITC eligibility is tied to reflection in GSTR-2B, not the supplier's filing date. However, if the late filing pushes the reflection into a later month's GSTR-2B, your ITC claim moves to that month. Verify in your GST portal under 'ITC Available' before filing. </s>"
}
Dataset quality checklist for Indian enterprise fine-tuning: each example should represent a complete, self-contained task. Avoid examples where the correct answer requires information not in the prompt. For Malayalam language tasks, ensure consistent Unicode normalization — mixed encoding in the training data produces erratic inference behaviour. Aim for 500-2,000 examples for task-specific adaptation; more is better only if the additional examples are genuinely diverse, not repetitive paraphrases of the same underlying task.
Cloud Options: Cost Estimates for Indian Teams
Google Colab free tier provides a T4 GPU (16GB VRAM) with runtime limits of approximately 4-6 hours before disconnection. Adequate for prototyping and datasets under 1,000 examples. Colab Pro at ₹899/month eliminates runtime interruptions and provides priority access to T4 and V100 GPUs — the practical minimum for serious fine-tuning work.
RunPod offers on-demand GPU instances billable by the hour. An RTX 3090 (24GB VRAM) costs approximately $0.22-0.35/hour — roughly ₹18-29/hour at current exchange rates. A 3-epoch fine-tuning run on 2,000 examples typically completes in 4-6 hours, putting the total cost under ₹200. Payment requires an international credit or debit card; Indian Visa debit cards with international transactions enabled generally work.
Lambda Labs offers A10 instances at $0.75/hour (approximately ₹62/hour). For teams needing the extra VRAM headroom of 24GB to fine-tune 13B models, this is a reliable choice with straightforward billing. Their storage is persistent across sessions, which matters when you are iterating on datasets.
AWS SageMaker is the right choice for enterprise deployments where fine-tuning needs to be reproducible, auditable, and integrated with existing AWS infrastructure. An ml.g5.2xlarge instance (A10G GPU, 24GB VRAM) costs approximately ₹180-200/hour in the Mumbai region. Expensive for experimentation, but the managed environment and IAM integration justify the premium for regulated industries like banking or healthcare.
When Fine-Tuning Is the Wrong Answer
Fine-tuning is not the default answer to every LLM performance problem. If your use case involves retrieving and reasoning over documents that change frequently — updated SEBI circulars, new GST notifications, live product catalogues — fine-tuning memorises a snapshot that goes stale. RAG (Retrieval-Augmented Generation) handles dynamic knowledge far better because it fetches fresh documents at inference time rather than relying on what was in the training data.
Similarly, if the base model already performs well on your task with a well-crafted system prompt and few-shot examples, fine-tuning adds cost and complexity without proportional benefit. Run a proper baseline evaluation first: test the base model with your best prompt on 50-100 representative examples, score the outputs, and only proceed to fine-tuning if the gap is material (more than 10-15 percentage points on your task metric). Many teams skip this step and fine-tune unnecessarily.
Evaluating Whether Fine-Tuning Actually Worked
Hold out 10-15% of your dataset before training and never let the model see it. After training, evaluate on this held-out set using task-appropriate metrics. For classification tasks, F1 score and accuracy. For generation tasks like summarisation or translation, BLEU and ROUGE scores provide automated signal, but human evaluation on 50-100 examples is the most reliable signal for domain-specific quality.
For Indian language tasks specifically, automated metrics often undercount quality — BLEU scores for Malayalam or Hindi generation can be misleading because tokenisation differences penalise correct outputs. Build a small human evaluation rubric: fluency, factual accuracy, and task completion on a 1-5 scale, evaluated by a native speaker familiar with the domain.
Compare your fine-tuned model against the base model and against GPT-4o or Claude 3.5 Sonnet on the same test set. If your fine-tuned open-source model matches or exceeds the frontier model on your specific task, you have achieved the goal: a performant, private, self-hosted model at a fraction of the per-call API cost. That is the business case for fine-tuning — and it is a compelling one when the numbers work out.
Frequently Asked Questions
Can I fine-tune a 7B parameter model on a free Google Colab notebook?
Yes, with QLoRA and 4-bit quantization, a 7B model like Mistral-7B fits within Colab's free tier T4 GPU (16GB VRAM). Expect training to take 2-4 hours per 1,000 training samples. The free tier has runtime limits, so Colab Pro at ₹899/month is recommended for serious fine-tuning work where you need uninterrupted sessions.
How much training data do I need for effective LoRA fine-tuning?
For task-specific fine-tuning — like teaching a model to format GST invoices or answer Malayalam queries — 500 to 2,000 high-quality examples often outperform 50,000 noisy ones. Quality matters far more than volume. Use the Alpaca or ShareGPT format: each example should be a prompt-completion pair that precisely represents your target task. Avoid repetitive paraphrases; genuine diversity in phrasing and context improves generalisation.
What is the difference between LoRA and QLoRA in practical terms?
LoRA adds trainable low-rank adapter matrices to the frozen base model, reducing trainable parameters from billions to millions. QLoRA does everything LoRA does but additionally loads the base model in 4-bit NormalFloat precision, cutting VRAM requirements by roughly 75%. A model that needed 80GB of VRAM in full precision needs around 8-10GB with QLoRA — the difference between needing an A100 and running fine on a consumer RTX 3080 or a Colab T4.