A Practical Guide to Fine-Tuning LLMs

Fine-tuning transforms a general-purpose language model into a specialist. This guide walks through the complete process—from deciding whether to fine-tune, through data preparation, training, and deployment.

When to Fine-Tune (And When Not To)

Fine-tuning makes sense when:

You need consistent output format or style
Domain-specific knowledge isn’t in the base model
Prompt engineering has hit its limits
You have at least 100-1000 high-quality examples

Don’t fine-tune when:

Few-shot prompting solves the problem
You lack quality training data
The task changes frequently

Step 1: Data Preparation

Quality training data is the single most important factor. Each example should follow this format:

{
  "instruction": "What you want the model to do",
  "input": "The specific input (optional)",
  "output": "The ideal response"
}

Data quality checklist:

Outputs are genuinely good (you’d be happy to see them in production)
Instructions are clear and consistent
No contradictory examples
Diverse coverage of your use case
At least 100 examples (500+ preferred)

Step 2: Choosing a Base Model

Use Case	Recommended Base
General assistant	Mistral 7B, Llama 2 7B
Code generation	CodeLlama, StarCoder
Long context	Mistral, Llama 2 with RoPE
Multilingual	BLOOM, mGPT

Step 3: Training Configuration

For most use cases, LoRA (Low-Rank Adaptation) offers the best tradeoff between quality and compute:

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,                    # Rank (start with 8-16)
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Training parameters:

Learning rate: 1e-4 to 2e-4
Batch size: 4-8 (with gradient accumulation)
Epochs: 3-5 (watch for overfitting)

Step 4: Evaluation

Don’t just eyeball outputs. Create a held-out test set and measure:

Task accuracy (does it do what you want?)
Format compliance (structured outputs)
Regression testing (did it forget general capabilities?)

Step 5: Deployment

Merge LoRA weights into base model, quantize if needed, and deploy via:

vLLM (high throughput)
llama.cpp (local/edge)
Hugging Face TGI (managed)

Common Pitfalls

Overfitting: Model memorizes training data, performs poorly on new inputs
Catastrophic forgetting: Loses general capabilities
Data leakage: Test examples too similar to training
Format inconsistency: Training data has varying output formats