Fine-tuning transforms a general-purpose language model into a specialist. This guide walks through the complete process—from deciding whether to fine-tune, through data preparation, training, and deployment.

When to Fine-Tune (And When Not To)

Fine-tuning makes sense when:

  • You need consistent output format or style
  • Domain-specific knowledge isn’t in the base model
  • Prompt engineering has hit its limits
  • You have at least 100-1000 high-quality examples

Don’t fine-tune when:

  • Few-shot prompting solves the problem
  • You lack quality training data
  • The task changes frequently

Step 1: Data Preparation

Quality training data is the single most important factor. Each example should follow this format:

{
  "instruction": "What you want the model to do",
  "input": "The specific input (optional)",
  "output": "The ideal response"
}

Data quality checklist:

  • Outputs are genuinely good (you’d be happy to see them in production)
  • Instructions are clear and consistent
  • No contradictory examples
  • Diverse coverage of your use case
  • At least 100 examples (500+ preferred)

Step 2: Choosing a Base Model

Use CaseRecommended Base
General assistantMistral 7B, Llama 2 7B
Code generationCodeLlama, StarCoder
Long contextMistral, Llama 2 with RoPE
MultilingualBLOOM, mGPT

Step 3: Training Configuration

For most use cases, LoRA (Low-Rank Adaptation) offers the best tradeoff between quality and compute:

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,                    # Rank (start with 8-16)
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Training parameters:

  • Learning rate: 1e-4 to 2e-4
  • Batch size: 4-8 (with gradient accumulation)
  • Epochs: 3-5 (watch for overfitting)

Step 4: Evaluation

Don’t just eyeball outputs. Create a held-out test set and measure:

  • Task accuracy (does it do what you want?)
  • Format compliance (structured outputs)
  • Regression testing (did it forget general capabilities?)

Step 5: Deployment

Merge LoRA weights into base model, quantize if needed, and deploy via:

  • vLLM (high throughput)
  • llama.cpp (local/edge)
  • Hugging Face TGI (managed)

Common Pitfalls

  1. Overfitting: Model memorizes training data, performs poorly on new inputs
  2. Catastrophic forgetting: Loses general capabilities
  3. Data leakage: Test examples too similar to training
  4. Format inconsistency: Training data has varying output formats