Fine-tuning transforms a general-purpose language model into a specialist. This guide walks through the complete process—from deciding whether to fine-tune, through data preparation, training, and deployment.
When to Fine-Tune (And When Not To)
Fine-tuning makes sense when:
- You need consistent output format or style
- Domain-specific knowledge isn’t in the base model
- Prompt engineering has hit its limits
- You have at least 100-1000 high-quality examples
Don’t fine-tune when:
- Few-shot prompting solves the problem
- You lack quality training data
- The task changes frequently
Step 1: Data Preparation
Quality training data is the single most important factor. Each example should follow this format:
{
"instruction": "What you want the model to do",
"input": "The specific input (optional)",
"output": "The ideal response"
}Data quality checklist:
- Outputs are genuinely good (you’d be happy to see them in production)
- Instructions are clear and consistent
- No contradictory examples
- Diverse coverage of your use case
- At least 100 examples (500+ preferred)
Step 2: Choosing a Base Model
| Use Case | Recommended Base |
|---|---|
| General assistant | Mistral 7B, Llama 2 7B |
| Code generation | CodeLlama, StarCoder |
| Long context | Mistral, Llama 2 with RoPE |
| Multilingual | BLOOM, mGPT |
Step 3: Training Configuration
For most use cases, LoRA (Low-Rank Adaptation) offers the best tradeoff between quality and compute:
from peft import LoraConfig
lora_config = LoraConfig(
r=16, # Rank (start with 8-16)
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Attention layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)Training parameters:
- Learning rate: 1e-4 to 2e-4
- Batch size: 4-8 (with gradient accumulation)
- Epochs: 3-5 (watch for overfitting)
Step 4: Evaluation
Don’t just eyeball outputs. Create a held-out test set and measure:
- Task accuracy (does it do what you want?)
- Format compliance (structured outputs)
- Regression testing (did it forget general capabilities?)
Step 5: Deployment
Merge LoRA weights into base model, quantize if needed, and deploy via:
- vLLM (high throughput)
- llama.cpp (local/edge)
- Hugging Face TGI (managed)
Common Pitfalls
- Overfitting: Model memorizes training data, performs poorly on new inputs
- Catastrophic forgetting: Loses general capabilities
- Data leakage: Test examples too similar to training
- Format inconsistency: Training data has varying output formats