Fine-Tuning LLMs: When and How to Customize AI Models

Fine-tuning lets you specialize an LLM for your needs. But it’s not always the answer.

What Is Fine-Tuning?

Fine-tuning trains an existing model on your specific data:

Base Model (general knowledge)
        ↓
+ Your Training Data
        ↓
Fine-Tuned Model (specialized for your use case)

When to Fine-Tune

Good Reasons to Fine-Tune

Specific style/tone
- Match your brand voice
- Consistent formatting
- Domain-specific language
Specialized knowledge
- Industry terminology
- Company-specific information
- Rare domains
Performance optimization
- Reduce prompt length
- Faster inference
- More consistent outputs
Cost reduction
- Use smaller fine-tuned model
- Fewer tokens per request
- Simplified prompts

When NOT to Fine-Tune

RAG is sufficient
- For factual knowledge retrieval
- When information changes frequently
- For citation needs
Prompt engineering works
- Simple formatting changes
- Standard use cases
- Still experimenting
Limited data
- Need hundreds+ examples
- Quality matters more than quantity
- Diverse examples required

Fine-Tuning vs Alternatives

Approach	Best For	Effort
Prompt engineering	Quick adjustments	Low
Few-shot examples	Format/style guidance	Low
RAG	Factual knowledge	Medium
Fine-tuning	Deep customization	High

How to Fine-Tune

Step 1: Prepare Data

Create training examples:

{
  "messages": [
    {"role": "system", "content": "You are a customer service agent..."},
    {"role": "user", "content": "Customer question here"},
    {"role": "assistant", "content": "Ideal response here"}
  ]
}

Step 2: Format Dataset

Requirements vary by provider:

OpenAI: JSONL format
Anthropic: Custom format
Open source: Various formats

Step 3: Upload and Train

# OpenAI example
from openai import OpenAI
client = OpenAI()

# Upload training file
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini"
)

Step 4: Evaluate

Test your fine-tuned model:

Compare to base model
Check for regression
Measure on held-out data

Step 5: Deploy

Use your custom model:

response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:your-org::abc123",
    messages=[{"role": "user", "content": "Hello"}]
)

Data Requirements

Quantity

Use Case	Minimum Examples
Style adjustment	50-100
Task specialization	200-500
Complex behavior	1000+

Quality

Diverse examples
Correct outputs
Representative of real use
Clean formatting

Structure

Good training example:

{
  "messages": [
    {"role": "user", "content": "Summarize this contract: [long text]"},
    {"role": "assistant", "content": "**Key Terms:**\n- Duration: 2 years\n- Value: $50,000\n**Obligations:**\n- Monthly reporting\n- Annual audit"}
  ]
}

Cost Considerations

Training Costs

Provider	Approximate Cost
OpenAI GPT-4o-mini	~$3-25 per training job
GPT-4	Higher
Open source	Compute costs only

Inference Costs

Fine-tuned models often cost more per token, but:

Shorter prompts needed
Better results = fewer retries
Net cost often lower

Common Mistakes

Mistake	Fix
Too few examples	Get more data
Poor quality data	Clean and curate
Overfitting	More diverse examples
Wrong task	Maybe use RAG instead
Ignoring base model	Build on its strengths

Open Source Options

Frameworks

Tool	Best For
Hugging Face	Standard fine-tuning
LLaMA Factory	LLaMA models
Axolotl	Easy configuration
PEFT	Efficient fine-tuning

Efficient Techniques

LoRA: Train small adapters
QLoRA: LoRA + quantization
PEFT: Parameter-efficient methods

Evaluation Checklist

Before deploying:

□ Tested on held-out data
□ Compared to base model
□ Checked for regressions
□ Evaluated edge cases
□ Measured cost impact
□ User tested

Need help deciding if fine-tuning is right for you? Let’s discuss your use case.