TECHNOLOGIES

Fine-Tuning

Continuing a model's training on your own examples to change how it behaves, which is the right tool for style and format and the wrong tool for injecting fresh facts.

Last reviewed: byKevin Riedl wiki ↗

Fine-tuning takes a pretrained LLM and trains it further on a curated set of your own examples, adjusting the model’s weights so it leans toward the behavior you demonstrated. You use it to bake in a consistent tone, a specific output format, a domain vocabulary, or a task the base model handles awkwardly. It changes how the model responds, not what facts it has access to.

The decision that actually matters is fine-tuning versus RAG, and people get it backwards constantly. RAG injects fresh, changing facts at runtime by retrieving from your data, so the answer is only as current as your last document update. Fine-tuning teaches durable behavior but freezes the knowledge at training time, so it is the wrong tool when your facts change, and it will not reliably stop the model from inventing things either, since hallucination is reduced by grounding, not by tuning. The rule of thumb: tune for form, retrieve for facts. Many real systems use both, a fine-tuned model that answers in your house style, grounded by retrieval for the live data.

The cost reality is less scary than it used to be but still real. You need a clean, labeled dataset (usually hundreds to thousands of examples), the training run itself, and the ongoing cost of re-tuning every time the base model improves or your requirements shift. That last cost is the one teams forget. A fine-tune is not a one-time project, it is a maintenance commitment.

Worked example of fine-tuning winning outright: a company needs every model response to come back as strict JSON matching an internal schema, with a specific terse house style, across millions of calls. Prompting can get there most of the time, but the occasional malformed response breaks the downstream system, and stuffing the full style guide and schema into every prompt burns tokens on every call. A fine-tune bakes the format and tone into the model’s weights, so the instructions no longer have to be repeated in the context, which makes responses both more reliable and cheaper per call at that volume. That is the sweet spot: durable behaviour, high call volume, and a format that prompting cannot quite nail consistently.

When does fine-tuning win outright? When prompting plus retrieval cannot reliably produce the format or behavior you need, and you have enough high-quality examples to teach it. When in doubt, exhaust prompting and RAG first, because they are cheaper to change. We make this build-versus-tune call under Artificial Intelligence.

// FAQ

FAQs

Tune for form, retrieve for facts. Fine-tuning changes how the model behaves (tone, format, task). RAG injects current facts at runtime. If your data changes often, you want RAG, not a fine-tune that freezes knowledge.
More than the training run suggests. Budget for building a clean labeled dataset, the training itself, and the recurring cost of re-tuning whenever the base model or your requirements change. The maintenance is the part teams underestimate.
When you need fresh or frequently changing facts in the answers. Fine-tuning bakes knowledge in at training time. For live data, use retrieval. Also wrong if prompting and RAG already get you there cheaper.