By Fabian Farestam and Jonathan Ahrlind, on behalf of Dentio Technical Staff

TLDR: LLM-as-a-judge evals are noisy because judges reward style, not just content. We rewrite every candidate into a neutral, fact-only form before judging. Essentially trading hidden style bias for a uniform, controllable bias, and cutting style-induced score spread without burying real clinical errors.

LLM-as-a-judge evaluation has become a default pattern for many production AI systems. It is useful, cheap enough to run often, and flexible enough to score messy outputs that do not fit clean unit tests. It is also easy to fool, as judges tend to prefer certain styles, often rewarding verbosity, and can favor outputs that look like something from their own model family. In multilingual settings, including Swedish, the agreement can get worse.

At Dentio, this matters because we are evaluating clinical documentation workflows, not chatbot answers. We help dentists draft journal notes from patient conversations, and a clinically correct note can look very different depending on the dentist. It might be terse or verbose, bullet-pointed or narrative, structured by procedure or by chronology. This variation, while good for the product, is bad for evals. If the goal is to measure whether the pipeline preserves the right clinical facts, then style should not influence the score as much as it usually does. Our fix is deliberately simple: before judging, make every candidate boring.

We call this style normalization: rewriting each candidate into short, neutral factual prose before passing it to the judge. Instead of pretending the judge is unbiased, we introduce a controlled bias that is applied uniformly across all candidates.

The setup

To evaluate some factor A (e.g., whether a generated journal note preserves clinically relevant facts), instead of sending the candidate directly to the judge, we first send it to a separate model Y with a narrower instruction:

Du är en medicinsk textnormaliserare. Din uppgift är att läsa ett journalanteckningsutdrag och återge ENBART de kliniska fakta som en kompakt, neutral punktlista på svenska.

Regler:
- En rad per faktum. Använd bindestreck som punkt.
- Behåll tandnummer, ytor, diagnoser, mätvärden och föreslagna åtgärder exakt som de står.
- Ta bort artighetsfraser, motiveringsspråk, stilistiska omsvep och formateringsrubriker som "Anamnes" / "Bedömning".
- Lägg INTE till information som inte finns i originalet.
- Skriv inga förklaringar eller rubriker före listan. Bara raderna.

Journaltext:
---
{journal}
---

The output from Y is then passed to the same LLM-as-a-judge evaluator. This still introduces bias, as the normalized text will reflect the preferences and failure modes of model Y, but this bias is applied uniformly across all candidates.

Some results

We tested this on an internal set of manually annotated Swedish dental journals. For each example, we generated style variants (verbose, terse, bullet, narrative) holding all clinical facts identical, then scored every variant against the gold reference. The judge used a 1-10 holistic clinical-quality score covering correctness, structure, and professionalism.

Without normalization, the same factual content received scores spread across 2.5 points on a 10-point scale depending purely on style. This 25% noise from presentation alone is enough to make A/B comparisons between pipeline versions unreliable.

We tested ten different models as the normalizer Y, ranging from small (Gemma 3 4b, GPT-5.4 nano) to large (Claude Opus 4.7, GPT-5.5). In this sample, every tested model reduced style-induced spread, with larger normalizers generally performing better.

Figure 1

Larger normalizers strip more style

Every model we tested cuts style-induced score spread. Bigger normalizers generally do it better, but even a 4B model helps. Hover any point for values; click a family to filter.

The critical safety question is whether normalization buries real clinical errors. We tested this by introducing wrong-tooth errors into the journals (changing one FDI tooth number to another valid but incorrect number) and measuring how much the judge's score dropped compared to the clean version. The important result is that normalization reduced style-induced score spread without reducing sensitivity to a clinically meaningful injected error.

Figure 2

Real errors still cut through

We injected wrong-tooth errors and measured the judge's score drop with and without normalization. The dashed line marks 100%, where normalization neither helps nor hurts. Above it, the error became more salient.

Practical takeaway

You do not need a frontier model as the normalizer. In this setup, mid-size models gave 33-40% spread reduction while preserving sensitivity to the error class we tested, at a fraction of the cost of using models like Claude Opus 4.7 or GPT-5.5 for every eval call.

Conclusion

Style normalization is one small layer, not the whole eval stack. But it helped with a very real production problem: our evals were too easily impressed by style. By making every candidate boring first, we made the judge focus more on the clinical content we actually cared about. Raw judging has hidden style bias, while style-normalized judging has normalizer bias; for our use case, the second one has been much easier to reason about.

Read the full paper here

Correspondence: [email protected]