All notes
Notes · Vol. I · № 0002
SHIP · Shipping

EvalLens.

A small open-source library for evaluating structured LLM outputs, with the diff at row level. Vibes are not an evaluation.

Apr 20264 min readv1London

A few weeks ago I open-sourced EvalLens, a small library for evaluating structured outputs from language models. It does one thing: it compares what a model produced against what you expected, row by row, schema by schema, and tells you where the model is wrong with the kind of precision you can put in a pull request.

This is a short note about why it exists.

§ 1Vibes are not an evaluation

Two kinds of confidence. EVALUATION, BEFORE / AFTER "looks right" VIBES × "18 of 20 fields match" STRUCTURED
Two kinds of confidence. One that ships and breaks; one that ships and stays.

Most LLM "evaluation", in practice, is not evaluation. It is checking. Someone runs a prompt, looks at the output, and says: that looks right. Then they ship. A week later the model returns something different and nobody can tell whether the difference is good, bad, or noise. The original sample is gone. The decision was made on a feeling.

Evaluation, as a discipline, demands a different shape. You say in advance what correct looks like. You compare what came out to that definition. You produce an answer that is the same answer next week, on the same input, regardless of how you feel about the result.

Vibes do not survive the second week.

§ 2Why structured outputs

A contract. STRUCTURED OUTPUT, FIELD BY FIELD ExtractedInvoice FIELD TYPE REQ. vendor issued amount currency line_items string ISO 8601 decimal enum {'{'}USD…{'}'} array model output Validated. RESULT
A schema is a contract. Each row is a slot the model has to fill, with a name and a constraint.

The hard case is not free text. Free text is hard to grade in the abstract, but the LLM-shaped systems that matter most in production are the ones that have to produce structured output. Extraction. Classification. JSON. The system has to put a thing in a slot, and the slot has a name, a type, and a constraint.

These are the systems where the difference between seems to work and actually works compounds the fastest. A small drift in how a model handles a date format, or how it disambiguates a near-duplicate field, costs you months downstream. The cheap insurance is to grade every output, every field, every time, against an expected schema. Not vibes. Diffs.

§ 3What it does

A diff, not a score. ROW-LEVEL REPORT ROW EXPECTED GOT DIFF 01 {'{date: 2024-01-15}'} {'{date: 2024-01-15}'} 02 {'{date: 2024-01-15}'} {'{date: Jan 15, 2024}'} ◐ format 03 {'{amount: 250.00}'} {'{amount: 250.00}'} 04 {'{amount: 99.95}'} {'{amount: 99.949}'} ◐ precision 05 {'{currency: USD}'} {'{currency: null}'} ✗ missing 2 / 5 pass · 2 drifts · 1 fail
Five rows. Each row is a single decision the model made. Each decision is checked, character by character, against the contract.

The tool itself is small on purpose. You point it at a CSV of inputs and expected outputs, you give it a way to call your model, and it gives you a row-level report. Which rows passed. Which rows failed. Which fields drifted. Which fields were exactly right.

The result is a diff, not a score. Scores are too aggregated to be useful when you are debugging; diffs tell you where the model is actually wrong. If you have ten thousand rows, ten thousand diffs is a lot. If you have ten rows, ten diffs is exactly the right amount of information. The library does not hide that asymmetry from you.

§ 4How it ended up open

The honest version is that I built it for myself first. I needed a way to grade structured outputs that did not involve me staring at a CSV in a notebook for an hour, and nothing on the shelf was small enough to want. So I wrote a small one. It worked for the thing I was working on, then for the next thing, then for the one after that.

After a few months of using it I realised I could either keep it in a private repo and rebuild some version of it the next time the laptop changed, or I could clean it up and put it somewhere I would not lose. Cleaning it up was a weekend; publishing it was an hour. So that is what I did.

If you find it useful, that is good. If not, that is also fine. Putting it out was less about the tool and more about not having to re-build the same small thing the next time I needed it.

§ 5What's next

The project is small and I would like to keep it that way. The next moves are the ones that make the row-level diff easier to read: a CLI report, a JSON output that fits into CI, a small adapter for the eval harness everyone is going to be standardising on next year.

Open-source things either grow or they don't. This one I would be happy with either way.

Repo on GitHub. Issues and pull requests welcome.