A small open-source library for evaluating structured LLM outputs, with the diff at row level. Vibes are not an evaluation.
A few weeks ago I open-sourced EvalLens, a small library for evaluating structured outputs from language models. It does one thing: it compares what a model produced against what you expected, row by row, schema by schema, and tells you where the model is wrong with the kind of precision you can put in a pull request.
This is a short note about why it exists.
Most LLM "evaluation", in practice, is not evaluation. It is checking. Someone runs a prompt, looks at the output, and says: that looks right. Then they ship. A week later the model returns something different and nobody can tell whether the difference is good, bad, or noise. The original sample is gone. The decision was made on a feeling.
Evaluation, as a discipline, demands a different shape. You say in advance what correct looks like. You compare what came out to that definition. You produce an answer that is the same answer next week, on the same input, regardless of how you feel about the result.
Vibes do not survive the second week.
The hard case is not free text. Free text is hard to grade in the abstract, but the LLM-shaped systems that matter most in production are the ones that have to produce structured output. Extraction. Classification. JSON. The system has to put a thing in a slot, and the slot has a name, a type, and a constraint.
These are the systems where the difference between seems to work and actually works compounds the fastest. A small drift in how a model handles a date format, or how it disambiguates a near-duplicate field, costs you months downstream. The cheap insurance is to grade every output, every field, every time, against an expected schema. Not vibes. Diffs.
The tool itself is small on purpose. You point it at a CSV of inputs and expected outputs, you give it a way to call your model, and it gives you a row-level report. Which rows passed. Which rows failed. Which fields drifted. Which fields were exactly right.
The result is a diff, not a score. Scores are too aggregated to be useful when you are debugging; diffs tell you where the model is actually wrong. If you have ten thousand rows, ten thousand diffs is a lot. If you have ten rows, ten diffs is exactly the right amount of information. The library does not hide that asymmetry from you.
The honest version is that I built it for myself first. I needed a way to grade structured outputs that did not involve me staring at a CSV in a notebook for an hour, and nothing on the shelf was small enough to want. So I wrote a small one. It worked for the thing I was working on, then for the next thing, then for the one after that.
After a few months of using it I realised I could either keep it in a private repo and rebuild some version of it the next time the laptop changed, or I could clean it up and put it somewhere I would not lose. Cleaning it up was a weekend; publishing it was an hour. So that is what I did.
If you find it useful, that is good. If not, that is also fine. Putting it out was less about the tool and more about not having to re-build the same small thing the next time I needed it.
The project is small and I would like to keep it that way. The next moves are the ones that make the row-level diff easier to read: a CLI report, a JSON output that fits into CI, a small adapter for the eval harness everyone is going to be standardising on next year.
Open-source things either grow or they don't. This one I would be happy with either way.
Repo on GitHub. Issues and pull requests welcome.