Structured output evaluationOpen source on GitHub

Catch schema drift before production.

Evaluate LLM structured outputs, pinpoint failure reasons in seconds, and run the same workflow in hosted or fully private self-hosted mode.

Built for prompt engineers, eval teams, and AI product developers.

Regression evalsExtraction QAClassification auditsAI failure analysisDocker deployable

How it works

Bring your CSV or JSONL with id, prompt, expected, and actual.

Score pass rate and classify schema, type, and value failures.

Filter row-level failures and diagnose regressions quickly.

Self-hosted: generate an AI narrative of failure patterns and get a recommended next step.

HOSTED

Open the hosted app and start evaluating in seconds with no infrastructure setup.

SELF-HOSTED

Run EvalLens in your own environment for private datasets and controlled provider keys.

Generate missing actual outputs before evaluating.
Trigger AI-powered failure analysis — patterns, affected rows, and a fix recommendation.
All four export formats embed run context and the narrative.

docker run -p 3000:3000 -e EVALLENS_MODE=self-hosted evallens

Current mode: Hosted

HOSTED

SELF-HOSTED

Generates missing actual outputs before scoring.
Bring your own OpenAI, Anthropic, or Gemini key.
After evaluation, trigger AI failure analysis — named patterns, affected row counts, and a recommended next step.
All exports (CSV, JSON, MD, PDF) embed run context and the narrative.
Deploy with Docker quickly for local or server environments.
Deterministic eval workflow for local, staging, or CI.

Your data stays in your environment.

Upload a CSV or JSONL file with id, prompt, expected, and actual columns.

Drop your file here, or browse

CSV, JSON, or JSONL

Sample dataset