EvalLens logo
Structured output evaluation

Spot LLM output failures before they reach users.

EvalLens helps you upload datasets, score model responses, and inspect row-level failures with a fast, focused workflow.

How it works

01

Upload

Bring your CSV or JSONL file with id, prompt, expected, and actual columns.

02

Evaluate

Run automatic checks to measure pass rate and classify failure reasons.

03

Inspect

Filter and drill into row-level results to diagnose what broke and why.

Hosted vs self-hosted

Current mode: Hosted

HOSTED

Bring your completed outputs

Hosted mode evaluates files that already include both expected and actual values. It is fast and ideal when your generation pipeline already exists.

SELF-HOSTED

Generate, then evaluate in one run

Self-hosted mode can generate missing actual outputs using your configured provider keys, then evaluate the results. Run it yourself to keep data private and fully under your control.

Best for teams that need local data boundaries, environment-based provider control, or reproducible evals in CI. You can bring your own keys for OpenAI, Anthropic, or Gemini and switch models without changing your dataset format.

1. Clone the repo from GitHub and run locally or in Docker.

2. Set EVALLENS_MODE=self-hosted and at least one provider API key.

3. Upload your dataset, generate missing outputs, and inspect failures row by row.

Evaluate your outputs

Upload a CSV or JSONL file with id, prompt, expected, and actual columns.

Drop your file here, or browse

CSV, JSON, or JSONL