Calculator

AI Eval Set Size
calculator.

Calculate how many eval examples you need to detect a quality regression with confidence.

How we calibrated this
Used internally before any AI eval engagement.
Inputs

Tell us about your project.

This is a static reference card. For interactive calculators, talk to us — we tune the assumptions per client.

Current quality baseline

Range: 5099 % · Default: 80 %

Smallest regression you want to detect

Range: 120 % · Default: 5 %

Statistical confidence
  • 90%0.7×
  • 95%1×
  • 99%1.4×
How it's calculated

The formula.

Power-analysis-style: needed n based on baseline, effect size, confidence

Output

Recommended eval set size

Examples needed.

Output

Cost to build (engineer days)

Approximate dataset-build time.

Output

Cost to run (LLM API)

Per regression-test run.

Want a real estimate?

This is a band,
not a quote.

For a real estimate calibrated to your specific project, brief us. We get back within two business days.

Brief us on evals

Got a specific project?

Brief us in three sentences. We'll send a tailored estimate.

Brief us on evals