Calculator

AI Eval Set Size
calculator.

Calculate how many eval examples you need to detect a quality regression with confidence.

How we calibrated this

Used internally before any AI eval engagement.

Inputs

Tell us about your project.

This is a static reference card. For interactive calculators, talk to us — we tune the assumptions per client.

Current quality baseline

Range: 50–99 % · Default: 80 %

Smallest regression you want to detect

Range: 1–20 % · Default: 5 %

Statistical confidence

How it's calculated

Power-analysis-style: needed n based on baseline, effect size, confidence

Output

Examples needed.

Output

Approximate dataset-build time.

Output

Per regression-test run.

Want a real estimate?

For a real estimate calibrated to your specific project, brief us. We get back within two business days.

Other calculators

Brief us in three sentences. We'll send a tailored estimate.