LLM Eval Calculator

LLM evaluation sample size calculator

Stop guessing your AI brand mention rate. Calculate exactly how many prompts you need to test for statistically significant results across ChatGPT, Gemini, Claude, and more.

Set your confidence level and margin of error
Account for LLM variability (model "wobble")
Get exact prompt counts and testing estimates

Confidence Level: 95%

80%90%95%99%

Margin of Error: ±2.0%

±0.5%±2%±3.5%±5%

Resamples per prompt (K): 3

151020

Unique Prompts

353

Total API Calls

1,059

95% confidence, ±2% margin, 3 resamples per prompt

/ The Bottom Line Up Front

Testing too few prompts wastes your conclusions. Testing too many wastes your resources.

AI models give different answers to the same prompt. This "wobble" means you need enough samples to be statistically confident in your brand mention rate. Our calculator uses variance-aware formulas based on real-world LLM testing to tell you exactly how many prompts you need.

What is the LLM evaluation sample size calculator?

The LLM Evaluation Sample Size Calculator helps you plan statistically rigorous AI visibility tests. When measuring how often ChatGPT, Gemini, or Claude mentions your brand, you need enough prompts to be confident in your results — but not so many that you waste time and API costs.

Unlike traditional sample size calculators, ours accounts for LLM-specific variance: the "wobble" where AI gives different answers to the same prompt. We incorporate both between-question variance (how rates differ across prompts) and within-question variance (how the same prompt varies across regenerations).

Based on real-world testing across industries and personas, the calculator tells you exactly how many unique prompts to test and how many times to resample each. Perfect for AEO teams measuring brand visibility, marketers tracking AI share of voice, or researchers benchmarking LLM behavior.

How the LLM eval calculator works

Set your precision target

Choose your margin of error (±2% is standard). This determines how accurate your measured brand mention rate will be.

Choose confidence level

Select how certain you want to be (95% is industry standard). Higher confidence requires more samples.

Set resampling strategy

Choose how many times to test each prompt (K). More resamples reduce within-question variance from LLM wobble.

Get your sample size

Receive the exact number of unique prompts needed, total API calls, and estimated testing effort.

Explore our other tools

More free tools to help optimize your content for AI search.

Reddit threads finder

Find threads cited in ChatGPT and Gemini for AI visibility and AEO

Try it

AEO content evaluator

Grade your content for AI search optimization and citation readiness

Try it

Headline optimizer

Generate and rank headline variants for AI search and SEO

Try it

AI Assist Widget

Add AI buttons to any website to boost AEO and brand citations

Try it

Frequently asked questions

What is an AI brand mention rate?

AI brand mention rate is how often AI assistants like ChatGPT or Gemini recommend your brand when users ask relevant questions. For example, if your brand is mentioned in 40 out of 100 responses to category prompts, your mention rate is 40%.

Why do I need to calculate sample size for LLM testing?

LLMs give different answers each time — even to the same prompt. Without enough samples, your measured mention rate could be wildly different from the true rate. Sample size calculation ensures your results are statistically valid, not just noise.

What confidence level should I use?

95% is the industry standard for most business decisions. Use 90% for quick directional insights, or 99% for high-stakes decisions. Higher confidence requires more prompts but gives more reliable results.

What margin of error should I choose?

±2-3% is typical for marketing decisions. If your true rate is 40%, a ±2% margin means you'll measure between 38-42%. Tighter margins require more samples but give more precise results.

Why do I need resampling (K)?

LLMs have "wobble" — they give different answers to the same prompt. Resampling the same prompt multiple times (K=3-10) reduces this noise and gives you a more stable estimate of how that prompt typically performs.

What is between-question variance?

Between-question variance measures how much brand mention rates differ across different prompts. Some prompts might have 60% mention rate while others have 20%. Higher variance requires more unique prompts to test.

What is within-question variance?

Within-question variance measures LLM "wobble" — how much the same prompt varies across regenerations. Even with identical input, AI might mention your brand 7 out of 10 times versus 5 out of 10 times. Resampling (K) helps average out this noise.

How often should I re-test my AI mention rate?

Monthly testing catches trends. AI models update frequently, and your competitors' content changes over time. Quarterly is minimum for tracking; monthly is better for active AEO programs.

How long does testing take?

With typical settings (95% confidence, ±2% margin, K=3), you need roughly 500-700 unique prompts × 3 resamples = ~2,000 API calls per platform. At standard rate limits, this takes 1-2 hours per platform.

Is this calculator free?

Yes. The LLM evaluation sample size calculator is free to use. Adjust the sliders to instantly see how many prompts you need for your specific confidence and precision requirements.