LLM evaluation sample size calculator
Stop guessing your AI brand mention rate. Calculate exactly how many prompts you need to test for statistically significant results across ChatGPT, Gemini, Claude, and more.
- Set your confidence level and margin of error
- Account for LLM variability (model "wobble")
- Get exact prompt counts and testing estimates
95% confidence, ±2% margin, 3 resamples per prompt
/ The Bottom Line Up Front
Testing too few prompts wastes your conclusions. Testing too many wastes your resources.
AI models give different answers to the same prompt. This "wobble" means you need enough samples to be statistically confident in your brand mention rate. Our calculator uses variance-aware formulas based on real-world LLM testing to tell you exactly how many prompts you need.
What is the LLM evaluation sample size calculator?
The LLM Evaluation Sample Size Calculator helps you plan statistically rigorous AI visibility tests. When measuring how often ChatGPT, Gemini, or Claude mentions your brand, you need enough prompts to be confident in your results — but not so many that you waste time and API costs.
Unlike traditional sample size calculators, ours accounts for LLM-specific variance: the "wobble" where AI gives different answers to the same prompt. We incorporate both between-question variance (how rates differ across prompts) and within-question variance (how the same prompt varies across regenerations).
Based on real-world testing across industries and personas, the calculator tells you exactly how many unique prompts to test and how many times to resample each. Perfect for AEO teams measuring brand visibility, marketers tracking AI share of voice, or researchers benchmarking LLM behavior.
How the LLM eval calculator works
Set your precision target
Choose your margin of error (±2% is standard). This determines how accurate your measured brand mention rate will be.
Choose confidence level
Select how certain you want to be (95% is industry standard). Higher confidence requires more samples.
Set resampling strategy
Choose how many times to test each prompt (K). More resamples reduce within-question variance from LLM wobble.
Get your sample size
Receive the exact number of unique prompts needed, total API calls, and estimated testing effort.
Explore our other tools
More free tools to help optimize your content for AI search.
Reddit threads finder
Find threads cited in ChatGPT and Gemini for AI visibility and AEO
Try itAEO content evaluator
Grade your content for AI search optimization and citation readiness
Try itHeadline optimizer
Generate and rank headline variants for AI search and SEO
Try itAI Assist Widget
Add AI buttons to any website to boost AEO and brand citations
Try it