A 4-percentage-point accuracy gap between two AI coding tools tested on 100 prompts produces a p-value of 0.54. That number means the observed difference is statistically indistinguishable from chance. Yet nearly every "I tested 20 tools" article published this year treats single-digit accuracy gaps as the basis for firm rankings, detailed scorecards, and confident recommendations.

This piece does the math those articles skip. Every formula below uses standard two-proportion test statistics available in any introductory textbook. The arithmetic is straightforward. The conclusions will not be comfortable for anyone publishing AI coding tool leaderboards derived from triple-digit prompt sets.

How Reliable Is a 100-Prompt Accuracy Test?

Not very, once you define "reliable" with numbers. The standard error of a proportion measured over n binary trials is SE = √(p(1−p)/n). For a tool that passes 72 of 100 prompts, SE = √(0.72 × 0.28 / 100) = 0.0449. The 95% confidence interval around that estimate spans 63.2% to 80.8% — an 18-point window.

That window is the tool's real accuracy range. A tool reported at "72% accurate" might sit anywhere from the low 60s to the low 80s in true performance. A second tool reported at 68% carries its own interval: 58.9% to 77.1%. These ranges overlap almost completely. The published ranking treats them as distinct positions on a leaderboard. The math does not.

One hundred binary trials produce wide confidence bands around any proportion near the middle of the range. The sample cannot support the precision that a numbered ranking implies.

Can a 4-Point Gap Between Tools Justify a Ranking?

No. A two-proportion z-test settles this quickly. Tool A: 72 passes out of 100. Tool B: 68 out of 100. Pooled proportion: (72 + 68) / 200 = 0.70. Standard error of the difference: √(0.70 × 0.30 × (1/100 + 1/100)) = 0.0648. Test statistic: z = (0.72 − 0.68) / 0.0648 = 0.617. Two-tailed p-value: 0.54.

That p-value sits an order of magnitude above the 0.05 threshold. The difference is noise dressed as signal.

Widen the gap to 10 points — 72% against 62%. Pooled proportion shifts to 0.67, standard error to 0.0665, z to 1.50, and the p-value to 0.13. Still not significant at any conventional threshold. A gap that looks decisive on a bar chart dissolves under arithmetic that takes less than a minute to run.

How Many Prompts Would It Take to Produce a Valid Ranking?

Approximately 2,000 per tool — for a 4-point difference. The sample-size formula for a two-proportion test at 80% power and α = 0.05 is n = (Z₀.₀₂₅ + Z₀.₂₀)² × 2p̄(1−p̄) / (p₁ − p₂)². Substitute Z₀.₀₂₅ = 1.96, Z₀.₂₀ = 0.84, p̄ = 0.70, and a 4-point target difference. The result: n ≈ 2,058 prompts per tool.

For a more generous 10-point gap, the requirement drops to roughly 350 per tool. That is still 3.5 times the entire prompt set most published benchmarks use across all tools combined.

The mismatch is not marginal. It is structural. Most articles test 20 tools with 100 shared prompts and then rank them to the decimal point. The math requires either far fewer tools or far more prompts. Nobody adjusts.

What Does "Accuracy" Actually Measure in These Tests?

It depends entirely on the evaluator, and most articles never say. Pass@1 measures whether the model's first output passes all test cases. Pass@10 measures whether any of ten outputs passes. These two numbers can reorder a leaderboard without changing a single tool.

A model with 60% pass@1 might reach 92% pass@10 — its distribution of outputs is broad, and correct solutions appear within the first several attempts. A competitor with 55% pass@1 but tighter output variance might sit at only 83% pass@10. The first tool wins on both metrics. But narrow the margin and the ranking flips depending on which number the tester chose to report.

Syntactic correctness adds another axis. Code that compiles but returns wrong output is not accurate by any functional standard, yet some evaluations count it as a pass. The metric chosen shapes the ranking. The ranking shapes the recommendation. The article rarely discloses which metric it used.

Does Temperature Setting Change Which Tool Ranks First?

Yes. Language models sample from a probability distribution over tokens. At temperature 0 — greedy decoding — the output is deterministic. At 0.7 or higher, outputs vary between runs. Most published benchmarks default to temperature 0 for reproducibility. Most users work between 0.3 and 0.8.

A model optimized for greedy decoding might rank first at temperature 0 and fourth at temperature 0.7. The relationship between architecture, training data, and sampling strategy is nonlinear. Small temperature shifts reshuffle the order.

This gap matters for one reason. The testing condition does not match the deployment condition. Nobody ships a coding assistant locked to greedy decoding. The leaderboard the reader consults was produced under settings the reader will never use. Whether the ranking transfers to real-world conditions is an empirical question that none of these articles attempt to answer.

Why Does Running the Same 100 Prompts Twice Produce Different Scores?

Stochastic sampling. At any temperature above zero, the model's output varies between calls by design. Even at very low temperatures, provider-side batching, quantization choices, and API routing can introduce micro-variation across requests.

Run the same 100 prompts through the same tool twice at temperature 0.4. Expect the pass count to differ by 2 to 5 between runs. That variance is not an artifact you can average away with two repetitions — the standard error of the mean of two samples is still large relative to the between-tool gaps these articles report as meaningful.

Reproducibility requires either many repeated runs, which is expensive, or greedy decoding, which is unrepresentative of real use. Most published tests choose neither. They run once, record the count, rank. The implicit assumption that a single run is sufficient fails basic replication logic. One draw from a distribution is not a stable estimate of the parameter.

How Does Prompt Selection Bias the Entire Leaderboard?

Heavily. A curated set of 100 prompts is not a random sample of "all coding tasks." It reflects the tester's language preferences, framework familiarity, and difficulty intuitions. A set heavy on Python string manipulation rewards different strengths than a set dominated by TypeScript generics or Rust lifetime resolution.

Difficulty clustering compounds the problem. If 60 of 100 prompts are solvable by most tools, scores compress into a narrow band — say 55% to 75% — and the test loses power to separate tools on hard problems. If the set skews hard, the ranking reflects elite-tier performance only, which may not match the reader's actual workload.

No published 100-prompt benchmark we have examined includes a difficulty calibration step or a stratification plan. The prompts are what the author chose to write. The ranking reflects that editorial decision as much as it reflects tool capability.

What Would a Statistically Defensible AI Coding Benchmark Require?

Three properties, minimum. First, adequate sample size: at least 350 prompts per tool to detect 10-point accuracy differences, or 2,000 per tool for 4-point differences, each at 80% power. Second, stratified difficulty: prompts categorized by language, domain, and estimated difficulty tier, with balanced representation across strata. Third, repeated measurement: each tool runs each prompt at least five times at the target deployment temperature, and the reported figure is the mean pass rate accompanied by a confidence interval.

The cost is concrete. Running 2,000 prompts × 5 repetitions × 20 tools = 200,000 API calls. At current pricing tiers, that places the experiment in four- to five-figure territory depending on model mix and average prompt length. This expense is why almost nobody does it.

But the alternative — publishing ranked leaderboards the underlying math cannot support — is not cheaper. It is less honest. The question worth asking next is not which tool won some 100-prompt sprint. It is whether any benchmark published in 2026 has met even two of these three criteria, and what a collaborative, open-methodology effort to build one would actually cost.

FAQ

Is 100 prompts ever enough to distinguish two AI coding tools?

Only when the true accuracy gap exceeds roughly 13 percentage points. At that spread, a two-proportion z-test reaches p < 0.05 with n = 100 per group. For the tighter gaps typically reported in published comparisons — 2 to 8 points — the sample is insufficient to reject the null hypothesis. Most rankings built on 100-prompt tests are reporting differences the data cannot confirm.

What is the difference between pass@1 and pass@k in coding benchmarks?

Pass@1 scores whether the model's single first output passes all test cases. Pass@k allows k attempts and checks whether any single output among them passes. Pass@k is always equal to or higher than pass@1 for any k greater than 1. The two metrics can produce different tool rankings from identical prompt sets because they reward different properties — raw first-shot precision for pass@1, output diversity and coverage for pass@k.

Does testing at temperature 0 reflect how developers actually use coding tools?

No. Most developers work at temperature settings between 0.3 and 0.8, where the model explores a wider range of possible outputs. Temperature 0 produces deterministic results useful for reproducibility but may favor tools whose training optimized for greedy decoding rather than for the sampling regime users encounter in practice. Rankings derived at temperature 0 do not necessarily transfer to real deployment conditions.

How much would a rigorous 20-tool coding benchmark cost to run?

A defensible test requires roughly 2,000 prompts per tool with five repeated runs each. For 20 tools, that totals 200,000 API calls. Depending on model pricing and average prompt length, compute costs range from low four figures to mid five figures in USD. This excludes the human effort of curating prompts, writing functional test cases, and calibrating difficulty tiers — which represents the majority of the true cost.

Why do different articles testing the same tools produce different rankings?

Prompt selection, temperature setting, accuracy metric (pass@1 or pass@k), evaluation criteria (functional correctness or syntactic validity), and random sampling variance all contribute. Two testers using 100 different prompts at different temperatures with different pass definitions can produce opposite rankings from tools whose true capabilities are similar. The methodology is usually the dominant variable, not the tool.

Can averaging multiple runs of a 100-prompt test solve the sample size problem?

Partially. Five runs of 100 prompts reduces the standard error by a factor of √5 ≈ 2.24, equivalent to a single run of about 500 prompts. That improves resolution but still falls well short of the approximately 2,000-prompt threshold needed to detect 4-point differences at 80% power. Averaging helps at the margins. It does not rescue a fundamentally undersized experiment.