A/B Test Sample Size Calculator

A/B Test Sample Size Before the Experiment Starts

What the Calculator Is Really Checking

An A/B test can be mathematically clean and still waste time if it is underpowered. Sample size planning asks a practical question before launch: how many observations does each variant need before the test has a reasonable chance of detecting the effect we care about? Without that step, teams often run tests that cannot distinguish noise from a meaningful change. The result is a dashboard full of movement and no decision worth trusting.

For a conversion-rate test, each visitor either converts or does not. The observed rate is an estimate of the true rate, and estimates bounce around because samples are finite. Smaller effects are harder to detect because the two rates sit closer together. Higher confidence asks for stronger evidence before calling a winner. Higher power asks for a better chance of noticing the effect if it is real. Those goals all push sample size upward.

A/B Test Sample Size Calculator uses this core relationship: n per group ~= 2 * pbar*(1-pbar) * (z_alpha/2 + z_power)^2 / effect^2. That formula is short enough to look harmless, but it carries the whole model. Before using the highlighted result, identify what the model includes and what it leaves out. In this tool, the visible inputs are baseline conversion, minimum detectable effect, confidence, power. Those inputs are not just boxes to fill in; they are the assumptions that decide whether the answer belongs to your situation.

Manual Calculation Path

The normal approximation uses a baseline conversion rate, a target rate, confidence, power, and the absolute difference between rates. The minimum detectable effect should be entered in percentage points. A move from 5 percent to 6 percent is a 1 percentage point absolute lift, not a 1 percent relative lift. The formula estimates the number of users per variant. Doubling that gives total sample for a two-arm test. The result should be rounded up because partial users are not helpful.

The calculator also states its working assumption plainly: Uses a normal approximation for planning. Sequential testing, multiple comparisons, and low counts require more care. That sentence is part of the calculation, not legal fine print. It tells you when the result is a quick engineering estimate and when the problem needs a datasheet, code book, lab measurement, simulation, or a more detailed model. If a real system violates the assumption, the number may still be useful as a reference point, but it should not be treated as final evidence.

A reliable hand check does not need to reproduce every displayed digit. It should confirm the direction and scale. Increase the input that should make the result larger and confirm that the result moves upward. Cut a length, rate, resistance, load, or probability in half and see whether the answer responds the way the formula says it should. That habit catches swapped units, inverted ratios, and copied values faster than staring at a finished number.

Reading the Inputs

Baseline conversion should come from recent, comparable traffic. If weekday and weekend behavior differ, use the segment that matches the experiment. Minimum detectable effect should be the smallest change worth acting on, not the smallest change someone hopes to see. Confidence is commonly 95 percent, while power is often 80 or 90 percent. Raising both makes the test more demanding. If traffic is limited, the honest answer may be that the test cannot detect small changes in a reasonable time.

The field labels are deliberately plain because the calculator is meant for quick use, but plain labels still need engineering context. If a value comes from a datasheet, check whether it is typical, maximum, RMS, peak, hot, cold, no-load, full-load, or measured under a specific condition. If it comes from a test, record the setup. If it comes from a guess, mark it as a guess. The result is only as honest as the least honest input.

Where the Answer Can Mislead

A frequent mistake is peeking at the test every hour and stopping when the graph looks exciting. That changes the statistical behavior unless the test is designed for sequential monitoring. Another mistake is testing too many metrics or variants without adjustment. Low baseline rates also make normal approximations less reliable. Sample ratio mismatch, bot traffic, repeated users, seasonality, and instrumentation changes can damage a test more than the sample size formula can repair.

Sample per variant is the planning number. Total sample helps estimate calendar time from traffic. Target conversion reminds everyone what effect the test is powered to detect. If the required sample is huge, the team has options: accept only larger detectable effects, run longer, target a higher-volume metric, reduce variance, use a different design, or skip the test and make a product judgment. Not every decision deserves an experiment, and not every experiment is affordable.

The supporting metrics are there to reduce that risk. They expose intermediate quantities, alternate units, or related values that make the main answer easier to challenge. When one of those supporting numbers looks strange, pause before moving on. A strange velocity, impossible current, negative margin, enormous sample size, or tiny time constant usually means the calculator is telling you something important about either the design or the way the problem was entered.

Using the Result in Real Work

Use the calculator before writing the experiment ticket. Put the sample size, planned duration, primary metric, decision rule, and guardrail metrics in the test plan. During the test, monitor data quality and sample ratio, but avoid changing the decision rule midstream. After the test, report the observed effect with uncertainty, not just "winner" or "loser." The calculator helps set expectations so stakeholders are not surprised when a small effect needs a long run.

A good A/B test note records baseline rate, minimum detectable effect, confidence, power, sample per variant, planned duration, primary metric, and stopping rule. The calculator does not make experimentation automatic, but it prevents one of the most common failures: running a test that never had enough information to answer the question. The best time to learn that is before launch, when changing the design is cheap.

For a clean review, save the input values, the highlighted result, the supporting metric that most constrains the design, and the next check you would run. That next check might be a bench measurement, a vendor curve, a code requirement, a production trace, a tolerance stack, or a second calculation with worst-case values. The goal is not to make the calculator look authoritative. The goal is to make the reasoning easy for another person to inspect and improve.