Analyze two real A/B test datasets with kstats: proportion z-test, Bayesian posteriors, paired t-test, effect sizes, power analysis, and multiple comparison correction.
Kotlin Notebook
Run this tutorial as a Kotlin Notebook with interactive Kandy charts — all code cells, DataFrame outputs, and visualizations included.
This tutorial walks through two real-world A/B test datasets end-to-end:
E-commerce Landing Page (Kaggle) — ~294K users randomly assigned to old vs new checkout page, binary outcome (converted / not)
Marketing Campaign (Kaggle) — daily campaign metrics (spend, clicks, impressions, purchases) for control vs test campaign
This tutorial uses Kotlin DataFrame for data loading and Kandy for visualization. These are Kotlin Notebook dependencies — the kstats statistical functions work in any Kotlin environment.
Before any hypothesis test, verify that randomization worked correctly. If the split ratio deviates from 50/50 more than chance would allow, the experiment is compromised — all downstream results are untrustworthy. is checked with a one-sample proportion z-test: is the observed split consistent with a 50/50 ratio?
should happen before data collection to determine the required sample size. We ask: “How many users do we need to detect a 1 percentage point lift from a 12% baseline with 80% power at alpha = 0.05?”This ensures the experiment is neither underpowered (missing real effects) nor wasteful (running longer than necessary).
// Effect size for a 1 pp lift: 12% → 13%val h = cohensH(p1 = 0.13, p2 = 0.12)println("Cohen's h for 12% → 13%: ${h.fmt(4)}")val requiredN = proportionZTestRequiredN(effectSize = h, power = 0.8)println("Required per group: $requiredN users")println("Total: ${requiredN * 2} users")
Cohen's h for 12% → 13%: 0.0302Required per group: 17164 usersTotal: 34328 users
// With ~145K per group, what power do we actually have?val actualPower = proportionZTestPower(effectSize = h, n = 145_000)println("Power with N=145K per group: ${(actualPower * 100).fmt(1)}%")val mde = proportionZTestMinimumEffect(n = 145_000, power = 0.8)println("MDE (Cohen's h): ${mde.fmt(4)}")
Power with N=145K per group: 100.0%MDE (Cohen's h): 0.0104
With ~145K users per group, the experiment is massively overpowered for a 1 pp lift — power is effectively 100%. The is Cohen’s h = 0.0104, meaning the experiment can detect extremely small differences.
The core question: is the conversion rate different between control and treatment?For comparing two binomial proportions on large samples, the proportion z-test is the standard method. It compares the observed difference against sampling variability under the null hypothesis of equal rates.
val conversionTest = proportionZTest( successes1 = treatmentConverted, trials1 = treatmentTotal, successes2 = controlConverted, trials2 = controlTotal)val ci = conversionTest.confidenceInterval!!println("Two-Sample Proportion z-Test")println(" z-statistic: ${conversionTest.statistic.fmt(4)}")println(" p-value: ${conversionTest.pValue.fmt(4)}")println(" 95% CI for Δ(p): [${ci.lower.fmt(4)}, ${ci.upper.fmt(4)}]")println(" Significant at α=0.05: ${conversionTest.isSignificant()}")
Two-Sample Proportion z-Test z-statistic: -1.3109 p-value: 0.1899 95% CI for Δ(p): [-0.0039, 0.0008] Significant at α=0.05: false
p > 0.05 — we cannot reject the null hypothesis. The new page does not significantly change conversion rate. The 95% for the difference includes zero, consistent with no effect.
The company only cares if the new page is better. A one-sided test has more power to detect an improvement, but p ≈ 0.905 here strongly suggests the treatment is actually worse (or at best equal).
val oneSided = proportionZTest( successes1 = treatmentConverted, trials1 = treatmentTotal, successes2 = controlConverted, trials2 = controlTotal, alternative = Alternative.GREATER)println("One-sided (treatment > control) p = ${oneSided.pValue.fmt(4)}")println("Significant: ${oneSided.isSignificant()}")
One-sided (treatment > control) p = 0.9051Significant: false
A answers “is there a difference?”. answers “how big is the difference?”With N=290K even tiny differences can become “significant”. Cohen’s h puts the difference on a standardized scale independent of sample size.
Cohen’s h
Interpretation
< 0.2
Negligible
0.2
Small
0.5
Medium
0.8+
Large
val effectH = cohensH(p1 = treatmentRate, p2 = controlRate)println("Cohen's h = ${effectH.fmt(4)}")println("Interpretation: negligible (|h| < 0.2)")
Cohen's h = -0.0049Interpretation: negligible (|h| < 0.2)
The chart shows Cohen’s h for various conversion lifts from a 12% baseline. The red line marks the “small effect” threshold (h = 0.2). A 1 pp lift barely registers; you need at least a 3-5 pp lift to reach a small effect.
The Wald CI from the z-test can misbehave near 0 or 1. The Wilson score interval is recommended for per-group conversion rate estimates, especially with small samples.
The frequentist approach gives a binary answer: reject or don’t reject. The Bayesian approach answers a more intuitive question: what is the probability that the treatment is better?We use the Beta-Binomial conjugate model:
Prior: Beta(1, 1) — uninformative (uniform on [0, 1])
Posterior: Beta(1 + successes, 1 + failures) — updated belief after seeing data
Decision: Compare posterior distributions via Monte Carlo sampling
The Bayesian approach tells the same story: P(treatment > control) is only ~9.5%. Most organizations require P(B > A) > 95% to ship a change.Frequentist vs Bayesian:
Bayesian: “There’s a ~9.5% chance the new page is better” — direct probability statement, more intuitive for stakeholders
Both agree: there is no compelling evidence to launch the new page.
We used a uniform Beta(1,1) prior. With N=145K observations, the prior has virtually no effect on the posterior. For smaller experiments (N < 1000), consider using an informative prior based on historical conversion rates.
We now switch to continuous metrics where different statistical tools are needed.Dataset: Marketing Campaign A/B Testing — daily campaign metrics for control vs test campaign over 30 days.
val dfControlRaw = DataFrame.readCsv("data/campaign-control.csv", delimiter = ';')val dfTestRaw = DataFrame.readCsv("data/campaign-test.csv", delimiter = ';')// Both CSVs share the same 30 dates. Drop nulls independently, then keep only matched dates.val dfControlClean = dfControlRaw.dropNulls()val dfTestClean = dfTestRaw.dropNulls()val ctrlDates = dfControlClean["Date"].toList().map { it.toString() }.toSet()val testDates = dfTestClean["Date"].toList().map { it.toString() }.toSet()val pairedDates = ctrlDates.intersect(testDates)val dfControlPaired = dfControlClean .filter { "Date"<Any>().toString() in pairedDates } .sortBy("Date")val dfTestPaired = dfTestClean .filter { "Date"<Any>().toString() in pairedDates } .sortBy("Date")
Matched days: 29 (dropped 1 day with missing data)Control: 29 rows × 10 columnsTest: 29 rows × 10 columns
Why rates, not raw counts? The two groups have very different exposure: treatment received ~30% fewer impressions but ~12% more spend. Comparing raw daily counts would conflate the campaign effect with the exposure difference. Instead, we normalize by daily impressions to get CTR (clicks / impressions) and purchase rate (purchases / impressions).Why paired tests? Since control and test observations are matched by date, a paired t-test (or Wilcoxon signed-rank) removes day-to-day variability, increasing power compared to an independent-samples test.
Before running hypothesis tests, visualize the data to understand its structure and spot potential issues.
CTR
Purchase Rate
Strip plot with mean ± SE — each dot is one day; the treatment mean is clearly higher, but individual days vary widely.Box plot — the treatment median sits above the control range, with several high-CTR outlier days.Time series — the treatment (red) consistently exceeds control (blue) day-over-day, supporting a paired test design.
Strip plot with mean ± SE — same pattern as CTR: treatment mean higher, with more day-to-day spread.Box plot — treatment interquartile range sits above the control group, with two high-purchase outlier days.Time series — treatment purchase rate tracks above control on most days, though the gap is smaller than for CTR.
The treatment group shows higher CTR and purchase rate across most days, with notably more variance. The strip and box plots confirm a clear upward shift in the treatment group.
Shapiro-Wilk on differences passes (p > 0.05) → paired t-test is primary
Shapiro-Wilk fails → paired t-test is often robust to mild non-normality, but at N ≈ 29 results should be interpreted cautiously. The Wilcoxon signed-rank serves as a sensitivity check — if both tests agree, the conclusion is more trustworthy
No need to check equal variances — paired tests work on differences, not separate groups
We report both parametric (paired t-test) and non-parametric (Wilcoxon signed-rank) tests.
Both parametric and non-parametric tests agree on both metrics: the treatment campaign has significantly higher CTR and purchase rate. This agreement strengthens our confidence despite the non-normal differences.
For paired designs, Cohen’s dz = mean(differences) / SD(differences) is the appropriate effect size. It captures how large the within-pair differences are relative to their variability.
Unlike unpaired Cohen’s d (which uses pooled between-group SD), dz directly reflects how consistently one condition outperforms the other across matched pairs. Larger dz values indicate more consistent day-over-day advantages for the treatment group.
We tested two metrics: CTR and Purchase Rate. Testing multiple hypotheses inflates the chance of false positives (Type I error). Three correction methods:
Method
Controls
Conservatism
Bonferroni
(probability of any false positive)
Most conservative
Holm-Bonferroni
FWER
Less conservative, uniformly more powerful
Benjamini-Hochberg
(expected proportion of false positives)
Least conservative
Use FWER (Bonferroni/Holm) when any false positive is costly (e.g., regulatory). Use FDR (BH) when screening many metrics and can tolerate some false positives.
Are our two test metrics correlated? Bonferroni controls FWER regardless of dependence structure, but when metrics are positively correlated the correction becomes more conservative than necessary.
val allCTR = controlCTR + treatmentCTRval allPurchaseRate = controlPurchaseRate + treatmentPurchaseRateval corrP = pearsonCorrelation(allCTR, allPurchaseRate)val corrS = spearmanCorrelation(allCTR, allPurchaseRate)println("Pearson r = ${corrP.coefficient.fmt(3)}, p = ${"%.2e".format(corrP.pValue)}")println("Spearman ρ = ${corrS.coefficient.fmt(3)}, p = ${"%.2e".format(corrS.pValue)}")
Pearson r = 0.748, p = 1.55e-11Spearman ρ = 0.517, p = 3.22e-05
Scatter plot — each point is one campaign-day; the positive trend confirms that higher-CTR days also see more purchases per impression.Correlation heatmap — Pearson r across all rate and exposure metrics. CTR and purchase rate (r = 0.75) are the strongest pair.CTR and purchase rate are strongly correlated (r = 0.75). This means the Bonferroni/Holm corrections are more conservative than necessary — the true FWER is below the nominal alpha. With two correlated metrics, both corrections are acceptable.
Act 2 methodology note: The marketing campaign groups had very different exposure levels (treatment received ~30% fewer impressions but ~12% more spend). All Act 2 analyses use rate metrics (per impression) to remove this confound, and paired tests matched by date to control for day-level variability.
Caveat on Act 2 power: With only ~29 matched pairs, the study has moderate power (~55% for a medium effect dz ≈ 0.4). The significant results correspond to medium-to-large effects (dz ≈ 0.59-0.75), well above the detectable threshold. Nevertheless, the wide confidence intervals reflect the small sample.Business conclusion: The new landing page does not improve conversion rate — the experiment was well-powered (>99% to detect a 1 pp lift) and both frequentist and Bayesian analyses agree. For the marketing campaign, both CTR and purchase rate are significantly higher in the treatment group after Holm correction for multiple comparisons. However, with only 29 matched days the effect size estimates are imprecise; a longer experiment would narrow the confidence intervals and confirm the magnitude of the improvement.