A/B Testing — Compare Variants with Statistical Tests

Kotlin Notebook

Try this guide as a Kotlin Notebook with Kandy visualizations — run the cells to see charts and explore the data interactively.

This guide walks through an A/B test comparing two checkout flow variants in a mobile app. The primary metric is session duration (seconds); the secondary metric is number of completed steps.

Experiment Data

// Variant A (control): original checkout flow
val controlDurationSec = doubleArrayOf(
    34.2, 41.5, 38.7, 45.1, 36.9, 42.3, 39.8, 44.6, 37.4, 40.1,
    43.2, 35.8, 41.9, 38.3, 46.0, 39.5, 42.7, 37.1, 40.8, 44.3
)

// Variant B (treatment): simplified checkout flow
val treatmentDurationSec = doubleArrayOf(
    29.1, 33.8, 31.5, 35.2, 28.7, 32.4, 30.9, 34.6, 29.8, 33.1,
    31.2, 27.5, 34.0, 30.3, 36.1, 31.8, 33.5, 28.9, 32.7, 35.8
)

Step 1: Summarize Both Groups

val controlSummary = controlDurationSec.describe()
val treatmentSummary = treatmentDurationSec.describe()

controlSummary.mean                // control average
treatmentSummary.mean              // treatment average
controlSummary.standardDeviation   // control spread
treatmentSummary.standardDeviation // treatment spread

Step 2: Check Assumptions

Normality

val controlNormality = shapiroWilkTest(controlDurationSec)
val treatmentNormality = shapiroWilkTest(treatmentDurationSec)

controlNormality.pValue
treatmentNormality.pValue

Variance homogeneity

val variances = leveneTest(controlDurationSec, treatmentDurationSec)
variances.pValue

Step 3: Choose and Run the Test

Parametric (normal data)
Non-parametric (non-normal data)

// Welch's t-test (default: equalVariances = false)
val result = tTest(controlDurationSec, treatmentDurationSec)

result.statistic
result.pValue
result.confidenceInterval // 95% CI for the difference in means
result.isSignificant()    // true if p < 0.05

If the Levene test confirmed equal variances:

val equalVar = tTest(
    controlDurationSec,
    treatmentDurationSec,
    equalVariances = true
)
equalVar.pValue

val result = mannWhitneyUTest(controlDurationSec, treatmentDurationSec)

result.statistic
result.pValue
result.isSignificant()

Effect size — Cohen’s d

A tells you whether a difference exists; effect size tells you how large it is. Cohen’s d expresses the difference in standard-deviation units: |d| < 0.2 negligible, 0.2 small, 0.5 medium, 0.8+ large.

// Cohen's d: how large is the difference in standard-deviation units?
val d = cohensD(controlDurationSec, treatmentDurationSec)
d // ~2.9 → large effect (|d| ≥ 0.8)

One-sided tests

When you expect the treatment to reduce session duration:

val oneSided = tTest(
    controlDurationSec,
    treatmentDurationSec,
    alternative = Alternative.GREATER // control > treatment
)
oneSided.pValue

Step 4: Test a Second Metric

Apply the same workflow to the secondary metric.

// Number of completed checkout steps per session
val controlSteps = doubleArrayOf(
    3.0, 4.0, 3.0, 5.0, 3.0, 4.0, 4.0, 5.0, 3.0, 4.0,
    4.0, 3.0, 4.0, 3.0, 5.0, 4.0, 4.0, 3.0, 4.0, 5.0
)
val treatmentSteps = doubleArrayOf(
    5.0, 5.0, 4.0, 5.0, 5.0, 5.0, 4.0, 5.0, 5.0, 5.0,
    4.0, 5.0, 5.0, 4.0, 5.0, 5.0, 5.0, 4.0, 5.0, 5.0
)

// Discrete step counts are typically non-normal
shapiroWilkTest(controlSteps).pValue

val stepsResult = mannWhitneyUTest(controlSteps, treatmentSteps)
stepsResult.pValue
stepsResult.isSignificant()

Multiple comparison correction

Testing two metrics (duration and steps) inflates the false-positive rate. correction adjusts p-values to account for this.

// Correct for testing two metrics (duration + steps)
val rawPValues = doubleArrayOf(result.pValue, stepsResult.pValue)
val corrected = holmBonferroniCorrection(rawPValues)

corrected[0] // adjusted p-value for duration
corrected[1] // adjusted p-value for steps

Step 5: Correlation Between Metrics

Check whether the two metrics move together within each group.

// Within the treatment group: do faster sessions correlate with more completed steps?
val correlation = spearmanCorrelation(treatmentDurationSec, treatmentSteps)

correlation.coefficient // negative means shorter sessions correlate with more steps
correlation.pValue

Spearman correlation is preferred here because one metric (steps) is ordinal.

Paired Before/After Comparison

When the same users are measured before and after a change, use paired tests.

val beforeMs = doubleArrayOf(
    340.2, 415.0, 387.1, 451.3, 369.5, 423.8, 398.0, 446.2, 374.1, 401.5
)
val afterMs = doubleArrayOf(
    310.5, 380.2, 355.8, 410.7, 335.1, 392.4, 365.3, 405.9, 340.8, 371.6
)

val paired = pairedTTest(beforeMs, afterMs)
paired.pValue
paired.confidenceInterval

// Non-parametric alternative
val wilcoxon = wilcoxonSignedRankTest(beforeMs, afterMs)
wilcoxon.pValue

// Paired effect size: Cohen's dz = mean(diff) / sd(diff)
val differences = DoubleArray(beforeMs.size) { beforeMs[it] - afterMs[it] }
val dz = differences.mean() / differences.standardDeviation()
dz // ~6.1 → large effect

How-To Guides

Tutorials

A/B Testing — Compare Variants with Statistical Tests

Kotlin Notebook

Experiment Data

Step 1: Summarize Both Groups

Step 2: Check Assumptions

Normality

Variance homogeneity

Step 3: Choose and Run the Test

Effect size — Cohen’s d

One-sided tests

Step 4: Test a Second Metric

Multiple comparison correction

Step 5: Correlation Between Metrics

Paired Before/After Comparison

How-To Guides

Tutorials

Documentation Index

Kotlin Notebook

​Experiment Data

​Step 1: Summarize Both Groups

​Step 2: Check Assumptions

​Normality

​Variance homogeneity

​Step 3: Choose and Run the Test

​Effect size — Cohen’s d

​One-sided tests

​Step 4: Test a Second Metric

​Multiple comparison correction

​Step 5: Correlation Between Metrics

​Paired Before/After Comparison

Experiment Data

Step 1: Summarize Both Groups

Step 2: Check Assumptions

Normality

Variance homogeneity

Step 3: Choose and Run the Test

Effect size — Cohen’s d

One-sided tests

Step 4: Test a Second Metric

Multiple comparison correction

Step 5: Correlation Between Metrics

Paired Before/After Comparison