Descriptive Statistics

kstats-core provides descriptive statistics as extension functions on DoubleArray and Iterable<Double>. It also includes Int and Long overloads for the most common operations.

Summary Snapshot

describe() is the fastest way to get a complete picture of a sample. It returns a DescriptiveStatistics object combining count, center, spread, quartiles, shape, and standard error.

val data = doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0)
val stats = data.describe()

stats.count              // 8
stats.mean               // 5.0
stats.median             // 4.5
stats.standardDeviation  // 2.1380
stats.variance           // 4.5714
stats.q1                 // 4.0
stats.q3                 // 5.5
stats.interquartileRange // 1.5
stats.skewness           // 0.6563
stats.kurtosis           // -0.1640
stats.range              // 7.0
stats.standardError      // 0.7559

variance() and standardDeviation() use the sample formula (dividing by

n - 1

) by default. Pass PopulationKind.POPULATION to use the population formula (dividing by

n

Central Tendency

Different notions of “center” suit different data shapes. mean() is the arithmetic average, median() is the positional center, mode() returns the most frequent values, and the trimmed and weighted variants handle outliers and importance weights.

val data = doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0)

data.mean()                    // 5.0
data.median()                  // 4.5
data.toList().mode()           // {4.0}

data.trimmedMean(0.1)          // 4.8333 — trims 10% from each tail

val positive = doubleArrayOf(1.0, 2.0, 4.0, 8.0)
positive.geometricMean()       // 2.8284
positive.harmonicMean()        // 2.1333

val values = doubleArrayOf(1.0, 2.0, 3.0)
val weights = doubleArrayOf(3.0, 1.0, 1.0)
values.weightedMean(weights)   // 1.6

Geometric mean — use for growth rates and ratios. Requires all positive values.
Harmonic mean — use for rates and speed averages. Requires all positive values.
Trimmed mean — drops a proportion from each tail before averaging. Robust to outliers.
Weighted mean — each observation contributes proportionally to its weight.

Math details

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

\bar{x}_{\text{geo}} = \left(\prod_{i=1}^{n} x_i\right)^{1/n}

\bar{x}_{\text{harm}} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

The trimmed mean drops

\lfloor p \cdot n \rfloor

observations from each end of the sorted sample, then computes the arithmetic mean of the remaining values.

Dispersion

Spread measures describe how far values deviate from the center.

val data = doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0)

data.variance()                                    // 4.5714 (sample)
data.variance(PopulationKind.POPULATION)            // 4.0
data.standardDeviation()                            // 2.1380
data.range()                                        // 7.0
data.interquartileRange()                           // 1.5
data.meanAbsoluteDeviation()                        // 1.5

data.trimmedVariance(0.1)                           // trimmed sample variance

data.semiVariance(5.0, SemiVarianceDirection.DOWNSIDE) // 1.7143
data.semiVariance(5.0, SemiVarianceDirection.UPSIDE)   // 2.8571

semiVariance measures variability on one side of a threshold. DOWNSIDE captures risk below the threshold; UPSIDE captures variability above it.

Math details

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2, \qquad s = \sqrt{s^2}, \qquad SE = \frac{s}{\sqrt{n}}

\text{MAD} = \frac{1}{n}\sum_{i=1}^{n}|x_i - \bar{x}|

\text{SemiVar}_{\text{down}}(t) = \frac{1}{n-1}\sum_{x_i < t}(x_i - t)^2

Quantiles and Position

Quantiles split the data at specified probability thresholds. percentile() takes a value on the 0–100 scale; quantile() takes a value on the 0–1 scale.

val data = doubleArrayOf(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0)

data.quantile(0.5)         // 5.5 (median)
data.quantile(0.25)        // 3.25 (Q1)
data.percentile(90.0)      // 9.1 (90th percentile)

val (q1, median, q3) = data.quartiles()
// q1 = 3.25, median = 5.5, q3 = 7.75

quantile() supports a QuantileInterpolation parameter with options: LINEAR (default), LOWER, HIGHER, NEAREST, and MIDPOINT. These control how the value is interpolated when the quantile falls between two data points.

Math details

With linear interpolation (the default), the quantile at probability

p

for sorted data

x_{(1)}, \ldots, x_{(n)}

is computed as:

Q(p) = x_{(\lfloor h \rfloor)} + (h - \lfloor h \rfloor)(x_{(\lceil h \rceil)} - x_{(\lfloor h \rfloor)})

where

h = p(n - 1) + 1

Shape and Moments

Shape measures describe asymmetry and tail weight beyond what the mean and variance capture.

val data = doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0)

data.skewness()            // 0.6563 — positive: right tail is longer
data.kurtosis()            // -0.1640 — negative excess: lighter tails than normal
data.kurtosis(excess = false) // raw kurtosis (not centered at 0)

data.centralMoment(2)      // 4.0 (equals population variance)
data.centralMoment(3)      // 5.25
data.centralMoment(4)      // 44.5

data.kStatistic(1)          // 5.0 (equals mean)
data.kStatistic(2)          // 4.5714 (equals sample variance)

Skewness near zero suggests symmetry. Positive skewness indicates a longer right tail; negative indicates a longer left tail. Excess kurtosis near zero is similar to a normal distribution; positive values indicate heavier tails.

kurtosis() returns excess kurtosis by default (excess = true), meaning the value is shifted so that a normal distribution has excess kurtosis of 0. Pass excess = false for the raw fourth standardized moment.

Math details

g_1 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3

g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}

The

r

-th central moment is

\mu_r = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^r

. K-statistics (

k_r

) are unbiased estimators of the corresponding cumulants.

Frequency Analysis

The Frequency class counts occurrences, proportions, and cumulative frequencies for any Comparable type. frequencyTable() groups numeric data into equal-width bins.

val freq = listOf("a", "a", "b", "b", "b", "c").toFrequency()
freq.totalCount            // 6
freq.count("b")            // 3
freq.proportion("b")       // 0.5
freq.cumulativeCount("b")  // 5 (a=2 + b=3)
freq.mode                  // {b}

val data = doubleArrayOf(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0)
val bins = data.asIterable().frequencyTable(3)
// Each FrequencyBin has: range, count, relativeFrequency, cumulativeFrequency
bins[0].count              // number of values in the first bin
bins[0].relativeFrequency  // proportion of total

Frequency works with any Comparable type — strings, enums, integers, or custom classes. For numeric binning, use frequencyTable() with either a bin count or a bin width.

Streaming Statistics

OnlineStatistics computes mean, variance, skewness, and kurtosis incrementally without storing the full sample. Values can be added one at a time or in batches.

val online = OnlineStatistics()
online.addAll(doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0))

online.count               // 8
online.mean                // 5.0
online.sum                 // 40.0
online.min                 // 2.0
online.max                 // 9.0
online.variance()          // 4.5714 (sample)
online.standardDeviation() // 2.1380
online.skewness()          // 0.6563

// Add more data later
online.add(3.0)
online.count               // 9
online.mean                // 4.7778

Use OnlineStatistics when data arrives incrementally (streaming, event-driven systems) or when buffering the full sample in memory is impractical.

Math details

OnlineStatistics uses Welford’s algorithm with the Terriberry extension for numerically stable single-pass computation of higher-order moments. The algorithm updates running sums of powers of deviations from the current mean, avoiding the catastrophic cancellation that affects naive two-pass formulas on large datasets.

Process Capability

Process capability indices quantify how well a process fits within its . processCapability(lsl, usl) returns four Statistical Process Control (SPC) indices in one call:

Cp, Cpk — potential and actual capability using the sample standard deviation (short-term, within-subgroup spread).
Pp, Ppk — the same formulas using the population standard deviation (long-term, overall spread).

The -k variants penalize a process that is off-center relative to the midpoint of the spec, so Cpk is never larger than Cp.

// Ten parts measured against a spec window of [48, 52]
val measurements = doubleArrayOf(
    50.0, 50.5, 49.5, 50.2, 49.8, 50.1, 49.9, 50.3, 49.7, 50.0
)
val capability = measurements.processCapability(lsl = 48.0, usl = 52.0)

capability.cp   // 2.2646 — potential capability (spread vs tolerance)
capability.cpk  // 2.2646 — actual capability (penalizes off-centering)
capability.pp   // 2.3870 — overall (population σ) counterpart of Cp
capability.ppk  // 2.3870 — overall counterpart of Cpk

Use these indices only on a process that is already in statistical control (stable over time — see Shewhart Control Charts below). For an unstable process the measured spread is not a fixed property of the process.

Values ≥ 1.33 are usually considered capable; ≥ 1.67 highly capable. When Cpk ≪ Cp, re-center the process before trying to reduce variance.

Math details

\mathrm{Cp} = \frac{\mathrm{USL} - \mathrm{LSL}}{6\sigma_s}, \qquad \mathrm{Cpk} = \min\!\left(\frac{\mathrm{USL} - \bar{x}}{3\sigma_s},\ \frac{\bar{x} - \mathrm{LSL}}{3\sigma_s}\right)

Pp and Ppk use the population standard deviation

\sigma_p

(divisor

n

) in place of the sample standard deviation

\sigma_s

(divisor

n-1

). processCapability computes both in a single numerically stable Welford pass.

Shewhart Control Charts

Shewhart plot subgroup statistics over time with three-sigma control limits. xBarRChart() monitors the process mean together with the range within each subgroup; xBarSChart() uses the sample standard deviation instead — more efficient for subgroup sizes above 10. Both require equal-sized subgroups of 2–25 observations.

// Five subgroups of four parts; bracket width monitored per batch
val subgroups = listOf(
    doubleArrayOf(72.0, 84.0, 79.0, 49.0),
    doubleArrayOf(56.0, 87.0, 33.0, 42.0),
    doubleArrayOf(55.0, 73.0, 22.0, 60.0),
    doubleArrayOf(44.0, 80.0, 54.0, 74.0),
    doubleArrayOf(97.0, 26.0, 48.0, 58.0),
)
val chart = xBarRChart(subgroups)

chart.centerLine     // 59.65 — grand mean (x-double-bar)
chart.ucl            // 95.6626 — upper control limit for the mean
chart.lcl            // 23.6374 — lower control limit for the mean
chart.rChart.centerLine // 49.4 — average range (R-bar)
chart.rChart.ucl     // 112.7308 — upper limit for within-subgroup range
chart.rChart.lcl     // 0.0 — lower limit (D₃ = 0 for n ≤ 6)

Control limits use the standard SPC constants

A_2, A_3, D_3, D_4, B_3, B_4, c_4

(Montgomery, Introduction to Statistical Quality Control, Appendix VI) available for subgroup sizes 2–25 through spcConstants(n).

Math details

For

k

subgroups of size

n

with subgroup means

\bar{x}_i

, ranges

R_i

, and standard deviations

s_i

\text{x̄-R:}\quad \mathrm{UCL}/\mathrm{LCL} = \bar{\bar{x}} \pm A_2 \bar{R}, \quad R\text{-chart: } [D_3 \bar{R},\ D_4 \bar{R}]

\text{x̄-S:}\quad \mathrm{UCL}/\mathrm{LCL} = \bar{\bar{x}} \pm A_3 \bar{s}, \quad S\text{-chart: } [B_3 \bar{s},\ B_4 \bar{s}]

CUSUM Chart

A Shewhart chart reacts slowly to drifts of less than 2σ because every point is judged in isolation. cusum() accumulates deviations from target over time, so a 0.5σ–1σ drift is detected within a few observations. The two-sided tabular form tracks an upper sum

C^+

that catches upward shifts and a lower sum

C^-

for downward shifts. An alarm fires on the first index where either sum exceeds the decision interval

H

// Individual measurements from a process with target 10, drifting upward
val observations = doubleArrayOf(10.2, 10.4, 10.6, 10.9, 11.2, 11.5, 11.8, 12.0)
val result = cusum(observations, target = 10.0, k = 0.5, h = 3.0)

result.sPlus      // [0.0, 0.0, 0.1, 0.5, 1.2, 2.2, 3.5, 5.0]
result.sMinus     // all zero — no downward drift
result.alarmIndex // 6 — first index where C⁺ > H

Tune k to half the shift size you want to detect, in units of σ — a common default is

K \approx 0.5\sigma

targeting a 1σ drift. Set h to 4σ–5σ to match the in-control run length of a 3σ Shewhart chart while reacting much faster to small shifts.

Math details

C^+_i = \max\bigl(0,\; C^+_{i-1} + (x_i - \mu_0 - K)\bigr), \qquad C^-_i = \max\bigl(0,\; C^-_{i-1} + (\mu_0 - K - x_i)\bigr)

starting from

C^\pm_0 = 0

, with an alarm the first time

C^+_i > H

C^-_i > H

EWMA Chart

EWMA (Roberts, 1959) is the other classic small-shift detector. Instead of an unbounded cumulative sum, ewma() maintains a weighted moving average that gives recent observations more influence while retaining memory of the past. Control limits widen with time until they reach a steady state, so the chart is most sensitive early — useful for catching an initial shift.

// EWMA chart: target = 25, σ = 1, λ = 0.2, L = 3
val observations = doubleArrayOf(25.0, 24.5, 25.2, 26.1, 25.8, 27.0, 26.5, 28.0)
val result = ewma(
    observations,
    target = 25.0,
    sigma = 1.0,
    lambda = 0.2,
    controlLimitWidth = 3.0
)

result.smoothedValues[0] // 25.0 — Z₀ = λ·x + (1-λ)·target
result.smoothedValues[7] // 26.2549 — smoothed statistic at t = 7
result.ucl[0]            // 25.6 — narrow at first, widens with t
result.ucl[7]            // 25.9858 — approaching steady state
result.outOfControl      // [7] — Z₇ exceeds UCL₇

\lambda = 0.2

with

L \approx 2.7

–

3.0

is a common default. Smaller

\lambda

emphasizes memory and detects smaller shifts;

\lambda = 1

collapses EWMA into a Shewhart individuals chart.

Math details

Z_t = \lambda x_t + (1 - \lambda) Z_{t-1}, \qquad Z_0 = \mu_0

\mathrm{UCL}_t/\mathrm{LCL}_t = \mu_0 \pm L \sigma \sqrt{\frac{\lambda}{2 - \lambda}\,\bigl(1 - (1 - \lambda)^{2t}\bigr)}

Western Electric Rules

westernElectricRules() extends a Shewhart chart beyond the basic ±3σ check with four run-length heuristics that catch trends, clusters, and prolonged one-sided runs.

Rule	Pattern	Detects
1	1 point beyond $\pm 3\sigma$	extreme single excursion
2	2 of last 3 points beyond $\pm 2\sigma$ , same side	strong shift
3	4 of last 5 points beyond $\pm 1\sigma$ , same side	moderate shift
4	8 consecutive points on the same side of the center	sustained shift, any magnitude

Each rule’s array contains the trigger indices — the observation whose arrival completes the offending pattern.

// Process drifting upward in the last four observations
val observations = doubleArrayOf(
    0.1, 0.2, -0.3, 0.0, 1.4, 1.2, 2.4, 2.6, 3.5, 2.2
)
val violations = westernElectricRules(observations, center = 0.0, sigma = 1.0)

violations.rule1 // indices of points beyond ±3σ
violations.rule2 // indices where 2 of last 3 points are beyond ±2σ (same side)
violations.rule3 // indices where 4 of last 5 points are beyond ±1σ (same side)
violations.rule4 // indices where 8 consecutive points fall on the same side

Combine a Shewhart chart (large shifts) with CUSUM or EWMA (small shifts) and Western Electric Rules (patterns) — the three views together catch the widest range of out-of-control conditions.

Error Handling

Empty arrays throw InsufficientDataException. Functions that require at least two observations (variance, standard deviation, skewness) throw InsufficientDataException for single-element input. Functions requiring positive data (geometricMean, harmonicMean) throw InvalidParameterException for non-positive values.

API Reference

Full API Reference

Browse all public types, functions, parameter overloads, and return types in the Dokka-generated reference.

Getting Started

Modules

Descriptive Statistics

Summary Snapshot

Central Tendency

Dispersion

Quantiles and Position

Shape and Moments

Frequency Analysis

Streaming Statistics

Process Capability

Shewhart Control Charts

CUSUM Chart

EWMA Chart

Western Electric Rules

Error Handling

API Reference

Full API Reference

Getting Started

Modules

​Summary Snapshot

​Central Tendency

​Dispersion

​Quantiles and Position

​Shape and Moments

​Frequency Analysis

​Streaming Statistics

​Process Capability

​Shewhart Control Charts

​CUSUM Chart

​EWMA Chart

​Western Electric Rules

​Error Handling

​API Reference

Full API Reference

Summary Snapshot

Central Tendency

Dispersion

Quantiles and Position

Shape and Moments

Frequency Analysis

Streaming Statistics

Process Capability

Shewhart Control Charts

CUSUM Chart

EWMA Chart

Western Electric Rules

Error Handling

API Reference