Summary statistics, central tendency, dispersion, quantiles, shape measures, frequency analysis, and streaming calculations in kstats-core.
kstats-core provides descriptive statistics as extension functions on DoubleArray and Iterable<Double>. It also includes Int and Long overloads for the most common operations.
describe() is the fastest way to get a complete picture of a sample. It returns a DescriptiveStatistics object combining count, center, spread, quartiles, shape, and standard error.
variance() and standardDeviation() use the sample formula (dividing by n−1) by default. Pass PopulationKind.POPULATION to use the population formula (dividing by n).
Different notions of “center” suit different data shapes. mean() is the arithmetic average, median() is the positional center, mode() returns the most frequent values, and the trimmed and weighted variants handle outliers and importance weights.
Geometric mean — use for growth rates and ratios. Requires all positive values.
Harmonic mean — use for rates and speed averages. Requires all positive values.
Trimmed mean — drops a proportion from each tail before averaging. Robust to outliers.
Weighted mean — each observation contributes proportionally to its weight.
Math details
xˉ=n1i=1∑nxixˉgeo=(i=1∏nxi)1/nxˉharm=∑i=1nxi1nxˉw=∑i=1nwi∑i=1nwixiThe trimmed mean drops ⌊p⋅n⌋ observations from each end of the sorted sample, then computes the arithmetic mean of the remaining values.
Quantiles split the data at specified probability thresholds. percentile() takes a value on the 0–100 scale; quantile() takes a value on the 0–1 scale.
quantile() supports a QuantileInterpolation parameter with options: LINEAR (default), LOWER, HIGHER, NEAREST, and MIDPOINT. These control how the value is interpolated when the quantile falls between two data points.
Math details
With linear interpolation (the default), the quantile at probability p for sorted data x(1),…,x(n) is computed as:Q(p)=x(⌊h⌋)+(h−⌊h⌋)(x(⌈h⌉)−x(⌊h⌋))where h=p(n−1)+1.
Shape measures describe asymmetry and tail weight beyond what the mean and variance capture.
val data = doubleArrayOf(2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0)data.skewness() // 0.6563 — positive: right tail is longerdata.kurtosis() // -0.1640 — negative excess: lighter tails than normaldata.kurtosis(excess = false) // raw kurtosis (not centered at 0)data.centralMoment(2) // 4.0 (equals population variance)data.centralMoment(3) // 5.25data.centralMoment(4) // 44.5data.kStatistic(1) // 5.0 (equals mean)data.kStatistic(2) // 4.5714 (equals sample variance)
Skewness near zero suggests symmetry. Positive skewness indicates a longer right tail; negative indicates a longer left tail. Excess kurtosis near zero is similar to a normal distribution; positive values indicate heavier tails.
kurtosis() returns excess kurtosis by default (excess = true), meaning the value is shifted so that a normal distribution has excess kurtosis of 0. Pass excess = false for the raw fourth standardized moment.
Math details
g1=(n−1)(n−2)ni=1∑n(sxi−xˉ)3g2=(n−1)(n−2)(n−3)n(n+1)i=1∑n(sxi−xˉ)4−(n−2)(n−3)3(n−1)2The r-th central moment is μr=n1∑i=1n(xi−xˉ)r. K-statistics (kr) are unbiased estimators of the corresponding cumulants.
The Frequency class counts occurrences, proportions, and cumulative frequencies for any Comparable type. frequencyTable() groups numeric data into equal-width bins.
val data = doubleArrayOf(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0)val bins = data.asIterable().frequencyTable(3)// Each FrequencyBin has: range, count, relativeFrequency, cumulativeFrequencybins[0].count // number of values in the first binbins[0].relativeFrequency // proportion of total
Frequency works with any Comparable type — strings, enums, integers, or custom classes. For numeric binning, use frequencyTable() with either a bin count or a bin width.
OnlineStatistics computes mean, variance, skewness, and kurtosis incrementally without storing the full sample. Values can be added one at a time or in batches.
Use OnlineStatistics when data arrives incrementally (streaming, event-driven systems) or when buffering the full sample in memory is impractical.
Math details
OnlineStatistics uses Welford’s algorithm with the Terriberry extension for numerically stable single-pass computation of higher-order moments. The algorithm updates running sums of powers of deviations from the current mean, avoiding the catastrophic cancellation that affects naive two-pass formulas on large datasets.
Process capability indices quantify how well a process fits within its . processCapability(lsl, usl) returns four Statistical Process Control (SPC) indices in one call:
Cp, Cpk — potential and actual capability using the sample standard deviation (short-term, within-subgroup spread).
Pp, Ppk — the same formulas using the population standard deviation (long-term, overall spread).
The -k variants penalize a process that is off-center relative to the midpoint of the spec, so Cpk is never larger than Cp.
// Ten parts measured against a spec window of [48, 52]val measurements = doubleArrayOf( 50.0, 50.5, 49.5, 50.2, 49.8, 50.1, 49.9, 50.3, 49.7, 50.0)val capability = measurements.processCapability(lsl = 48.0, usl = 52.0)capability.cp // 2.2646 — potential capability (spread vs tolerance)capability.cpk // 2.2646 — actual capability (penalizes off-centering)capability.pp // 2.3870 — overall (population σ) counterpart of Cpcapability.ppk // 2.3870 — overall counterpart of Cpk
Use these indices only on a process that is already in statistical control (stable over time — see Shewhart Control Charts below). For an unstable process the measured spread is not a fixed property of the process.
Values ≥ 1.33 are usually considered capable; ≥ 1.67 highly capable. When Cpk ≪ Cp, re-center the process before trying to reduce variance.
Math details
Cp=6σsUSL−LSL,Cpk=min(3σsUSL−xˉ,3σsxˉ−LSL)Pp and Ppk use the population standard deviation σp (divisor n) in place of the sample standard deviation σs (divisor n−1). processCapability computes both in a single numerically stable Welford pass.
Shewhart plot subgroup statistics over time with three-sigma control limits. xBarRChart() monitors the process mean together with the range within each subgroup; xBarSChart() uses the sample standard deviation instead — more efficient for subgroup sizes above 10. Both require equal-sized subgroups of 2–25 observations.
// Five subgroups of four parts; bracket width monitored per batchval subgroups = listOf( doubleArrayOf(72.0, 84.0, 79.0, 49.0), doubleArrayOf(56.0, 87.0, 33.0, 42.0), doubleArrayOf(55.0, 73.0, 22.0, 60.0), doubleArrayOf(44.0, 80.0, 54.0, 74.0), doubleArrayOf(97.0, 26.0, 48.0, 58.0),)val chart = xBarRChart(subgroups)chart.centerLine // 59.65 — grand mean (x-double-bar)chart.ucl // 95.6626 — upper control limit for the meanchart.lcl // 23.6374 — lower control limit for the meanchart.rChart.centerLine // 49.4 — average range (R-bar)chart.rChart.ucl // 112.7308 — upper limit for within-subgroup rangechart.rChart.lcl // 0.0 — lower limit (D₃ = 0 for n ≤ 6)
Control limits use the standard SPC constants A2,A3,D3,D4,B3,B4,c4 (Montgomery, Introduction to Statistical Quality Control, Appendix VI) available for subgroup sizes 2–25 through spcConstants(n).
Math details
For k subgroups of size n with subgroup means xˉi, ranges Ri, and standard deviations si:xˉ-R:UCL/LCL=xˉˉ±A2Rˉ,R-chart: [D3Rˉ,D4Rˉ]xˉ-S:UCL/LCL=xˉˉ±A3sˉ,S-chart: [B3sˉ,B4sˉ]
A Shewhart chart reacts slowly to drifts of less than 2σ because every point is judged in isolation. cusum() accumulates deviations from target over time, so a 0.5σ–1σ drift is detected within a few observations. The two-sided tabular form tracks an upper sum C+ that catches upward shifts and a lower sum C− for downward shifts. An alarm fires on the first index where either sum exceeds the decision interval H.
// Individual measurements from a process with target 10, drifting upwardval observations = doubleArrayOf(10.2, 10.4, 10.6, 10.9, 11.2, 11.5, 11.8, 12.0)val result = cusum(observations, target = 10.0, k = 0.5, h = 3.0)result.sPlus // [0.0, 0.0, 0.1, 0.5, 1.2, 2.2, 3.5, 5.0]result.sMinus // all zero — no downward driftresult.alarmIndex // 6 — first index where C⁺ > H
Tune k to half the shift size you want to detect, in units of σ — a common default is K≈0.5σ targeting a 1σ drift. Set h to 4σ–5σ to match the in-control run length of a 3σ Shewhart chart while reacting much faster to small shifts.
Math details
Ci+=max(0,Ci−1++(xi−μ0−K)),Ci−=max(0,Ci−1−+(μ0−K−xi))starting from C0±=0, with an alarm the first time Ci+>H or Ci−>H.
EWMA (Roberts, 1959) is the other classic small-shift detector. Instead of an unbounded cumulative sum, ewma() maintains a weighted moving average that gives recent observations more influence while retaining memory of the past. Control limits widen with time until they reach a steady state, so the chart is most sensitive early — useful for catching an initial shift.
λ=0.2 with L≈2.7–3.0 is a common default. Smaller λ emphasizes memory and detects smaller shifts; λ=1 collapses EWMA into a Shewhart individuals chart.
westernElectricRules() extends a Shewhart chart beyond the basic ±3σ check with four run-length heuristics that catch trends, clusters, and prolonged one-sided runs.
Rule
Pattern
Detects
1
1 point beyond ±3σ
extreme single excursion
2
2 of last 3 points beyond ±2σ, same side
strong shift
3
4 of last 5 points beyond ±1σ, same side
moderate shift
4
8 consecutive points on the same side of the center
sustained shift, any magnitude
Each rule’s array contains the trigger indices — the observation whose arrival completes the offending pattern.
// Process drifting upward in the last four observationsval observations = doubleArrayOf( 0.1, 0.2, -0.3, 0.0, 1.4, 1.2, 2.4, 2.6, 3.5, 2.2)val violations = westernElectricRules(observations, center = 0.0, sigma = 1.0)violations.rule1 // indices of points beyond ±3σviolations.rule2 // indices where 2 of last 3 points are beyond ±2σ (same side)violations.rule3 // indices where 4 of last 5 points are beyond ±1σ (same side)violations.rule4 // indices where 8 consecutive points fall on the same side
Combine a Shewhart chart (large shifts) with CUSUM or EWMA (small shifts) and Western Electric Rules (patterns) — the three views together catch the widest range of out-of-control conditions.
Empty arrays throw InsufficientDataException. Functions that require at least two observations (variance, standard deviation, skewness) throw InsufficientDataException for single-element input. Functions requiring positive data (geometricMean, harmonicMean) throw InvalidParameterException for non-positive values.