Exploratory Data Analysis

Kotlin Notebook

Try this guide as a Kotlin Notebook with Kandy visualizations — run the cells to see charts and explore the data interactively.

This guide analyzes backend service performance metrics collected over 30 days. Four metrics are tracked: response time (ms), errors per hour, memory usage (MB), and throughput (requests/sec).

Dataset

val responseTimeMs = doubleArrayOf(
    89.2, 95.1, 87.6, 102.3, 91.8, 88.4, 96.7, 103.5, 90.1, 94.3,
    88.9, 97.2, 105.8, 91.4, 93.6, 87.1, 99.0, 92.5, 96.1, 104.2,
    90.7, 88.3, 101.6, 93.9, 95.4, 89.8, 98.3, 106.1, 91.0, 94.7
)

val errorsPerHour = doubleArrayOf(
    2.0, 3.0, 1.0, 5.0, 2.0, 1.0, 4.0, 6.0, 2.0, 3.0,
    1.0, 4.0, 7.0, 2.0, 3.0, 1.0, 5.0, 2.0, 4.0, 6.0,
    2.0, 1.0, 5.0, 3.0, 3.0, 1.0, 4.0, 8.0, 2.0, 3.0
)

val memoryUsageMb = doubleArrayOf(
    512.3, 528.1, 505.7, 545.2, 519.6, 508.4, 534.8, 551.3, 515.0, 526.7,
    509.2, 537.1, 558.4, 517.8, 524.3, 503.1, 541.6, 520.9, 531.5, 549.7,
    514.2, 506.8, 543.9, 522.5, 529.0, 511.4, 539.3, 561.2, 516.3, 527.4
)

val throughputRps = doubleArrayOf(
    245.0, 238.0, 251.0, 225.0, 242.0, 249.0, 232.0, 218.0, 244.0, 236.0,
    250.0, 230.0, 212.0, 243.0, 237.0, 253.0, 227.0, 241.0, 233.0, 220.0,
    246.0, 252.0, 224.0, 239.0, 235.0, 248.0, 228.0, 210.0, 243.0, 234.0
)

Step 1: Summary Statistics

val rtSummary = responseTimeMs.describe()
val errSummary = errorsPerHour.describe()
val memSummary = memoryUsageMb.describe()
val tpSummary = throughputRps.describe()

rtSummary.mean; rtSummary.standardDeviation; rtSummary.min; rtSummary.max
errSummary.mean; errSummary.standardDeviation; errSummary.min; errSummary.max
memSummary.mean; memSummary.standardDeviation; memSummary.min; memSummary.max
tpSummary.mean; tpSummary.standardDeviation; tpSummary.min; tpSummary.max

Frequency distribution

val rtBins = responseTimeMs.frequencyTable(binCount = 5)
rtBins.forEach { bin ->
    // bin.range, bin.count, bin.relativeFrequency
}

val errBins = errorsPerHour.frequencyTable(binSize = 2.0)
errBins.forEach { bin ->
    // bin.range, bin.count, bin.cumulativeFrequency
}

Step 2: Distribution Shape

responseTimeMs.skewness() // positive = right-skewed
responseTimeMs.kurtosis() // positive excess = heavier tails than Normal

errorsPerHour.skewness()
errorsPerHour.kurtosis()

memoryUsageMb.skewness()
throughputRps.skewness()

Normality tests

shapiroWilkTest(responseTimeMs).pValue
shapiroWilkTest(errorsPerHour).pValue
shapiroWilkTest(memoryUsageMb).pValue
shapiroWilkTest(throughputRps).pValue

Fit a candidate distribution

val rtFit = NormalDistribution(
    mu = responseTimeMs.mean(),
    sigma = responseTimeMs.standardDeviation()
)
kolmogorovSmirnovTest(responseTimeMs, rtFit).pValue

val memFit = NormalDistribution(
    mu = memoryUsageMb.mean(),
    sigma = memoryUsageMb.standardDeviation()
)
kolmogorovSmirnovTest(memoryUsageMb, memFit).pValue

Step 3: Correlations

// Correlation matrix across all four metrics
val matrix = correlationMatrix(responseTimeMs, errorsPerHour, memoryUsageMb, throughputRps)
// matrix[i][j] gives Pearson r between metrics i and j

// Deeper look at specific relationships
val rtVsErrors = pearsonCorrelation(responseTimeMs, errorsPerHour)
rtVsErrors.coefficient // positive = errors rise with latency
rtVsErrors.pValue

val rtVsThroughput = spearmanCorrelation(responseTimeMs, throughputRps)
rtVsThroughput.coefficient // negative = latency rises when throughput drops
rtVsThroughput.pValue

Regression

// Model: how does error count relate to response time?
val regression = simpleLinearRegression(errorsPerHour, responseTimeMs)

regression.slope     // ms increase per additional error/hour
regression.intercept // baseline response time at zero errors
regression.rSquared  // proportion of variance explained

regression.predict(4.0) // expected latency at 4 errors/hour

Step 4: Compare Periods

Split the data into two halves and check whether performance changed.

val firstHalfRt = responseTimeMs.sliceArray(0 until 15)
val secondHalfRt = responseTimeMs.sliceArray(15 until 30)

val periodComparison = tTest(firstHalfRt, secondHalfRt)
periodComparison.pValue
periodComparison.isSignificant()

// Non-parametric alternative
val periodRank = mannWhitneyUTest(firstHalfRt, secondHalfRt)
periodRank.pValue

Compare throughput between periods:

val firstHalfTp = throughputRps.sliceArray(0 until 15)
val secondHalfTp = throughputRps.sliceArray(15 until 30)

val tpComparison = tTest(firstHalfTp, secondHalfTp)
tpComparison.pValue

Step 5: Normalize and Rank

Compare metrics on a common scale.

// Z-score: values become standard deviations from mean
val rtNormalized = responseTimeMs.zScore()
val memNormalized = memoryUsageMb.zScore()
// Both are now on the same scale and can be compared directly

// Min-max scaling to [0, 1]
val rtScaled = responseTimeMs.minMaxNormalize()
val tpScaled = throughputRps.minMaxNormalize()

// Rank the days by response time (worst days get highest rank)
val rtRanked = responseTimeMs.rank()

Z-score normalization is useful for combining metrics into a composite score: a day with high z-scores across response time, errors, and memory may warrant investigation.

How-To Guides

Tutorials

Exploratory Data Analysis

Kotlin Notebook

Dataset

Step 1: Summary Statistics

Frequency distribution

Step 2: Distribution Shape

Normality tests

Fit a candidate distribution

Step 3: Correlations

Regression

Step 4: Compare Periods

Step 5: Normalize and Rank

How-To Guides

Tutorials

Documentation Index

Kotlin Notebook

​Dataset

​Step 1: Summary Statistics

​Frequency distribution

​Step 2: Distribution Shape

​Normality tests

​Fit a candidate distribution

​Step 3: Correlations

​Regression

​Step 4: Compare Periods

​Step 5: Normalize and Rank

Dataset

Step 1: Summary Statistics

Frequency distribution

Step 2: Distribution Shape

Normality tests

Fit a candidate distribution

Step 3: Correlations

Regression

Step 4: Compare Periods

Step 5: Normalize and Rank