07 - Distribution Fitting with the TimeSeries class¶
Target reader: someone with limited statistics background who wants to understand what probability distributions are and how to check whether real data follow a given distribution.
We will work with annual maximum river discharges (one largest flood per year) measured at several gauges on the Rhine river. Annual maxima are classic extreme values: they are rare, large, and often skewed to the right.
By the end of this notebook you will know how to:
- Read and visualise a distribution with the empirical CDF.
- Run four different normality tests and read their output.
- Use QQ-plots and PP-plots to see whether a distribution fits.
- Let
statistapick the best-fitting distribution automatically and interpret the Kolmogorov-Smirnov statistic.
1. What is a probability distribution?¶
A probability distribution describes how likely each possible value of a random quantity is. Two standard tools describe it:
- PDF - Probability Density Function. Think of it as the shape of the histogram in the limit of infinite data. Higher PDF means values near there appear more often. For a continuous distribution, the area under the PDF between two values is the probability of landing in that range.
- CDF - Cumulative Distribution Function.
CDF(x) = P(X <= x). It starts at 0 on the far left and grows to 1 on the far right. It is the cleanest way to compare distributions because there are no bin-width choices to make.
The Normal distribution as a reference¶
The Normal (Gaussian) distribution is the familiar bell curve, symmetric around its mean. Many classical statistical tests assume it. But flood peaks, rainfall, earthquakes, insurance losses... most extremes are not normal - they are skewed and have heavy right tails.
Why we need extreme-value distributions (GEV, Gumbel)¶
If you take the maximum of many random variables (for example, the largest flood in a year), extreme value theory tells us the result tends to follow a Generalised Extreme Value (GEV) distribution. The Gumbel distribution is a special case of GEV with zero shape parameter - often a good first guess for annual maxima.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries
# Reproducibility for any resampling we might do
np.random.seed(42)
plt.rcParams['figure.figsize'] = (9, 4)
plt.rcParams['axes.grid'] = True
2. Load the annual-maximum flood data¶
ams-gauges.csv contains the annual maximum discharge (m3/s) for six gauges on the Rhine from 1951 to 2004. A value of -9 in the Frankfurt column is a missing-value flag - we replace it with NaN.
DATA_PATH = '../../../examples/data/ams-gauges.csv'
df = pd.read_csv(DATA_PATH)
df = df.replace(-9, np.nan)
df = df.set_index('date')
print('Shape :', df.shape)
print('Columns :', list(df.columns))
print('Years :', df.index.min(), '->', df.index.max())
df.head()
Shape : (54, 6) Columns : ['Frankfurt', 'Mainz', 'Kaub', 'Andernach', 'Cologne', 'Rees'] Years : 1951 -> 2004
| Frankfurt | Mainz | Kaub | Andernach | Cologne | Rees | |
|---|---|---|---|---|---|---|
| date | ||||||
| 1951 | NaN | 4250 | 4480 | 6080 | 6490 | 6830 |
| 1952 | NaN | 4490 | 4610 | 6970 | 7110 | 7340 |
| 1953 | NaN | 4270 | 4380 | 7300 | 7610 | 7970 |
| 1954 | NaN | 2850 | 2910 | 3440 | 3620 | 3840 |
| 1955 | NaN | 5940 | 6050 | 9460 | 9460 | 9500 |
# Wrap the columns we want to analyse in a TimeSeries object.
# We keep four complete-record gauges and drop any remaining NaNs.
cols = ['Mainz', 'Kaub', 'Cologne', 'Rees']
ts = TimeSeries(df[cols].dropna())
print(type(ts).__name__, ts.shape)
ts.head()
TimeSeries (54, 4)
| Mainz | Kaub | Cologne | Rees | |
|---|---|---|---|---|
| date | ||||
| 1951 | 4250 | 4480 | 6490 | 6830 |
| 1952 | 4490 | 4610 | 7110 | 7340 |
| 1953 | 4270 | 4380 | 7610 | 7970 |
| 1954 | 2850 | 2910 | 3620 | 3840 |
| 1955 | 5940 | 6050 | 9460 | 9500 |
3. First look: the empirical CDF¶
The empirical CDF (ECDF) is the simplest honest description of a sample:
- Sort the values from smallest to largest.
- For each value
x(i), plot the fraction of observations that are<= x(i).
It is a step function - no smoothing, no bandwidth, no assumptions. If two gauges have different ECDFs at the same x, they have different distributions.
fig, ax = ts.empirical_cdf(title='ECDF - Annual Maximum Discharge (m3/s)',
xlabel='Discharge (m3/s)')
The ECDF clearly shows that downstream gauges (Cologne, Rees) see larger floods than upstream gauges (Mainz, Kaub): their curves are shifted to the right.
4. Is the data normal? normality_test¶
Before doing anything that assumes a Normal distribution (confidence intervals, t-tests, linear regression inference), we should check whether the assumption is even reasonable.
The null hypothesis H0¶
Every normality test shares the same null hypothesis:
H0: the data are drawn from a Normal distribution.
The test returns a p-value:
p > alpha(e.g. 0.05) -> we fail to reject H0. The data are compatible with a Normal. (Not the same as 'proven normal'.)p <= alpha-> we reject H0. Strong evidence the data are not Normal.
Which test?¶
| Test | Best used when | Sensitive to |
|---|---|---|
| Shapiro-Wilk | Small samples, n < 5000 | General departures, esp. tails |
| D'Agostino-Pearson | Larger samples, n >= 20 | Skewness + kurtosis |
| Anderson-Darling | Any n; gives critical values | Heavy tails |
| Jarque-Bera | Large n | Skewness + kurtosis (asymptotic) |
Using several tests and checking agreement is a good habit - they weigh different kinds of non-normality.
# Shapiro-Wilk (auto-selected for n < 5000)
shapiro = ts.normality_test(method='shapiro')
shapiro
| test_name | statistic | p_value | is_normal | conclusion | |
|---|---|---|---|---|---|
| column | |||||
| Mainz | Shapiro-Wilk | 0.994159 | 0.995826 | True | Normal |
| Kaub | Shapiro-Wilk | 0.995695 | 0.999502 | True | Normal |
| Cologne | Shapiro-Wilk | 0.987188 | 0.830227 | True | Normal |
| Rees | Shapiro-Wilk | 0.990059 | 0.932994 | True | Normal |
# D'Agostino-Pearson
dagostino = ts.normality_test(method='dagostino')
dagostino
| test_name | statistic | p_value | is_normal | conclusion | |
|---|---|---|---|---|---|
| column | |||||
| Mainz | D'Agostino-Pearson | 0.039744 | 0.980324 | True | Normal |
| Kaub | D'Agostino-Pearson | 0.015587 | 0.992237 | True | Normal |
| Cologne | D'Agostino-Pearson | 0.112740 | 0.945189 | True | Normal |
| Rees | D'Agostino-Pearson | 0.038826 | 0.980774 | True | Normal |
# Anderson-Darling
anderson = ts.normality_test(method='anderson')
anderson
| test_name | statistic | p_value | is_normal | conclusion | |
|---|---|---|---|---|---|
| column | |||||
| Mainz | Anderson-Darling | 0.134070 | 0.15 | True | Normal |
| Kaub | Anderson-Darling | 0.102931 | 0.15 | True | Normal |
| Cologne | Anderson-Darling | 0.235421 | 0.15 | True | Normal |
| Rees | Anderson-Darling | 0.196575 | 0.15 | True | Normal |
# Jarque-Bera
jb = ts.normality_test(method='jarque_bera')
jb
| test_name | statistic | p_value | is_normal | conclusion | |
|---|---|---|---|---|---|
| column | |||||
| Mainz | Jarque-Bera | 0.094473 | 0.953862 | True | Normal |
| Kaub | Jarque-Bera | 0.112326 | 0.945385 | True | Normal |
| Cologne | Jarque-Bera | 0.259025 | 0.878524 | True | Normal |
| Rees | Jarque-Bera | 0.200392 | 0.904660 | True | Normal |
Annual maximum discharges tend to be right-skewed (a long upper tail of very large events), so p-values for Shapiro-Wilk and D'Agostino-Pearson are usually small - rejecting the Normal hypothesis. That is expected for extremes and is exactly why we use GEV / Gumbel instead of Normal for flood frequency analysis.
5. Seeing the mis-fit: the QQ plot¶
A Quantile-Quantile (QQ) plot is the most useful visual tool for checking a distribution.
The idea¶
- Sort the data from smallest to largest.
- For each data point, compute the theoretical quantile - the value you would expect at that rank if the data truly came from the assumed distribution.
- Plot theoretical quantiles (x-axis) vs sample quantiles (y-axis).
How to read it¶
- Points on the 1:1 line -> perfect fit.
- Points curve above the line on the right -> data has a heavier right tail than the reference (common for floods vs Normal).
- Points curve below on the left -> data has a lighter left tail.
- S-shape -> fatter tails on both sides.
# QQ plot of Cologne annual maxima vs Normal distribution
fig, ax = ts.qq_plot(distribution='norm', column='Cologne',
title='QQ Plot - Cologne vs Normal')
# Now against a Gumbel (extreme-value) distribution
fig, ax = ts.qq_plot(distribution='gumbel_r', column='Cologne',
title='QQ Plot - Cologne vs Gumbel')
Compare the two plots: for flood maxima, the Gumbel QQ-plot usually hugs the 1:1 line much more closely than the Normal QQ-plot, especially in the upper tail. That is visual evidence for using extreme-value distributions.
6. The PP plot - complement to QQ¶
The Probability-Probability (PP) plot compares the empirical CDF against the theoretical CDF at each observation. Both axes run from 0 to 1.
- QQ emphasises the tails (the quantile axis spans the whole real line).
- PP emphasises the centre (probabilities are squashed into [0,1]).
They are complementary diagnostics - look at both.
fig, ax = ts.pp_plot(distribution='gumbel_r', column='Cologne',
title='PP Plot - Cologne vs Gumbel')
7. Automatic best fit: fit_distributions¶
Rather than trying distributions one by one, we can fit all four distributions that statista supports (GEV, Gumbel, Exponential, Normal) and pick the winner.
The Kolmogorov-Smirnov (KS) test¶
fit_distributions returns a KS statistic and a KS p-value per column.
- The KS statistic is the largest vertical gap between the empirical CDF and the fitted theoretical CDF. Smaller is better (closer fit).
- The KS p-value tests the null 'the data came from this distribution'. A large p-value (> 0.05, say) means the data are compatible with the fit - we have no evidence to reject the distribution.
- Note: with KS, a big p-value is good news for the fit, unlike in most other hypothesis tests where you want it small.
L-moments estimation¶
We use method='lmoments' - these are robust linear combinations of order statistics that work especially well for small extreme-value samples.
best_fit = ts.fit_distributions(method='lmoments')
best_fit
-----KS Test-------- Statistic = 0.05555555555555555 Accept Hypothesis P value = 0.9999984404687655 -----KS Test-------- Statistic = 0.12962962962962962 Accept Hypothesis P value = 0.7597078706434061 -----KS Test-------- Statistic = 0.18518518518518517 reject Hypothesis P value = 0.3148816042088447 -----KS Test-------- Statistic = 0.05555555555555555 Accept Hypothesis P value = 0.9999984404687655 -----KS Test-------- Statistic = 0.05555555555555555 Accept Hypothesis P value = 0.9999984404687655 -----KS Test-------- Statistic = 0.1111111111111111 Accept Hypothesis P value = 0.897060064079704 -----KS Test-------- Statistic = 0.18518518518518517 reject Hypothesis P value = 0.3148816042088447 -----KS Test-------- Statistic = 0.05555555555555555 Accept Hypothesis P value = 0.9999984404687655 -----KS Test-------- Statistic = 0.07407407407407407 Accept Hypothesis P value = 0.9987375782247235 -----KS Test-------- Statistic = 0.1111111111111111 Accept Hypothesis P value = 0.897060064079704 -----KS Test-------- Statistic = 0.18518518518518517 reject Hypothesis P value = 0.3148816042088447 -----KS Test-------- Statistic = 0.07407407407407407 Accept Hypothesis P value = 0.9987375782247235 -----KS Test-------- Statistic = 0.07407407407407407 Accept Hypothesis P value = 0.9987375782247235 -----KS Test-------- Statistic = 0.1111111111111111 Accept Hypothesis P value = 0.897060064079704 -----KS Test-------- Statistic = 0.18518518518518517 reject Hypothesis P value = 0.3148816042088447 -----KS Test-------- Statistic = 0.05555555555555555 Accept Hypothesis P value = 0.9999984404687655
| best_distribution | loc | scale | shape | ks_statistic | ks_p_value | |
|---|---|---|---|---|---|---|
| column | ||||||
| Mainz | GEV | 3743.806013 | 1214.617042 | 0.307295 | 0.055556 | 0.999998 |
| Kaub | GEV | 3881.573477 | 1262.426086 | 0.282580 | 0.055556 | 0.999998 |
| Cologne | GEV | 5783.017454 | 2090.224037 | 0.306146 | 0.074074 | 0.998738 |
| Rees | Normal | 6701.425926 | 2114.794456 | NaN | 0.055556 | 0.999998 |
# Pretty summary
for col in ts.columns:
row = best_fit.loc[col]
print(f"{col:10s} best={row['best_distribution']:12s} "
f"loc={row['loc']:.1f} scale={row['scale']:.1f} "
f"KS stat={row['ks_statistic']:.3f} KS p={row['ks_p_value']:.3f}")
Mainz best=GEV loc=3743.8 scale=1214.6 KS stat=0.056 KS p=1.000 Kaub best=GEV loc=3881.6 scale=1262.4 KS stat=0.056 KS p=1.000 Cologne best=GEV loc=5783.0 scale=2090.2 KS stat=0.074 KS p=0.999 Rees best=Normal loc=6701.4 scale=2114.8 KS stat=0.056 KS p=1.000
8. Summary¶
- A probability distribution is described by its PDF (shape) and CDF (cumulative probability).
- Extreme values (annual flood maxima) are typically not Normal - they are right-skewed. Use GEV or Gumbel instead.
normality_testruns Shapiro-Wilk, D'Agostino-Pearson, Anderson-Darling, or Jarque-Bera. In all cases: small p-value -> reject normality.qq_plotandpp_plotare visual fit checks. Straight line = good fit.empirical_cdfis the simplest distribution view - no bandwidth needed.fit_distributionspicks the best of GEV, Gumbel, Exponential and Normal by KS test: smaller KS statistic / larger KS p-value -> better fit.