07 - Distribution Fitting with the `TimeSeries` class¶

Target reader: someone with limited statistics background who wants to understand what probability distributions are and how to check whether real data follow a given distribution.

We will work with annual maximum river discharges (one largest flood per year) measured at several gauges on the Rhine river. Annual maxima are classic extreme values: they are rare, large, and often skewed to the right.

By the end of this notebook you will know how to:

Read and visualise a distribution with the empirical CDF.
Run four different normality tests and read their output.
Use QQ-plots and PP-plots to see whether a distribution fits.
Let statista pick the best-fitting distribution automatically and interpret the Kolmogorov-Smirnov statistic.

1. What is a probability distribution?¶

A probability distribution describes how likely each possible value of a random quantity is. Two standard tools describe it:

PDF - Probability Density Function. Think of it as the shape of the histogram in the limit of infinite data. Higher PDF means values near there appear more often. For a continuous distribution, the area under the PDF between two values is the probability of landing in that range.
CDF - Cumulative Distribution Function. CDF(x) = P(X <= x). It starts at 0 on the far left and grows to 1 on the far right. It is the cleanest way to compare distributions because there are no bin-width choices to make.

The Normal distribution as a reference¶

The Normal (Gaussian) distribution is the familiar bell curve, symmetric around its mean. Many classical statistical tests assume it. But flood peaks, rainfall, earthquakes, insurance losses... most extremes are not normal - they are skewed and have heavy right tails.

Why we need extreme-value distributions (GEV, Gumbel)¶

If you take the maximum of many random variables (for example, the largest flood in a year), extreme value theory tells us the result tends to follow a Generalised Extreme Value (GEV) distribution. The Gumbel distribution is a special case of GEV with zero shape parameter - often a good first guess for annual maxima.

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

# Reproducibility for any resampling we might do
np.random.seed(42)

plt.rcParams['figure.figsize'] = (9, 4)
plt.rcParams['axes.grid'] = True
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

# Reproducibility for any resampling we might do
np.random.seed(42)

plt.rcParams['figure.figsize'] = (9, 4)
plt.rcParams['axes.grid'] = True

2. Load the annual-maximum flood data¶

ams-gauges.csv contains the annual maximum discharge (m3/s) for six gauges on the Rhine from 1951 to 2004. A value of -9 in the Frankfurt column is a missing-value flag - we replace it with NaN.

In [2]:

Copied!





DATA_PATH = '../../../examples/data/ams-gauges.csv'

df = pd.read_csv(DATA_PATH)
df = df.replace(-9, np.nan)
df = df.set_index('date')

print('Shape    :', df.shape)
print('Columns  :', list(df.columns))
print('Years    :', df.index.min(), '->', df.index.max())
df.head()
DATA_PATH = '../../../examples/data/ams-gauges.csv'

df = pd.read_csv(DATA_PATH)
df = df.replace(-9, np.nan)
df = df.set_index('date')

print('Shape    :', df.shape)
print('Columns  :', list(df.columns))
print('Years    :', df.index.min(), '->', df.index.max())
df.head()

Shape    : (54, 6)
Columns  : ['Frankfurt', 'Mainz', 'Kaub', 'Andernach', 'Cologne', 'Rees']
Years    : 1951 -> 2004

Out[2]:

	Frankfurt	Mainz	Kaub	Andernach	Cologne	Rees
date
1951	NaN	4250	4480	6080	6490	6830
1952	NaN	4490	4610	6970	7110	7340
1953	NaN	4270	4380	7300	7610	7970
1954	NaN	2850	2910	3440	3620	3840
1955	NaN	5940	6050	9460	9460	9500

In [3]:

Copied!





# Wrap the columns we want to analyse in a TimeSeries object.
# We keep four complete-record gauges and drop any remaining NaNs.
cols = ['Mainz', 'Kaub', 'Cologne', 'Rees']
ts = TimeSeries(df[cols].dropna())
print(type(ts).__name__, ts.shape)
ts.head()
# Wrap the columns we want to analyse in a TimeSeries object.
# We keep four complete-record gauges and drop any remaining NaNs.
cols = ['Mainz', 'Kaub', 'Cologne', 'Rees']
ts = TimeSeries(df[cols].dropna())
print(type(ts).__name__, ts.shape)
ts.head()

TimeSeries (54, 4)

Out[3]:

	Mainz	Kaub	Cologne	Rees
date
1951	4250	4480	6490	6830
1952	4490	4610	7110	7340
1953	4270	4380	7610	7970
1954	2850	2910	3620	3840
1955	5940	6050	9460	9500

3. First look: the empirical CDF¶

The empirical CDF (ECDF) is the simplest honest description of a sample:

Sort the values from smallest to largest.
For each value x(i), plot the fraction of observations that are <= x(i).

It is a step function - no smoothing, no bandwidth, no assumptions. If two gauges have different ECDFs at the same x, they have different distributions.

In [4]:

Copied!

fig, ax = ts.empirical_cdf(title='ECDF - Annual Maximum Discharge (m3/s)',
                           xlabel='Discharge (m3/s)')
fig, ax = ts.empirical_cdf(title='ECDF - Annual Maximum Discharge (m3/s)',
                           xlabel='Discharge (m3/s)')

No description has been provided for this image

The ECDF clearly shows that downstream gauges (Cologne, Rees) see larger floods than upstream gauges (Mainz, Kaub): their curves are shifted to the right.

4. Is the data normal? `normality_test`¶

Before doing anything that assumes a Normal distribution (confidence intervals, t-tests, linear regression inference), we should check whether the assumption is even reasonable.

The null hypothesis H0¶

Every normality test shares the same null hypothesis:

H0: the data are drawn from a Normal distribution.

The test returns a p-value:

p > alpha (e.g. 0.05) -> we fail to reject H0. The data are compatible with a Normal. (Not the same as 'proven normal'.)
p <= alpha -> we reject H0. Strong evidence the data are not Normal.

Which test?¶

Test	Best used when	Sensitive to
Shapiro-Wilk	Small samples, n < 5000	General departures, esp. tails
D'Agostino-Pearson	Larger samples, n >= 20	Skewness + kurtosis
Anderson-Darling	Any n; gives critical values	Heavy tails
Jarque-Bera	Large n	Skewness + kurtosis (asymptotic)

Using several tests and checking agreement is a good habit - they weigh different kinds of non-normality.

In [5]:

Copied!

# Shapiro-Wilk (auto-selected for n < 5000)
shapiro = ts.normality_test(method='shapiro')
shapiro
# Shapiro-Wilk (auto-selected for n < 5000)
shapiro = ts.normality_test(method='shapiro')
shapiro

Out[5]:

	test_name	statistic	p_value	is_normal	conclusion
column
Mainz	Shapiro-Wilk	0.994159	0.995826	True	Normal
Kaub	Shapiro-Wilk	0.995695	0.999502	True	Normal
Cologne	Shapiro-Wilk	0.987188	0.830227	True	Normal
Rees	Shapiro-Wilk	0.990059	0.932994	True	Normal

In [6]:

Copied!

# D'Agostino-Pearson
dagostino = ts.normality_test(method='dagostino')
dagostino
# D'Agostino-Pearson
dagostino = ts.normality_test(method='dagostino')
dagostino

Out[6]:

	test_name	statistic	p_value	is_normal	conclusion
column
Mainz	D'Agostino-Pearson	0.039744	0.980324	True	Normal
Kaub	D'Agostino-Pearson	0.015587	0.992237	True	Normal
Cologne	D'Agostino-Pearson	0.112740	0.945189	True	Normal
Rees	D'Agostino-Pearson	0.038826	0.980774	True	Normal

In [7]:

Copied!

# Anderson-Darling
anderson = ts.normality_test(method='anderson')
anderson
# Anderson-Darling
anderson = ts.normality_test(method='anderson')
anderson

Out[7]:

	test_name	statistic	p_value	is_normal	conclusion
column
Mainz	Anderson-Darling	0.134070	0.15	True	Normal
Kaub	Anderson-Darling	0.102931	0.15	True	Normal
Cologne	Anderson-Darling	0.235421	0.15	True	Normal
Rees	Anderson-Darling	0.196575	0.15	True	Normal

In [8]:

Copied!

# Jarque-Bera
jb = ts.normality_test(method='jarque_bera')
jb
# Jarque-Bera
jb = ts.normality_test(method='jarque_bera')
jb

Out[8]:

	test_name	statistic	p_value	is_normal	conclusion
column
Mainz	Jarque-Bera	0.094473	0.953862	True	Normal
Kaub	Jarque-Bera	0.112326	0.945385	True	Normal
Cologne	Jarque-Bera	0.259025	0.878524	True	Normal
Rees	Jarque-Bera	0.200392	0.904660	True	Normal

Annual maximum discharges tend to be right-skewed (a long upper tail of very large events), so p-values for Shapiro-Wilk and D'Agostino-Pearson are usually small - rejecting the Normal hypothesis. That is expected for extremes and is exactly why we use GEV / Gumbel instead of Normal for flood frequency analysis.

5. Seeing the mis-fit: the QQ plot¶

A Quantile-Quantile (QQ) plot is the most useful visual tool for checking a distribution.

The idea¶

Sort the data from smallest to largest.
For each data point, compute the theoretical quantile - the value you would expect at that rank if the data truly came from the assumed distribution.
Plot theoretical quantiles (x-axis) vs sample quantiles (y-axis).

How to read it¶

Points on the 1:1 line -> perfect fit.
Points curve above the line on the right -> data has a heavier right tail than the reference (common for floods vs Normal).
Points curve below on the left -> data has a lighter left tail.
S-shape -> fatter tails on both sides.

In [9]:

Copied!

# QQ plot of Cologne annual maxima vs Normal distribution
fig, ax = ts.qq_plot(distribution='norm', column='Cologne',
                     title='QQ Plot - Cologne vs Normal')
# QQ plot of Cologne annual maxima vs Normal distribution
fig, ax = ts.qq_plot(distribution='norm', column='Cologne',
                     title='QQ Plot - Cologne vs Normal')

In [10]:

Copied!

# Now against a Gumbel (extreme-value) distribution
fig, ax = ts.qq_plot(distribution='gumbel_r', column='Cologne',
                     title='QQ Plot - Cologne vs Gumbel')
# Now against a Gumbel (extreme-value) distribution
fig, ax = ts.qq_plot(distribution='gumbel_r', column='Cologne',
                     title='QQ Plot - Cologne vs Gumbel')

Compare the two plots: for flood maxima, the Gumbel QQ-plot usually hugs the 1:1 line much more closely than the Normal QQ-plot, especially in the upper tail. That is visual evidence for using extreme-value distributions.

6. The PP plot - complement to QQ¶

The Probability-Probability (PP) plot compares the empirical CDF against the theoretical CDF at each observation. Both axes run from 0 to 1.

QQ emphasises the tails (the quantile axis spans the whole real line).
PP emphasises the centre (probabilities are squashed into [0,1]).

They are complementary diagnostics - look at both.

In [11]:

Copied!

fig, ax = ts.pp_plot(distribution='gumbel_r', column='Cologne',
                     title='PP Plot - Cologne vs Gumbel')
fig, ax = ts.pp_plot(distribution='gumbel_r', column='Cologne',
                     title='PP Plot - Cologne vs Gumbel')

7. Automatic best fit: `fit_distributions`¶

Rather than trying distributions one by one, we can fit all four distributions that statista supports (GEV, Gumbel, Exponential, Normal) and pick the winner.

The Kolmogorov-Smirnov (KS) test¶

fit_distributions returns a KS statistic and a KS p-value per column.

The KS statistic is the largest vertical gap between the empirical CDF and the fitted theoretical CDF. Smaller is better (closer fit).
The KS p-value tests the null 'the data came from this distribution'. A large p-value (> 0.05, say) means the data are compatible with the fit - we have no evidence to reject the distribution.
Note: with KS, a big p-value is good news for the fit, unlike in most other hypothesis tests where you want it small.

L-moments estimation¶

We use method='lmoments' - these are robust linear combinations of order statistics that work especially well for small extreme-value samples.

In [12]:

Copied!

best_fit = ts.fit_distributions(method='lmoments')
best_fit
best_fit = ts.fit_distributions(method='lmoments')
best_fit

-----KS Test--------
Statistic = 0.05555555555555555
Accept Hypothesis
P value = 0.9999984404687655
-----KS Test--------
Statistic = 0.12962962962962962
Accept Hypothesis
P value = 0.7597078706434061
-----KS Test--------
Statistic = 0.18518518518518517
reject Hypothesis
P value = 0.3148816042088447
-----KS Test--------
Statistic = 0.05555555555555555
Accept Hypothesis
P value = 0.9999984404687655
-----KS Test--------
Statistic = 0.05555555555555555
Accept Hypothesis
P value = 0.9999984404687655
-----KS Test--------
Statistic = 0.1111111111111111
Accept Hypothesis
P value = 0.897060064079704
-----KS Test--------
Statistic = 0.18518518518518517
reject Hypothesis
P value = 0.3148816042088447
-----KS Test--------
Statistic = 0.05555555555555555
Accept Hypothesis
P value = 0.9999984404687655
-----KS Test--------
Statistic = 0.07407407407407407
Accept Hypothesis
P value = 0.9987375782247235
-----KS Test--------
Statistic = 0.1111111111111111
Accept Hypothesis
P value = 0.897060064079704
-----KS Test--------
Statistic = 0.18518518518518517
reject Hypothesis
P value = 0.3148816042088447
-----KS Test--------
Statistic = 0.07407407407407407
Accept Hypothesis
P value = 0.9987375782247235
-----KS Test--------
Statistic = 0.07407407407407407
Accept Hypothesis
P value = 0.9987375782247235
-----KS Test--------
Statistic = 0.1111111111111111
Accept Hypothesis
P value = 0.897060064079704
-----KS Test--------
Statistic = 0.18518518518518517
reject Hypothesis
P value = 0.3148816042088447
-----KS Test--------
Statistic = 0.05555555555555555
Accept Hypothesis
P value = 0.9999984404687655

Out[12]:

	best_distribution	loc	scale	shape	ks_statistic	ks_p_value
column
Mainz	GEV	3743.806013	1214.617042	0.307295	0.055556	0.999998
Kaub	GEV	3881.573477	1262.426086	0.282580	0.055556	0.999998
Cologne	GEV	5783.017454	2090.224037	0.306146	0.074074	0.998738
Rees	Normal	6701.425926	2114.794456	NaN	0.055556	0.999998

In [13]:

Copied!





# Pretty summary
for col in ts.columns:
    row = best_fit.loc[col]
    print(f"{col:10s}  best={row['best_distribution']:12s}  "
          f"loc={row['loc']:.1f}  scale={row['scale']:.1f}  "
          f"KS stat={row['ks_statistic']:.3f}  KS p={row['ks_p_value']:.3f}")
# Pretty summary
for col in ts.columns:
    row = best_fit.loc[col]
    print(f"{col:10s}  best={row['best_distribution']:12s}  "
          f"loc={row['loc']:.1f}  scale={row['scale']:.1f}  "
          f"KS stat={row['ks_statistic']:.3f}  KS p={row['ks_p_value']:.3f}")

Mainz       best=GEV           loc=3743.8  scale=1214.6  KS stat=0.056  KS p=1.000
Kaub        best=GEV           loc=3881.6  scale=1262.4  KS stat=0.056  KS p=1.000
Cologne     best=GEV           loc=5783.0  scale=2090.2  KS stat=0.074  KS p=0.999
Rees        best=Normal        loc=6701.4  scale=2114.8  KS stat=0.056  KS p=1.000

8. Summary¶

A probability distribution is described by its PDF (shape) and CDF (cumulative probability).
Extreme values (annual flood maxima) are typically not Normal - they are right-skewed. Use GEV or Gumbel instead.
normality_test runs Shapiro-Wilk, D'Agostino-Pearson, Anderson-Darling, or Jarque-Bera. In all cases: small p-value -> reject normality.
qq_plot and pp_plot are visual fit checks. Straight line = good fit.
empirical_cdf is the simplest distribution view - no bandwidth needed.
fit_distributions picks the best of GEV, Gumbel, Exponential and Normal by KS test: smaller KS statistic / larger KS p-value -> better fit.

07 - Distribution Fitting with the TimeSeries class¶