03 — Missing Data and Outliers with TimeSeries¶
Real-world hydrological and climate records almost always have gaps (sensor failure, power outages, flood damage) and outliers (equipment glitches, true extremes). This notebook shows how the MissingData mixin of statista.time_series.TimeSeries helps you diagnose and visualise both.
We use annual-maximum gauge data from the Rhine — six stations, 54 years — where missing values are coded as -9.
1. Why missing data matters¶
Ignoring missing data can silently bias every downstream analysis:
- Biased means and trends. If the missing years are systematically wet (or dry), the computed mean is wrong.
- Wrong return periods. Flood frequency analysis assumes each year is represented.
- Broken periodicities. Seasonal decomposition fails if a whole month is missing.
A gap diagnosis is the first thing a careful analyst runs.
2. Load the data and convert the -9 sentinel to NaN¶
The raw file uses -9 as a missing-value marker. Pandas treats -9 as a real number unless we tell it otherwise, so we replace with NaN immediately. We also promote the date column (years) into a proper DatetimeIndex.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries
df = pd.read_csv('../../../examples/data/ams-gauges.csv')
print('Raw shape:', df.shape)
df = df.replace(-9, np.nan)
df['date'] = pd.to_datetime(df['date'], format='%Y')
df = df.set_index('date')
ts = TimeSeries(df)
ts.head()
Raw shape: (54, 7)
| Frankfurt | Mainz | Kaub | Andernach | Cologne | Rees | |
|---|---|---|---|---|---|---|
| date | ||||||
| 1951-01-01 | NaN | 4250 | 4480 | 6080 | 6490 | 6830 |
| 1952-01-01 | NaN | 4490 | 4610 | 6970 | 7110 | 7340 |
| 1953-01-01 | NaN | 4270 | 4380 | 7300 | 7610 | 7970 |
| 1954-01-01 | NaN | 2850 | 2910 | 3440 | 3620 | 3840 |
| 1955-01-01 | NaN | 5940 | 6050 | 9460 | 9460 | 9500 |
3. .missing_summary() — per-column diagnostics¶
This method returns a one-row-per-column table with:
total_count/valid_count/missing_count— absolute numbers.missing_pct— missing as a % of total.longest_gap— length of the biggest contiguous stretch of NaNs.n_gaps— number of separate gap episodes.mean_gap_length— average length of a gap episode.first_valid/last_valid— the useful range of the record.
summary = ts.missing_summary()
print(summary)
total_count missing_count missing_pct valid_count longest_gap \
Frankfurt 54 13 24.074074 41 13
Mainz 54 0 0.000000 54 0
Kaub 54 0 0.000000 54 0
Andernach 54 0 0.000000 54 0
Cologne 54 0 0.000000 54 0
Rees 54 0 0.000000 54 0
n_gaps mean_gap_length first_valid last_valid
Frankfurt 1 13.0 1964-01-01 2004-01-01
Mainz 0 0.0 1951-01-01 2004-01-01
Kaub 0 0.0 1951-01-01 2004-01-01
Andernach 0 0.0 1951-01-01 2004-01-01
Cologne 0 0.0 1951-01-01 2004-01-01
Rees 0 0.0 1951-01-01 2004-01-01
What this means for your data. Frankfurt is missing 13 years — an entire block at the start of the record (1951-1963). All other stations are complete. If you are comparing gauges you should either trim to the common period (1964 onward) or explicitly handle the asymmetry.
4. .gap_analysis() — locate every gap¶
Sometimes you want a list of every gap, sorted longest first, so you can see whether the missing data is a single long block or many small holes. The returned frame has one row per gap with column, gap_start, gap_end, and gap_length.
gaps = ts.gap_analysis()
print(gaps)
column gap_start gap_end gap_length 0 Frankfurt 1951-01-01 1963-01-01 13
What this means for your data. Only one gap in the whole dataset — a single 13-year block in Frankfurt (1951-1963). This is the easiest kind of missing-data problem to handle because it is a clean prefix we can simply clip off.
5. .completeness_report(freq='YE') — yearly completeness¶
This method groups the data by a pandas frequency alias ('YE' = year-end, 'ME' = month-end, 'QE' = quarter-end, ...) and returns the percentage of non-missing values in each period. Because our data is already annual, each year is either 0% or 100% complete, but on daily data this method can quickly tell you which years are partial.
completeness = ts.completeness_report(freq='YE')
print('Shape:', completeness.shape)
print(completeness.head(15))
Shape: (54, 6)
Frankfurt Mainz Kaub Andernach Cologne Rees
date
1951-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1952-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1953-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1954-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1955-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1956-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1957-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1958-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1959-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1960-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1961-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1962-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1963-12-31 0.0 100.0 100.0 100.0 100.0 100.0
1964-12-31 100.0 100.0 100.0 100.0 100.0 100.0
1965-12-31 100.0 100.0 100.0 100.0 100.0 100.0
6. What is an outlier?¶
An outlier is an observation that is unusually far from the rest of the data. Outliers can be:
- Errors (instrument spikes, mis-typed values) → remove or correct.
- Real extremes (a true flood) → keep; in hydrology these are the most important observations.
The tricky part is deciding which kind you have. The tools below flag candidates; you still need domain knowledge to classify them.
Three common methods¶
| Method | Rule | Pros | Cons |
|--------|------|------|------|
| IQR | flag points outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR] | Distribution-free, robust | Can miss outliers when the data is very skewed |
| Z-score | flag points with |z| > 3 where z = (x - mean)/std | Simple, well known | Assumes normality; single extremes inflate std, masking themselves |
| Modified Z-score | flag points with 0.6745*|x - median| / MAD > 3.5 | Robust (uses median and MAD) | Still assumes a roughly symmetric distribution |
The modified Z-score (Iglewicz & Hoaglin, 1993) is the recommended default for many workflows because it does not get fooled by the outliers themselves.
7. .detect_outliers(method='iqr') — the classic recipe¶
outliers_iqr = ts.detect_outliers(method='iqr', threshold=1.5)
print('Outliers per column (IQR):')
print(outliers_iqr.sum())
Outliers per column (IQR): Frankfurt 0 Mainz 0 Kaub 1 Andernach 1 Cologne 1 Rees 0 dtype: int64
What this means for your data. The IQR rule flags only 1 outlier each in Kaub, Andernach and Cologne. This is a fairly conservative rule and tends to be forgiving with heavy-tailed hydrological data.
8. .detect_outliers(method='modified_zscore') — robust¶
The modified Z-score uses the median and MAD (from the first notebook). The default threshold is 3.5: any observation whose scaled distance from the median exceeds 3.5 is flagged.
outliers_mz = ts.detect_outliers(method='modified_zscore', threshold=3.5)
print('Outliers per column (modified Z-score):')
print(outliers_mz.sum())
Outliers per column (modified Z-score): Frankfurt 0 Mainz 0 Kaub 0 Andernach 0 Cologne 0 Rees 0 dtype: int64
What this means for your data. The modified Z-score flags many more points than IQR. For right-skewed hydrological data this is expected: the robust rule picks up all the real flood peaks. Whether these are 'errors' to remove depends on context — in flood frequency analysis they are the signal, not the noise.
9. .outlier_plot() — see them on the timeline¶
Finally, visualising outliers on the actual time series tells you whether they cluster in time (suspicious) or are scattered (more likely real extremes). Red markers show flagged points, dashed orange lines show the IQR bounds.
fig, ax = ts.outlier_plot(
method='iqr',
threshold=1.5,
column='Cologne',
title='Cologne annual maxima — IQR outliers',
ylabel='Annual max flow',
)
We can also compare with the modified Z-score on the same gauge:
fig, ax = ts.outlier_plot(
method='modified_zscore',
threshold=3.5,
column='Cologne',
title='Cologne annual maxima — modified Z-score outliers',
ylabel='Annual max flow',
)
10. Summary¶
You have seen the complete missing-data and outlier workflow of TimeSeries:
| Step | Method | What it does |
|---|---|---|
| 1 | missing_summary() |
Per-column missing-value diagnostics |
| 2 | gap_analysis() |
Lists every contiguous gap |
| 3 | completeness_report(freq=...) |
% complete per time period |
| 4 | detect_outliers(method='iqr') |
Classic, conservative outlier flag |
| 5 | detect_outliers(method='modified_zscore') |
Robust outlier flag using MAD |
| 6 | outlier_plot(...) |
Visual inspection on the time axis |
Rules of thumb.
- Always replace sentinel codes like
-9or-999withNaNbefore any analysis. - Use IQR for quick screening; use modified Z-score for robust detection.
- Never delete an outlier without first checking whether it is a real extreme value.
Reference. Iglewicz, B. & Hoaglin, D. C. (1993). How to detect and handle outliers. ASQC Quality Press.