03 — Missing Data and Outliers with `TimeSeries`¶

Real-world hydrological and climate records almost always have gaps (sensor failure, power outages, flood damage) and outliers (equipment glitches, true extremes). This notebook shows how the MissingData mixin of statista.time_series.TimeSeries helps you diagnose and visualise both.

We use annual-maximum gauge data from the Rhine — six stations, 54 years — where missing values are coded as -9.

1. Why missing data matters¶

Ignoring missing data can silently bias every downstream analysis:

Biased means and trends. If the missing years are systematically wet (or dry), the computed mean is wrong.
Wrong return periods. Flood frequency analysis assumes each year is represented.
Broken periodicities. Seasonal decomposition fails if a whole month is missing.

A gap diagnosis is the first thing a careful analyst runs.

2. Load the data and convert the `-9` sentinel to `NaN`¶

The raw file uses -9 as a missing-value marker. Pandas treats -9 as a real number unless we tell it otherwise, so we replace with NaN immediately. We also promote the date column (years) into a proper DatetimeIndex.

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

df = pd.read_csv('../../../examples/data/ams-gauges.csv')
print('Raw shape:', df.shape)
df = df.replace(-9, np.nan)
df['date'] = pd.to_datetime(df['date'], format='%Y')
df = df.set_index('date')
ts = TimeSeries(df)
ts.head()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

df = pd.read_csv('../../../examples/data/ams-gauges.csv')
print('Raw shape:', df.shape)
df = df.replace(-9, np.nan)
df['date'] = pd.to_datetime(df['date'], format='%Y')
df = df.set_index('date')
ts = TimeSeries(df)
ts.head()

Raw shape: (54, 7)

Out[1]:

	Frankfurt	Mainz	Kaub	Andernach	Cologne	Rees
date
1951-01-01	NaN	4250	4480	6080	6490	6830
1952-01-01	NaN	4490	4610	6970	7110	7340
1953-01-01	NaN	4270	4380	7300	7610	7970
1954-01-01	NaN	2850	2910	3440	3620	3840
1955-01-01	NaN	5940	6050	9460	9460	9500

3. `.missing_summary()` — per-column diagnostics¶

This method returns a one-row-per-column table with:

total_count / valid_count / missing_count — absolute numbers.
missing_pct — missing as a % of total.
longest_gap — length of the biggest contiguous stretch of NaNs.
n_gaps — number of separate gap episodes.
mean_gap_length — average length of a gap episode.
first_valid / last_valid — the useful range of the record.

In [2]:

Copied!

summary = ts.missing_summary()
print(summary)
summary = ts.missing_summary()
print(summary)

           total_count  missing_count  missing_pct  valid_count  longest_gap  \
Frankfurt           54             13    24.074074           41           13   
Mainz               54              0     0.000000           54            0   
Kaub                54              0     0.000000           54            0   
Andernach           54              0     0.000000           54            0   
Cologne             54              0     0.000000           54            0   
Rees                54              0     0.000000           54            0   

           n_gaps  mean_gap_length first_valid last_valid  
Frankfurt       1             13.0  1964-01-01 2004-01-01  
Mainz           0              0.0  1951-01-01 2004-01-01  
Kaub            0              0.0  1951-01-01 2004-01-01  
Andernach       0              0.0  1951-01-01 2004-01-01  
Cologne         0              0.0  1951-01-01 2004-01-01  
Rees            0              0.0  1951-01-01 2004-01-01

What this means for your data. Frankfurt is missing 13 years — an entire block at the start of the record (1951-1963). All other stations are complete. If you are comparing gauges you should either trim to the common period (1964 onward) or explicitly handle the asymmetry.

4. `.gap_analysis()` — locate every gap¶

Sometimes you want a list of every gap, sorted longest first, so you can see whether the missing data is a single long block or many small holes. The returned frame has one row per gap with column, gap_start, gap_end, and gap_length.

In [3]:

Copied!

gaps = ts.gap_analysis()
print(gaps)
gaps = ts.gap_analysis()
print(gaps)

      column  gap_start    gap_end  gap_length
0  Frankfurt 1951-01-01 1963-01-01          13

What this means for your data. Only one gap in the whole dataset — a single 13-year block in Frankfurt (1951-1963). This is the easiest kind of missing-data problem to handle because it is a clean prefix we can simply clip off.

5. `.completeness_report(freq='YE')` — yearly completeness¶

This method groups the data by a pandas frequency alias ('YE' = year-end, 'ME' = month-end, 'QE' = quarter-end, ...) and returns the percentage of non-missing values in each period. Because our data is already annual, each year is either 0% or 100% complete, but on daily data this method can quickly tell you which years are partial.

In [4]:

Copied!

completeness = ts.completeness_report(freq='YE')
print('Shape:', completeness.shape)
print(completeness.head(15))
completeness = ts.completeness_report(freq='YE')
print('Shape:', completeness.shape)
print(completeness.head(15))

Shape: (54, 6)
            Frankfurt  Mainz   Kaub  Andernach  Cologne   Rees
date                                                          
1951-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1952-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1953-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1954-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1955-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1956-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1957-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1958-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1959-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1960-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1961-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1962-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1963-12-31        0.0  100.0  100.0      100.0    100.0  100.0
1964-12-31      100.0  100.0  100.0      100.0    100.0  100.0
1965-12-31      100.0  100.0  100.0      100.0    100.0  100.0

6. What is an outlier?¶

An outlier is an observation that is unusually far from the rest of the data. Outliers can be:

Errors (instrument spikes, mis-typed values) → remove or correct.
Real extremes (a true flood) → keep; in hydrology these are the most important observations.

The tricky part is deciding which kind you have. The tools below flag candidates; you still need domain knowledge to classify them.

Three common methods¶

| Method | Rule | Pros | Cons | |--------|------|------|------| | IQR | flag points outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR] | Distribution-free, robust | Can miss outliers when the data is very skewed | | Z-score | flag points with |z| > 3 where z = (x - mean)/std | Simple, well known | Assumes normality; single extremes inflate std, masking themselves | | Modified Z-score | flag points with 0.6745*|x - median| / MAD > 3.5 | Robust (uses median and MAD) | Still assumes a roughly symmetric distribution |

The modified Z-score (Iglewicz & Hoaglin, 1993) is the recommended default for many workflows because it does not get fooled by the outliers themselves.

7. `.detect_outliers(method='iqr')` — the classic recipe¶

In [5]:

Copied!

outliers_iqr = ts.detect_outliers(method='iqr', threshold=1.5)
print('Outliers per column (IQR):')
print(outliers_iqr.sum())
outliers_iqr = ts.detect_outliers(method='iqr', threshold=1.5)
print('Outliers per column (IQR):')
print(outliers_iqr.sum())

Outliers per column (IQR):
Frankfurt    0
Mainz        0
Kaub         1
Andernach    1
Cologne      1
Rees         0
dtype: int64

What this means for your data. The IQR rule flags only 1 outlier each in Kaub, Andernach and Cologne. This is a fairly conservative rule and tends to be forgiving with heavy-tailed hydrological data.

8. `.detect_outliers(method='modified_zscore')` — robust¶

The modified Z-score uses the median and MAD (from the first notebook). The default threshold is 3.5: any observation whose scaled distance from the median exceeds 3.5 is flagged.

In [6]:

Copied!

outliers_mz = ts.detect_outliers(method='modified_zscore', threshold=3.5)
print('Outliers per column (modified Z-score):')
print(outliers_mz.sum())
outliers_mz = ts.detect_outliers(method='modified_zscore', threshold=3.5)
print('Outliers per column (modified Z-score):')
print(outliers_mz.sum())

Outliers per column (modified Z-score):
Frankfurt    0
Mainz        0
Kaub         0
Andernach    0
Cologne      0
Rees         0
dtype: int64

What this means for your data. The modified Z-score flags many more points than IQR. For right-skewed hydrological data this is expected: the robust rule picks up all the real flood peaks. Whether these are 'errors' to remove depends on context — in flood frequency analysis they are the signal, not the noise.

9. `.outlier_plot()` — see them on the timeline¶

Finally, visualising outliers on the actual time series tells you whether they cluster in time (suspicious) or are scattered (more likely real extremes). Red markers show flagged points, dashed orange lines show the IQR bounds.

In [7]:

Copied!





fig, ax = ts.outlier_plot(
    method='iqr',
    threshold=1.5,
    column='Cologne',
    title='Cologne annual maxima — IQR outliers',
    ylabel='Annual max flow',
)
fig, ax = ts.outlier_plot(
    method='iqr',
    threshold=1.5,
    column='Cologne',
    title='Cologne annual maxima — IQR outliers',
    ylabel='Annual max flow',
)

No description has been provided for this image

We can also compare with the modified Z-score on the same gauge:

In [8]:

Copied!





fig, ax = ts.outlier_plot(
    method='modified_zscore',
    threshold=3.5,
    column='Cologne',
    title='Cologne annual maxima — modified Z-score outliers',
    ylabel='Annual max flow',
)
fig, ax = ts.outlier_plot(
    method='modified_zscore',
    threshold=3.5,
    column='Cologne',
    title='Cologne annual maxima — modified Z-score outliers',
    ylabel='Annual max flow',
)

10. Summary¶

You have seen the complete missing-data and outlier workflow of TimeSeries:

Step	Method	What it does
1	`missing_summary()`	Per-column missing-value diagnostics
2	`gap_analysis()`	Lists every contiguous gap
3	`completeness_report(freq=...)`	% complete per time period
4	`detect_outliers(method='iqr')`	Classic, conservative outlier flag
5	`detect_outliers(method='modified_zscore')`	Robust outlier flag using MAD
6	`outlier_plot(...)`	Visual inspection on the time axis

Rules of thumb.

Always replace sentinel codes like -9 or -999 with NaN before any analysis.
Use IQR for quick screening; use modified Z-score for robust detection.
Never delete an outlier without first checking whether it is a real extreme value.

Reference. Iglewicz, B. & Hoaglin, D. C. (1993). How to detect and handle outliers. ASQC Quality Press.

03 — Missing Data and Outliers with TimeSeries¶

1. Why missing data matters¶

2. Load the data and convert the -9 sentinel to NaN¶

3. .missing_summary() — per-column diagnostics¶

4. .gap_analysis() — locate every gap¶

5. .completeness_report(freq='YE') — yearly completeness¶