09 - Decomposition and Smoothing¶
Target reader: someone with limited statistics background who wants to split a time series into understandable components and reduce noise.
We use the daily temperature record temp.csv - 10 years of daily values with a clear yearly cycle.
1. Why decompose a time series?¶
Raw time series mix several things at once:
- A slow-moving trend (long-term direction: warming, cooling, urbanisation...).
- A repeating seasonal pattern (yearly for climate, weekly for traffic, daily for electricity demand).
- A residual - everything left over (weather noise, measurement error, irregular events).
Decomposition separates the three so each can be studied on its own.
Additive vs multiplicative¶
Two classic ways to combine the parts:
| Model | Formula | Use when... |
|---|---|---|
| Additive | Y(t) = Trend(t) + Seasonal(t) + Residual(t) |
Seasonal swings are constant in size |
| Multiplicative | Y(t) = Trend(t) * Seasonal(t) * Residual(t) |
Seasonal swings scale with the level |
Temperature in degC is typically additive: a 10 degC winter-summer swing does not get bigger when the annual mean rises by 1 degC. Retail sales are typically multiplicative: December spike grows as the business grows. A quick check: plot the data. If the height of the seasonal wiggle is roughly constant over time -> additive. If it fans out -> multiplicative.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries
np.random.seed(42)
plt.rcParams['figure.figsize'] = (11, 3.5)
plt.rcParams['axes.grid'] = True
DATA_PATH = '../../../examples/data/temp.csv'
df = pd.read_csv(DATA_PATH, parse_dates=['Date']).set_index('Date')
df = df.rename(columns={'Temp': 'temp'})
print('Shape :', df.shape)
print('Range :', df.index.min().date(), '->', df.index.max().date())
df.head()
Shape : (3650, 1) Range : 1981-01-01 -> 1990-12-31
| temp | |
|---|---|
| Date | |
| 1981-01-01 | 20.7 |
| 1981-01-02 | 17.9 |
| 1981-01-03 | 18.8 |
| 1981-01-04 | 14.6 |
| 1981-01-05 | 15.8 |
ts = TimeSeries(df[['temp']])
fig, ax = plt.subplots(figsize=(11, 3.5))
ax.plot(ts.index, ts['temp'].values, linewidth=0.4, color='steelblue')
ax.set_title('Daily minimum temperature (degC)')
ax.set_ylabel('temp')
plt.show()
The seasonal wiggle has roughly the same amplitude every year -> we will use an additive decomposition with a period of 365 (one year in daily data).
2. classical_decompose step by step¶
The classical decomposition procedure:
- Trend = centred moving average of length
period. Smoothing over a full year cancels the seasonal cycle. - Detrended series = Y - Trend (or Y / Trend for multiplicative).
- Seasonal = the average value of the detrended series at each position in the cycle. All Januarys are averaged, all Februarys, ..., all Decembers.
- Residual = what remains after removing Trend and Seasonal.
The trend and residual have NaNs at the edges (half a period on each side) - this is unavoidable because the moving average needs data on both sides.
decomp, _ = ts.classical_decompose(period=365, model='additive')
decomp.head()
| observed | trend | seasonal | residual | |
|---|---|---|---|---|
| Date | ||||
| 1981-01-01 | 20.7 | NaN | 4.202612 | NaN |
| 1981-01-02 | 17.9 | NaN | 3.880025 | NaN |
| 1981-01-03 | 18.8 | NaN | 3.124256 | NaN |
| 1981-01-04 | 14.6 | NaN | 2.868122 | NaN |
| 1981-01-05 | 15.8 | NaN | 2.524804 | NaN |
# Quick numerical summary
print('Observed mean :', round(decomp['observed'].mean(), 3))
print('Trend mean :', round(decomp['trend'].dropna().mean(), 3))
print('Seasonal amplitude :', round(decomp['seasonal'].max() - decomp['seasonal'].min(), 3))
print('Residual std :', round(decomp['residual'].std(), 3))
Observed mean : 11.178 Trend mean : 11.113 Seasonal amplitude : 12.109 Residual std : 2.555
Sanity check: reconstruction¶
For an additive decomposition we should get observed = trend + seasonal + residual exactly (up to floating-point error) where the trend is defined.
reconstructed = decomp['trend'] + decomp['seasonal'] + decomp['residual']
diff = (decomp['observed'] - reconstructed).dropna()
print('Max |observed - (trend+seasonal+residual)| =',
float(np.nanmax(np.abs(diff))))
Max |observed - (trend+seasonal+residual)| = 3.552713678800501e-15
3. Why smooth? The signal/noise trade-off¶
Raw daily data is spiky: day-to-day variation hides slower patterns. Smoothing replaces each point by a weighted average of its neighbours so the slow signal becomes visible.
The fundamental trade-off¶
- Narrow window -> smoother than the raw data, but still noisy; follows every real turn closely (high responsiveness, low smoothness).
- Wide window -> very smooth; but lags behind turns and flattens real peaks (high smoothness, low responsiveness).
There is no single correct window - pick the one that emphasises the scale of variation you care about.
4. Moving average¶
smooth(method='moving_average', window=w) replaces each value with the simple average of the w values centred on that point. Pros: trivial to understand. Cons: equal weight on old and new data, so peaks get flattened, and NaNs appear at the edges.
ma7 = ts.smooth(method='moving_average', window=7)
ma30 = ts.smooth(method='moving_average', window=30)
ma365 = ts.smooth(method='moving_average', window=365)
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma7.index, ma7['temp'], linewidth=0.6, label='MA-7 days')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='MA-30 days')
ax.plot(ma365.index, ma365['temp'], linewidth=1.5, color='red', label='MA-365 days (trend)')
ax.set_title('Moving-average smoothing at three window sizes')
ax.legend()
plt.show()
The 7-day window keeps the seasonal cycle + some noise, 30-day smooths the noise, and 365-day completely removes the seasonal cycle leaving only the slow trend.
5. Exponential smoothing¶
smooth(method='exponential', window=w) uses an exponentially weighted average: recent points count more than old points. The window parameter is the span - roughly the effective number of recent observations.
Advantages over the moving average:
- No NaNs at the edges - the filter is one-sided.
- Reacts faster to recent changes (good for online forecasting).
Trade-off: it introduces a small phase lag - the smoothed curve tends to be slightly behind the raw data.
ewm30 = ts.smooth(method='exponential', window=30)
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='Moving average (30)')
ax.plot(ewm30.index, ewm30['temp'], linewidth=1.0, color='red', label='Exponential (30)')
ax.set_title('Moving average vs exponential smoothing (window=30)')
ax.legend()
plt.show()
6. Savitzky-Golay¶
smooth(method='savgol', window=w, polyorder=k) fits a low-order polynomial (default degree 2) to each rolling window by least squares and returns the polynomial's centre value.
Why use it?¶
A moving average is equivalent to fitting a constant - that is why it flattens peaks. Savitzky-Golay fits a curve, so peaks stay sharp. It is the go-to smoother when the shape of peaks matters (spectroscopy, hydrograph analysis, signal processing).
The window must be odd; if you pass an even number statista adds 1 for you.
savgol = ts.smooth(method='savgol', window=31, polyorder=2)
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='Moving average (31)')
ax.plot(savgol.index, savgol['temp'], linewidth=1.0, color='red', label='Savitzky-Golay (31, order 2)')
ax.set_title('Moving average vs Savitzky-Golay - note peak preservation')
ax.legend()
plt.show()
When to use which?¶
| Method | Preserves peaks | NaNs at edge | Weights | Good for |
|---|---|---|---|---|
| Moving average | No | Yes | Equal | Trend, de-seasonalising |
| Exponential | Partly | No | Recent > old | Online forecasting |
| Savitzky-Golay | Yes | No | Polynomial fit | Peak shapes, spectra |
7. Rolling envelope - visualising variability¶
Smoothing hides variability by design. Sometimes we want the opposite: to see how variable the data is around its local level. envelope(window, lower_pct, upper_pct) plots:
- the raw series (thin, translucent),
- the rolling median,
- a shaded band between the rolling
lower_pctandupper_pctpercentiles.
A widening band means more variability; a narrowing band means a calmer period.
fig, ax = ts.envelope(window=60, lower_pct=10, upper_pct=90,
title='Temperature envelope - 60-day 10-90% band')
8. Summary¶
- Decomposition separates a series into
trend + seasonal + residual(additive) ortrend * seasonal * residual(multiplicative). Use additive when seasonal amplitude is constant, multiplicative when it scales with level. classical_decompose(period=P)needs you to supply the seasonal period (365 for daily-with-yearly-cycle, 12 for monthly, 7 for daily-with-weekly-cycle).- Reconstruction sanity check:
observed = trend + seasonal + residual(additive) exactly. - Smoothing trades responsiveness for smoothness. Pick the window from the time scale of interest.
- Moving average - simplest; flattens peaks, NaNs at edges.
- Exponential - recent values weigh more; no NaNs; some lag.
- Savitzky-Golay - fits a low-order polynomial; preserves peaks.
- Envelope - shows variability instead of hiding it.