09 - Decomposition and Smoothing¶

Target reader: someone with limited statistics background who wants to split a time series into understandable components and reduce noise.

We use the daily temperature record temp.csv - 10 years of daily values with a clear yearly cycle.

1. Why decompose a time series?¶

Raw time series mix several things at once:

A slow-moving trend (long-term direction: warming, cooling, urbanisation...).
A repeating seasonal pattern (yearly for climate, weekly for traffic, daily for electricity demand).
A residual - everything left over (weather noise, measurement error, irregular events).

Decomposition separates the three so each can be studied on its own.

Additive vs multiplicative¶

Two classic ways to combine the parts:

Model	Formula	Use when...
Additive	`Y(t) = Trend(t) + Seasonal(t) + Residual(t)`	Seasonal swings are constant in size
Multiplicative	`Y(t) = Trend(t) * Seasonal(t) * Residual(t)`	Seasonal swings scale with the level

Temperature in degC is typically additive: a 10 degC winter-summer swing does not get bigger when the annual mean rises by 1 degC. Retail sales are typically multiplicative: December spike grows as the business grows. A quick check: plot the data. If the height of the seasonal wiggle is roughly constant over time -> additive. If it fans out -> multiplicative.

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

np.random.seed(42)
plt.rcParams['figure.figsize'] = (11, 3.5)
plt.rcParams['axes.grid'] = True
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statista.time_series import TimeSeries

np.random.seed(42)
plt.rcParams['figure.figsize'] = (11, 3.5)
plt.rcParams['axes.grid'] = True

In [2]:

Copied!





DATA_PATH = '../../../examples/data/temp.csv'
df = pd.read_csv(DATA_PATH, parse_dates=['Date']).set_index('Date')
df = df.rename(columns={'Temp': 'temp'})
print('Shape :', df.shape)
print('Range :', df.index.min().date(), '->', df.index.max().date())
df.head()
DATA_PATH = '../../../examples/data/temp.csv'
df = pd.read_csv(DATA_PATH, parse_dates=['Date']).set_index('Date')
df = df.rename(columns={'Temp': 'temp'})
print('Shape :', df.shape)
print('Range :', df.index.min().date(), '->', df.index.max().date())
df.head()

Shape : (3650, 1)
Range : 1981-01-01 -> 1990-12-31

Out[2]:

	temp
Date
1981-01-01	20.7
1981-01-02	17.9
1981-01-03	18.8
1981-01-04	14.6
1981-01-05	15.8

In [3]:

Copied!





ts = TimeSeries(df[['temp']])
fig, ax = plt.subplots(figsize=(11, 3.5))
ax.plot(ts.index, ts['temp'].values, linewidth=0.4, color='steelblue')
ax.set_title('Daily minimum temperature (degC)')
ax.set_ylabel('temp')
plt.show()
ts = TimeSeries(df[['temp']])
fig, ax = plt.subplots(figsize=(11, 3.5))
ax.plot(ts.index, ts['temp'].values, linewidth=0.4, color='steelblue')
ax.set_title('Daily minimum temperature (degC)')
ax.set_ylabel('temp')
plt.show()

No description has been provided for this image

The seasonal wiggle has roughly the same amplitude every year -> we will use an additive decomposition with a period of 365 (one year in daily data).

2. `classical_decompose` step by step¶

The classical decomposition procedure:

Trend = centred moving average of length period. Smoothing over a full year cancels the seasonal cycle.
Detrended series = Y - Trend (or Y / Trend for multiplicative).
Seasonal = the average value of the detrended series at each position in the cycle. All Januarys are averaged, all Februarys, ..., all Decembers.
Residual = what remains after removing Trend and Seasonal.

The trend and residual have NaNs at the edges (half a period on each side) - this is unavoidable because the moving average needs data on both sides.

In [4]:

Copied!

decomp, _ = ts.classical_decompose(period=365, model='additive')
decomp.head()
decomp, _ = ts.classical_decompose(period=365, model='additive')
decomp.head()

Out[4]:

	observed	trend	seasonal	residual
Date
1981-01-01	20.7	NaN	4.202612	NaN
1981-01-02	17.9	NaN	3.880025	NaN
1981-01-03	18.8	NaN	3.124256	NaN
1981-01-04	14.6	NaN	2.868122	NaN
1981-01-05	15.8	NaN	2.524804	NaN

In [5]:

Copied!





# Quick numerical summary
print('Observed mean      :', round(decomp['observed'].mean(), 3))
print('Trend mean         :', round(decomp['trend'].dropna().mean(), 3))
print('Seasonal amplitude :', round(decomp['seasonal'].max() - decomp['seasonal'].min(), 3))
print('Residual std       :', round(decomp['residual'].std(), 3))
# Quick numerical summary
print('Observed mean      :', round(decomp['observed'].mean(), 3))
print('Trend mean         :', round(decomp['trend'].dropna().mean(), 3))
print('Seasonal amplitude :', round(decomp['seasonal'].max() - decomp['seasonal'].min(), 3))
print('Residual std       :', round(decomp['residual'].std(), 3))

Observed mean      : 11.178
Trend mean         : 11.113
Seasonal amplitude : 12.109
Residual std       : 2.555

Sanity check: reconstruction¶

For an additive decomposition we should get observed = trend + seasonal + residual exactly (up to floating-point error) where the trend is defined.

In [6]:

Copied!





reconstructed = decomp['trend'] + decomp['seasonal'] + decomp['residual']
diff = (decomp['observed'] - reconstructed).dropna()
print('Max |observed - (trend+seasonal+residual)| =',
      float(np.nanmax(np.abs(diff))))
reconstructed = decomp['trend'] + decomp['seasonal'] + decomp['residual']
diff = (decomp['observed'] - reconstructed).dropna()
print('Max |observed - (trend+seasonal+residual)| =',
      float(np.nanmax(np.abs(diff))))

Max |observed - (trend+seasonal+residual)| = 3.552713678800501e-15

3. Why smooth? The signal/noise trade-off¶

Raw daily data is spiky: day-to-day variation hides slower patterns. Smoothing replaces each point by a weighted average of its neighbours so the slow signal becomes visible.

The fundamental trade-off¶

Narrow window -> smoother than the raw data, but still noisy; follows every real turn closely (high responsiveness, low smoothness).
Wide window -> very smooth; but lags behind turns and flattens real peaks (high smoothness, low responsiveness).

There is no single correct window - pick the one that emphasises the scale of variation you care about.

4. Moving average¶

smooth(method='moving_average', window=w) replaces each value with the simple average of the w values centred on that point. Pros: trivial to understand. Cons: equal weight on old and new data, so peaks get flattened, and NaNs appear at the edges.

In [7]:

Copied!





ma7   = ts.smooth(method='moving_average', window=7)
ma30  = ts.smooth(method='moving_average', window=30)
ma365 = ts.smooth(method='moving_average', window=365)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma7.index, ma7['temp'], linewidth=0.6, label='MA-7 days')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='MA-30 days')
ax.plot(ma365.index, ma365['temp'], linewidth=1.5, color='red', label='MA-365 days (trend)')
ax.set_title('Moving-average smoothing at three window sizes')
ax.legend()
plt.show()
ma7   = ts.smooth(method='moving_average', window=7)
ma30  = ts.smooth(method='moving_average', window=30)
ma365 = ts.smooth(method='moving_average', window=365)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma7.index, ma7['temp'], linewidth=0.6, label='MA-7 days')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='MA-30 days')
ax.plot(ma365.index, ma365['temp'], linewidth=1.5, color='red', label='MA-365 days (trend)')
ax.set_title('Moving-average smoothing at three window sizes')
ax.legend()
plt.show()

The 7-day window keeps the seasonal cycle + some noise, 30-day smooths the noise, and 365-day completely removes the seasonal cycle leaving only the slow trend.

5. Exponential smoothing¶

smooth(method='exponential', window=w) uses an exponentially weighted average: recent points count more than old points. The window parameter is the span - roughly the effective number of recent observations.

Advantages over the moving average:

No NaNs at the edges - the filter is one-sided.
Reacts faster to recent changes (good for online forecasting).

Trade-off: it introduces a small phase lag - the smoothed curve tends to be slightly behind the raw data.

In [8]:

Copied!





ewm30 = ts.smooth(method='exponential', window=30)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index,  ma30['temp'],  linewidth=1.0, label='Moving average (30)')
ax.plot(ewm30.index, ewm30['temp'], linewidth=1.0, color='red', label='Exponential (30)')
ax.set_title('Moving average vs exponential smoothing (window=30)')
ax.legend()
plt.show()
ewm30 = ts.smooth(method='exponential', window=30)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index,  ma30['temp'],  linewidth=1.0, label='Moving average (30)')
ax.plot(ewm30.index, ewm30['temp'], linewidth=1.0, color='red', label='Exponential (30)')
ax.set_title('Moving average vs exponential smoothing (window=30)')
ax.legend()
plt.show()

6. Savitzky-Golay¶

smooth(method='savgol', window=w, polyorder=k) fits a low-order polynomial (default degree 2) to each rolling window by least squares and returns the polynomial's centre value.

Why use it?¶

A moving average is equivalent to fitting a constant - that is why it flattens peaks. Savitzky-Golay fits a curve, so peaks stay sharp. It is the go-to smoother when the shape of peaks matters (spectroscopy, hydrograph analysis, signal processing).

The window must be odd; if you pass an even number statista adds 1 for you.

In [9]:

Copied!





savgol = ts.smooth(method='savgol', window=31, polyorder=2)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='Moving average (31)')
ax.plot(savgol.index, savgol['temp'], linewidth=1.0, color='red', label='Savitzky-Golay (31, order 2)')
ax.set_title('Moving average vs Savitzky-Golay - note peak preservation')
ax.legend()
plt.show()
savgol = ts.smooth(method='savgol', window=31, polyorder=2)

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(ts.index, ts['temp'], alpha=0.3, linewidth=0.3, label='raw')
ax.plot(ma30.index, ma30['temp'], linewidth=1.0, label='Moving average (31)')
ax.plot(savgol.index, savgol['temp'], linewidth=1.0, color='red', label='Savitzky-Golay (31, order 2)')
ax.set_title('Moving average vs Savitzky-Golay - note peak preservation')
ax.legend()
plt.show()

When to use which?¶

Method	Preserves peaks	NaNs at edge	Weights	Good for
Moving average	No	Yes	Equal	Trend, de-seasonalising
Exponential	Partly	No	Recent > old	Online forecasting
Savitzky-Golay	Yes	No	Polynomial fit	Peak shapes, spectra

7. Rolling envelope - visualising variability¶

Smoothing hides variability by design. Sometimes we want the opposite: to see how variable the data is around its local level. envelope(window, lower_pct, upper_pct) plots:

the raw series (thin, translucent),
the rolling median,
a shaded band between the rolling lower_pct and upper_pct percentiles.

A widening band means more variability; a narrowing band means a calmer period.

In [10]:

Copied!

fig, ax = ts.envelope(window=60, lower_pct=10, upper_pct=90,
                      title='Temperature envelope - 60-day 10-90% band')
fig, ax = ts.envelope(window=60, lower_pct=10, upper_pct=90,
                      title='Temperature envelope - 60-day 10-90% band')

8. Summary¶

Decomposition separates a series into trend + seasonal + residual (additive) or trend * seasonal * residual (multiplicative). Use additive when seasonal amplitude is constant, multiplicative when it scales with level.
classical_decompose(period=P) needs you to supply the seasonal period (365 for daily-with-yearly-cycle, 12 for monthly, 7 for daily-with-weekly-cycle).
Reconstruction sanity check: observed = trend + seasonal + residual (additive) exactly.
Smoothing trades responsiveness for smoothness. Pick the window from the time scale of interest.
Moving average - simplest; flattens peaks, NaNs at edges.
Exponential - recent values weigh more; no NaNs; some lag.
Savitzky-Golay - fits a low-order polynomial; preserves peaks.
Envelope - shows variability instead of hiding it.