Signal Processing for Emotion Analysis: The Rolling Window Trade-off
Why I chose a 10-minute rolling window for emotion smoothing — and the trade-offs between noise reduction and temporal precision
When you’re analyzing emotional patterns in films, raw data is your enemy.
A single powerful line — “There’s no place like home” — creates a massive sentiment spike. Two minutes later, characters discuss lunch plans (low emotional content), and the signal crashes. The chart looks like a seismograph during an earthquake.
This is the nature of discrete emotional events embedded in continuous time. How do you smooth the signal without losing the story?
The Problem: Emotion Data is Inherently Noisy
In the Spiriteddata project, I analyzed emotion scores for ~100,000 dialogue lines across 22 Studio Ghibli films. Each line gets 28 emotion scores from the GoEmotions model.
Here’s what raw emotion data looks like for Spirited Away (joy emotion, English version):
Minute | Joy Score
-------|----------
0 | 0.12
1 | 0.45 ← "Chihiro, don't cling like that!"
2 | 0.08
3 | 0.89 ← "Look! A tunnel!"
4 | 0.11
5 | 0.06
6 | 0.92 ← "What a pretty statue!"
7 | 0.03
The peaks are genuine — those lines express joy. But the valleys are just… silence, or neutral technical dialogue.
If you plot this raw, you get visual noise that obscures the narrative arc.
Why Raw Data Fails
Three reasons raw emotion scores don’t work:
1. Single-line volatility
One character screaming in joy doesn’t mean the scene is joyful. It means one person had one emotional moment.
2. Dialogue density variation
Action scenes have sparse dialogue. Quiet scenes have dense dialogue. Raw scores conflate “low emotion” with “no dialogue.”
3. Human perception smooths naturally
When you watch a film, your brain doesn’t experience discrete emotional spikes. You perceive trends — “this scene feels tense” or “the mood is lifting.”
Data should match perception.
The Solution: Temporal Aggregation + Smoothing
I implemented a two-stage process:
Stage 1: Aggregate into 1-Minute Buckets
First, collapse all dialogue within each minute into a single aggregate score:
import pandas as pd
def aggregate_to_minutes(emotion_df: pd.DataFrame, emotion: str) -> pd.DataFrame:
"""
Aggregate emotion scores by minute bucket.
"""
# Convert timestamps to minute offsets
emotion_df['minute'] = (emotion_df['start_seconds'] // 60).astype(int)
# Aggregate by minute - use MEAN of all lines in that minute
minute_agg = emotion_df.groupby(['film_slug', 'language_code', 'minute']).agg({
f'emotion_{emotion}': 'mean',
'line_id': 'count' # Track dialogue density
}).reset_index()
minute_agg.rename(columns={'line_id': 'dialogue_count'}, inplace=True)
return minute_agg
Why 1-minute buckets?
- Small enough to capture scene-level changes
- Large enough to smooth single-line noise
- Aligns with human time perception (“that minute felt tense”)
Stage 2: Apply Rolling Window Smoothing
Then apply a rolling window average to capture broader trends:
-- dbt model: mart_film_emotion_timeseries.sql
SELECT
film_slug,
language_code,
minute_offset,
emotion_joy,
-- 11-point symmetric window (±5 minutes from center)
AVG(emotion_joy) OVER (
PARTITION BY film_slug, language_code
ORDER BY minute_offset
ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
) AS emotion_joy_smoothed
FROM film_with_metadata
The window: 5 minutes before + current minute + 5 minutes after = 11 data points. I often call it “10-minute” as shorthand (referring to the ±5 minute span), though technically it’s 11 points.
The Trade-off: Window Size Matters
Choosing the window size is a classic signal processing problem: noise reduction vs. temporal precision.
What I Tested
I ran experiments with multiple window sizes:
| Window Size | Noise Reduction | Peak Preservation | Temporal Precision |
|---|---|---|---|
| 3 minutes | Low | Excellent | Excellent |
| 5 minutes | Moderate | Good | Good |
| 10 minutes | High | Good | Moderate |
| 15 minutes | Very High | Poor | Poor |
| 20 minutes | Extreme | Very Poor | Very Poor |
3-Minute Window: Too Noisy
Visual: Still spiky, individual line effects visible
Problem: Dramatic scenes (like the bathhouse chase in Spirited Away) show rapid emotional swings that feel like noise, not signal.
Example:
- Minute 32: 0.15 (fear)
- Minute 33: 0.82 (fear) ← Chihiro runs from Yubaba
- Minute 34: 0.21 (fear)
With a 3-minute window, you still see the spike. That’s technically accurate but visually distracting.
15-Minute Window: Too Smooth
Visual: Gentle curves, no sharp features
Problem: Misses climactic moments. The emotional peaks that define key scenes get smoothed away.
Example: Spirited Away’s reunion scene (Chihiro recognizes Haku) is the emotional climax. With a 15-minute window, it blends into the surrounding scenes.
Lost nuance: Different films have different pacing. A 15-minute window makes fast-paced films (like Castle in the Sky) look like slow-paced films (like My Neighbor Totoro).
10-Minute Window: The Sweet Spot
Visual: Smooth enough to show trends, sharp enough to show climaxes
Rationale:
-
Film scene length: Most dramatic scenes in Ghibli films run 5-10 minutes. A 10-minute window captures “scene-level” emotion without over-smoothing.
-
Peak preservation: Climactic moments span multiple minutes (buildup + peak + fallout). A 10-minute window captures the full arc.
-
Cross-film consistency: Works well for both fast-paced action (Nausicaä) and slow character studies (Only Yesterday).
The Math: Why 10 Minutes Works
Here’s the impulse response test:
Scenario: A 2-minute emotional spike (joy = 0.9) surrounded by neutral dialogue (joy = 0.1).
| Window Size | Detected Peak | Spread (minutes) |
|---|---|---|
| 3 minutes | 0.74 | ±1 minutes |
| 10 minutes | 0.42 | ±5 minutes |
| 15 minutes | 0.28 | ±7 minutes |
Interpretation:
- 3-minute window: Peak is 0.74 (too high, feels like noise)
- 10-minute window: Peak is 0.42 (noticeable but contextualized)
- 15-minute window: Peak is 0.28 (lost in the average)
A 0.42 peak says: “Something emotional happened here, but it’s part of a broader context.” That’s the insight I want.
Implementation Details
Handling Edge Cases
Problem 1: Start/End of Film
At the film’s first minute, there are no “5 minutes before.” The window is asymmetric.
Solution: Use an asymmetric window at boundaries:
-- Dynamic window sizing
CASE
WHEN minute_offset < 5 THEN
-- Beginning of film: only look forward
AVG(emotion_joy) OVER (
PARTITION BY film_slug, language_code
ORDER BY minute_offset
ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING
)
WHEN minute_offset > max_minute - 5 THEN
-- End of film: only look backward
AVG(emotion_joy) OVER (
PARTITION BY film_slug, language_code
ORDER BY minute_offset
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
)
ELSE
-- Middle of film: symmetric window
AVG(emotion_joy) OVER (
PARTITION BY film_slug, language_code
ORDER BY minute_offset
ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
)
END AS emotion_joy_smoothed
Problem 2: Missing Data (Sparse Dialogue)
Some films have long stretches with no dialogue (e.g., Totoro’s forest scenes). Missing minutes create gaps.
Solution: Forward-fill missing minutes with NULL, then interpolate:
def fill_missing_minutes(df: pd.DataFrame, max_runtime: int) -> pd.DataFrame:
"""
Ensure every minute from 0 to max_runtime exists.
Forward-fill emotion scores for missing minutes.
"""
all_minutes = pd.DataFrame({
'minute': range(0, max_runtime + 1)
})
df_filled = all_minutes.merge(df, on='minute', how='left')
df_filled = df_filled.fillna(method='ffill').fillna(method='bfill')
return df_filled
Problem 3: Different Film Lengths
Films range from 81 minutes (My Neighbor Totoro) to 138 minutes (The Wind Rises). A 10-minute window is:
- 12% of Totoro’s runtime
- 7% of The Wind Rises’ runtime
Decision: Accept this discrepancy. Alternative (percentage-based windows) would create inconsistent temporal resolution across films.
Validation: Does It Work?
I validated the 10-minute window against known emotional moments in Spirited Away:
| Scene | Actual Minute | Detected Peak (10-min window) | Match? |
|---|---|---|---|
| Tunnel discovery | 3 | Minute 2-4 (broad peak) | ✅ |
| Parents transform | 8 | Minute 7-9 (sharp peak) | ✅ |
| Bathhouse entry | 15 | Minute 14-17 (sustained) | ✅ |
| First Haku encounter | 22 | Minute 21-23 (sharp peak) | ✅ |
| Reunion with Haku | 98 | Minute 96-100 (climax) | ✅ |
Result: 10-minute window successfully identifies all major emotional beats while smoothing scene micro-fluctuations.
Alternative Approaches I Considered
1. Gaussian Kernel Smoothing
Instead of a uniform window, weight nearby points more heavily:
from scipy.ndimage import gaussian_filter1d
df['emotion_joy_gaussian'] = gaussian_filter1d(
df['emotion_joy'],
sigma=3 # ~10 minute effective window
)
Verdict: Produces very similar results to rolling average. Not worth the added complexity for this use case.
2. Savitzky-Golay Filter
Polynomial smoothing that preserves peak shapes better:
from scipy.signal import savgol_filter
df['emotion_joy_savgol'] = savgol_filter(
df['emotion_joy'],
window_length=11, # 10-minute window
polyorder=2
)
Verdict: Slightly better peak preservation, but introduces edge artifacts. Rolling average is more interpretable.
3. Exponential Moving Average (EMA)
Weight recent data more heavily:
df['emotion_joy_ema'] = df['emotion_joy'].ewm(span=10).mean()
Verdict: Creates temporal asymmetry (past influences present more than future). Doesn’t match how films work — a sad ending affects perception of earlier scenes.
Winner: Simple rolling average. Easy to understand, easy to explain, works well.
What I Learned
1. Domain Knowledge Beats Sophistication
I could have used wavelets, Fourier transforms, or Kalman filters. But film scenes are ~5-10 minutes long. A 10-minute rolling average directly matches the domain.
Lesson: Match your signal processing to the underlying reality, not the fanciest algorithm.
2. Smoothing is Lossy — Embrace It
Every smoothing method loses information. The question is: what information do you want to keep?
For emotion analysis, I wanted:
- Narrative arcs (keep)
- Climactic peaks (keep)
- Single-line spikes (discard)
- Frame-level noise (discard)
10-minute window discards what I don’t want, keeps what I do.
3. Validate Against Ground Truth
I didn’t pick 10 minutes arbitrarily. I tested it against:
- Known emotional moments from watching the films
- Scene boundaries from screenplays
- Colleague feedback on the visualizations
Quantitative validation works, but qualitative matters too.
4. There’s No Perfect Window
Different users might prefer different windows:
- Film scholars might want 3-minute precision
- Casual viewers might prefer 15-minute smoothness
The solution? Offer both raw and smoothed in the interactive viz.
The Code
Here’s the complete smoothing pipeline:
import pandas as pd
import duckdb
def smooth_emotion_timeseries(
film_slug: str,
language: str,
window_size: int = 10
) -> pd.DataFrame:
"""
Apply rolling window smoothing to emotion timeseries.
Args:
film_slug: Film identifier
language: Language code (en, fr, es, etc.)
window_size: Rolling window size in minutes (default: 10)
Returns:
DataFrame with smoothed emotion scores
"""
# Query minute aggregates
query = f"""
SELECT *
FROM int_emotion_minute_aggregates
WHERE film_slug = ? AND language_code = ?
ORDER BY minute_offset
"""
df = duckdb.execute(query, [film_slug, language]).df()
# Fill missing minutes
max_minute = df['minute_offset'].max()
all_minutes = pd.DataFrame({'minute_offset': range(0, max_minute + 1)})
df = all_minutes.merge(df, on='minute_offset', how='left')
df = df.fillna(method='ffill').fillna(method='bfill')
# Apply rolling window for each emotion
emotions = ['joy', 'sadness', 'anger', 'fear', 'surprise', 'disgust']
for emotion in emotions:
df[f'emotion_{emotion}_smoothed'] = df[f'emotion_{emotion}'].rolling(
window=window_size,
center=True,
min_periods=1
).mean()
return df
Conclusion
The 10-minute rolling window isn’t magic. It’s a deliberate choice balancing:
- Noise reduction (smooth single-line spikes)
- Peak preservation (keep climactic moments visible)
- Scene-level fidelity (match typical scene duration)
Could you use 8 minutes? 12 minutes? Sure. The difference is marginal.
The key insight: smoothing is a product decision, not just a technical one. You’re deciding what story the data tells.
For Spiriteddata, that story is: “Here’s how emotion flows through the narrative arc.”
The 10-minute window tells that story without the noise.
Signal processing isn’t about finding the “correct” answer — it’s about finding the answer that reveals the insight you’re looking for.