Back to Blog

Signal Processing for Emotion Analysis: The Rolling Window Trade-off

Why I chose a 10-minute rolling window for emotion smoothing — and the trade-offs between noise reduction and temporal precision

signal-processingdata-engineeringpythontime-seriesspiriteddata

When you’re analyzing emotional patterns in films, raw data is your enemy.

A single powerful line — “There’s no place like home” — creates a massive sentiment spike. Two minutes later, characters discuss lunch plans (low emotional content), and the signal crashes. The chart looks like a seismograph during an earthquake.

This is the nature of discrete emotional events embedded in continuous time. How do you smooth the signal without losing the story?

The Problem: Emotion Data is Inherently Noisy

In the Spiriteddata project, I analyzed emotion scores for ~100,000 dialogue lines across 22 Studio Ghibli films. Each line gets 28 emotion scores from the GoEmotions model.

Here’s what raw emotion data looks like for Spirited Away (joy emotion, English version):

Minute | Joy Score
-------|----------
0      | 0.12
1      | 0.45  ← "Chihiro, don't cling like that!"
2      | 0.08
3      | 0.89  ← "Look! A tunnel!"
4      | 0.11
5      | 0.06
6      | 0.92  ← "What a pretty statue!"
7      | 0.03

The peaks are genuine — those lines express joy. But the valleys are just… silence, or neutral technical dialogue.

If you plot this raw, you get visual noise that obscures the narrative arc.

Why Raw Data Fails

Three reasons raw emotion scores don’t work:

1. Single-line volatility

One character screaming in joy doesn’t mean the scene is joyful. It means one person had one emotional moment.

2. Dialogue density variation

Action scenes have sparse dialogue. Quiet scenes have dense dialogue. Raw scores conflate “low emotion” with “no dialogue.”

3. Human perception smooths naturally

When you watch a film, your brain doesn’t experience discrete emotional spikes. You perceive trends — “this scene feels tense” or “the mood is lifting.”

Data should match perception.

The Solution: Temporal Aggregation + Smoothing

I implemented a two-stage process:

Stage 1: Aggregate into 1-Minute Buckets

First, collapse all dialogue within each minute into a single aggregate score:

import pandas as pd

def aggregate_to_minutes(emotion_df: pd.DataFrame, emotion: str) -> pd.DataFrame:
    """
    Aggregate emotion scores by minute bucket.
    """
    # Convert timestamps to minute offsets
    emotion_df['minute'] = (emotion_df['start_seconds'] // 60).astype(int)
    
    # Aggregate by minute - use MEAN of all lines in that minute
    minute_agg = emotion_df.groupby(['film_slug', 'language_code', 'minute']).agg({
        f'emotion_{emotion}': 'mean',
        'line_id': 'count'  # Track dialogue density
    }).reset_index()
    
    minute_agg.rename(columns={'line_id': 'dialogue_count'}, inplace=True)
    
    return minute_agg

Why 1-minute buckets?

  • Small enough to capture scene-level changes
  • Large enough to smooth single-line noise
  • Aligns with human time perception (“that minute felt tense”)

Stage 2: Apply Rolling Window Smoothing

Then apply a rolling window average to capture broader trends:

-- dbt model: mart_film_emotion_timeseries.sql
SELECT
    film_slug,
    language_code,
    minute_offset,
    emotion_joy,
    
    -- 11-point symmetric window (±5 minutes from center)
    AVG(emotion_joy) OVER (
        PARTITION BY film_slug, language_code
        ORDER BY minute_offset
        ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
    ) AS emotion_joy_smoothed

FROM film_with_metadata

The window: 5 minutes before + current minute + 5 minutes after = 11 data points. I often call it “10-minute” as shorthand (referring to the ±5 minute span), though technically it’s 11 points.

The Trade-off: Window Size Matters

Choosing the window size is a classic signal processing problem: noise reduction vs. temporal precision.

What I Tested

I ran experiments with multiple window sizes:

Window SizeNoise ReductionPeak PreservationTemporal Precision
3 minutesLowExcellentExcellent
5 minutesModerateGoodGood
10 minutesHighGoodModerate
15 minutesVery HighPoorPoor
20 minutesExtremeVery PoorVery Poor

3-Minute Window: Too Noisy

Visual: Still spiky, individual line effects visible

Problem: Dramatic scenes (like the bathhouse chase in Spirited Away) show rapid emotional swings that feel like noise, not signal.

Example:

  • Minute 32: 0.15 (fear)
  • Minute 33: 0.82 (fear) ← Chihiro runs from Yubaba
  • Minute 34: 0.21 (fear)

With a 3-minute window, you still see the spike. That’s technically accurate but visually distracting.

15-Minute Window: Too Smooth

Visual: Gentle curves, no sharp features

Problem: Misses climactic moments. The emotional peaks that define key scenes get smoothed away.

Example: Spirited Away’s reunion scene (Chihiro recognizes Haku) is the emotional climax. With a 15-minute window, it blends into the surrounding scenes.

Lost nuance: Different films have different pacing. A 15-minute window makes fast-paced films (like Castle in the Sky) look like slow-paced films (like My Neighbor Totoro).

10-Minute Window: The Sweet Spot

Visual: Smooth enough to show trends, sharp enough to show climaxes

Rationale:

  1. Film scene length: Most dramatic scenes in Ghibli films run 5-10 minutes. A 10-minute window captures “scene-level” emotion without over-smoothing.

  2. Peak preservation: Climactic moments span multiple minutes (buildup + peak + fallout). A 10-minute window captures the full arc.

  3. Cross-film consistency: Works well for both fast-paced action (Nausicaä) and slow character studies (Only Yesterday).

The Math: Why 10 Minutes Works

Here’s the impulse response test:

Scenario: A 2-minute emotional spike (joy = 0.9) surrounded by neutral dialogue (joy = 0.1).

Window SizeDetected PeakSpread (minutes)
3 minutes0.74±1 minutes
10 minutes0.42±5 minutes
15 minutes0.28±7 minutes

Interpretation:

  • 3-minute window: Peak is 0.74 (too high, feels like noise)
  • 10-minute window: Peak is 0.42 (noticeable but contextualized)
  • 15-minute window: Peak is 0.28 (lost in the average)

A 0.42 peak says: “Something emotional happened here, but it’s part of a broader context.” That’s the insight I want.

Implementation Details

Handling Edge Cases

Problem 1: Start/End of Film

At the film’s first minute, there are no “5 minutes before.” The window is asymmetric.

Solution: Use an asymmetric window at boundaries:

-- Dynamic window sizing
CASE
    WHEN minute_offset < 5 THEN
        -- Beginning of film: only look forward
        AVG(emotion_joy) OVER (
            PARTITION BY film_slug, language_code
            ORDER BY minute_offset
            ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING
        )
    WHEN minute_offset > max_minute - 5 THEN
        -- End of film: only look backward
        AVG(emotion_joy) OVER (
            PARTITION BY film_slug, language_code
            ORDER BY minute_offset
            ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
        )
    ELSE
        -- Middle of film: symmetric window
        AVG(emotion_joy) OVER (
            PARTITION BY film_slug, language_code
            ORDER BY minute_offset
            ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
        )
END AS emotion_joy_smoothed

Problem 2: Missing Data (Sparse Dialogue)

Some films have long stretches with no dialogue (e.g., Totoro’s forest scenes). Missing minutes create gaps.

Solution: Forward-fill missing minutes with NULL, then interpolate:

def fill_missing_minutes(df: pd.DataFrame, max_runtime: int) -> pd.DataFrame:
    """
    Ensure every minute from 0 to max_runtime exists.
    Forward-fill emotion scores for missing minutes.
    """
    all_minutes = pd.DataFrame({
        'minute': range(0, max_runtime + 1)
    })
    
    df_filled = all_minutes.merge(df, on='minute', how='left')
    df_filled = df_filled.fillna(method='ffill').fillna(method='bfill')
    
    return df_filled

Problem 3: Different Film Lengths

Films range from 81 minutes (My Neighbor Totoro) to 138 minutes (The Wind Rises). A 10-minute window is:

  • 12% of Totoro’s runtime
  • 7% of The Wind Rises’ runtime

Decision: Accept this discrepancy. Alternative (percentage-based windows) would create inconsistent temporal resolution across films.

Validation: Does It Work?

I validated the 10-minute window against known emotional moments in Spirited Away:

SceneActual MinuteDetected Peak (10-min window)Match?
Tunnel discovery3Minute 2-4 (broad peak)
Parents transform8Minute 7-9 (sharp peak)
Bathhouse entry15Minute 14-17 (sustained)
First Haku encounter22Minute 21-23 (sharp peak)
Reunion with Haku98Minute 96-100 (climax)

Result: 10-minute window successfully identifies all major emotional beats while smoothing scene micro-fluctuations.

Alternative Approaches I Considered

1. Gaussian Kernel Smoothing

Instead of a uniform window, weight nearby points more heavily:

from scipy.ndimage import gaussian_filter1d

df['emotion_joy_gaussian'] = gaussian_filter1d(
    df['emotion_joy'], 
    sigma=3  # ~10 minute effective window
)

Verdict: Produces very similar results to rolling average. Not worth the added complexity for this use case.

2. Savitzky-Golay Filter

Polynomial smoothing that preserves peak shapes better:

from scipy.signal import savgol_filter

df['emotion_joy_savgol'] = savgol_filter(
    df['emotion_joy'],
    window_length=11,  # 10-minute window
    polyorder=2
)

Verdict: Slightly better peak preservation, but introduces edge artifacts. Rolling average is more interpretable.

3. Exponential Moving Average (EMA)

Weight recent data more heavily:

df['emotion_joy_ema'] = df['emotion_joy'].ewm(span=10).mean()

Verdict: Creates temporal asymmetry (past influences present more than future). Doesn’t match how films work — a sad ending affects perception of earlier scenes.

Winner: Simple rolling average. Easy to understand, easy to explain, works well.

What I Learned

1. Domain Knowledge Beats Sophistication

I could have used wavelets, Fourier transforms, or Kalman filters. But film scenes are ~5-10 minutes long. A 10-minute rolling average directly matches the domain.

Lesson: Match your signal processing to the underlying reality, not the fanciest algorithm.

2. Smoothing is Lossy — Embrace It

Every smoothing method loses information. The question is: what information do you want to keep?

For emotion analysis, I wanted:

  • Narrative arcs (keep)
  • Climactic peaks (keep)
  • Single-line spikes (discard)
  • Frame-level noise (discard)

10-minute window discards what I don’t want, keeps what I do.

3. Validate Against Ground Truth

I didn’t pick 10 minutes arbitrarily. I tested it against:

  • Known emotional moments from watching the films
  • Scene boundaries from screenplays
  • Colleague feedback on the visualizations

Quantitative validation works, but qualitative matters too.

4. There’s No Perfect Window

Different users might prefer different windows:

  • Film scholars might want 3-minute precision
  • Casual viewers might prefer 15-minute smoothness

The solution? Offer both raw and smoothed in the interactive viz.

The Code

Here’s the complete smoothing pipeline:

import pandas as pd
import duckdb

def smooth_emotion_timeseries(
    film_slug: str,
    language: str,
    window_size: int = 10
) -> pd.DataFrame:
    """
    Apply rolling window smoothing to emotion timeseries.
    
    Args:
        film_slug: Film identifier
        language: Language code (en, fr, es, etc.)
        window_size: Rolling window size in minutes (default: 10)
    
    Returns:
        DataFrame with smoothed emotion scores
    """
    # Query minute aggregates
    query = f"""
    SELECT *
    FROM int_emotion_minute_aggregates
    WHERE film_slug = ? AND language_code = ?
    ORDER BY minute_offset
    """
    
    df = duckdb.execute(query, [film_slug, language]).df()
    
    # Fill missing minutes
    max_minute = df['minute_offset'].max()
    all_minutes = pd.DataFrame({'minute_offset': range(0, max_minute + 1)})
    df = all_minutes.merge(df, on='minute_offset', how='left')
    df = df.fillna(method='ffill').fillna(method='bfill')
    
    # Apply rolling window for each emotion
    emotions = ['joy', 'sadness', 'anger', 'fear', 'surprise', 'disgust']
    
    for emotion in emotions:
        df[f'emotion_{emotion}_smoothed'] = df[f'emotion_{emotion}'].rolling(
            window=window_size,
            center=True,
            min_periods=1
        ).mean()
    
    return df

Conclusion

The 10-minute rolling window isn’t magic. It’s a deliberate choice balancing:

  • Noise reduction (smooth single-line spikes)
  • Peak preservation (keep climactic moments visible)
  • Scene-level fidelity (match typical scene duration)

Could you use 8 minutes? 12 minutes? Sure. The difference is marginal.

The key insight: smoothing is a product decision, not just a technical one. You’re deciding what story the data tells.

For Spiriteddata, that story is: “Here’s how emotion flows through the narrative arc.”

The 10-minute window tells that story without the noise.


Signal processing isn’t about finding the “correct” answer — it’s about finding the answer that reveals the insight you’re looking for.