Back to Blog

Data Quality Validation: From 41.8% to 54.5%

How I improved subtitle file quality for emotion analysis through systematic validation and iterative improvement.

data-qualitytestingpythondata-engineering

When I first validated my subtitle files for the Ghibli emotion analysis project, only 41.8% passed quality checks. That number eventually reached 54.5% — still below my 75% target, but the journey taught me valuable lessons about data quality at scale.

Here’s what actually happened.

The Problem: Film Version Mismatch

For my Spiriteddata project, I needed to analyze emotions in Studio Ghibli film dialogue across 5 languages (English, French, Spanish, Dutch, Arabic). But there’s a catch: subtitle files come from different film versions.

Theatrical cuts, Blu-ray releases, streaming versions — they all have slightly different runtimes. A subtitle file synced to a 124-minute Blu-ray version won’t align correctly with a 119-minute streaming version. And if the timing is off, the emotion analysis timestamps become meaningless.

The Validation Approach

The core idea: compare the subtitle file’s total duration against documented film runtimes. If the last subtitle ends way before (or after) the film’s expected runtime, the file was probably synced to a different version.

I documented reference runtimes for all 22 films from IMDB and Blu-ray releases, then compared each subtitle file:

Validation Thresholds:

  • < 2% timing drift: ✅ PASS
  • 2-5% timing drift: ⚠️ WARN
  • > 5% timing drift: ❌ FAIL

Baseline Results (134 files across 22 films × ~6 languages):

  • PASS: 56 files (41.8%)
  • WARN/FAIL: 78 files (58.2%)

Nearly 60% of my subtitle files had timing issues significant enough to affect analysis.

Rather than trying to fix everything at once, I focused on 14 English priority films — the ones most likely to be featured in the portfolio.

Strategy:

  • Use OpenSubtitles API with quality filters
  • Prioritize: high download count, verified uploaders, runtime matching
  • Validate immediately after acquisition
  • Refine if initial file fails

Results:

  • 14/14 films acquired successfully
  • 14/14 achieved PASS status after refinement
  • Average timing drift: <1%
  • Best case: 0.005% drift (Kiki’s Delivery Service)

Key insight: Download count ≠ quality. Lower-downloaded subtitles were often more accurate than popular ones, likely because they came from specific Blu-ray releases rather than generic versions.

Phase 2: Multi-Language Expansion

With the methodology proven, I scaled to non-English languages: French, Spanish, Dutch, and Arabic.

Challenges:

  • Not all languages available for all films
  • More runtime variations across international releases
  • API rate limiting (40 requests per 10 seconds)

Results:

  • 49 multi-language files acquired
  • 22 PASS, 14 WARN, 27 FAIL
  • Dutch had lowest success: 7.7% pass rate (vs English: 71.4%)

Current Status:

  • Overall pass rate: 54.5% (72/132 files)
  • Improvement: +12.7 percentage points from baseline
  • Gap to target: 20+ percentage points remaining

Lessons Learned

1. Define “Valid” Based on Your Use Case

My validation checked runtime alignment, not content quality. A subtitle file with perfect grammar but 10% timing drift is useless for timestamp-based emotion analysis.

2. Small Drifts Don’t Break the Narrative

When I visualized the emotion timelines, I noticed something important: small runtime differences (<10 minutes) didn’t actually impact the visual narrative of the film compared to other language versions. The emotional arcs still lined up. The peaks and valleys still told the same story.

This realization changed my approach. Instead of chasing perfect data alignment, I focused on fixing the worst timing drifts — like the subtitle file that was clearly synced to a TV broadcast version with commercial breaks baked in. Those files genuinely broke the analysis.

3. Keep the Option Open

I can always go back and source new subtitles for the failing files later. The infrastructure is there. The validation scripts work. If the issue becomes more important, the path forward is clear.

4. Document Your Quality Thresholds

Having explicit thresholds (<2% = PASS) made decisions objective. No debates about whether a file was “good enough.”

5. Some Data Quality Issues Can’t Be Fixed (Yet)

27 files still fail validation. For some film/language combinations, no suitable subtitle exists with the right runtime. That’s okay — better to analyze fewer films accurately than force bad data through the pipeline.

The Portfolio Value

This validation work demonstrates:

  1. Proactive quality identification — Found issues before they corrupted analysis
  2. Quantifiable metrics — Specific pass/warn/fail thresholds
  3. Iterative approach — Prove methodology, then scale
  4. Pragmatic scope — Accept that 54.5% is better than nothing, keep improving

What’s Next

The 27 failing files need alternative sources or manual refinement. Target: 75%+ pass rate. But the emotion analysis already works on 72 validated files across 5 languages — enough to demonstrate cross-language emotion patterns in the portfolio.

Data quality isn’t glamorous work, but it’s the foundation everything else builds on.