Back to Blog

Building a Data Pipeline with dbt and DuckDB

How I built a transformation layer for the Spiriteddata project using dbt-core and DuckDB — featuring emotion analysis, knowledge graph edges, and data quality validation.

dbtduckdbdata-engineeringpython

When I started the Spiriteddata project — analyzing emotional arcs in Studio Ghibli films — I needed a transformation layer that was both powerful and portable. I didn’t want to spin up a BigQuery project for a passion project, but I also didn’t want to write raw SQL files without any testing or documentation.

Enter dbt + DuckDB: the modern data stack, miniaturized.

Why dbt + DuckDB?

The combination offers some compelling advantages for smaller projects:

  • Zero infrastructure: DuckDB runs in-process, no server needed
  • Full dbt features: Testing, documentation, lineage — all available
  • Blazing fast: DuckDB’s columnar engine handles analytical queries efficiently
  • Version controlled: Everything lives in Git, including the database file
  • Production-ready patterns: The same modeling techniques scale to BigQuery/Snowflake

What I Actually Built

The transformation layer grew larger than I initially expected. Here’s the real structure:

Staging models (5): Clean and standardize raw data from Ghibli API, Kaggle, and emotion analysis.

Intermediate models (5): Data quality validation and knowledge graph edge generation.

Mart models (14): Analysis-ready tables for the Streamlit app — everything from emotion timeseries to film similarity matrices.

Setting Up the Connection

The profiles.yml configuration:

ghibli:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ../../data/ghibli.duckdb
      schema: main
      threads: 4

That’s it. No credentials, no network configuration.

Staging Layer: Standardize the Raw Data

The staging models clean and type-cast raw data. Here’s a simplified version of stg_films.sql:

{{
  config(
    materialized='view',
    schema='staging'
  )
}}

SELECT
    id,
    title,
    CAST(release_date AS INTEGER) AS release_year,
    CAST(running_time AS INTEGER) AS running_time,
    CAST(rt_score AS INTEGER) AS rt_score,
    COALESCE(director, '') AS director,
    loaded_at,
    source
FROM {{ source('raw', 'films') }}

The actual models also handle arrays (for character/location relationships) and include more columns, but the pattern is straightforward: cast types, handle nulls, standardize names.

Intermediate Layer: Validation and Graph Edges

The intermediate layer does two things:

1. Data Quality Validationint_emotion_data_quality_checks.sql validates that emotion data doesn’t extend beyond film runtime. This catches subtitle version mismatches (a real problem I encountered — see my data quality post).

2. Knowledge Graph Edges — Models like int_film_character_edges.sql create relationship tables connecting characters to films, locations to films, and characters to species. These power a NetworkX knowledge graph (a feature I built but ultimately didn’t emphasize in the portfolio).

Mart Layer: Analysis-Ready Tables

The mart layer is where the real value lives. Some highlights:

mart_film_emotion_timeseries — The core table powering emotion visualizations. Applies a 10-minute rolling window to smooth 28 emotion dimensions:

-- Simplified from actual 270-line model
SELECT
    film_slug,
    language_code,
    minute_offset,
    
    AVG(emotion_joy) OVER (
        PARTITION BY film_slug
        ORDER BY minute_offset
        ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
    ) AS emotion_joy,
    
    -- ... repeat for all 28 emotions
    
FROM film_with_metadata

mart_cross_language_emotion_comparison — Compares emotion scores between language pairs to identify translation biases (e.g., Arabic subtitles showed 87% higher amusement scores than English).

mart_director_emotion_profile — Aggregates emotions by director to compare “signature styles” between Miyazaki and Takahata.

mart_film_similarity_matrix — Pre-computed cosine similarity for “films like this” recommendations.

Testing and Documentation

dbt’s testing framework catches data quality issues. The schema.yml files define tests like:

models:
  - name: mart_film_emotion_timeseries
    columns:
      - name: emotion_joy
        tests:
          - not_null
      - name: language_code
        tests:
          - accepted_values:
              values: ['en', 'fr', 'es', 'nl', 'ar']

Running dbt docs generate && dbt docs serve creates a documentation site with the full DAG — incredibly useful when I returned to the project after a break.

Running the Pipeline

The workflow:

# 1. Run Python NLP pipeline (generates emotion analysis)
python -m src.nlp.analyze_emotions

# 2. Navigate to transformation directory
cd src/transformation

# 3. Run dbt transformations
dbt run

# 4. Test data quality
dbt test

# 5. Generate documentation
dbt docs generate
dbt docs serve

What I Learned

1. Start simple, add complexity as needed
I started with just staging and marts. The intermediate layer and many mart models came later as I discovered new analysis needs.

2. Data quality validation is worth the investment
The runtime validation model caught real issues with subtitle version mismatches. Without it, I would have analyzed emotions against wrong timestamps.

3. dbt docs are genuinely useful
Even on a solo project, the generated documentation helped me remember transformation logic weeks later.

4. DuckDB handles more than expected
With ~100,000 dialogue entries and 28 emotion dimensions, I expected performance issues. DuckDB processed the entire pipeline in under 30 seconds.

The Result

The pipeline feeds into a Streamlit app at spiriteddata.streamlit.app. The emotion timeseries, cross-language comparisons, and director profiles all come from these dbt marts.

If you’re building a data project and don’t need cloud infrastructure, dbt + DuckDB is a powerful combination.