Building a Data Pipeline with dbt and DuckDB
How I built a transformation layer for the Spiriteddata project using dbt-core and DuckDB — featuring emotion analysis, knowledge graph edges, and data quality validation.
When I started the Spiriteddata project — analyzing emotional arcs in Studio Ghibli films — I needed a transformation layer that was both powerful and portable. I didn’t want to spin up a BigQuery project for a passion project, but I also didn’t want to write raw SQL files without any testing or documentation.
Enter dbt + DuckDB: the modern data stack, miniaturized.
Why dbt + DuckDB?
The combination offers some compelling advantages for smaller projects:
- Zero infrastructure: DuckDB runs in-process, no server needed
- Full dbt features: Testing, documentation, lineage — all available
- Blazing fast: DuckDB’s columnar engine handles analytical queries efficiently
- Version controlled: Everything lives in Git, including the database file
- Production-ready patterns: The same modeling techniques scale to BigQuery/Snowflake
What I Actually Built
The transformation layer grew larger than I initially expected. Here’s the real structure:
Staging models (5): Clean and standardize raw data from Ghibli API, Kaggle, and emotion analysis.
Intermediate models (5): Data quality validation and knowledge graph edge generation.
Mart models (14): Analysis-ready tables for the Streamlit app — everything from emotion timeseries to film similarity matrices.
Setting Up the Connection
The profiles.yml configuration:
ghibli:
target: dev
outputs:
dev:
type: duckdb
path: ../../data/ghibli.duckdb
schema: main
threads: 4
That’s it. No credentials, no network configuration.
Staging Layer: Standardize the Raw Data
The staging models clean and type-cast raw data. Here’s a simplified version of stg_films.sql:
{{
config(
materialized='view',
schema='staging'
)
}}
SELECT
id,
title,
CAST(release_date AS INTEGER) AS release_year,
CAST(running_time AS INTEGER) AS running_time,
CAST(rt_score AS INTEGER) AS rt_score,
COALESCE(director, '') AS director,
loaded_at,
source
FROM {{ source('raw', 'films') }}
The actual models also handle arrays (for character/location relationships) and include more columns, but the pattern is straightforward: cast types, handle nulls, standardize names.
Intermediate Layer: Validation and Graph Edges
The intermediate layer does two things:
1. Data Quality Validation — int_emotion_data_quality_checks.sql validates that emotion data doesn’t extend beyond film runtime. This catches subtitle version mismatches (a real problem I encountered — see my data quality post).
2. Knowledge Graph Edges — Models like int_film_character_edges.sql create relationship tables connecting characters to films, locations to films, and characters to species. These power a NetworkX knowledge graph (a feature I built but ultimately didn’t emphasize in the portfolio).
Mart Layer: Analysis-Ready Tables
The mart layer is where the real value lives. Some highlights:
mart_film_emotion_timeseries — The core table powering emotion visualizations. Applies a 10-minute rolling window to smooth 28 emotion dimensions:
-- Simplified from actual 270-line model
SELECT
film_slug,
language_code,
minute_offset,
AVG(emotion_joy) OVER (
PARTITION BY film_slug
ORDER BY minute_offset
ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
) AS emotion_joy,
-- ... repeat for all 28 emotions
FROM film_with_metadata
mart_cross_language_emotion_comparison — Compares emotion scores between language pairs to identify translation biases (e.g., Arabic subtitles showed 87% higher amusement scores than English).
mart_director_emotion_profile — Aggregates emotions by director to compare “signature styles” between Miyazaki and Takahata.
mart_film_similarity_matrix — Pre-computed cosine similarity for “films like this” recommendations.
Testing and Documentation
dbt’s testing framework catches data quality issues. The schema.yml files define tests like:
models:
- name: mart_film_emotion_timeseries
columns:
- name: emotion_joy
tests:
- not_null
- name: language_code
tests:
- accepted_values:
values: ['en', 'fr', 'es', 'nl', 'ar']
Running dbt docs generate && dbt docs serve creates a documentation site with the full DAG — incredibly useful when I returned to the project after a break.
Running the Pipeline
The workflow:
# 1. Run Python NLP pipeline (generates emotion analysis)
python -m src.nlp.analyze_emotions
# 2. Navigate to transformation directory
cd src/transformation
# 3. Run dbt transformations
dbt run
# 4. Test data quality
dbt test
# 5. Generate documentation
dbt docs generate
dbt docs serve
What I Learned
1. Start simple, add complexity as needed
I started with just staging and marts. The intermediate layer and many mart models came later as I discovered new analysis needs.
2. Data quality validation is worth the investment
The runtime validation model caught real issues with subtitle version mismatches. Without it, I would have analyzed emotions against wrong timestamps.
3. dbt docs are genuinely useful
Even on a solo project, the generated documentation helped me remember transformation logic weeks later.
4. DuckDB handles more than expected
With ~100,000 dialogue entries and 28 emotion dimensions, I expected performance issues. DuckDB processed the entire pipeline in under 30 seconds.
The Result
The pipeline feeds into a Streamlit app at spiriteddata.streamlit.app. The emotion timeseries, cross-language comparisons, and director profiles all come from these dbt marts.
If you’re building a data project and don’t need cloud infrastructure, dbt + DuckDB is a powerful combination.