November 21, 2025

When to Kill Your AI Project: The Sora Deprecation Story

Why I deprecated an AI chatbot that technically worked — a lesson in knowing when the right technical solution isn't the right product solution

airaglangchainengineering-decisionsspiriteddata

I built an AI assistant that technically worked. It could answer complex questions about emotional patterns in Studio Ghibli films and provide context-aware responses with dialogue citations.

Then I deprecated it.

This is the story of Sora — a RAG-powered chatbot that I intentionally killed despite it showing promise. It’s about knowing when the right technical solution isn’t the right product solution.

What Sora Was

Sora was an AI assistant built on top of the Spiriteddata emotional analysis pipeline. The vision: let users ask natural language questions about Ghibli films instead of manually exploring charts.

Example Queries Sora Could Answer:

“Which film has the highest emotional volatility?”
“How does joy compare between English and French versions of Spirited Away?”
“What are the top 3 emotional peak moments in Howl’s Moving Castle?”
“Do Miyazaki and Takahata films differ in sadness patterns?”

These questions required:

Cross-film aggregation (comparing metrics across 22 films)
Cross-language correlation (emotion arcs in 5 languages)
Timestamp precision (finding specific emotional peaks)
Director comparison (Miyazaki vs. Takahata patterns)

Traditional dashboards struggle with this kind of flexible, multi-dimensional querying. A chatbot seemed perfect.

The Architecture: RAG on Rails

I built Sora using LangChain as the orchestration framework, with OpenAI GPT-4 as the reasoning engine and ChromaDB for vector embeddings.

The Stack

from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from chromadb import Client

# Initialize LLM
llm = ChatOpenAI(model="gpt-4", temperature=0)

# Initialize vector store
chroma_client = Client()
collection = chroma_client.create_collection(
    name="ghibli_emotions",
    metadata={"description": "Emotion analysis results across 22 films"}
)

# Agent with custom tools (detailed below)
agent = initialize_agent(
    tools=custom_tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

The 6 Custom Tools

Instead of having the LLM query a database directly, I gave it domain-specific tools:

1. query_film_emotions — Get emotion timeline for a specific film + language

def query_film_emotions(film_slug: str, language: str, emotion: str) -> dict:
    """
    Retrieve emotion scores over time for a specific film/language/emotion.
    
    Returns: {
        "film": str,
        "language": str,
        "emotion": str,
        "timeline": [(minute, score), ...]
    }
    """
    # Query DuckDB emotion mart
    query = f"""
    SELECT minute_offset, emotion_{emotion}_smoothed
    FROM mart_film_emotion_timeseries
    WHERE film_slug = ? AND language_code = ?
    ORDER BY minute_offset
    """
    results = duckdb_conn.execute(query, [film_slug, language]).fetchall()
    
    return {
        "film": film_slug,
        "language": language,
        "emotion": emotion,
        "timeline": results
    }

2. calculate_cross_language_correlation — Compare emotion patterns across translations

def calculate_cross_language_correlation(
    film_slug: str,
    emotion: str,
    lang1: str = "en",
    lang2: str = "fr"
) -> float:
    """
    Calculate Pearson correlation between emotion arcs in two languages.
    """
    arc1 = get_emotion_timeline(film_slug, lang1, emotion)
    arc2 = get_emotion_timeline(film_slug, lang2, emotion)
    
    # Align timelines and compute correlation
    corr = arc1['smoothed_value'].corr(arc2['smoothed_value'])
    return corr

3. find_emotional_peaks — Identify top N moments for an emotion

def find_emotional_peaks(film_slug: str, emotion: str, top_n: int = 3) -> list:
    """
    Find the strongest peaks for a given emotion in a film.
    
    Returns list of (timestamp, score, context) tuples.
    """
    # Get emotion timeline
    timeline = query_film_emotions(film_slug, "en", emotion)
    
    # Detect l ocal maxima
    peaks = detect_peaks(timeline['timeline'], prominence=0.1)
    
    # Get top N
    top_peaks = sorted(peaks, key=lambda x: x[1], reverse=True)[:top_n]
    
    # Add dialogue context for each peak
    for peak in top_peaks:
        minute = peak[0]
        peak['context'] = get_dialogue_at_minute(film_slug, minute)
    
    return top_peaks

4. compare_directors — Miyazaki vs. Takahata emotional styles

def compare_directors(emotion: str, metric: str = "average") -> dict:
    """
    Compare emotion metrics between Miyazaki and Takahata films.
    """
    miyazaki_films = get_films_by_director("Hayao Miyazaki")
    takahata_films = get_films_by_director("Isao Takahata")
    
    miyazaki_scores = [
        calculate_film_metric(film, emotion, metric)
        for film in miyazaki_films
    ]
    
    takahata_scores = [
        calculate_film_metric(film, emotion, metric)
        for film in takahata_films
    ]
    
    return {
        "miyazaki_avg": np.mean(miyazaki_scores),
        "takahata_avg": np.mean(takahata_scores),
        "difference": np.mean(miyazaki_scores) - np.mean(takahata_scores)
    }

5. get_film_metadata — Basic film info (runtime, languages, etc.)

6. search_similar_moments — Vector similarity search for dialogue

The Validation: Mixed Results

I created a test set of 10 sentiment-focused queries across different complexity levels:

Question Type	Count	Example
Sentiment analysis	1	”Show me the sentiment curve for Spirited Away”
Correlation study	3	”Calculate correlation between sentiment and revenue”
Trajectory analysis	2	”Do rising sentiment films perform better with critics?”
Multilingual	1	”Compare sentiment arcs across English, French, and Spanish”
Success prediction	3	”Correlation between peak emotions and commercial success”

Results:

Queries passed: 5/10 (50%)
Overall validation score: 55.2%
Average response time: ~15 seconds

The pattern was clear: Sora excelled at straightforward sentiment queries and multilingual comparisons (100% pass rate on those categories), but struggled with complex correlation studies and success prediction queries (33% pass rate).

The failures revealed fundamental limitations:

Complex multi-step analytical queries required too much context for reliable answers
Some queries hit rate limits due to token-heavy responses
Statistical analysis outputs were inconsistent

50% pass rate for a RAG system answering analytical questions? Not terrible—but the dashboards already provide these insights instantly and reliably.

So Why Deprecate It?

Here’s the harsh truth: the interactive visualizations answered the same questions faster, cheaper, and more reliably.

Cost Comparison

Metric	Sora (Chatbot)	Interactive Dashboards
Baseline cost	~$0.04/query (GPT-4-turbo)	$0/query (static)
Response time	10-20 seconds	Instant
Monthly cost (1000 users)	~$400-600	$0 (static hosting)
Reliability	~50% pass rate	100% (user-controlled)

User Experience Reality

I asked a few colleagues for feedback on Sora. Here’s what I learned:

Problem 1: The Discoverability Gap

Users didn’t know what questions to ask. They’d type vague queries like “tell me about Spirited Away” and Sora would ask for clarification, creating friction.

With the visual dashboard, exploration is guided. You see film thumbnails, click one, see emotion timelines. No need to know the “right question.”

Problem 2: Trust Issues

When Sora said “Howl’s Moving Castle has the highest emotional volatility,” users wanted proof. They’d ask “how do you know?” and Sora would cite data sources, but…

…they’re already looking at a screen. Why not just show the chart?

The interactive viz lets you:

See the raw timeline
Hover over peaks to see exact values
Compare multiple films side-by-side

Trust comes from transparency. Chatbots abstract; dashboards reveal.

Problem 3: The Iteration Tax

Analytical work is iterative. You ask a question, see an answer, refine the question.

With Sora:

Ask question (10-20 sec wait)
Get answer
Refine question (10-20 sec wait)
Repeat

With dashboards:

Select film (instant)
See all emotions (instant)
Toggle languages (instant)
Compare directors (instant)

The latency compounds. Every iteration costs seconds of wait time. Dashboards give you immediate feedback loops.

The Decision Framework: When to Kill It

I used a simple decision matrix:

Does This Tool Solve a Problem Users Can’t Solve Another Way?

For Sora: No. The dashboards already provide all the data Sora queries.

Is the Added Value Worth the Added Complexity?

For Sora: No.

Added complexity: RAG pipeline, vector db, LLM costs, prompt engineering
Added value: Natural language interface
Trade-off: Not worth it when the alternative is instant, free, and more transparent

What Would It Take to Make This Valuable?

For Sora to be worth the complexity, it would need to:

Answer questions the dashboards can’t answer
Connect to external data sources (Wikipedia, IMDB, reviews)
Generate novel insights through reasoning

That’s a different product. Not a query interface — a research assistant.

I wasn’t building that. I was building a conversational wrapper around a database. That’s not AI’s strength.

What I Learned

1. Working Code ≠ Right Solution

A working RAG system sounds impressive. But technical capability isn’t the metric that matters. User value is.

Sora worked for some queries. It failed as a product.

2. AI Excels at Transformation, Not Translation

LLMs are powerful when they:

Transform information (summarize, rewrite, synthesize)
Generate new content (write, ideate, create)
Reason through ambiguity (interpret, infer, decide)

They’re weak when they:

Translate structured queries (SQL in disguise)
Retrieve exact information (databases do this better)
Replace existing good UX (dashboards work great)

Sora was translation. Not transformation.

3. Choose the Right Tool for the Job

I love AI. But it’s not always the answer.

Scenario	Right Tool
Analytical exploration	Interactive dashboards
Exact data lookup	Database queries or search
Open-ended research	LLMs + RAG
Creative generation	LLMs
Ambiguous interpretation	LLMs

Know your tool’s strengths. Don’t use AI because it’s cool — use it because it’s the best solution.

4. Deprecation is a Feature

Killing Sora wasn’t failure. It was good engineering judgment.

I built it to test a hypothesis: “Can a chatbot make emotion exploration easier?”

Answer: No. Dashboards win.

Now I know. Hypothesis tested. Time to move on.

5. Build MVPs to Learn, Not to Ship

Sora took ~15 hours to build. If I’d committed to “building the perfect RAG system,” I’d still be iterating on prompt engineering.

Instead:

Built MVP (15 hours)
Ran validation (2 hours)
Asked colleagues for feedback (3 hours)
Made deprecation decision (1 hour)

Total: ~20 hours to decisively answer: “Is this valuable?”

That’s efficient product development.

The Memorial: Sora Lives On as Reference

I didn’t delete the code. It’s preserved in the Spiriteddata project under “Memories of Sora” — a dedicated section explaining:

What Sora was
How it worked
Why I deprecated it
What I learned

It’s a case study in engineering decision-making. Sometimes the best code you write is the code you choose not to use.

Sora taught me more by failing softly than it would have by succeeding marginally.

The next time you’re building an AI feature, ask:

Does this transform information or just translate it?
Is the value worth the complexity?
Could I solve this simpler?

Sometimes the right answer is hitting delete.