When to Kill Your AI Project: The Sora Deprecation Story
Why I deprecated an AI chatbot that technically worked — a lesson in knowing when the right technical solution isn't the right product solution
I built an AI assistant that technically worked. It could answer complex questions about emotional patterns in Studio Ghibli films and provide context-aware responses with dialogue citations.
Then I deprecated it.
This is the story of Sora — a RAG-powered chatbot that I intentionally killed despite it showing promise. It’s about knowing when the right technical solution isn’t the right product solution.
What Sora Was
Sora was an AI assistant built on top of the Spiriteddata emotional analysis pipeline. The vision: let users ask natural language questions about Ghibli films instead of manually exploring charts.
Example Queries Sora Could Answer:
- “Which film has the highest emotional volatility?”
- “How does joy compare between English and French versions of Spirited Away?”
- “What are the top 3 emotional peak moments in Howl’s Moving Castle?”
- “Do Miyazaki and Takahata films differ in sadness patterns?”
These questions required:
- Cross-film aggregation (comparing metrics across 22 films)
- Cross-language correlation (emotion arcs in 5 languages)
- Timestamp precision (finding specific emotional peaks)
- Director comparison (Miyazaki vs. Takahata patterns)
Traditional dashboards struggle with this kind of flexible, multi-dimensional querying. A chatbot seemed perfect.
The Architecture: RAG on Rails
I built Sora using LangChain as the orchestration framework, with OpenAI GPT-4 as the reasoning engine and ChromaDB for vector embeddings.
The Stack
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from chromadb import Client
# Initialize LLM
llm = ChatOpenAI(model="gpt-4", temperature=0)
# Initialize vector store
chroma_client = Client()
collection = chroma_client.create_collection(
name="ghibli_emotions",
metadata={"description": "Emotion analysis results across 22 films"}
)
# Agent with custom tools (detailed below)
agent = initialize_agent(
tools=custom_tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
The 6 Custom Tools
Instead of having the LLM query a database directly, I gave it domain-specific tools:
1. query_film_emotions — Get emotion timeline for a specific film + language
def query_film_emotions(film_slug: str, language: str, emotion: str) -> dict:
"""
Retrieve emotion scores over time for a specific film/language/emotion.
Returns: {
"film": str,
"language": str,
"emotion": str,
"timeline": [(minute, score), ...]
}
"""
# Query DuckDB emotion mart
query = f"""
SELECT minute_offset, emotion_{emotion}_smoothed
FROM mart_film_emotion_timeseries
WHERE film_slug = ? AND language_code = ?
ORDER BY minute_offset
"""
results = duckdb_conn.execute(query, [film_slug, language]).fetchall()
return {
"film": film_slug,
"language": language,
"emotion": emotion,
"timeline": results
}
2. calculate_cross_language_correlation — Compare emotion patterns across translations
def calculate_cross_language_correlation(
film_slug: str,
emotion: str,
lang1: str = "en",
lang2: str = "fr"
) -> float:
"""
Calculate Pearson correlation between emotion arcs in two languages.
"""
arc1 = get_emotion_timeline(film_slug, lang1, emotion)
arc2 = get_emotion_timeline(film_slug, lang2, emotion)
# Align timelines and compute correlation
corr = arc1['smoothed_value'].corr(arc2['smoothed_value'])
return corr
3. find_emotional_peaks — Identify top N moments for an emotion
def find_emotional_peaks(film_slug: str, emotion: str, top_n: int = 3) -> list:
"""
Find the strongest peaks for a given emotion in a film.
Returns list of (timestamp, score, context) tuples.
"""
# Get emotion timeline
timeline = query_film_emotions(film_slug, "en", emotion)
# Detect l ocal maxima
peaks = detect_peaks(timeline['timeline'], prominence=0.1)
# Get top N
top_peaks = sorted(peaks, key=lambda x: x[1], reverse=True)[:top_n]
# Add dialogue context for each peak
for peak in top_peaks:
minute = peak[0]
peak['context'] = get_dialogue_at_minute(film_slug, minute)
return top_peaks
4. compare_directors — Miyazaki vs. Takahata emotional styles
def compare_directors(emotion: str, metric: str = "average") -> dict:
"""
Compare emotion metrics between Miyazaki and Takahata films.
"""
miyazaki_films = get_films_by_director("Hayao Miyazaki")
takahata_films = get_films_by_director("Isao Takahata")
miyazaki_scores = [
calculate_film_metric(film, emotion, metric)
for film in miyazaki_films
]
takahata_scores = [
calculate_film_metric(film, emotion, metric)
for film in takahata_films
]
return {
"miyazaki_avg": np.mean(miyazaki_scores),
"takahata_avg": np.mean(takahata_scores),
"difference": np.mean(miyazaki_scores) - np.mean(takahata_scores)
}
5. get_film_metadata — Basic film info (runtime, languages, etc.)
6. search_similar_moments — Vector similarity search for dialogue
The Validation: Mixed Results
I created a test set of 10 sentiment-focused queries across different complexity levels:
| Question Type | Count | Example |
|---|---|---|
| Sentiment analysis | 1 | ”Show me the sentiment curve for Spirited Away” |
| Correlation study | 3 | ”Calculate correlation between sentiment and revenue” |
| Trajectory analysis | 2 | ”Do rising sentiment films perform better with critics?” |
| Multilingual | 1 | ”Compare sentiment arcs across English, French, and Spanish” |
| Success prediction | 3 | ”Correlation between peak emotions and commercial success” |
Results:
- Queries passed: 5/10 (50%)
- Overall validation score: 55.2%
- Average response time: ~15 seconds
The pattern was clear: Sora excelled at straightforward sentiment queries and multilingual comparisons (100% pass rate on those categories), but struggled with complex correlation studies and success prediction queries (33% pass rate).
The failures revealed fundamental limitations:
- Complex multi-step analytical queries required too much context for reliable answers
- Some queries hit rate limits due to token-heavy responses
- Statistical analysis outputs were inconsistent
50% pass rate for a RAG system answering analytical questions? Not terrible—but the dashboards already provide these insights instantly and reliably.
So Why Deprecate It?
Here’s the harsh truth: the interactive visualizations answered the same questions faster, cheaper, and more reliably.
Cost Comparison
| Metric | Sora (Chatbot) | Interactive Dashboards |
|---|---|---|
| Baseline cost | ~$0.04/query (GPT-4-turbo) | $0/query (static) |
| Response time | 10-20 seconds | Instant |
| Monthly cost (1000 users) | ~$400-600 | $0 (static hosting) |
| Reliability | ~50% pass rate | 100% (user-controlled) |
User Experience Reality
I asked a few colleagues for feedback on Sora. Here’s what I learned:
Problem 1: The Discoverability Gap
Users didn’t know what questions to ask. They’d type vague queries like “tell me about Spirited Away” and Sora would ask for clarification, creating friction.
With the visual dashboard, exploration is guided. You see film thumbnails, click one, see emotion timelines. No need to know the “right question.”
Problem 2: Trust Issues
When Sora said “Howl’s Moving Castle has the highest emotional volatility,” users wanted proof. They’d ask “how do you know?” and Sora would cite data sources, but…
…they’re already looking at a screen. Why not just show the chart?
The interactive viz lets you:
- See the raw timeline
- Hover over peaks to see exact values
- Compare multiple films side-by-side
Trust comes from transparency. Chatbots abstract; dashboards reveal.
Problem 3: The Iteration Tax
Analytical work is iterative. You ask a question, see an answer, refine the question.
With Sora:
- Ask question (10-20 sec wait)
- Get answer
- Refine question (10-20 sec wait)
- Repeat
With dashboards:
- Select film (instant)
- See all emotions (instant)
- Toggle languages (instant)
- Compare directors (instant)
The latency compounds. Every iteration costs seconds of wait time. Dashboards give you immediate feedback loops.
The Decision Framework: When to Kill It
I used a simple decision matrix:
Does This Tool Solve a Problem Users Can’t Solve Another Way?
For Sora: No. The dashboards already provide all the data Sora queries.
Is the Added Value Worth the Added Complexity?
For Sora: No.
- Added complexity: RAG pipeline, vector db, LLM costs, prompt engineering
- Added value: Natural language interface
- Trade-off: Not worth it when the alternative is instant, free, and more transparent
What Would It Take to Make This Valuable?
For Sora to be worth the complexity, it would need to:
- Answer questions the dashboards can’t answer
- Connect to external data sources (Wikipedia, IMDB, reviews)
- Generate novel insights through reasoning
That’s a different product. Not a query interface — a research assistant.
I wasn’t building that. I was building a conversational wrapper around a database. That’s not AI’s strength.
What I Learned
1. Working Code ≠ Right Solution
A working RAG system sounds impressive. But technical capability isn’t the metric that matters. User value is.
Sora worked for some queries. It failed as a product.
2. AI Excels at Transformation, Not Translation
LLMs are powerful when they:
- Transform information (summarize, rewrite, synthesize)
- Generate new content (write, ideate, create)
- Reason through ambiguity (interpret, infer, decide)
They’re weak when they:
- Translate structured queries (SQL in disguise)
- Retrieve exact information (databases do this better)
- Replace existing good UX (dashboards work great)
Sora was translation. Not transformation.
3. Choose the Right Tool for the Job
I love AI. But it’s not always the answer.
| Scenario | Right Tool |
|---|---|
| Analytical exploration | Interactive dashboards |
| Exact data lookup | Database queries or search |
| Open-ended research | LLMs + RAG |
| Creative generation | LLMs |
| Ambiguous interpretation | LLMs |
Know your tool’s strengths. Don’t use AI because it’s cool — use it because it’s the best solution.
4. Deprecation is a Feature
Killing Sora wasn’t failure. It was good engineering judgment.
I built it to test a hypothesis: “Can a chatbot make emotion exploration easier?”
Answer: No. Dashboards win.
Now I know. Hypothesis tested. Time to move on.
5. Build MVPs to Learn, Not to Ship
Sora took ~15 hours to build. If I’d committed to “building the perfect RAG system,” I’d still be iterating on prompt engineering.
Instead:
- Built MVP (15 hours)
- Ran validation (2 hours)
- Asked colleagues for feedback (3 hours)
- Made deprecation decision (1 hour)
Total: ~20 hours to decisively answer: “Is this valuable?”
That’s efficient product development.
The Memorial: Sora Lives On as Reference
I didn’t delete the code. It’s preserved in the Spiriteddata project under “Memories of Sora” — a dedicated section explaining:
- What Sora was
- How it worked
- Why I deprecated it
- What I learned
It’s a case study in engineering decision-making. Sometimes the best code you write is the code you choose not to use.
Sora taught me more by failing softly than it would have by succeeding marginally.
The next time you’re building an AI feature, ask:
- Does this transform information or just translate it?
- Is the value worth the complexity?
- Could I solve this simpler?
Sometimes the right answer is hitting delete.