RAG in production for $12/month: a WhatsApp chatbot with ChromaDB and GPT-4o mini
How I built a Retrieval-Augmented Generation chatbot for a real business, with multi-provider fallback, swappable personalities, and multimedia support. No complex infrastructure.
$12 a month
That’s what it costs to run a WhatsApp chatbot that answers questions about products, pricing, and business processes for a premium kitchenware company. It handles natural language, understands images and audio, and knows when to hand off to a human.
It runs in production. 184KB of knowledge base — 10 documents covering product catalogs, price lists, and business guides.
The stack: FastAPI, ChromaDB, GPT-4o mini, WhatsApp Business API. No Kubernetes. No Redis. No message queues. A single Python process and a 6.7MB vector database.
The architecture on a napkin
WhatsApp/Telegram → Webhook → FastAPI
↓
User question
↓
Embedding (all-MiniLM-L6-v2)
↓
ChromaDB → Top 3 chunks
↓
Prompt = Personality + RAG Context + Question
↓
GPT-4o mini (or fallback)
↓
Response + Confidence
↓
WhatsApp/Telegram → User
Each message takes this path in 2-5 seconds. The bottleneck is the LLM call, not the vector search.
The RAG pipeline
RAG sounds complex until you take it apart. Three steps: chunking, embeddings, search.
Chunking: splitting documents into useful pieces
The knowledge base documents are Markdown: product catalogs, price lists with financing options, business guides, terms and conditions. Some are 36KB.
An LLM can’t process 36KB of context per question (well, it can, but it would be expensive and slow). So we split each document into 1000-character chunks with 200-character overlap:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
The 200-character overlap is key. Without it, a sentence that falls on a boundary loses context. With it, edges overlap and semantic search finds matches even when relevant information spans two chunks.
Separators are priority-ordered: first try to cut at paragraphs, then lines, then sentences. It never splits a word in half unless there’s no other option.
Embeddings: turning text into vectors
Each chunk becomes a 384-dimensional vector using all-MiniLM-L6-v2. It’s a HuggingFace model that runs locally — no API call, no per-embedding cost.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode(["How much is the FLIP pot?"])
# → 384-float vector
The model weighs ~90MB and generates embeddings in 200-500ms. For a 10-document knowledge base, full indexation takes seconds.
Search: ChromaDB as the brain
ChromaDB stores vectors in SQLite + HNSW indices. When a question arrives, it converts it to a vector and finds the closest chunks:
def search(self, query: str, n_results: int = 5):
query_embedding = self.model.encode([query])[0].tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
# ChromaDB returns distances, convert to similarity
for distance in results["distances"][0]:
similarity = 1 - distance # 0.0 = unrelated, 1.0 = identical
return results
The query “how much is the FLIP pot?” returns chunks from the price catalog with the product spec, financing plans, and warranty details. All of this gets injected as context into the LLM prompt.
6.7MB. The entire vector database. Fits on a conference swag USB drive.
WhatsApp integration
WhatsApp Business API works with webhooks. Meta sends a POST every time someone writes. You respond by calling their Graph API.
Webhook verification
@app.get("/webhook/whatsapp")
async def verify_whatsapp(request: Request):
mode = request.query_params.get("hub.mode")
token = request.query_params.get("hub.verify_token")
challenge = request.query_params.get("hub.challenge")
if mode == "subscribe" and token == VERIFY_TOKEN:
return int(challenge)
raise HTTPException(403, "Invalid token")
Meta sends a GET with a challenge. If you respond with the right number, it activates the webhook. After that, everything comes via POST.
Message processing
Each WhatsApp message arrives wrapped in several JSON layers. The text is at entry[0].changes[0].value.messages[0].text.body. Images carry a media_id that requires another API call to resolve:
# Download media from WhatsApp
media_url = requests.get(
f"https://graph.facebook.com/v17.0/{media_id}",
headers={"Authorization": f"Bearer {access_token}"}
).json()["url"]
media_content = requests.get(
media_url,
headers={"Authorization": f"Bearer {access_token}"}
).content
Two requests for one file. The first gives you a temporary URL, the second downloads the content. Redundant, but that’s Meta’s API.
Multimedia support
The bot receives text, images, audio, and documents. GPT-4o mini has native vision, so images go straight to the model in base64:
if media_type == "image":
user_message = [{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{media_data}"
}
}]
For audio there’s a specialized model (gpt-4o-mini-audio-preview) that processes WAV directly without prior transcription.
The fallback system
A production chatbot can’t depend on a single provider. If OpenAI goes down at 3 AM, the bot needs to keep answering.
The solution: a three-tier fallback chain.
1. OpenAI GPT-4o mini → $0.15/1M input, $0.60/1M output
2. OpenRouter (same model) → similar pricing, different infra
3. DeepSeek v3 (free tier) → $0
When the primary provider fails, the system switches automatically:
try:
response = call_openai(prompt, context)
except Exception:
try:
response = call_openrouter(prompt, context)
response["fallback"] = True
response["confidence"] *= 0.8 # penalize confidence
except Exception:
response = call_openrouter_free(prompt, context)
response["free_tier"] = True
response["confidence"] *= 0.7
The confidence penalty matters. If the response comes from the free tier, the system is more aggressive about suggesting “talk to a human.” GPT-4o mini and a free model aren’t the same thing.
Bot personalities
The same bot can be three different people depending on context. Each personality is an .env file with variables that define tone, instructions, and style:
Lucía Casual — For younger customers. Relaxed tone, plenty of emojis, short responses.
Lucía Formal — For corporate clients. Zero emojis, technical vocabulary, ROI-focused.
Lucía Sales — For active conversion. Urgency, social proof, objection handling.
Switching personality is one command:
python scripts/change_personality.py vendedora
The system builds the system prompt dynamically based on the active personality and query type:
def build_system_prompt(self, query="", media_type="text"):
prompt = f"I'm {self.name}, {self.role} at {self.company}. "
prompt += self.personality
if "price" in query.lower():
prompt += self.price_prompt
elif "warranty" in query.lower():
prompt += self.warranty_prompt
if media_type != "text":
prompt += self.multimedia_prompt
return prompt
If someone asks about pricing, the prompt includes specific instructions on how to present financing plans. If they send a product photo, it activates multimedia instructions. The LLM gets precise context for each situation.
When to hand off to a human
Not everything is solved with AI. The system calculates a confidence score and decides if the answer is good enough or needs human intervention:
confidence = 0.8 if context and len(context) > 100 else 0.3
if response.get("fallback"):
confidence *= 0.8
if response.get("free_tier"):
confidence *= 0.7
# Complex queries lower the threshold
complex_keywords = ["problem", "complaint", "issue", "return"]
threshold = 0.4 if any(kw in query.lower() for kw in complex_keywords) else 0.6
requires_human = confidence < threshold
If someone says “I have a problem with my order,” the threshold drops to 0.4 — the system is more likely to escalate because a complaint mishandled by AI can spiral. If the question is “how much is the FLIP pot?”, the threshold is 0.6 and the RAG probably has the exact answer.
What it actually costs
Monthly breakdown for ~1000 messages/day:
| Component | Cost |
|---|---|
| GPT-4o mini (input ~300 tokens × 1000) | ~$1.35/month |
| GPT-4o mini (output ~200 tokens × 1000) | ~$3.60/month |
| WhatsApp Business (messages within 24h) | $0 |
| ChromaDB | $0 (local) |
| Embeddings (all-MiniLM-L6-v2) | $0 (local) |
| Total | ~$5-12/month |
The range depends on actual message volume and how many include multimedia (which consume more tokens). Peak months with promotions never exceeded $12.
The embedding model runs locally. The vector database runs locally. The only variable cost is the OpenAI API, and GPT-4o mini is absurdly cheap for what it does.
What I’d do differently
Conversation memory. The current system doesn’t maintain context between messages. Each question is independent. For a sales bot, this is a real limitation: the customer says “I want the red pot” and then asks “does it ship free?” and the bot doesn’t know which pot they mean.
Semantic cache. Many questions repeat: “what payment methods do you accept?”, “do you ship nationwide?”. A cache that detects semantically similar questions would avoid unnecessary LLM calls.
Populated FAQ. The knowledge_base/faq/ folder exists but is empty. Real customer questions are the best input for improving the knowledge base, and we’re not capturing them.
Fine-tuned embedding model. all-MiniLM-L6-v2 is a generalist. A model fine-tuned with kitchenware vocabulary and product names would improve search accuracy, especially for product names that aren’t common words.
The point
RAG doesn’t have to be complex. ChromaDB + a local embedding model + GPT-4o mini solve 80% of enterprise chatbot use cases. The other 20% is product engineering: deciding when to escalate to a human, how to handle multimedia, which personality to use for each context.
The most expensive infrastructure in this project is domain knowledge. The Markdown documents in the knowledge base are written by hand, updated manually, and they’re what makes the bot useful instead of generic.
The AI is the channel. The value is in what it knows.