RAG in production for $12/month: a WhatsApp chatbot with ChromaDB and GPT-4o mini

$12 a month

That’s what it costs to run a WhatsApp chatbot that answers questions about products, pricing, and business processes for a premium kitchenware company. It handles natural language, understands images and audio, and knows when to hand off to a human.

It runs in production. 184KB of knowledge base — 10 documents covering product catalogs, price lists, and business guides.

The stack: FastAPI, ChromaDB, GPT-4o mini, WhatsApp Business API. No Kubernetes. No Redis. No message queues. A single Python process and a 6.7MB vector database.

The architecture on a napkin

WhatsApp/Telegram → Webhook → FastAPI
                                 ↓
                           User question
                                 ↓
                  Embedding (all-MiniLM-L6-v2)
                                 ↓
                   ChromaDB → Top 3 chunks
                                 ↓
                 Prompt = Personality + RAG Context + Question
                                 ↓
                  GPT-4o mini (or fallback)
                                 ↓
                       Response + Confidence
                                 ↓
                 WhatsApp/Telegram → User

Each message takes this path in 2-5 seconds. The bottleneck is the LLM call, not the vector search.

The RAG pipeline

RAG sounds complex until you take it apart. Three steps: chunking, embeddings, search.

Chunking: splitting documents into useful pieces

The knowledge base documents are Markdown: product catalogs, price lists with financing options, business guides, terms and conditions. Some are 36KB.

An LLM can’t process 36KB of context per question (well, it can, but it would be expensive and slow). So we split each document into 1000-character chunks with 200-character overlap:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

The 200-character overlap is key. Without it, a sentence that falls on a boundary loses context. With it, edges overlap and semantic search finds matches even when relevant information spans two chunks.

Separators are priority-ordered: first try to cut at paragraphs, then lines, then sentences. It never splits a word in half unless there’s no other option.

Embeddings: turning text into vectors

Each chunk becomes a 384-dimensional vector using all-MiniLM-L6-v2. It’s a HuggingFace model that runs locally — no API call, no per-embedding cost.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode(["How much is the FLIP pot?"])
# → 384-float vector

The model weighs ~90MB and generates embeddings in 200-500ms. For a 10-document knowledge base, full indexation takes seconds.

Search: ChromaDB as the brain

ChromaDB stores vectors in SQLite + HNSW indices. When a question arrives, it converts it to a vector and finds the closest chunks:

def search(self, query: str, n_results: int = 5):
    query_embedding = self.model.encode([query])[0].tolist()
    results = self.collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    # ChromaDB returns distances, convert to similarity
    for distance in results["distances"][0]:
        similarity = 1 - distance  # 0.0 = unrelated, 1.0 = identical
    return results

The query “how much is the FLIP pot?” returns chunks from the price catalog with the product spec, financing plans, and warranty details. All of this gets injected as context into the LLM prompt.

6.7MB. The entire vector database. Fits on a conference swag USB drive.

WhatsApp integration

WhatsApp Business API works with webhooks. Meta sends a POST every time someone writes. You respond by calling their Graph API.

Webhook verification

@app.get("/webhook/whatsapp")
async def verify_whatsapp(request: Request):
    mode = request.query_params.get("hub.mode")
    token = request.query_params.get("hub.verify_token")
    challenge = request.query_params.get("hub.challenge")

    if mode == "subscribe" and token == VERIFY_TOKEN:
        return int(challenge)
    raise HTTPException(403, "Invalid token")

Meta sends a GET with a challenge. If you respond with the right number, it activates the webhook. After that, everything comes via POST.

Message processing

Each WhatsApp message arrives wrapped in several JSON layers. The text is at entry[0].changes[0].value.messages[0].text.body. Images carry a media_id that requires another API call to resolve:

# Download media from WhatsApp
media_url = requests.get(
    f"https://graph.facebook.com/v17.0/{media_id}",
    headers={"Authorization": f"Bearer {access_token}"}
).json()["url"]

media_content = requests.get(
    media_url,
    headers={"Authorization": f"Bearer {access_token}"}
).content

Two requests for one file. The first gives you a temporary URL, the second downloads the content. Redundant, but that’s Meta’s API.

Multimedia support

The bot receives text, images, audio, and documents. GPT-4o mini has native vision, so images go straight to the model in base64:

if media_type == "image":
    user_message = [{
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{media_data}"
        }
    }]

For audio there’s a specialized model (gpt-4o-mini-audio-preview) that processes WAV directly without prior transcription.

The fallback system

A production chatbot can’t depend on a single provider. If OpenAI goes down at 3 AM, the bot needs to keep answering.

The solution: a three-tier fallback chain.

1. OpenAI GPT-4o mini            → $0.15/1M input, $0.60/1M output
2. OpenRouter (same model)        → similar pricing, different infra
3. DeepSeek v3 (free tier)        → $0

When the primary provider fails, the system switches automatically:

try:
    response = call_openai(prompt, context)
except Exception:
    try:
        response = call_openrouter(prompt, context)
        response["fallback"] = True
        response["confidence"] *= 0.8  # penalize confidence
    except Exception:
        response = call_openrouter_free(prompt, context)
        response["free_tier"] = True
        response["confidence"] *= 0.7

The confidence penalty matters. If the response comes from the free tier, the system is more aggressive about suggesting “talk to a human.” GPT-4o mini and a free model aren’t the same thing.

Bot personalities

The same bot can be three different people depending on context. Each personality is an .env file with variables that define tone, instructions, and style:

Lucía Casual — For younger customers. Relaxed tone, plenty of emojis, short responses.

Lucía Formal — For corporate clients. Zero emojis, technical vocabulary, ROI-focused.

Lucía Sales — For active conversion. Urgency, social proof, objection handling.

Switching personality is one command:

python scripts/change_personality.py vendedora

The system builds the system prompt dynamically based on the active personality and query type:

def build_system_prompt(self, query="", media_type="text"):
    prompt = f"I'm {self.name}, {self.role} at {self.company}. "
    prompt += self.personality

    if "price" in query.lower():
        prompt += self.price_prompt
    elif "warranty" in query.lower():
        prompt += self.warranty_prompt

    if media_type != "text":
        prompt += self.multimedia_prompt

    return prompt

If someone asks about pricing, the prompt includes specific instructions on how to present financing plans. If they send a product photo, it activates multimedia instructions. The LLM gets precise context for each situation.

When to hand off to a human

Not everything is solved with AI. The system calculates a confidence score and decides if the answer is good enough or needs human intervention:

confidence = 0.8 if context and len(context) > 100 else 0.3

if response.get("fallback"):
    confidence *= 0.8
if response.get("free_tier"):
    confidence *= 0.7

# Complex queries lower the threshold
complex_keywords = ["problem", "complaint", "issue", "return"]
threshold = 0.4 if any(kw in query.lower() for kw in complex_keywords) else 0.6

requires_human = confidence < threshold

If someone says “I have a problem with my order,” the threshold drops to 0.4 — the system is more likely to escalate because a complaint mishandled by AI can spiral. If the question is “how much is the FLIP pot?”, the threshold is 0.6 and the RAG probably has the exact answer.

What it actually costs

Monthly breakdown for ~1000 messages/day:

Component	Cost
GPT-4o mini (input ~300 tokens × 1000)	~$1.35/month
GPT-4o mini (output ~200 tokens × 1000)	~$3.60/month
WhatsApp Business (messages within 24h)	$0
ChromaDB	$0 (local)
Embeddings (all-MiniLM-L6-v2)	$0 (local)
Total	~$5-12/month

The range depends on actual message volume and how many include multimedia (which consume more tokens). Peak months with promotions never exceeded $12.

The embedding model runs locally. The vector database runs locally. The only variable cost is the OpenAI API, and GPT-4o mini is absurdly cheap for what it does.

What I’d do differently

Conversation memory. The current system doesn’t maintain context between messages. Each question is independent. For a sales bot, this is a real limitation: the customer says “I want the red pot” and then asks “does it ship free?” and the bot doesn’t know which pot they mean.

Semantic cache. Many questions repeat: “what payment methods do you accept?”, “do you ship nationwide?”. A cache that detects semantically similar questions would avoid unnecessary LLM calls.

Populated FAQ. The knowledge_base/faq/ folder exists but is empty. Real customer questions are the best input for improving the knowledge base, and we’re not capturing them.

Fine-tuned embedding model. all-MiniLM-L6-v2 is a generalist. A model fine-tuned with kitchenware vocabulary and product names would improve search accuracy, especially for product names that aren’t common words.

The point

RAG doesn’t have to be complex. ChromaDB + a local embedding model + GPT-4o mini solve 80% of enterprise chatbot use cases. The other 20% is product engineering: deciding when to escalate to a human, how to handle multimedia, which personality to use for each context.

The most expensive infrastructure in this project is domain knowledge. The Markdown documents in the knowledge base are written by hand, updated manually, and they’re what makes the bot useful instead of generic.

The AI is the channel. The value is in what it knows.