TL;DR — what you'll actually pay
A production RAG pipeline serving 1 million queries per month in 2026 typically costs $400–$900 using Claude Haiku 4.5 + a small embedding model + a managed vector DB. Premium Claude Opus 4.7 builds run $4,000–$9,000. The LLM call is 70–85% of total cost — vector DB, embeddings and infra are rounding errors at most scales. The two biggest cost levers are prompt caching (60–80% LLM savings) and reducing retrieval k.
RAG ("retrieval-augmented generation") is the dominant pattern for putting LLMs on private data. It's how you build a chatbot that knows your docs, a support agent that cites the actual policy page, or an internal search that answers in sentences instead of lists of links.
The architecture is well-understood by now. The pricing isn't. Vendors quote per-vector, per-token, per-query, per-GB — and the dimensions don't line up. This post strips away the marketing math and gives you a real, line-item breakdown per million queries, in 2026 prices.
In this post
The five line items
Every RAG pipeline breaks into the same five cost centers:
- Embedding — one-time per document ingest, plus a tiny per-query embedding
- Vector database — storage + queries-per-second
- Retrieval — k nearest neighbors fetch, optional reranking
- LLM call — generation with retrieved context
- Infrastructure + observability — hosting, logging, eval, monitoring
The wild thing: in almost every real pipeline, item 4 (the LLM call) dominates 70–85% of total cost, while items 1–3 combined are under 20%. We'll see why.
1. Embedding cost
Embedding turns text into vectors. You pay for it twice: once when you ingest your knowledge base, once per query.
Per ingest: If your knowledge base is 100K documents × 5K tokens average = 500M tokens. At OpenAI text-embedding-3-small ($0.02/1M tokens), that's $10 one-time. Even at the larger text-embedding-3-large ($0.13/1M), it's $65.
Per query: Embedding a 50-token query costs $0.000001. For 1M queries: $1.00 total.
Embedding is basically free at any reasonable scale. The only time it dominates is for continuous re-embedding use cases (e.g., re-indexing 10M documents every week as the underlying data changes).
2. Vector database cost
Three categories in 2026:
- Managed serverless (Pinecone Serverless, Qdrant Cloud, Weaviate Cloud): pay per query + per stored vector. For 1M vectors and 1M queries/month, expect $70–$200/month.
- Self-hosted on a VPS (Qdrant, Weaviate, Milvus): $20–$80/month VPS handles 1–5M vectors and ~1,000 QPS easily. ~$40/month.
- Postgres with pgvector: if you already pay for managed Postgres, pgvector adds nothing. Works fine up to ~10M vectors with HNSW indexing.
For most teams the vector DB cost rounds to "negligible." The reason it feels expensive is that pricing-page math (per-vector, per-namespace, per-shard) looks scary. Run the actual numbers for your volume — it's almost always under $200/month.
Want a fixed-price RAG build?
Our RAG specialists ship end-to-end pipelines — ingest, vector DB, retrieval, LLM, evals, observability — in 2–6 weeks. Average cost: $4,000–$18,000 depending on scope.
Hire a RAG specialist →3. Retrieval cost
Retrieval itself is just a vector DB query, which we already paid for above. The added cost here is reranking — running an extra precision pass on the top-N retrieved chunks.
In 2026, the common rerankers are Cohere Rerank 3 ($1 per 1K reranks at default settings) and self-hosted BGE-Reranker-v2 (free, ~150ms latency on a $50/month GPU). For 1M queries with rerank at top-20:
- Cohere Rerank 3: ~$50/month
- Self-hosted BGE: $50/month VPS + GPU
For high-precision domains (legal, medical, support) a reranker typically lifts answer quality 10–30% — usually worth it. For casual Q&A you can often skip it by tuning chunk size.
4. LLM call cost (where everything lives)
This is the line item that dominates. Why? Because the LLM call sends a lot more tokens than retrieval does. A typical RAG call sends:
- System prompt: ~500 tokens (cacheable)
- Retrieved context: 3–10 chunks × 500 tokens = 1,500–5,000 tokens
- User question: ~50 tokens
- Output: 200–500 tokens generated
So per query: ~2,000–6,000 input tokens + ~300 output tokens.
2026 pricing per million queries, at these averages (assume 3K input + 300 output per query):
| Model | Input $/1M tok | Output $/1M tok | Total per 1M queries |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | ~$4,500 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | ~$13,500 |
| Claude Opus 4.7 | $15.00 | $75.00 | ~$67,500 |
| GPT-4o-mini class | $0.15 | $0.60 | ~$630 |
This is before caching. Apply prompt caching (90% discount on the cached prefix, hits ~90% of repeated calls) and Claude Haiku 4.5 drops to roughly $700–$900 per 1M queries.
5. Infrastructure + observability
Don't forget the boring bits:
- App server: $20–$200/month depending on QPS
- Observability: LangSmith, Helicone, Langfuse — free tier covers 10K–50K traces/month, paid tiers $50–$300
- Eval pipeline: $50–$200/month if you're running automated quality regression on every deploy
- Redis or other query cache: $20–$80/month, often pays for itself 5–10× in LLM savings
Budget $200–$600/month for the whole infra layer at production scale.
Putting it all together: 1M-query scenarios
Three realistic builds, all costed for 1M queries/month with 30% query-cache hit rate:
| Scenario | Stack | ~Cost / 1M queries |
|---|---|---|
| Lean startup | Haiku 4.5 + text-embedding-3-small + self-hosted Qdrant + Redis + prompt caching | $500–$900 |
| Mid-market production | Sonnet 4.6 + text-embedding-3-large + Qdrant Cloud + Cohere Rerank + Langfuse | $3,000–$5,500 |
| High-stakes enterprise | Opus 4.7 + large embeddings + pgvector on RDS + Rerank + full eval pipeline | $10,000–$18,000 |
The 36× cost spread between "lean" and "enterprise" comes almost entirely from the LLM choice. Vector DB, embeddings and infra differ by maybe 3×.
Cost-cutting levers that don't tank quality
In order of impact:
- Prompt caching — 60–80% off LLM cost. The system prompt + retrieved context often repeats across queries from the same session or user — perfect for caching.
- Query result caching (Redis) — 20–40% of real-world queries are duplicates within 24h. Cache them.
- Drop k from 10 to 3–5 — measure quality first. Most pipelines over-retrieve. Cutting context in half often costs <5% quality and 50% of LLM cost.
- Use a smaller model with better retrieval — Haiku 4.5 with 3 highly-relevant chunks beats Sonnet with 10 mediocre chunks for most use cases. Invest in retrieval quality, not LLM tier.
- Fine-tune for narrow domains — a fine-tuned smaller model can match a larger generic one for 10× less per query at high volume (>5M queries/month).
- Switch to streaming + early-exit — if your UI lets users stop reading, configure your client to abort the LLM call. Saves output tokens on aborted requests.
FAQ
How much does a RAG pipeline cost per 1M queries?
In 2026, a well-built RAG pipeline using Claude Haiku 4.5 + OpenAI text-embedding-3-small + a managed vector DB like Qdrant Cloud costs roughly $400–$900 per million queries, depending on context length, cache hit rate and retrieval k. The same pipeline on Claude Opus 4.7 + larger embeddings can hit $4,000–$9,000 per million. The LLM call is usually 70–85% of the total.
What's the cheapest production-grade RAG stack right now?
Self-hosted Qdrant on a $40/month VPS + OpenAI text-embedding-3-small ($0.02/1M tokens) + Claude Haiku 4.5 (~$1/1M input tokens) + Redis for query caching. For 1M queries with a 30% cache hit rate, you're looking at $200–$500/month total — vector DB is rounding error, LLM is everything.
Should I use Pinecone or self-host Qdrant?
For under 5M vectors and <100 queries/sec, self-hosting Qdrant on a small server saves 70–90% vs Pinecone. Past 50M vectors or 1,000 QPS you start wanting managed infrastructure — at that point, Qdrant Cloud, Pinecone Serverless or pgvector on managed Postgres are all in the same ballpark.
Why is the LLM call so much more expensive than retrieval?
Because the LLM processes 10–50× more tokens. A retrieval call embeds the user query (~50 tokens). The LLM call sends the query plus 3–10 retrieved chunks (~2,000–8,000 tokens) and generates a 200–500 token answer. Even at flat per-token rates, that's a 20–50× cost ratio.
What's the single biggest cost-cutting lever for RAG?
Prompt caching on the LLM call. Claude's prompt caching gives you ~90% discount on cached input tokens for repeated prefixes. If your system prompt + retrieved context overlaps across queries (it usually does at high volume), caching can cut total LLM cost by 60–80%. After caching, the next lever is shrinking k (fewer retrieved chunks) or moving to a smaller-but-fine-tuned model.
Do I need a reranker?
For high-precision use cases (legal, medical, support) — yes, a Cohere or BGE reranker after initial retrieval typically lifts answer quality 10–30% for ~$0.10–$0.50 per 1M reranks. For general Q&A, you can usually skip it by tuning chunk size and overlap. Always measure quality on a fixed eval set before committing.
Quote your RAG build in 24 hours
Send us your scope — corpus size, expected QPS, quality bar — and our RAG specialists will quote a fixed-price end-to-end build. Most projects land between $4K–$18K.
Get a RAG quote →Related: LangChain vs LangGraph · How to hire an AI automation freelancer