The problem
Most RAG demos look great in a single happy-path Jupyter cell. They fall over the moment someone asks a question just outside the demo corpus, or when embeddings drift after a model swap, or when a silent prompt regression ships on a Friday.
This project fixes that third one first: make every RAG change land with an evaluation that would have caught it.
The stack
- Qdrant as the vector store — fast hybrid search (dense + sparse) with payload filtering for tenant isolation
- Claude Sonnet 4.6 as the generator, behind a provider-agnostic interface so we can A/B against other frontier models
- Braintrust as the eval harness — deterministic, versioned, and wired into CI so PRs can't merge with regressions
- FastAPI + Pydantic for the service, Postgres for audit logs
Retrieval
Hybrid retrieval (BM25 + embedding cosine) followed by a small cross-encoder rerank. We measured:
- pure dense: Recall@5 = 0.78
- pure sparse: Recall@5 = 0.71
- hybrid + rerank: Recall@5 = 0.92
Worth the extra ~120ms.
Evals that actually block deploys
Phase 1
Seed the eval set
280 real user questions from the design-partner corpus, each labelled with ideal context passages and a faithful answer. No synthetic fluff.
Phase 2
Multi-axis scoring
Every change scored on recall, faithfulness (LLM-as-judge with a strict rubric), answer helpfulness, latency p50/p95, and cost per query.
Phase 3
CI gate
A GitHub Action runs the suite on every PR. Any axis that regresses beyond a tolerance blocks merge. Braintrust UI surfaces the deltas for review.
Phase 4
Shadow eval in prod
A fraction of live traffic is replayed against candidate configurations nightly. Drift shows up in the dashboard before users complain.
What surprised me
- Reranking matters more than embedding choice. Swapping embedding models moved Recall@5 by ±2 points; adding rerank added 14.
- LLM-as-judge is fine if your rubric is strict. Loose rubrics give every answer an 8/10. A rubric that looks for specific citations from the retrieved context is far more discriminating.
- Your eval set rots. Every month, regenerate ~10% from recent real queries, or you'll optimise to an old distribution.
What's next
- Public write-up with the full rubric + scoring code
- Open-source the Qdrant + Braintrust scaffold as a template repo
- Add adversarial eval cases (jailbreaks, prompt-injection in retrieved docs)
