Builders Spotlight — Ragas — Stochastic Sandbox

Ragas

A reference-free evaluation framework for RAG pipelines built by Exploding Gradients, designed to measure retrieval and generation quality without needing gold-standard ground truth.

The problem it set out to solve

RAG systems are notoriously hard to evaluate. You can’t just check if an LLM’s answer matches a reference — you need to know if it retrieved the right documents, synthesized them correctly, and stayed faithful to the source material. But building evaluation datasets requires expert annotation and domain knowledge you often don’t have. Teams were shipping RAG systems blind, tweaking retrieval parameters and prompts based on gut feel rather than metrics.

The key insight

You don’t need human-labeled ground truth to measure RAG quality. Instead, you can use the LLM itself as a judge — asking it to evaluate whether a retrieved context actually answers the question, whether the generated answer is faithful to that context, and whether the answer is complete. This flips the paradigm: instead of comparing outputs to a gold standard, you’re testing the internal coherence of the pipeline. The builder’s insight was that LLM-as-judge scales and works across domains in ways traditional metrics (BLEU, ROUGE) never could.

How it works (in plain terms)

Ragas takes three inputs: a question, the retrieved documents, and the generated answer. It computes five independent metrics that each ask a different question: Does the retrieved context actually contain relevant information? Is the generated answer grounded in the context (not hallucinated)? Does the answer answer the question fully? Are the retrieved documents diverse and non-redundant? It scores each on 0–1 and averages them into a pipeline score. The clever part: each metric is itself a small LLM evaluation prompt, so you can inspect what the evaluator is thinking and even swap in your own judge.

What it looks like in practice

from ragas import evaluate
from ragas.metrics import (
    context_relevance,
    faithfulness,
    answer_relevance,
    answer_correctness
)

results = evaluate(
    dataset=rag_dataset,  # questions, contexts, answers
    metrics=[
        context_relevance,
        faithfulness,
        answer_relevance,
        answer_correctness
    ],
    llm=llm,  # your judge model
    embeddings=embeddings
)

print(results["faithfulness"])  # 0.87
print(results)  # full report

Why it matters

Closed the feedback loop: Teams can now iterate on RAG systems quantitatively. You can A/B test retrieval strategies, prompt changes, or model swaps and see if they actually improve the metrics that matter.
Lowered the barrier to production: You no longer need to hand-label 500 Q&A pairs to know if your system is working. Ragas made RAG evaluation accessible to small teams and solo builders.
Became the industry standard for RAG benchmarking: Ragas is now the reference point in almost every RAG framework and research paper, creating a common language for comparing systems.

Where to go next

GitHub: https://github.com/explodinggradients/ragas
Documentation: https://docs.ragas.io — includes metric deep-dives and custom evaluation examples
Blog: “Evaluating RAG pipelines” on the Exploding Gradients site walks through the philosophy behind reference-free metrics