AI Engineering12 min read

LLM Eval Framework: DeepEval vs Braintrust vs RAGAS

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

An eval that runs on every PR beats a comprehensive framework that runs quarterly. I benchmarked DeepEval, RAGAS, Braintrust, Promptfoo, and LangSmith on three real client projects. Here is what I would actually ship and why.

Why your eval matters more than your prompt

Every AI project I have shipped in the last two years had the same inflection point. The first version is built on vibes - the prompt looks good, the demo works, the founder is happy. Then a real user asks a weird question, the model hallucinates, and suddenly the team is debating whether to switch from GPT-5 to Claude Opus 4.7. Without an eval suite, that debate is religion. With an eval suite, it is a 20-minute experiment. The eval is the thing that turns AI engineering from prompt-tweaking into actual engineering.

I have benchmarked the five frameworks below on three real client projects over the past 12 months: a RAG-backed customer support bot, an agent that triages and routes inbound sales emails, and a document-extraction pipeline for a fintech. The findings are opinionated because shipped systems force you to be opinionated. If you want neutral coverage of every metric in every framework, the respective docs are better than I will be. If you want to know which one to install on Monday morning, this post is for you.

The 5 frameworks worth knowing in 2026

There are dozens of evaluation libraries on GitHub. These five are the ones I see in real production pipelines, plus OpenAI Evals as the honorable mention every reviewer asks about. The table below is the cheat sheet I use when scoping a new project.

FrameworkBest forHostingRAG metricsAgent metricsCI integrationFree tier
DeepEvalPytest-style eval in TS/PySelf-host or Confident AI CloudGood (faithfulness, contextual recall)Good (task completion, tool correctness)First-class, pytest-nativeFully open source
RAGASRigorous RAG-specific scoringSelf-host (Python only)Best in classLimitedLibrary, you wire it upFully open source
BraintrustObservability + eval in one toolHosted SaaS (self-host on enterprise)Good, custom-friendlyGood, trace-drivenCLI and SDK, CI-friendlyGenerous free tier
PromptfooFast prompt-level iterationSelf-host (CLI-first)Adequate via pluginsAdequate via pluginsYAML config, very CI-friendlyFully open source
LangSmithLangChain-native stacksHosted SaaS or self-hostGood with LangChainStrong for LangGraphSDK, opinionatedFree tier under 5K traces/mo
OpenAI Evals (mention)OpenAI-only model comparisonSelf-host (Python)BasicBasicLibrary, manual wiringFully open source

OpenAI Evals is on the list because reviewers always ask. It was the original - and in early 2023 it was the only thing in the category - but in 2026 the ecosystem has lapped it. I would not start a new project on it. Use it only if you are doing pure OpenAI-model-against-OpenAI-model comparisons and you want a zero-dependency Python library.

The 4 metric families you actually need

Before picking a framework, understand the metric families. Every tool below is some combination of these four. If you cannot name which family each of your metrics belongs to, you are measuring noise.

Reference-based metrics. Classical NLP scores like BLEU, ROUGE, METEOR, and embedding-similarity. They compare model output against a fixed reference answer. Cheap, fast, deterministic, and almost always misleading on open-ended generation. They earn their keep for translation, summarization with tight reference summaries, and extraction tasks where the answer is a known string. For chat, agents, and most RAG, they will tell you a perfectly good answer is wrong because it phrased things differently from your reference. Treat them as a smoke test, not a quality gate.

LLM-as-judge metrics. Use a stronger LLM to score the output of your production LLM against a rubric. This is the dominant pattern in 2026 because it scales and it handles open-ended outputs that reference-based metrics cannot. The catch is that judges drift, judges hallucinate, and judges have biases (position bias, verbosity bias, self-preference bias). You must calibrate them. I cover this in the trap section below.

RAG-specific metrics. Faithfulness (did the answer stay grounded in the retrieved context?), context recall (did we retrieve all the relevant chunks?), context precision (was the retrieved context mostly relevant?), and answer relevancy (does the answer address the question?). These are the metrics that diagnose why your RAG is broken, because they isolate retrieval failure from generation failure. RAGAS implements these most rigorously; DeepEval and Braintrust have solid versions; everyone else has approximations. If you are building anything in the RAG architecture space, you need these four scored on every PR.

Trajectory and agent metrics. Did the agent call the right tools, in the right order, with the right arguments? Did it complete the task? Did it stay within a sensible step budget? Did it recover from errors? These matter the moment you ship anything in the agentic RAG category. DeepEval and Braintrust handle these well via trace-based evaluation; LangSmith is strongest if you are already on LangGraph; RAGAS does not really play here.

DeepEval: the pytest of LLM evaluation

DeepEval is my default for new projects in 2026. It is open source, has 14+ built-in metrics, supports both Python and TypeScript reasonably well, and the API is intentionally pytest-shaped which means it slots into existing test workflows with no ceremony. You write an eval the same way you write a unit test, you run it the same way, and CI integration is a matter of pointing your existing pytest runner at the eval folder.

Strengths. The pytest-style API is the killer feature - engineers already know it, and the cognitive overhead of adding evals is near zero. Metric coverage is broad: hallucination, faithfulness, contextual precision, contextual recall, answer relevancy, task completion, tool correctness, and bias. Custom metrics are a clean subclass of a base metric class. The Confident AI hosted dashboard is optional but useful for tracking eval runs over time without building your own.

Weaknesses. The TypeScript SDK lags the Python one by a few releases. Some of the more exotic RAG metrics (especially around context precision with explanations) are sharper in RAGAS. Judge model selection is on you - if you do not pin a judge model and track its calibration, your numbers will drift release to release.

When I pick it. Default for greenfield projects, especially TS-heavy teams. Default for anyone whose engineers already think in pytest. Default when I want one framework that handles chat, RAG, and agent evals without juggling three tools.

RAGAS: the specialist for retrieval pipelines

RAGAS is the framework that taught the industry how to measure RAG. The original faithfulness and context-recall papers behind it are still the cleanest formalization of the problem, and the implementations are unusually rigorous for an open-source library. If you are doing anything serious with retrieval, this should be on your radar.

Strengths. The four core RAG metrics (faithfulness, context recall, context precision, answer relevancy) are the most well-validated implementations available. The library is small, focused, and easy to read. Synthetic test set generation from your own corpus is built-in - you point RAGAS at your docs and it generates a starter eval set, which is genuinely useful when you do not have human labels yet. Integration with major frameworks (LangChain, LlamaIndex, Haystack) is first-class.

Weaknesses. Python only. The scope is narrow - chat and agent evaluation is not really its job. The hosted dashboard story is weak compared to Braintrust or Confident AI. Judge calls are expensive at scale because the metrics use multi-step prompting; a full eval run on a 500-question test set against GPT-5 as judge will cost you real money, so plan your sampling.

When I pick it. RAG-heavy projects where retrieval quality is the product, especially when I want to debug the retrieval step in isolation from the generation step. Pairs cleanly with DeepEval - RAGAS for retrieval metrics, DeepEval for everything else.

Braintrust: eval and observability in one tool

Braintrust is the polished commercial option in the category. It does eval, logging, observability, and dataset management in one product, and the developer experience is the cleanest I have used. The free tier is generous enough for small teams to run real production workloads on it before paying.

Strengths. The unified model is the headline feature - your eval runs share a data model with your production traces, which means you can promote a real production interaction into an eval case with one click. Dataset versioning, prompt versioning, and experiment comparison are first-class. The TypeScript SDK is on par with the Python one, which is rare. Custom scorers are easy to write and the autoeval primitives (faithfulness, summarization quality) are solid defaults. CI integration via the CLI is painless.

Weaknesses. Hosted SaaS, which is a non-starter for some regulated workloads (self-host is enterprise-only). The pricing climbs once you cross the free tier in a meaningful way - at 100K+ traced calls per month it can compete with a few open-source tools plus a few hours of glue code. The opinionated data model is great when you embrace it and friction when you want to do something custom.

When I pick it. Teams that want one vendor for eval plus observability, do not want to self-host anything, and are okay with a SaaS bill. Especially good for product teams who want non-engineers to be able to inspect eval results - the UI is the best in the category for that. Often the right pairing with my AI integration engagements.

Promptfoo: zero-config prompt iteration

Promptfoo is the fastest tool in the category for the specific job of comparing prompts and models. The entire workflow is a YAML file plus a CLI command, and you can have a meaningful eval running against five models in under 10 minutes. For pure prompt engineering work, nothing else gets you to first results this quickly.

Strengths. The YAML config is genuinely zero-ceremony - describe your prompts, your test cases, your assertions, and run. Model comparison across OpenAI, Anthropic, Google, local Ollama, and more is one config block. The HTML report is shareable and the diff view between runs is the best in the category. Red-teaming and prompt-injection probing are built in, which matters more than most teams realize after reading my prompt injection defense notes. CI integration via the CLI is one line.

Weaknesses. The model is YAML-first, which is great for prompt evaluation and awkward for complex agent or multi-step pipelines. Custom JS or Python assertions help but you eventually outgrow the config-driven approach. RAG-specific metrics exist but are not as sharp as RAGAS or DeepEval. The tool is at its best on prompts and at its worst on complex agentic flows.

When I pick it. Prompt engineering passes, model bake-offs, and regression suites for chat applications. Especially good for teams who want non-engineers to contribute test cases - editing YAML is approachable in a way that writing pytest fixtures is not.

LangSmith: LangChain-native, painful otherwise

LangSmith is the eval and observability platform built by the LangChain team. If you are already building on LangChain or LangGraph, it is the path of least resistance - tracing is automatic, datasets flow from production into eval runs, and the UI knows what your chains and agents look like.

Strengths. LangGraph trajectory evaluation is the strongest in the category. Automatic tracing for LangChain components means zero instrumentation overhead in the happy path. The eval SDK supports both built-in and custom evaluators. The dataset management UI is good. Self-hosting is supported, which matters for regulated workloads.

Weaknesses. The framework lock-in is real - outside of LangChain and LangGraph, you are pushing a square peg into a round hole. For teams using the Vercel AI SDK, OpenAI SDK directly, or custom orchestration (which is most teams I work with now), the instrumentation overhead negates the convenience advantage. The company strategy on LangChain itself has been turbulent enough that I would think hard before betting a multi-year eval stack on this product specifically. For my take on whether you should even be on LangChain in 2026, see OpenAI API cost and the related framework discussions.

When I pick it. LangChain or LangGraph projects where the team already has Python infrastructure expertise. Otherwise, almost never - the alternatives have caught up on every axis where LangSmith used to lead.

A real CI eval setup in TypeScript

This is the actual skeleton I drop into client projects. It uses DeepEval-style scoring (any framework would work the same way) and runs as a GitHub Action on every PR that touches the prompt or retrieval folder. The whole thing is under 40 lines.

// evals/rag.eval.ts
import { evaluate, FaithfulnessMetric, AnswerRelevancyMetric } from "deepeval";
import { ragAnswer } from "../src/lib/rag";
import dataset from "./dataset.json";

const cases = dataset.map((row) => ({
  input: row.question,
  expectedOutput: row.expected,
  contextRequired: row.contextRequired,
}));

const metrics = [
  new FaithfulnessMetric({ threshold: 0.8, model: "gpt-5-mini" }),
  new AnswerRelevancyMetric({ threshold: 0.75, model: "gpt-5-mini" }),
];

await evaluate({
  testCases: cases.map(async (c) => {
    const result = await ragAnswer(c.input);
    return {
      input: c.input,
      actualOutput: result.answer,
      retrievalContext: result.context,
      expectedOutput: c.expectedOutput,
    };
  }),
  metrics,
  failOnError: true,
});

And the matching GitHub Action that gates merges:

# .github/workflows/eval.yml
name: eval
on:
  pull_request:
    paths:
      - "src/lib/rag/**"
      - "src/prompts/**"
      - "evals/**"
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with: { node-version: 22, cache: pnpm }
      - run: pnpm install --frozen-lockfile
      - run: pnpm tsx evals/rag.eval.ts
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}

Two design choices worth calling out. First, the eval runs only on PRs that touch RAG-relevant paths, so it does not waste API spend on every README typo. Second, the judge model is pinned to a small cheap model - judge calls dominate the bill, and a smaller calibrated judge is almost always good enough. The full cost of a 200-case eval run on this setup is roughly $0.40 against a GPT-5-mini judge, which is the right number to run on every PR.

The trap of LLM-as-judge

The most common eval failure mode I see in 2026 is teams trusting their judge model without ever validating it. Judges are LLMs. They hallucinate, they have biases, they drift between model versions, and they will happily give you a 0.92 score on outputs that humans would call bad. If you ship LLM-as-judge metrics without calibration, you are measuring whether your judge likes your output, not whether your output is good.

The fix is calibration. Take 30 to 100 examples, have a human score them on the same rubric your judge uses, run the judge on the same examples, and measure agreement. Cohen's kappa above 0.6 is usable, above 0.75 is good. If your judge disagrees with humans more often than it agrees, your rubric is broken, your judge model is too weak, or both. I have seen judge scores swing 20 points when calibration was finally done - almost always downward, because the uncalibrated judge was rewarding verbosity and confident phrasing.

The other trap is judge bias. Position bias means a judge prefers whichever answer is shown first when comparing two options. Verbosity bias means a judge prefers longer answers. Self-preference bias means a GPT judge prefers GPT outputs to Claude outputs. Mitigations are straightforward - randomize position, normalize for length, and never use the same model as both generator and judge for production-critical metrics. None of this is exotic, but almost nobody does it on the first pass.

Finally, the rubric matters more than the framework. A bad rubric in DeepEval scores the same as a bad rubric in Braintrust. Spend the time to write a rubric a human reviewer could agree with, then wire it into whichever tool you picked. This is also where human-in-the-loop review flows pay off - your eval set grows from real production cases that humans have actually labeled.

What I run on every client project

After a year of mixing and matching, this is my default eval stack for new client work. It is not the only way to do it but it is the one I trust enough to stake project quality on.

DeepEval as the runner. Pytest-style, TypeScript and Python, slots into CI. Most metrics live here.

RAGAS for the four RAG metrics. Faithfulness, context recall, context precision, answer relevancy. Run as a separate eval stage on RAG projects, results uploaded to the same dashboard.

Braintrust for observability and dataset management.Production traces flow into Braintrust, the best ones get promoted into the eval dataset weekly, and the team uses the dashboard for cross-experiment comparison. This is also where stakeholders look at eval trends.

Promptfoo for prompt iteration spikes. When I am iterating on a single prompt or comparing models, the YAML workflow is faster than firing up the main eval suite. The good results graduate into DeepEval test cases.

A pinned judge model. Currently GPT-5-mini for cost-sensitive metrics, GPT-5 for high-stakes ones. Calibrated quarterly against a 50-example human-labeled set.

An eval budget alarm. Daily LLM spend on eval calls is tracked separately from production spend. If a PR triples the eval bill, that is a signal something has changed and needs a human look. I cover the cost-tracking patterns in detail in the OpenAI API cost post.

When to write your own eval vs use a framework

Almost never write your own from scratch. The frameworks above have spent thousands of engineering hours on edge cases you will not anticipate - token counting in judge prompts, multi-step scoring with retries, dataset versioning, result aggregation, statistical significance testing. Replicating any of that in a weekend produces something that looks right and silently lies.

The exceptions are narrow. If your domain has a scoring rule that nobody has implemented (say, IFRS-compliant financial-extraction accuracy with line-item tolerance), write that single metric as a custom scorer inside an existing framework. DeepEval and Braintrust both make this clean - subclass a base metric, return a score and an explanation, you are done. The framework handles everything around your custom logic.

The other narrow case is when your eval is downstream of an actual product metric. If the thing you care about is conversion rate, ticket-deflection rate, or a labeled outcome from your operations team, that is not really an LLM eval - it is a product analytics signal, and it should live in your data warehouse, not in DeepEval. The two complement each other; do not conflate them.

Cost math: what an eval suite actually runs

The dominant cost line for any framework above is judge API spend. The framework itself is either free or cheap; the LLM calls are not. Here is the math I use to size eval budgets at three traffic tiers, assuming a 200-case eval suite running on every PR plus a 2000-case nightly run.

TierPR runs/dayNightly casesJudge modelMonthly eval LLM cost
Solo / early MVP5500GPT-5-mini$20 to $40
Small team (5 engineers)152000GPT-5-mini$120 to $250
Production AI product305000, mixed judgeGPT-5-mini + GPT-5$500 to $1200

Add framework hosting costs if you go SaaS: Braintrust free under a generous threshold then scales with traced calls, LangSmith free under 5K traces per month then per-seat plus per-trace, DeepEval optional Confident AI Cloud at modest seat pricing. For the open-source-only path (DeepEval self-hosted plus RAGAS plus Promptfoo plus your own dashboard), the only line item is judge API spend.

Closing

The framework wars in LLM eval are real but the stakes are smaller than the marketing suggests. Any of the five tools above is good enough to ship - the difference between picking well and picking badly is measured in a couple of weeks of engineering time and a few hundred dollars per month in LLM spend. The decision that actually matters is whether you have an eval suite at all, whether it runs on every PR, and whether you have calibrated your judge against humans. A 100-case DeepEval suite that ships beats a 10000-case Braintrust suite that nobody looks at.

Start with DeepEval for the runner. Add RAGAS if you are doing RAG. Add Braintrust if you want observability in the same product. Add Promptfoo for prompt iteration. Skip LangSmith unless you are on LangChain. Skip OpenAI Evals unless you are doing pure OpenAI-on-OpenAI comparisons. Pin your judge. Calibrate it. Re-calibrate it when you change models. That is the whole playbook.

Frequently asked questions

These are the questions I get most often when teams scope an eval stack with me. The answers are also embedded as FAQ structured data for search.

What is an LLM evaluation framework?

An LLM evaluation framework is a tool that measures whether your AI system produces good outputs against a fixed dataset of inputs and expected behaviors. The good ones run in CI on every PR, score outputs across multiple metric families (reference-based, LLM-as-judge, RAG-specific, agent trajectory), and fail the build when quality regresses.

Which LLM eval framework should I use in 2026?

For RAG pipelines, RAGAS is the most specialized. For TypeScript or Python apps that want pytest-style evals in CI, DeepEval is the safest default. For teams that already want observability and eval in one tool, Braintrust is the cleanest hosted option. For zero-config YAML iterations on prompts, Promptfoo. For LangChain shops, LangSmith.

Is DeepEval better than RAGAS?

They solve different problems. RAGAS is laser-focused on RAG metrics and the implementations are the most rigorous I have seen. DeepEval is a broader test framework with a pytest-like API, more than 14 built-in metrics, and good support for both Python and TypeScript. On a RAG project I will often use RAGAS for retrieval metrics and DeepEval for everything else.

How much does Braintrust cost compared to open source?

Braintrust has a free tier that covers small teams, then climbs as traffic and seat count grow. DeepEval, RAGAS, and Promptfoo are open source and free to run yourself - you pay only the LLM API spend for judge calls, which is the dominant cost line for almost everyone regardless of framework.

Is LLM-as-judge reliable?

Reliable enough to ship, not reliable enough to trust blindly. The standard pattern is to calibrate your judge against 30 to 100 human-labeled examples, measure inter-rater agreement, and re-validate any time you change the judge model or the rubric. A judge you have never validated is theater.

Should I write my own eval or use a framework?

Use a framework. The open-source options in 2026 are better than anything you will hand-roll in a month. The only reason to write custom evals on top of a framework is when your domain has a specific scoring rule that no library implements - and even then, write it as a single custom metric inside DeepEval or Braintrust, not from scratch. This is also the kind of decision I help with via AI agent development engagements.

How often should I run my evals?

On every pull request that touches a prompt, retriever, model, or tool. Plus a nightly run against a larger holdout set. Plus on every production deployment as a gate. If your eval suite takes more than 10 minutes, sample it down - a fast eval that runs catches more regressions than a comprehensive one that does not.

Does Promptfoo work for agents or only prompts?

Promptfoo is sharpest for prompt-level evaluation but it does support agent and tool-calling evals through provider plugins. For complex multi-step agent trajectories with tool use and state, I would reach for DeepEval or Braintrust first. For prompt iteration, model comparison, and quick regression suites, Promptfoo is the fastest tool in the category. If you want help scoping this end-to-end, I take engagements through hire an AI developer in Kosovo.