AI Engineering12 min read

LLM Observability: Langfuse vs LangSmith vs Helicone

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

Helicone takes one base-URL change. Langfuse self-hosts for free. LangSmith owns LangChain. I run all three in production and break down which fits which workload - with a TCO calculator at three traffic tiers.

Why LLM observability is different from APM

Traditional application performance monitoring assumes deterministic code. A function takes inputs, returns outputs, and either succeeds or throws. You instrument latency, error rate, and throughput, and Datadog or New Relic shows you the shape. LLM observability breaks every assumption in that model. The same input produces different outputs run to run. Cost is denominated in tokens rather than CPU seconds. A single user request fans out into a tree of model calls, retrieval steps, tool calls, and post-processing - and the only way to debug a bad final answer is to walk the whole tree.

Three things make LLM workloads genuinely different. First, non-determinism: the same prompt with the same model and the same temperature can produce different outputs because sampling, batch fingerprints, and rolling model updates all introduce variance. You cannot diff today's output against yesterday's the way you diff a unit test. Second, token cost as a first-class metric: a single runaway agent loop can cost more than a month of cloud compute, and the bill arrives a week after the damage is done. Third, trace trees: a chat with one retrieval, one rerank, one synthesis call, and three tool calls is seven dependent spans that have to be linked through a session ID, a user ID, and a request ID before anyone can debug it.

The tools in this post all solve the same core problem - capture every model call as a structured span, tag it with cost and latency, link it into a trace tree, and let you slice the data by user, session, model, prompt version, and outcome. They differ in how invasive the integration is, how much they cost at scale, how good the eval story is, and how comfortable you are running your own infrastructure.

The 5 tools that matter in 2026

The LLM observability category has consolidated since 2024. There are dozens of products, but five are doing the bulk of the production work I see across client projects. Here is the cheat sheet I use when scoping a new build.

ToolSelf-host?Open source?Best forFree tierIntegration timePricing shape
LangfuseYes (Postgres + container)Yes, MITFramework-agnostic, evals, low TCO50k observations/mo5-15 minPer-observation, cheap at scale
LangSmithSelf-hosted (Enterprise)NoLangChain and LangGraph stacks5k traces/mo (Developer)Zero with LangChainPer-trace, paid plans climb fast
HeliconeYes (Cloudflare Workers + Clickhouse)Yes, Apache 2.0Proxy logging, cost dashboards, caching10k requests/mo1 line (base URL change)Per-request, generous free tier
BraintrustHybrid (data plane)NoEval-first teams shipping prompts to prod1k rows/mo10-30 minPer-evaluation row, premium pricing
Phoenix (Arize)Yes (local Docker or notebook)Yes, ELv2ML-research-leaning teams, embeddings driftSelf-host is free10-20 minArize Cloud is the paid tier

Three tools you will see mentioned and that I have skipped on purpose: Weights and Biases Weave is mature but heavier than the average application team needs; Datadog LLM Observability is fine if you are already a Datadog shop but it costs roughly what the rest of this category costs combined; and OpenLLMetry plus Traceloop is interesting if you want OpenTelemetry-native instrumentation, but you still need a backend like Langfuse or Phoenix to make the data useful.

The 4 things you actually need to observe

Before picking a tool, get clear on what you are actually trying to observe. Every LLM observability product markets itself as an all-in-one platform; in practice, you are trying to answer four distinct questions, and different tools answer them with different levels of polish.

1. Traces. The structured tree of every model call, tool call, retrieval, and rerank that produced a user-facing response. Without traces, debugging a bad answer is guesswork. With traces, you walk the tree and find the step that broke. This is the foundation; every tool here handles it.

2. Cost attribution. Token spend broken down by user, session, prompt version, model, and feature. This is the metric the CFO asks about the day after launch. Helicone is the cleanest here because it sits at the HTTP layer and sees every call regardless of framework.

3. Evals. Quality measurement that runs against captured traces - either online (sampling live traffic) or offline (a regression suite on a curated dataset). Braintrust is eval-first; Langfuse and LangSmith have strong eval support; Helicone is weakest here. I cover the eval framework picture in detail in the LLM eval framework comparison.

4. Errors and anomalies. Refusals, timeouts, schema violations, prompt injection attempts, and the long tail of weird-input failures. Every tool surfaces basic errors; the ones that catch anomalies (sudden cost spikes, latency regressions, refusal rate increases) are the ones that earn their keep at 3am.

Langfuse deep dive: the practical default

Langfuse is the tool I default to when there is no strong reason to pick something else. It is open source under MIT, the self-hosted deployment is a Docker container plus a Postgres database, the cloud free tier is generous enough to validate a product, and the SDK is framework-agnostic. It speaks OpenTelemetry, integrates natively with the OpenAI SDK, the Anthropic SDK, the Vercel AI SDK, LangChain, LlamaIndex, and Haystack, and the trace UI is the cleanest in the category.

Strengths: the self-host story is genuinely zero-friction - docker compose up and you are running. The eval primitives are first-class, so you can store datasets, run offline evals, and tag traces with quality scores from the same UI. The pricing curve at scale is the most generous in the category; I have a client running 8 million traces per month for roughly $200 on the cloud plan.

Weaknesses: the UI is dense for newcomers, the dataset management interface is functional rather than delightful, and the alerting story is weaker than dedicated observability tools. If you want a beautiful dashboard you can show executives, Helicone wins. If you want the most complete picture of what the application is doing, Langfuse wins.

Pick Langfuse when you want a vendor-neutral tool you can either host or run yourself, you care about long-term cost, and your team is comfortable reading a dense UI. It is the default I recommend for every project that does not already use LangChain.

LangSmith deep dive: LangChain-native, premium-priced

LangSmith is the observability product from the LangChain team. If your application is built on LangChain or LangGraph, LangSmith requires approximately zero work to integrate - you set two environment variables and every chain, every agent, every tool call is automatically traced. The trace UI is shaped around LangChain concepts (chains, agents, tools, retrievers) and visualizes them better than anything else.

Strengths: zero-config tracing for LangChain users, the most mature prompt versioning and playground experience in the category, strong dataset and eval tooling, and the experience of being built by the team that wrote the framework you are using. For LangGraph in particular, the visualization of agent state transitions is genuinely useful.

Weaknesses: pricing is the highest of the five tools here and the Developer plan caps at 5,000 traces per month before paid plans kick in. The lock-in is real - your traces only make sense through the LangSmith UI, and exporting them is awkward. Outside of LangChain, integration is no easier than any other tool, and in some cases more awkward because the SDK assumes LangChain primitives. The self-hosted option is gated behind an Enterprise plan that starts in the four-figure-per-month range.

Pick LangSmith when your stack is LangChain or LangGraph, you value zero-config integration over price, and you do not anticipate leaving the ecosystem. For everyone else, it is hard to justify over Langfuse.

Helicone deep dive: the one-line install

Helicone takes a different architectural approach. Instead of an SDK that wraps your model calls, Helicone is a proxy that sits between your application and the model provider. You change the OpenAI base URL from api.openai.com to oai.helicone.ai, add a header with your Helicone key, and every call now gets logged, cost-tracked, and optionally cached. The integration is one line of code.

Strengths: the fastest setup in the category, period. The cost dashboards are the most polished, the caching feature can meaningfully cut costs on idempotent calls, and the rate-limiting and user-throttling features are genuinely useful for SaaS apps with per-user quotas. The free tier covers 10,000 requests per month, which is enough to validate a product before paying anything.

Weaknesses: because Helicone is a proxy, your application becomes dependent on Helicone uptime in the critical path. Their SLA is good in practice but it is a real consideration. The trace tree visualization is weaker than Langfuse or LangSmith because the proxy sees HTTP calls in isolation - linking them into a session requires you to pass session headers yourself. Eval support exists but is the weakest of the tools here.

Pick Helicone when you want cost tracking and basic logging with the absolute minimum integration work, or when you want proxy-level features like caching and rate limiting that the SDK-based tools cannot provide. It pairs well with Langfuse in the stack pattern I describe later.

Braintrust deep dive: evals as the center of gravity

Braintrust approaches the problem from the eval side first. Most tools add evals to a tracing product; Braintrust built an eval product and added tracing. The result is the strongest experience for teams that treat prompt and model changes like code - write the evaluation, run it on a dataset, compare experiments side by side, ship only if the score improves.

Strengths: the eval UI is the best in the category. Side-by-side diff of two experiments, with row-level disagreement highlighting, makes prompt iteration feel like a code review. The playground supports multiple providers and is genuinely useful for spike work. The hybrid data plane option keeps prompts and outputs in your infrastructure while metadata flows to Braintrust.

Weaknesses: pricing is premium and meant for funded teams. The free tier is small (1,000 evaluation rows per month) and paid plans land north of $250 per month for any serious volume. Tracing-only use cases are not where Braintrust shines - you are paying for the eval platform whether you use it or not. The learning curve to get the most out of it is real.

Pick Braintrust when evaluation is your central workflow, budget is not the constraint, and your team is willing to invest in the methodology. For agents and assistants where output quality is the product, this is the tool that takes you furthest.

Phoenix and Arize: the open-source ML-team option

Phoenix is the open-source LLM observability tool from Arize. It runs locally in a notebook or as a Docker container, ships the OpenInference instrumentation library (which is becoming a de facto standard alongside OpenTelemetry), and is particularly strong for embeddings analysis, retrieval evaluation, and drift detection. The aesthetic is closer to a Jupyter-native ML tool than to a production application dashboard.

Strengths: the embeddings visualization for RAG debugging is unique - you can see clusters of retrieved chunks plotted in 2D and spot when retrieval is missing a semantic region. The OpenInference instrumentation is framework-agnostic and works with LlamaIndex, LangChain, Haystack, and direct OpenAI calls. Self-hosting cost is effectively zero. Arize Cloud is the paid upgrade path when you outgrow local Phoenix.

Weaknesses: the UI is less polished than the SaaS competitors, the alerting and team-collaboration features are thinner, and the eval primitives are less developed than Braintrust or Langfuse. If your team thinks like application engineers, Phoenix can feel notebook-flavored.

Pick Phoenix when your team has ML-research DNA, your RAG debugging would benefit from embeddings visualization, and you want a self-hosted open-source tool with zero infrastructure commitment. It also works well as a development-time tool alongside a different production observability backend.

Integration time head-to-head

Here is what wiring each tool into a TypeScript application actually looks like. These are the snippets I use as a starting point on client projects. Every example assumes you have already set the relevant API keys as environment variables.

Helicone - one line, just change the base URL:

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

Langfuse - wrap the OpenAI client and you get full trace trees automatically:

import { observeOpenAI } from "langfuse";
import OpenAI from "openai";

const openai = observeOpenAI(new OpenAI(), {
  generationName: "chat-completion",
  metadata: { userId: "user_123", sessionId: "sess_abc" },
});

LangSmith - set two environment variables, the LangChain SDK does the rest:

// .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...
LANGCHAIN_PROJECT=my-agent-prod

Braintrust - wrap calls explicitly so you can attach eval scores later:

import { wrapOpenAI } from "braintrust";
import OpenAI from "openai";

const openai = wrapOpenAI(new OpenAI());

// Calls are now logged to Braintrust with full request/response capture.

Phoenix - instrument once at app start with OpenInference, then make normal calls:

import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";

registerInstrumentations({
  instrumentations: [new OpenAIInstrumentation()],
});

The integration-time gap is real but smaller than the marketing suggests. All five are under 30 minutes of work for a competent engineer. The interesting differences show up at scale, in cost, and in the slice-and-dice power of the trace UI.

TCO at 3 traffic tiers

Tool cost matters most at the boundaries - the free tier where you validate, and the high-traffic tier where you start writing real checks. Here is what the bill actually looks like at three realistic volumes for an application that averages 5 LLM calls per user request, retains traces for 30 days, and runs a small eval suite weekly.

Tool100k traces/mo1M traces/mo10M traces/moNotes
Langfuse Cloud$0 (free tier)$59 (Pro)~$300-500Per-observation pricing, cheapest at scale
Langfuse self-hosted~$30 (small VM)~$60 (medium VM)~$150 (VM + Postgres)Infra only, zero license cost
LangSmith$0 (Developer)$199 (Plus)~$2,000+ (Enterprise)Climbs fastest, Enterprise is custom
Helicone Cloud$0 (free tier)$25 (Pro)~$250-600Generous free, per-request pricing
Helicone self-hosted~$40 (Workers + DB)~$80~$200Requires Clickhouse, more ops
Braintrust$0 (free tier)$249 (Pro)~$1,500+ (custom)Eval-row pricing, premium tier
Phoenix self-hosted$0 (local)~$50 (Docker on VM)~$150 (VM + Postgres)Open source, no SaaS markup

Three honest caveats. First, these are list prices and most vendors will negotiate at the higher tiers. Second, your real bill depends on retention - 90-day retention can double the cost. Third, the eval workload layered on top usually costs more than the observability tool itself, because every eval is more LLM calls. Budget for it. The same patterns I document in the OpenAI API cost breakdown apply directly here.

What I actually run in production

After shipping observability stacks for clients and my own products since 2023, here is the stack I default to in 2026 unless there is a strong reason to change it.

Helicone in the proxy layer, Langfuse in the application layer. Helicone catches every model call at the HTTP boundary, gives me the cost dashboard the founder asks about, and provides cache and rate-limit primitives I would have to build otherwise. Langfuse wraps the same calls at the application layer, builds the trace tree across retrievals and tool calls, stores my eval datasets, and gives me the deep debugging UI when something goes wrong.

The two tools observe different layers and do not duplicate. Helicone sees "OpenAI request with prompt X took 1200ms and cost $0.003." Langfuse sees "user request with session Y triggered retrieval, then rerank, then synthesis, then two tool calls, total cost $0.012, total latency 4.1 seconds, final eval score 0.84." Both views matter; neither replaces the other.

For projects where evaluation is the central workflow - agents whose quality is the product, or content-generation pipelines where small prompt changes matter - I add Braintrust on top for the eval experience. That is a heavier stack and I only run it when the value is clear. For the architecture patterns this stack supports, see the AI SaaS architecture and agentic RAG architecture guides - observability is the dependency that makes both of those production-ready.

Self-host vs SaaS - when each makes sense

Three of the tools in this post (Langfuse, Helicone, Phoenix) have credible self-host options. The choice between self-host and SaaS is rarely about features - it is about three other constraints.

Data residency. If your application processes regulated data (healthcare, finance, EU personal data under GDPR), the prompts and outputs that flow into your observability tool are themselves regulated data. SaaS observability means that data leaves your perimeter. Self-host keeps it inside. Some SaaS providers (Braintrust's hybrid data plane, Langfuse Enterprise) offer architectures where prompts stay on your infrastructure and only metadata flows out - those are worth a serious look if you fall in this bucket.

Cost crossover. The cost math flips around 5 to 10 million traces per month. Below that, SaaS free tiers and the smallest paid plans are cheaper than the engineer time to run your own infrastructure. Above that, self-hosting Langfuse on a Postgres-backed VM can save several hundred dollars per month - meaningful for some teams, noise for others.

Ops overhead. Self-hosting means backups, upgrades, monitoring, on-call. Langfuse and Helicone are not complicated to run, but they are real systems. If you do not have a person who already runs Postgres and Docker for production, the SaaS premium is the right trade. If you do, the operational tax is small.

The pattern I see most often with clients I onboard through my AI integration work: start on SaaS, validate the product, move to self-host when the bill becomes painful or the data residency story forces it. The migration is half a day for Langfuse if you have used the SDK from the start.

Pitfalls - what these tools don't catch

Observability tools are necessary and not sufficient. Here are the failure modes I have seen ship to production with an observability tool happily logging the whole way down.

Eval is not observability. Trace tools tell you what happened. They do not tell you whether what happened was good. A trace with 200ms latency, $0.001 cost, and a confident-looking answer might still be hallucinated. You need an eval layer - either online sampling or an offline regression suite - that scores quality independently of the trace. The Braintrust and Langfuse eval features exist precisely because tracing alone is not enough.

Sampling bias. Most teams that hit volume eventually turn on sampling - log 10% of traces, not 100%. The moment you do, your dashboards become a sample, and your sample is biased toward whatever the sampler picks (random by default, head-based in some tools, tail-based in others). Errors and slow calls are exactly the ones you want full visibility on. Make sure your sampling rule keeps errors at 100%.

Trace cardinality explosions. A naive implementation that creates a new session per page load produces millions of single-trace sessions, which makes session-level dashboards meaningless. Conversely, lumping all anonymous users into one session ID makes per-user debugging impossible. Design your session-ID scheme deliberately.

Prompt and output PII. The prompts that flow into your observability tool contain whatever the user typed. Sensitive emails, personal information, sometimes credit card numbers. Every tool here supports redaction hooks; very few teams enable them on day one. Set up the redaction before you need it, not after. This is the same defense pattern that shows up in the prompt injection defense playbook.

Cost spike alerts. Tools track cost, but almost none alert on it by default. A runaway agent loop can burn through the monthly budget overnight. Set a daily cost threshold alert on day one, not month three.

Frequently asked questions

These are the questions I get most often when teams scope an observability stack with me. The answers are also embedded as FAQ structured data for search.

What is LLM observability?

LLM observability is the practice of capturing every model call, tool call, retrieval step, and evaluation result in a structured trace so you can debug failures, attribute cost, and measure quality over time. It differs from traditional APM because the inputs and outputs are non-deterministic strings, the unit cost is token-based, and a single user request can fan out into a tree of dependent LLM calls that all need to be linked together.

Which LLM observability tool should I pick first?

If you are using LangChain or LangGraph, start with LangSmith because the integration is automatic and the trace UI is shaped around chains. If you are framework-agnostic and cost-sensitive, start with Langfuse - the cloud free tier is generous and the self-host option costs you nothing once your traffic crosses the paid threshold. If you just want a one-line install and price-per-call dashboards, Helicone is the fastest path because it works as a proxy with a single base URL change.

How much does LLM observability cost in production?

At 100,000 traces per month, all the major tools have free or near-free tiers. At 1 million traces per month, expect $50 to $200 per month depending on tool and retention. At 10 million traces per month, hosted plans land in the $500 to $2,500 per month range and self-hosting Langfuse on a small Postgres-backed VM becomes meaningfully cheaper. The number that surprises people is not the tool cost - it is the eval cost layered on top.

Is Langfuse really free if I self-host?

Yes, the core Langfuse server is MIT-licensed and the self-hosted deployment has no usage caps or feature gates for the open-source tier. You pay for the Postgres database and the container hosting, which is typically $30 to $80 per month for a workload up to a few million traces. The Enterprise edition adds SSO, role-based access control, and a few advanced features but is not required for production use.

What is the difference between Langfuse and LangSmith?

LangSmith is built and operated by the LangChain team. It is the most polished experience if your stack is LangChain or LangGraph, and the SDK integration is automatic. Langfuse is independent, open-source, framework-agnostic, and can be self-hosted. LangSmith is faster to set up if you already use LangChain; Langfuse gives you more control, lower long-term cost, and works just as well with the Vercel AI SDK, plain OpenAI calls, or any custom orchestrator.

Does Helicone slow down my LLM calls?

Helicone runs as a proxy, so every call passes through their edge before reaching OpenAI or Anthropic. In practice the added latency is in the 20 to 60 millisecond range, which is invisible against the multi-second LLM response time. The benefit is that you do not have to instrument anything in your code - you change the base URL and you get logs, cost tracking, and caching for free. The downside is that you are now dependent on Helicone uptime in the request path.

Can I combine multiple observability tools?

Yes, and I often do. A common pattern is Helicone as the proxy for cost tracking and rate-limit protection, plus Langfuse for trace-level debugging and eval runs. They observe different layers - Helicone watches the HTTP call, Langfuse watches the application logic - so they do not duplicate. The cost is roughly the sum of the two free tiers until you grow into paid plans on either side.

What about Phoenix from Arize and other open-source options?

Phoenix is the open-source LLM observability tool from Arize. It is particularly strong for teams that already think in ML-evaluation terms - drift detection, embeddings analysis, and cluster visualization. It is a lighter-weight Langfuse alternative for notebooks and local development, and the OpenInference instrumentation it ships is becoming a standard. If your team is ML-research-leaning rather than application-engineering-leaning, Phoenix is worth a serious look.

Closing

LLM observability in 2026 is a solved category in the sense that any of the five tools above will get you to production. The difference between picking well and picking badly is a few hours of integration time, a few hundred dollars per month at scale, and the difference between debugging a bad answer in five minutes versus five hours. Default to Langfuse if you have no constraints, add Helicone at the proxy layer if you want cost dashboards and caching, reach for LangSmith only if you live in the LangChain ecosystem, reach for Braintrust if evaluation is your central workflow, and reach for Phoenix if your team thinks in embeddings. Wire it on day one, not month three - the trace you do not capture today is the bug you cannot reproduce tomorrow. If you want help putting this together for your stack, the work I cover under AI agent development and the team I can introduce through hire an AI developer in Kosovo both default to the stack patterns in this post.