On this page
If you're researching Langfuse pricing, you're probably not deciding between $29 and $199. You're trying to figure out whether Langfuse will scale to your agent workload without a surprise bill or a retention cliff that blinds you during your next incident. Best picks up front: Hobby for solo prototyping, Core for early-production teams with simple LLM calls, Pro for multi-step agents that need 90-day forensic retention, self-host when data residency or volume tips the TCO math. The one thing every other pricing guide skips: Langfuse bills on "units," and agentic workloads generate them fast.
How We Evaluated Langfuse Pricing
We evaluated Langfuse across five dimensions: published plan pricing accuracy, unit-billing mechanics and how predictable they are for agent workloads, data retention windows, rate limits and ingestion throughput, and the real total cost of self-hosting beyond "it's free." We reviewed Langfuse's official pricing page, billing docs, and unit definition documentation as primary sources, with third-party content used only for corroboration.
The lens matters here. Most pricing guides treat Langfuse as an LLM tracing tool, evaluating it for chatbot or single-call workloads. We evaluated it specifically for teams running multi-step AI agents, where each run generates dozens of spans, tool calls, and eval scores. That's where unit consumption forecasting breaks down, and where the difference between "affordable" and "budget-blowing" lives.
One honest caveat: Enterprise pricing is custom. The figures we cite for that tier come from published guidance and third-party breakdowns, not a signed contract.
Langfuse Cloud Pricing: All Four Tiers Broken Down
Langfuse offers four tiers. Here's what each actually includes, with an honest warning for each one.
Hobby (Free)
- • Monthly price: $0
- • Included units: 50,000/month
- • Overage: None; you hit a hard limit
- • Data retention: 30 days
- • Rate limits: Lower throughput caps
Best for: Solo developers prototyping before any real traffic hits.
Warning: 50,000 units sounds like a lot until you realize a multi-step agent with 15 spans and a couple of eval scores burns through 20+ units per run. That's 2,500 agent runs before you hit the ceiling. If you're doing any meaningful testing, you'll hit it in days.
Core (~$29/month)
- • Included units: 100,000/month
- • Overage rate: ~$8 per additional 100,000 units
- • Data retention: 30 days
- • SSO/RBAC: Not included
Best for: Early-production teams with light, simple LLM call traffic and predictable volumes.
Warning: The 30-day retention window sounds fine in month one. It becomes a problem the moment you need to diagnose a regression that started 45 days ago. You'll have no forensic data to replay. If your agents have any complexity, you'll outgrow this tier faster than the price suggests.
Pro (~$199/month)
- • Included units: Higher allowance with volume discount on overages
- • Data retention: 90 days
- • SSO: Included (check current plan page for RBAC details)
- • Rate limits: Higher throughput
Best for: Production agent teams running multi-step workflows who need a real retention window for incident investigation.
Warning: 90-day retention is meaningfully better than 30 days, but unit costs still compound quickly for complex agents. Run the unit math before assuming the base price is your actual monthly spend.
Enterprise (~$2,499/month or custom)
- • Data retention: Up to 3 years
- • Governance: SSO, RBAC, SCIM provisioning, audit logs
- • SLAs: Included
- • Pricing: Custom; published figures are ballpark only
Best for: Organizations with compliance requirements, large engineering teams, or multi-year retention needs.
Warning: Retention and audit logs are not the same as semantic failure detection. Long retention windows help with forensics after the fact, but they don't catch failures in the first place.
LLMs hallucinate in anywhere from 3% to 27% of responses depending on use case, and Suprmind's 2026 research notes that all hallucinated responses return a clean HTTP 200 to standard monitoring. Longer retention windows help with forensics after the fact, but they don't catch the failure in the first place.
How Langfuse Units Actually Work (And Why Agent Teams Underestimate Them)
A Langfuse "unit" is not a conversation, and it's not a token. Units are billable observations: each trace, each span, each tool call observation, each eval score, and certain prompt artifacts all count separately toward your monthly unit total.
Glassbrain's 2026 Langfuse pricing breakdown puts it plainly: a complex agent with tool calls, retrieval, reranking, and multi-step reasoning can easily generate ten to thirty observations per user request. That's before eval scores.
Here's the math most teams don't run before picking a tier:
Worked example: One agent run
- • 15 spans (LLM calls, tool calls, retrieval steps) = 15 units
- • 2 eval scores attached to the trace = 2 units
- • 1 top-level trace = 1 unit
- • Total per run: ~18 units
Now scale that:
| Monthly Agent Runs | Spans per Run | Eval Scores | Est. Units/Month | Likely Tier | Est. Monthly Cost |
|---|---|---|---|---|---|
| 1,000 | 10 | 2 | ~14,000 | Hobby | $0 |
| 5,000 | 15 | 2 | ~90,000 | Hobby/Core boundary | $0-$29 |
| 10,000 | 15 | 3 | ~190,000 | Core + overage | ~$37-$50 |
| 50,000 | 20 | 3 | ~1.15M | Pro + overage | $199+ |
| 100,000 | 25 | 4 | ~2.9M | Pro heavy overage or Enterprise | $400+ |
OneUptime's 2026 analysis quantifies exactly this dynamic: an agent that reasons through 5 steps generates 40-75 spans per user interaction, multiplying unit consumption 8-15x compared to traditional API calls. Planning on conversation count instead of spans-per-run causes teams to underestimate their unit usage by a significant margin.
Instrumentation choices compound this further. CheckThat's Langfuse pricing analysis found that unit consumption varies 3-5x based on instrumentation decisions alone: logging every intermediate step in a chain costs substantially more than logging only the top-level trace, with optimized instrumentation enabling 50-90% cost reductions.
The practical implication: before you lock in a tier, count your spans per run, not your runs per month. The number that matters for billing is the product of both.
One additional wrinkle for teams running reasoning models like OpenAI o1: token and cost tracking can't always be inferred when usage fields aren't ingested. This doesn't affect unit billing directly, but it does affect your ability to correlate trace cost with agent behavior, which matters if you're trying to understand which agent patterns are expensive and why.
We analyze agent behavior across millions of logs at Sentrial, and the patterns hold: 22% of agent failures are explicit tool call failures, the kind that produce errors Langfuse will surface clearly. The other 78% are silent regressions: hallucinations, user frustration, agent forgetfulness. No unit count tells you about those.
Self-Hosted Langfuse: Free in Licensing, Not in Practice
Langfuse's open-source version is genuinely free to self-host. No licensing fee, and Langfuse doesn't artificially gate features behind the cloud tier. For teams with data residency requirements or the infrastructure muscle to run it, self-hosting is a real option, not a consolation prize.
But "free" refers to licensing. The actual cost has two components that most self-host comparisons understate.
Infrastructure costs: A containerized Langfuse deployment with a Postgres database and object storage for trace retention runs $100-$400/month for modest agent workloads, based on typical cloud provider pricing. TrueFoundry's analysis of self-hosted observability stacks puts the range at $200-$800/month including configuration overhead.
Engineering time: This is the bigger number. Langfuse ships frequently, which is a feature of the product and a cost of self-hosting. Someone on your team owns the upgrade cycle, the backup strategy, and the on-call responsibility when ingestion breaks at 2am. SitePoint's 2026 self-hosted infrastructure cost analysis estimates that a production system requires 20-30% of a senior engineer's time, translating to $3,000-$6,000/month in staffing cost at Series A+ salaries.
Self-host decision checklist:
Self-hosting makes sense when:
- • Data residency requirements are a hard blocker, not a preference
- • You already have an infra team with headroom
- • Your monthly cloud Langfuse bill would exceed the TCO above at scale
Self-hosting doesn't make sense when:
- • Your team is already stretched thin on core product
- • Observability uptime becomes your problem to solve
- • You're early-stage and optimizing for iteration speed over cost
The break-even point where self-host TCO beats cloud pricing depends heavily on your volume and existing infra team capacity. At 10,000 agent runs/month with moderate complexity, cloud Pro is usually cheaper when you factor in eng time. At 500,000+ runs/month with an existing Kubernetes team, the math often flips.
Langfuse vs. Alternatives: Comparison Table
| Tool | Starting Price | Billing Model | Data Retention | Silent Failure Detection | Self-Host | Best For |
|---|---|---|---|---|---|---|
| Langfuse Core | ~$29/mo | Per unit | 30 days | No (traces only) | Yes (OSS) | Early production, simple LLM calls |
| Langfuse Pro | ~$199/mo | Per unit | 90 days | No (traces only) | Yes (OSS) | Multi-step agents needing forensic retention |
| LangSmith | Free tier; paid from ~$39/mo | Per trace | Varies by plan | No (traces + evals) | No | Teams in LangChain ecosystem |
| Arize Phoenix | Free OSS; cloud pricing varies | Per event | Plan-dependent | Partial (drift detection) | Yes (OSS) | ML teams adding LLM observability |
| Sentrial | Usage-based; contact for pricing | Per agent run | Configurable | Yes (full observability stack: tracing, evals, A/B testing, alerting, and code-level debugging) | No | Production agent teams that need end-to-end visibility across tracing, evaluation, alerting, and debugging in one platform |
Note: LangSmith and Arize Phoenix pricing reflects publicly available information as of 2026. Enterprise tiers for all tools require direct contact.
For a deeper comparison of Arize specifically, our Arize vs. Sentrial breakdown covers what the docs don't surface about where each tool's detection model breaks down.
Sentrial: Full Observability Stack for Production AI Agents
Langfuse tells you what your agent did. Sentrial tells you what your agent did, whether it did it well, and exactly where in the code it went wrong when it didn't -- even when no error was thrown, no span failed, and your HTTP logs show a clean 200.
Sentrial is a production monitoring platform for AI agents covering the full observability stack: session-level tracing (inputs, outputs, latency, and token costs at every step), automated evaluations (flags and assertions for hallucinations, tool failures, user frustration, and goal abandonment), prompt A/B testing with statistical rigor, real-time Slack alerts on error spikes and behavioral anomalies, and source-code-level failure pinpointing with fix suggestions. It integrates in minutes via OpenTelemetry, LangChain, LangGraph, or custom Python agents.
Best for: Engineering teams running production AI agents who need end-to-end visibility: what the agent did (tracing), whether it did it well (evaluations), and what to do when it didn't (alerts plus code-level debugging).
What makes the difference:
- • Full observability stack in one platform. Tracing, evaluations, A/B testing, alerting, and debugging are all covered. Most tools only address one or two of these. Traditional APM catches crashes but misses wrong answers. Eval-only tools catch known failure modes but miss production drift. Sentrial covers all three in one platform.
- • Post-trained classifiers, not generic LLM-as-judge. Sentrial post-trains models on each customer's own agent traffic. Generic eval frameworks apply one-size-fits-all judgment; Sentrial's classifiers learn the specific patterns of your agent's failure modes. That's what makes the accuracy usable at scale rather than a noise source.
- • Built-in and custom classifiers. Sentrial ships built-in classifiers for hallucinations, bad tool calls, agent forgetfulness, and jailbreaking. Teams can also define any failure mode, review 3-4 example logs, and deploy a fine-tuned classifier in under a minute, without being locked into built-in categories.
- • Every log, not a sample. Sentrial classifies every interaction because sampling is the wrong tradeoff when 78% of the failures you're trying to catch are silent. A sampled eval setup that misses the tail is worse than not sampling at all for this class of problem.
- • Prompt A/B testing with statistical rigor in production. Teams can test prompt changes against real traffic and get statistically valid results, not gut-feel comparisons.
- • Real-time Slack alerts with source-code-level failure pinpointing. When an error spike or behavioral anomaly surfaces, Sentrial sends an alert and points to the specific line of code responsible, with fix suggestions attached.
- • Replay and fork from any intermediate step. When a failure surfaces, you can replay from any point in the agent run, which matters when the failure started in step 3 but only became visible in step 12.
- • Catches the 78% of failures that don't produce an error. Across 12 million logs analyzed at Sentrial, 22% of agent issues were explicit tool call failures. The remaining 78% were hallucinations, user frustration, and agent forgetfulness. None of them threw an error. None of them would have triggered a standard tracing alert.
Research from OpenReview on silent hallucinations in agentic systems confirms the structural nature of this problem: silent hallucinations are internally generated false beliefs that influence an agent's decisions without being explicitly surfaced, making them difficult to detect despite their potential to cause compounding downstream errors. This isn't a model quality issue you can tune away; it requires a monitoring layer designed to catch it.
A concrete example of what undetected silent failure looks like: one of our Series B finance customers deployed a vendor quoting agent. The agent was generating quotes that looked approximately right. HTTP 200, spans completing, no errors. But the PDF ingestion step was broken; the agent was hallucinating quote prices from surrounding context rather than the actual document. It ran that way for weeks before Sentrial's classifiers surfaced the pattern. The customer's own assessment was that without classification-level monitoring, the failure would not have been caught "for a century."
Real cons, the kind worth knowing before you decide:
Sentrial is purpose-built for production agent monitoring. If you need prompt management, dataset versioning, or a UI for browsing individual traces as a primary workflow, Langfuse remains strong in those areas. Some teams run both: Langfuse for prompt management and trace storage, Sentrial as the evaluation, alerting, and debugging layer on top. That's a legitimate architecture. But for teams whose primary need is end-to-end production observability -- tracing, evaluation, A/B testing, alerting, and debugging -- Sentrial covers the full stack without requiring a second tool.
Pricing is usage-based; reach out to sentrial.com for current figures based on your volume.
How to Choose: Decision Framework by Use Case
The decision branches into four paths:
Prototyping / pre-production: Start on Hobby. You're not generating enough volume for any tier decision to matter. When you start approaching 2,000 agent runs per month at moderate complexity, reassess.
Early production, simple LLM calls, tight budget: Core at ~$29/month is the right starting point. Watch your unit consumption in month one before assuming the bill will hold. If your traces are mostly single-span LLM calls without heavy tool use or evals, Core will last longer than you expect. If you're chaining calls, it won't.
Production agents, multi-step, need forensic data: Pro or self-host. The decision between them comes down to data residency requirements and whether you have infra team bandwidth. The 90-day retention on Pro is the critical feature here: it's the difference between being able to debug a regression that started six weeks ago and being completely blind to it.
Enterprise governance or end-to-end production observability as primary need: Enterprise tier if your driver is compliance (SSO, RBAC, SCIM, audit logs, SLAs). Sentrial if your driver is full production observability -- tracing, evaluations, A/B testing, alerting, and code-level debugging in one platform -- and particularly if catching failures that don't crash is a core requirement. For teams that need both compliance infrastructure and full-stack agent observability, both tools together cover the complete surface area.
The core insight of this article's angle bears repeating: optimizing for the cheapest tier is the wrong frame. The right question is whether the retention window covers your typical incident detection lag. If a silent failure starts on day 31 and your retention window is 30 days, you have no evidence to work with. Pick the tier whose window covers the realistic gap between "failure begins" and "someone notices something is wrong." For most production agent teams, that's longer than 30 days.
For a broader view of how tracing tools stack up against platforms designed for semantic failure detection, our LLM observability platform review covers eight tools in 2026 across the full evaluation set.
The one-sentence version for the reader with 12 tabs open: if your failures crash, Langfuse Pro or self-host covers you. If your failures don't crash -- or if you need tracing, evaluation, alerting, and debugging all in one place -- Sentrial covers the full stack.
FAQ
What are the alternatives to Langfuse?
The main alternatives are LangSmith (strong if you're already in the LangChain ecosystem, free tier available, not open source), Arize Phoenix (better ML observability heritage, open-source version available, good for teams with existing ML monitoring infrastructure), and Sentrial (a full production monitoring platform for AI agents covering tracing, automated evaluations, prompt A/B testing, real-time alerting, and source-code-level debugging). If your primary need is prompt management and trace storage, LangSmith and Langfuse are the closest substitutes. If your primary need is end-to-end production observability -- including catching hallucinations, silent regressions, and behavioral drift -- Sentrial covers the full stack that the others address only partially.
What are the key features of Langfuse?
Langfuse's core features are trace and span logging for LLM calls and chains, prompt management and versioning, dataset management for eval runs, a scoring and evaluation framework, and a UI for browsing and filtering traces. In 2026, it's one of the more complete open-source LLM engineering platforms for the development and pre-production workflow. The gap it doesn't close is semantic classification of production failures: Langfuse shows you what happened, not whether what happened was wrong in a meaningful way.
Is LangSmith free and open source?
LangSmith has a free tier but is not open source. It's a managed cloud product from Langchain, Inc. LangGraph, the graph-based agent orchestration framework from the same company, is open source. The distinction matters if data residency or self-hosting is a requirement: LangSmith cloud-only means your traces go to Langchain's infrastructure. Langfuse's OSS version gives you a self-hostable alternative with no licensing fee.
How do I get a Langfuse secret key?
After creating a Langfuse account (cloud or self-hosted), work through to your project settings and look for the API Keys section. Langfuse generates a public key and a secret key pair per project. The secret key is shown once on creation; copy it immediately and store it in your secrets manager. For self-hosted deployments, key generation works the same way through the UI, but the keys authenticate against your own instance rather than Langfuse's cloud. If you're instrumenting with OpenTelemetry, you'll use these keys as the authorization header on your OTLP endpoint.
Share