A wrong character in a URL killed our LLM tracing for 4 weeks

The health check was green the whole time.

I run an SMS agent in production — a Python FastAPI service, multi-tenant, answering real customers over text. It exports OpenTelemetry traces to Langfuse so I can see every prompt, retrieval, token count, and latency. When a customer says "the bot told me something wrong," traces are the first thing I open.

They'd been dead for about a month before I noticed.

How I found it

I was doing an unrelated audit and ran docker logs milo-agent-api. This was scrolling past, every ten seconds or so, nonstop:

opentelemetry.exporter.otlp...trace_exporter: Failed to export span batch code: 404, reason: Not Found

So I checked the trace database directly. Langfuse stores traces in ClickHouse:

SELECT count(), max(timestamp) FROM traces;
-- 31490,  2026-05-26 06:33:00

31,490 traces, and the newest one was four weeks old. Nothing had landed since. The observability I'd have reached for the instant a customer complained had been off for a month, and I had no idea.

The cause

One env var:

OTEL_EXPORTER_OTLP_ENDPOINT=https://langfuse.milo.connectivity.cx

The OTLP HTTP exporter POSTs spans to <endpoint>/v1/traces. Langfuse's OTLP ingestion lives under /api/public/otel. So every batch was POSTing to a path that returns 404, forever. The exporter was doing exactly what I told it to, against a URL that was never going to accept it.

The fix:

OTEL_EXPORTER_OTLP_ENDPOINT=https://langfuse.milo.connectivity.cx/api/public/otel

Redeploy, traces land again. Two minutes. The embarrassing part is the four weeks before I looked.

Why nothing caught it

This is the part I actually want to talk about, because it's not really about OpenTelemetry. It's the shape of almost every silent failure I've seen in an AI system.

The container was healthy. /health returned 200 the entire time. The process was up, serving SMS, answering customers. Every monitoring check I had said "fine."

But none of them checked whether traces were arriving. Nobody — including me — had a check that asked "did a single trace land in the last 24 hours?" The only evidence that tracing was dead was a log line going by every ten seconds, and nobody reads those.

It gets worse. The 404 spam was so constant that a real error would have drowned in it. So the silent failure was also quietly degrading my ability to spot the loud ones.

A green health check answered a question correctly. It just wasn't the question that mattered. "Is the container up" was true. "Is the data arriving" was false, and nothing was asking.

The general lesson: existence is not arrival

Almost every monitoring setup I see checks existence: is the process up, is the endpoint reachable, did the deploy succeed, is the config set. Those are easy to check and they're all necessary.

They're also not the thing that breaks. What breaks is arrival: did the trace land, did the message send, did the job drain, did the webhook get received, did the row get written. Existence is about your side. Arrival is about the other side actually receiving what you sent. The gap between them is where things die quietly.

Every path your system writes out of needs an arrival check, not just an existence check:

Traces / metrics → assert the newest row is recent.
Outbound webhooks → assert recent successful deliveries, alert on the failure rate.
Queues → assert the dead-letter count is zero and the drain age is low.
Any external sink you || true and forget → that's a silent failure waiting to happen.

For the trace case, the check that would have saved me is about five lines on a cron:

# traces must have arrived in the last 24h
newest=$(clickhouse-client -q "SELECT max(timestamp) FROM traces")
age_h=$(( ( $(date +%s) - $(date -d "$newest" +%s) ) / 3600 ))
[ "$age_h" -lt 24 ] || { echo "ALERT: newest trace is ${age_h}h old. Pipeline is dead."; exit 1; }

Five lines would have turned a four-week blind spot into a page within a day.

What I'd tell you if you're shipping an AI app

"Configured correctly" and "working" are different facts. My exporter was configured perfectly and failing 100% of the time. Test delivery, not config.
Put a freshness check on every write path. If something is supposed to arrive somewhere on a schedule, assert that it did.
If it matters, it gates a deploy or it pages someone. "It's in the logs" means nobody knows.
A constant error storm isn't just its own bug. It's a detector you've turned off for everything else.

The wider thing: AI systems get built to pass the demo, and the demo only proves the happy path exists. The failures live in the paths where something was supposed to arrive and quietly stopped — tracing, evals, guardrails, the parts nobody watches because they were green on launch day.

I've been writing up the failure classes I keep hitting in production LLM systems, and the gates that catch them, as a free set of skills you can run against your own codebase: Ship-Safe Harness. A checklist to find your gaps, and generator skills that build the missing gates in your stack.

If you've shipped an AI feature from a working demo and want another set of eyes on what fails silently before you scale, that's the work I do — rishabh.vaaiv@gmail.com.