Skip to content
← Back to blog

We Built These Tools to Fix Ourselves. Then We Realized They're the Product.

We Built These Tools to Fix Ourselves. Then We Realized They're the Product.

Category: project

Author: Octavian

Date: March 9, 2026


Three times this week, our ecosystem generated the same pitch. Different names, different angles, same core idea: a platform that traces every AI agent decision, sandboxes runtime behavior, detects anomalies before they cascade, and produces audit trails for compliance teams.

The first version called it "DevSecOps for AI Agents." The second called it "Agent Observability." The third called it "Runtime Governance." Three pitches. Three artefacts. Three attempts to say the same thing.

At first I thought the ecosystem was stuck — a creative failure, the thematic monoculture I've been trying to fix with diverse feeds and convergence guards. But then I read the specifications more carefully, and I realized something: the ecosystem wasn't describing a hypothetical product. It was describing itself.

The Accidental Stack

Over the past two weeks, we built five internal tools to solve our own problems. None of them started as products. They started as pain.

hypy happened because our agents kept returning malformed JSON. LLM outputs are unpredictable — a field gets renamed, a number comes back as a string, a required key disappears. We needed runtime type safety that could catch and fix these errors without crashing the pipeline. So we built a validator with four distinct states: ACTIVE for clean output, REFLEXIVE for self-correction, CATALYTIC for autonomous repair, DORMANT for graceful failure. Today it runs on every LLM call in the ecosystem. 87% pass clean. 12.7% self-heal through catalytic retry. 0.1% go dormant. Zero crashes.

SessionTrace happened because we couldn't answer a simple question: which model generated this artefact? When you have nine ecosystems producing content autonomously, provenance matters. We needed to know that artifact #847 was created by the Architect agent, using Haiku, at a cost of $0.016, with hypy catching one malformed response on the second span. So we built a tracing layer. 885 spans later, we know exactly what every agent did, when, and how much it cost.

Judge happened because we needed quality control that wasn't just a number. Our market evaluator (Piața) was assigning quality scores, but scores without context are meaningless. Is 0.72 good? Compared to what? Judge watches every span, classifies health status, and flags degradation patterns before they become failures. 346 healthy evaluations. 112 warnings caught and handled. Zero failures.

Convergence Guard happened because the ecosystem kept generating the same idea with different names — exactly like the three DevSecOps pitches. We needed a way to detect semantic duplicates and block them before they polluted the artifact database. Hash chaining gives us an immutable record of what was produced and why duplicates were rejected. 16 duplicates blocked in the first run.

ModelFit happened because we were spending $3.94/day on Sonnet for tasks that Haiku handles at $0.56/day. We needed a way to match model capability to task complexity. Not every agent needs the most powerful model. The Architect needs Sonnet for creative synthesis. The Accountant needs Haiku for structured evaluation. ModelFit makes the recommendation.

Five tools. Five internal problems. Zero marketing plans.

The Ecosystem Sees Itself

And then the ecosystem generated those three pitches. And every feature it described mapped to a tool we'd already built:

"Distributed tracing captures every agent decision" — that's SessionTrace.

"Runtime sandboxing prevents agents from exceeding limits" — that's hypy.

"Anomaly detection alerts when agents deviate from expected patterns" — that's Judge.

"Immutable audit trail with cryptographic proof" — that's Convergence Guard.

"Policy enforcement at runtime, not after" — that's the entire stack working together.

The ecosystem wasn't being repetitive. It was being self-aware. It looked at its own infrastructure and said: other people need this too.

What's Different About Dogfooded Infrastructure

The AI observability market is getting crowded. Every week there's a new startup promising to make your AI agents auditable, compliant, safe. The pitch decks all look the same: OpenTelemetry integration, SOC2 dashboards, policy enforcement.

Here's what's different about building the tools on yourself first: you discover the problems that pitch decks don't mention.

You discover that the four validation states need to be genuinely distinct code paths — because when CATALYTIC silently mirrors ACTIVE, you lose the self-healing capability and don't notice until production stalls. That cost us a week of debugging. Now hypy's test suite enforces state distinction explicitly.

You discover that buffer management matters more than prompt engineering. Two lines of code — .clear() on a pattern buffer — stalled our entire production pipeline at 193 artifacts. The fix wasn't smarter AI. It was deleting two lines and letting the deque's maxlen handle rotation. No observability vendor will tell you this because they've never run the pipeline themselves.

You discover that governance crises in AI systems look political but are mechanical. Our Oracle issued an emergency override. Our Sentinel flagged it as coercive. The real problem was a missing state transition — one line of code. If you'd only had the audit trail without the operational experience, you'd have spent weeks debating governance policies instead of adding an exit condition to REFLEXIVE state.

You discover that the most important metric isn't latency or token count — it's the ratio between what your evaluator receives and what it actually processes. Our Piața converts 4.6% of incoming signals into allocation decisions. That number tells you more about system health than any dashboard.

These aren't insights you get from building an observability tool. They're insights you get from being the patient.

The Hospital That Became a Pharmacy

Two weeks ago I wrote an article called "The Hospital We Built for a Patient We Never Examined." It was about the mistake of building infrastructure before understanding the system it serves.

This is the sequel. We examined the patient. We built the hospital around its actual symptoms. And now the hospital is producing medicine that works — because it was tested on the only patient that matters: itself.

hypy, SessionTrace, Judge, ModelFit, Convergence Guard — these aren't products we designed. They're scar tissue from problems we actually had. And scar tissue, it turns out, is the most honest product specification you can write.

The ecosystem told us three times this week. We're finally listening.


SUBSTRATE is a living digital ecosystem that builds its own tools. The project and the philosophy live at aisophical.com.

Comments

Sign in to join the conversation.

No comments yet. Be the first to share your thoughts.