All articles
/6 minutes read

What three years of watching AI in production taught us

March 11, 2026

JT

Justin Torre

Engineering Manager

CG

Cole Gottdank

GTM Manager

Share this article


What three years of watching AI in production taught us
SUMMARY

Helicone's founders explain how watching 14.2 trillion tokens across 16,000 organizations taught them that the knowledge layer, not the model, is what makes or breaks AI in production, and why that realization led them to join Mintlify.

We started Helicone during YC W23 because nobody was solving observability for LLMs. Everyone in our batch was building a "GPT wrapper." We were building the tool that would tell you whether your GPT wrapper was actually working.

With a single line of code, developers could log, monitor, and debug every request flowing through their AI stack. Helicone grew to serve over 16,000 organizations and process 14.2 trillion tokens. We got a front-row seat to how AI systems actually behave in production across thousands of companies.

That vantage point changed how we think about what matters in AI, and it's ultimately what led us to Mintlify. But before we get there, here's what we actually saw.

We watched the models get dramatically better, but the problems didn't go away

Over three years we had a front-row seat to the fastest improvement cycle in the history of software. GPT-3.5 to GPT-4 to Claude 3 to o1 to GPT-5 to Opus 4.6 to whatever shipped last Tuesday. Every few months the models got meaningfully smarter, faster, and cheaper. And the quality problems did shrink. But they didn't go away. And the ones that persisted had a pattern.

A company would upgrade to the latest model and their AI product would still give wrong answers, because the docs it was pulling from hadn't been updated since the team redesigned their billing flow two months ago. Their support chatbot would confidently walk customers through a setup process that no longer existed. The customers would follow the steps, something would break, they'd open a ticket, and the support team would spend an hour tracing it back to the chatbot giving correct answers about the wrong version of the product. The model wasn't hallucinating. It was reading outdated context and doing exactly what it was told.

Peter Steinberger made this point well on the Lex Fridman podcast: you have to empathize with what the model actually sees. It starts every session knowing nothing about your product except what you give it. If you hand it an outdated training manual, it's going to give outdated answers, and it's going to do it confidently. The chatbot wasn't broken. It was doing exactly what you'd do if someone dropped you into a new job and handed you documentation from two months ago.

What became clearer with every model generation: as models get better, context matters more, not less. A smarter model reading stale documentation just produces more confident wrong answers. The ceiling used to be the model. Now the ceiling is what you give it to work with.

Context engineering replaced prompt engineering

Andrej Karpathy captured this well when he started talking about "context engineering" instead of prompt engineering. His framing: think of an LLM as a CPU and the context window as RAM. As the CPU gets faster, what you load into RAM becomes the differentiator.

The teams getting the best results in our data had internalized this. They weren't chasing the latest model release. They were engineering what went into the context window for each request: dynamically assembling the right documentation, the right user state, the right constraints, and nothing more. In 2023, "prompt engineer" was the hottest job title in tech. By 2025, it had largely disappeared, replaced by this understanding that the real leverage isn't in how you ask the model or even which model you use. It's in what you give it to work with.

The bottleneck is shifting

After three years, every pattern we observed converged on the same thing. Models will keep getting better. That's a given. But as they do, the bottleneck shifts. The teams that succeeded treated their knowledge layer, their documentation, product information, internal processes, as real infrastructure. They kept it current, kept it structured, and kept it accessible to their AI systems. The teams that treated it as an afterthought were building on a foundation that couldn't keep up with what they were building on top of it.

This applies to agents too. As they get more autonomous, they depend even more on the quality of the knowledge they can access. A more capable agent pulling from stale context doesn't fail less. It just does more damage faster.

95% of enterprise AI pilots fail to reach production. We'd bet that more and more of those failures have less to do with the model and more to do with the knowledge underneath it. And that gap is only going to widen as models improve.

Why this led us to Mintlify

Here's the thing most people don't know: Helicone was already powering the millions of AI interactions happening inside Mintlify before we ever talked about joining. Our gateway was routing their requests. Our observability was keeping things fast and accurate. We were deeply connected at the infrastructure level.

Mintlify started as the best way to build developer documentation, and they became one of the most-loved tools in the developer ecosystem doing it. But they saw the same future we did from our vantage point: one where documentation isn't just something humans read, but the knowledge layer that AI agents pull from to make decisions, write code, and operate with real autonomy. Where that context comes from is everything, and Mintlify is building that layer.

The team

We've known Han and Hahnbee since YC. We were both early in our journeys, building out of the same office in San Francisco. Even back then, they stood out. The level of care they put into every detail of Mintlify, the speed they shipped at, the focus. You could tell they were building something real.

Over the next three years we stayed close, watching each other's companies grow. Mintlify went from a developer tool to the knowledge infrastructure behind companies like Anthropic, Microsoft, and Coinbase. That trajectory wasn't luck. It was the team.

Through all of it, pivoting, rewriting our stack, grinding through the phases where nothing worked, the thing that always mattered most was the people you're building with. When we spent time with the Mintlify team, we already knew.

Building Helicone was one of the best experiences of our lives. Getting to apply everything we learned to this problem, with this team, is the next one.