▲What makes 5% of AI agents work in production?motivenotes.ai

53 points by AnhTho_FR 4 days ago | 43 comments

sbierwagen 3 hours ago [-]

>This Monday, I moderated a panel in San Francisco with engineers and ML leads from Uber, WisdomAI, EvenUp, and Datastrato. The event, Beyond the Prompt, drew 600+ registrants, mostly founders, engineers, and early AI product builders.

>We weren’t there to rehash prompt engineering tips.

>We talked about context engineering, inference stack design, and what it takes to scale agentic systems inside enterprise environments. If “prompting” is the tip of the iceberg, this panel dove into the cold, complex mass underneath: context selection, semantic layers, memory orchestration, governance, and multi-model routing.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles 3 hours ago [-]

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.

Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?

It comes off as so cheap.

mccoyb 2 hours ago [-]

It comes off as someone who lives their life according to quantity, not quality.

The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

palmotea 1 hours ago [-]

> The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

The businessmen's job will be complete when they've totally eliminated all pride from work.

alexchantavy 53 minutes ago [-]

Yeah it bugs me. We've got enough examples in this article to make Cards Against Humanity ChatGPT edition

> One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.

> Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.

> One panelist described asking ChatGPT for family movie recommendations, only to have it respond with suggestions tailored to his children by name, Claire and Brandon. His reaction? “I don’t like this answer. Why do you know my son and my girl so much? Don’t touch my privacy.”

stingraycharles 41 seconds ago [-]

Yeah, AI isn’t creative. You need to ask it to describe these types of patterns, and then include avoiding them in your original prompt to make it come across as somewhat natural.

What I wonder is whether the author of the article recognized these patterns and didn’t care, didn’t even recognize them, or didn’t proofread the article?

rapind 2 hours ago [-]

> Why can’t anyone be bothered anymore to write actual content

The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.

I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.

The real question with AI to me is whether it will remain consistently better when wielded by a specialist who has invested their time into whatever the thing is they are producing. If that ever changes then we are doomed. When it’s no longer slop…

2 hours ago [-]

esperent 2 hours ago [-]

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!

I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".

It works for the first couple of responses but then it's back to loads of links again.

tkgally 2 hours ago [-]

I started to suspect a few paragraphs in that this post was written with a lot of AI assistance, but I continued to read to the end because the content was interesting to me. Here's one point that resonated in particular:

"There’s a missing primitive here: a secure, portable memory layer that works across apps, usable by the user, not locked inside the provider. No one’s nailed it yet. One panelist said if he weren’t building his current startup, this would be his next one."

ares623 60 minutes ago [-]

Isn’t that markdown files?

tkgally 19 minutes ago [-]

I was thinking about consumer-facing AI products, where md files controlled by the user presumably wouldn’t fly.

I find it annoying that, when prompting ChatGPT, Claude, Gemini, etc. on personal tasks through their chat interfaces, I have to provide the same context about myself and my job again and again to the different providers.

The memory functions of the individual providers now reduce some of that repetition, but it would be nice to have a portable personal-memory context (under my control, of course) that is shared with and updated semiautomatically by any AI provider I interact with.

As isoprophlex suggests in a sister comment, though, that would be hard to monetize.

isoprophlex 21 minutes ago [-]

Sheesh how ever will you monetize a text file

Will someone please think of the MRR!

AdieuToLogic 3 hours ago [-]

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever1 3 hours ago [-]

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.

If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

AdieuToLogic 3 hours ago [-]

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middles altogether.

Well said and I could not agree more.

danieltanfh95 2 hours ago [-]

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

another_twist 4 hours ago [-]

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

For example, the number comes from perceived successes and failures and not actual measurements. The customer conclusions are also - it doesnt improve or it doesnt remember. Literally buying into the hype of recursive self improvement and completely oblivious to the fact that API dont control model weights and such cant do much self improvement besides writing more CRUD layers. The other complaints are about integrations which are totally valid. But in industries which still run windows XYZ without any API platforms so thats not going away in those cases.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

iagooar 2 hours ago [-]

Wow, half of this article deeply resonates with what I am working on.

Text-to-SQL is the funniest example. It seems to be the "hello world" of agentic use in enterprise environments. It looks so easy, so clear, so straight-forward. But just because the concept is easy to grasp (LLMs are great at generating markup or code, so let's have them translate natural language to SQL) doesn't mean it is easy to get right.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

jamesblonde 1 hours ago [-]

Text2SQL was 75% on bird-bench 6 months ago. Now it's 80%. Humans are still at 90+%. We're not quite there yet. I suspect text-to-sql needs a lot of intermediate state and composition of abstractions, which vanilla attention is not great at.

https://bird-bench.github.io/

ares623 1 hours ago [-]

Text to sql is solved by having good UX and a reasonable team that’s in touch with the customers needs.

A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.

caust1c 49 minutes ago [-]

This is exactly it. AI is sniffing out the good datamodels from the bad. Easy to understand? AI can understand it too! Complex business mess with endless technical debt? Not too much.

But this is precisely why we're seeing startups build insane things fast while well established companies are still questioning if it's even worth it or not.

juleiie 47 minutes ago [-]

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries

Wait but this just sounds unhinged, why oh why

ares623 3 hours ago [-]

At some point, say 5 years from now, someone will revisit their AI-powered production workloads and ask the question "how can we optimize this by falling back to non-AI workload?". Where does that leave AI companies when the obvious choice is to do away with their services once their customers reach a threshold?

anonzzzies 3 minutes ago [-]

A lot of what we encounter is; there is this 'chat' interface which is 'wow factor': you type something in english and something (like text to sql) falls out, maybe 60-80% of what was needed. But then the frustration (for the user) starts: the finetuning of the result. After a few uses, they always ask for the 'old way' back to do that: just editing the query or give them knobs to turn to finetune the result. Where most want knobs which are, outside the most generic cases (pick a timespan for a datetime column), custom work. So AI is used for the first 10% of the work time (which gives you 60%+ of the solution) until the frustration lands: the last 40% or less are going to take 90% of your time. Still great as overall it will probably take less time than before.

EdwardDiego 1 hours ago [-]

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."

ares623 45 minutes ago [-]

“How I used ancient programming techniques to save the company $100k/year in token costs”

EdwardDiego 1 hours ago [-]

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

dchftcs 25 minutes ago [-]

On one side, you have an agent calculating the revenue.

On the other side, you have an SQL that calculates the revenue

Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.

tirumaraiselvan 55 minutes ago [-]

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.

moomoo11 51 minutes ago [-]

psychedelics

tirumaraiselvan 10 minutes ago [-]

This article is getting a lot of hate but honestly it does have good amount of useful content learned through practical experience, although at an abstract level. For example, this section:

``` The teams that succeed don’t just throw SQL schemas at the model. They build:

Business glossaries and term mappings

Query templates with constraints

Validation layers that catch semantic errors before execution ```

Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.

monero-xmr 4 hours ago [-]

A non-open ended path collapses into a decision tree. Very hard to think of customer support use-cases that do not collapse into decision trees. Most prompt engineering on the SaaS side results in very long prompts to re-invent decision trees and protect against edge cases. Ultimately the AI makes a “decision function call” which hits a decision tree. LLM is very poor replacement for a decision tree.

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood 3 hours ago [-]

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

schrodinger 1 hours ago [-]

A decision tree is simply a model where you follow branches and make a decision at each point. Like...

If we had tech support for a toaster, you might see:

    if toaster toasts the bread:
      if no: has turning it off and on again worked?
        if yes: great! you found a solution
        if no: hmm, try ...
      if yes:
        is the bread burnt after?
          if no: sounds like your toaster is fine!
          if yes: have you tried adjusting the darkness knob?
            if no: ship it in for repair
            if yes: try replacing the timer. does that help?
              if no: ship it in for repair
              if yes: yay you're toaster is fixed

LostMyLogin 2 hours ago [-]

Any chance you can ELI5 this to me?

2 hours ago [-]

dmbche 1 hours ago [-]

Just search "expert system"

hn_throwaway_99 2 hours ago [-]

> Here’s the reality check: One panelist mentioned that 95% of AI agent deployments fail in production. Not because the models aren’t smart enough, but because the scaffolding around them, context engineering, security, memory design, isn’t there yet.

It's a big pet peeve of mine when an author states an opinion, with no evidence, as some kind of axiom. I think there is plenty of evidence that "the models aren't smart enough". Or to put it more accurately, it's an incredibly difficult problem to get a big productivity gain when an automated system is blatantly wrong ~1% of the time but when those wrong answers are inherently designed to look like right answers as much as possible.

jongjong 4 hours ago [-]

It's interesting because my management philosophy when delegating work has been to always start by telling people what my intent is, so that they don't get too caught up in a specific approach. Many problems require out-of-the-box thinking. This is really about providing context. Context engineering is basically a management skill.

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

another_twist 4 hours ago [-]

Its weird that this makes the front page and Metas code world model never did.

metadat 4 hours ago [-]

First I've heard of it:

https://ai.meta.com/research/publications/cwm-an-open-weight...

CuriouslyC 3 hours ago [-]

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

ath3nd 2 hours ago [-]

[dead]

hshdhdhehd 3 hours ago [-]

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight 3 hours ago [-]

Agents are the oxen pulling the plow through the seasons... turning over ground, following furrows, adapting to terrain. RAG is the irrigation system. Prompts are the farmer's instructions. And the harvest? That depends on how well you understood what you were trying to grow.

ath3nd 2 hours ago [-]

[dead]

Loading comments...

sbierwagen 3 hours ago [-]

>We weren’t there to rehash prompt engineering tips.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles 3 hours ago [-]

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.

Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?

It comes off as so cheap.

mccoyb 2 hours ago [-]

It comes off as someone who lives their life according to quantity, not quality.

The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

palmotea 1 hours ago [-]

> The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

The businessmen's job will be complete when they've totally eliminated all pride from work.

alexchantavy 53 minutes ago [-]

Yeah it bugs me. We've got enough examples in this article to make Cards Against Humanity ChatGPT edition

> One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.

> Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.

stingraycharles 41 seconds ago [-]

Yeah, AI isn’t creative. You need to ask it to describe these types of patterns, and then include avoiding them in your original prompt to make it come across as somewhat natural.

What I wonder is whether the author of the article recognized these patterns and didn’t care, didn’t even recognize them, or didn’t proofread the article?

rapind 2 hours ago [-]

> Why can’t anyone be bothered anymore to write actual content

The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.

I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.

2 hours ago [-]

esperent 2 hours ago [-]

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!

I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".

It works for the first couple of responses but then it's back to loads of links again.

tkgally 2 hours ago [-]

ares623 60 minutes ago [-]

Isn’t that markdown files?

tkgally 19 minutes ago [-]

I was thinking about consumer-facing AI products, where md files controlled by the user presumably wouldn’t fly.

As isoprophlex suggests in a sister comment, though, that would be hard to monetize.

isoprophlex 21 minutes ago [-]

Sheesh how ever will you monetize a text file

Will someone please think of the MRR!

AdieuToLogic 3 hours ago [-]

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever1 3 hours ago [-]

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.

If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

AdieuToLogic 3 hours ago [-]

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middles altogether.

Well said and I could not agree more.

danieltanfh95 2 hours ago [-]

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

another_twist 4 hours ago [-]

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

iagooar 2 hours ago [-]

Wow, half of this article deeply resonates with what I am working on.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

jamesblonde 1 hours ago [-]

https://bird-bench.github.io/

ares623 1 hours ago [-]

Text to sql is solved by having good UX and a reasonable team that’s in touch with the customers needs.

A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.

caust1c 49 minutes ago [-]

This is exactly it. AI is sniffing out the good datamodels from the bad. Easy to understand? AI can understand it too! Complex business mess with endless technical debt? Not too much.

But this is precisely why we're seeing startups build insane things fast while well established companies are still questioning if it's even worth it or not.

juleiie 47 minutes ago [-]

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries

Wait but this just sounds unhinged, why oh why

ares623 3 hours ago [-]

anonzzzies 3 minutes ago [-]

EdwardDiego 1 hours ago [-]

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."

ares623 45 minutes ago [-]

“How I used ancient programming techniques to save the company $100k/year in token costs”

EdwardDiego 1 hours ago [-]

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

dchftcs 25 minutes ago [-]

On one side, you have an agent calculating the revenue.

On the other side, you have an SQL that calculates the revenue

Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.

tirumaraiselvan 55 minutes ago [-]

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.

moomoo11 51 minutes ago [-]

psychedelics

tirumaraiselvan 10 minutes ago [-]

This article is getting a lot of hate but honestly it does have good amount of useful content learned through practical experience, although at an abstract level. For example, this section:

``` The teams that succeed don’t just throw SQL schemas at the model. They build:

Business glossaries and term mappings

Query templates with constraints

Validation layers that catch semantic errors before execution ```

Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.

monero-xmr 4 hours ago [-]

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood 3 hours ago [-]

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

schrodinger 1 hours ago [-]

A decision tree is simply a model where you follow branches and make a decision at each point. Like...

If we had tech support for a toaster, you might see:

    if toaster toasts the bread:
      if no: has turning it off and on again worked?
        if yes: great! you found a solution
        if no: hmm, try ...
      if yes:
        is the bread burnt after?
          if no: sounds like your toaster is fine!
          if yes: have you tried adjusting the darkness knob?
            if no: ship it in for repair
            if yes: try replacing the timer. does that help?
              if no: ship it in for repair
              if yes: yay you're toaster is fixed

LostMyLogin 2 hours ago [-]

Any chance you can ELI5 this to me?

2 hours ago [-]

dmbche 1 hours ago [-]

Just search "expert system"

hn_throwaway_99 2 hours ago [-]

jongjong 4 hours ago [-]

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

another_twist 4 hours ago [-]

Its weird that this makes the front page and Metas code world model never did.

metadat 4 hours ago [-]

First I've heard of it:

https://ai.meta.com/research/publications/cwm-an-open-weight...

CuriouslyC 3 hours ago [-]

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

ath3nd 2 hours ago [-]

[dead]

hshdhdhehd 3 hours ago [-]

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight 3 hours ago [-]

ath3nd 2 hours ago [-]

[dead]