NHacker Next
login
▲AI coding assistants are getting worse?spectrum.ieee.org
432 points by voxadam 1 days ago | 694 comments
Loading comments...
llmslave2 1 days ago [-]
One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
misja111 11 hours ago [-]
A while ago someone posted a claim like that on LinkedIn again. And of course there was the usual herd of LinkedIn sheep who were full of compliments and wows about the claim he was making: a 10x speedup of his daily work.

The difference with the zillion others who did the same, is that he attached a link to a live stream where he was going to show his 10x speedup on a real life problem. Credits to him for doing that! So I decided to go have a look.

What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.

So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.

I admit I was sceptic at the start but I honestly had been hoping that my scepticism would be proven wrong. But not.

Folcon 10 hours ago [-]
I'm going to try and be honest with you because I'm where you were at 3 months ago

I honestly don't think there's anything I can say to convince you because from my perspective that's a fools errand and the reason for that has nothing to do with the kind of person either of us are, but what kind of work we're doing and what we're trying to accomplish

The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output

And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper

It's still my job to refine, reflect, define and correct the problem, the approach etc

I can say this because it's painfully evident to me when I try and do something in areas where it really is weak and I honestly doubt that the foundation model creators presently know how to improve it

My personal evidence for this is that after several years of tilting those windmills, I'm successfully creating things that I have on and off spent the last decade trying to create successfully and have had difficulty with not because I couldn't do it, but because the cost of change and iteration was so high that after trying a few things and failing, I invariably move to simplifying the problem because solving it is too expensive, I'm now solving a category of those problems now, this for me is different and I really feel it because that sting of persistent failure and dread of trying is absent now

That's my personal perspective on it, sorry it's so anecdotal :)

bigfishrunning 9 hours ago [-]
>The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output

>And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper

But there's a hidden cost here -- by not doing the reading and reasoning out the result, you have learned nothing and your value has not increased. Perhaps you extended a bit less energy producing this output, but you've taken one more step down the road to atrophy.

rectang 2 hours ago [-]
Seeing the code that the LLM generates and occasionally asking it to explain has been an effective way to improve my understanding. It's better in some ways than reading documentation or doing tutorials because I'm working on a practical project I'm highly motivated by.

I agree that there is benefit in doing research and reasoning, but in my experience skill acquisition through supervising an LLM has been more efficient because my learning is more focused. The LLM is a weird meld of domain expert/sycophant/scatterbrain but the explanations it gives about the code that it generates are quite educational.

ben_w 8 hours ago [-]
I think there's a potential unstated assumption here, though forgive me if it was made explicit elsewhere and/or I missed it.

LLM-assisted can be with or without code review. The original meaning of "vibe coding" was without, and I absolutely totally agree this rapidly leads to a massive pile of technical debt, having tried this with some left-over credit on a free trial specifically to see what the result would be. Sure, it works, but it's a hell of a mess that will make future development fragile (unless the LLMs improve much faster than I'm expecting) for no good reason.

Before doing that, I used Claude Code the other way, with me doing code reviews to make sure it was still aligned with my ideas of best practices. I'm not going to claim it was perfect, because it did a python backend and web front end for a webcam in one case and simultaneously on a second project a browser-based game engine and example game for that engine and on a third simultaneous project a joke programming language, and I'm not a "real" python dev or "real" web dev or any kind of compiler engineer (last time I touched Yacc before this joke language was 20 years earlier at university). But it produced code I was satisfied I could follow, understand, wasn't terrible, had tests.

I wouldn't let a junior commit blindly without code review and tests because I know what junior code looks like from all the times I've worked with juniors (or gone back to 20 year old projects of my own), but even if I was happy to blindly accept a junior's code, or even if the LLM was senior-quality or lead quality, the reason you're giving here means code review before acceptance is helpful for professional development even when all the devs are at the top of their games.

bigfishrunning 8 hours ago [-]
Yes, but I'm talking about more then code review -- there is a ton of value in discovering all of the ways not to solve a problem. When reading 25 forum posts or whatever in trying to write some function, you're learning more then just the answer. You're picking up a ton of context about how these sorts of problems are solved. If all you're doing is reviewing the output of some code generator, your mental context is not growing in the same way.
0x262d 7 hours ago [-]
I'm curious if you think the same thing was lost with the transition from reading man pages and first-party documentation to going to stackoverflow or google first (at least, I assume the former was more common a couple decades ago)
bigfishrunning 7 hours ago [-]
What was lost in that transition was the required quality of that first-party documentation decreased; generally that first party documentation simply didn't contain enough information, so you needed to determine things empirically or read source code to get more information. I do think the culture of "copy-and-paste from stackoverflow" harmed the general competency of programmers, but having more third-party information available was only a positive thing.
newsoftheday 5 hours ago [-]
Before 2022 age of modern AI, man pages, SO and Google all were the results of humans, not AI fabrication and hallucination.
coldtea 3 hours ago [-]
A lot was lost then too.
Freebytes 4 hours ago [-]
Merely choosing lines to copy and paste from one file of your own code to another is a learning experience for your brain. AI is excellent for removing a lot of grunt work, but that type of work also reinforces your brain even if you think you are learning nothing. Something can still be lost even if AI is merely providing templates or scaffolding. The same can be said of using Google to find examples, though. You should try to come up with the correct function name or parameter list yourself in your head before using a search engine or AI. And that is for the moist simple examples, e.g. "SQL table creation example". These should be things we know off the top of our heads, so we should first try to type it out before we go to look for an answer.
Aeolun 5 hours ago [-]
> By not doing the reading and reasoning out the result, you have learned nothing and your value has not increased

AI helps at the margins.

It’s like adding anti-piracy. Some people would simply never have bought the game unless they can pirate it.

There’s a large volume of simple tools, or experimental software that I would simply never had the time to build the traditional way.

Folcon 7 hours ago [-]
I mean you're not wrong

I suppose the way I approach this is, I use libraries which solve problems that I have, that in principle understand, because I know and understand the theory, but in practice I don't know the specific details, because I've not implemented the solution myself

And honestly, it's not my job to solve everything, I've just got to build something useful or that serves my goals

I basically put LLM's into that category, I'm not much of a NIH kinda person, I'm happy to use libraries, including alpha ones on projects if they've been vetted over the range of inputs that I care about, and I'm not going to go into how to do that here, because honestly it's not that exciting, but there's very standard boring ways to produce good guarantees about it's behaviour, so as long as I've done that, I'm pretty happy

So I suppose what I'm saying is that isn't a hidden cost to me, it's a pragmatic decision I made that I was happy with the trade off :)

When I want to learn, and believe me I do now and again, I'll focus on that there :)

newsoftheday 5 hours ago [-]
> I use libraries

> I basically put LLM's into that category

That says a lot to be sure.

Folcon 53 minutes ago [-]
Seeing as you've chosen to be ambiguous, I'll interpret your comment positively :)

Otherwise feel free to put forward a criticism

brianwawok 9 hours ago [-]
Example for me: I am primarily a web dev today. I needed some kuberntes stuff setup. Usually that’s 4 hours of google and guess and check. Claude did it better in 15 minutes.

Even if all it does is speed up the stuff i suck at, that’s plenty. Oh boy docker builds, saves my bacon there too.

Draiken 9 hours ago [-]
And you learned nothing and have no clue if what it spit out is good or not.

How can you even assume what it did is "better" if you have no knowledge of kubernetes in the first place? It's mere hope.

Sure it gets you somewhere but you learned nothing in the way and now depend on the LLM to maintain it forever given you don't want to learn the skill.

I use LLMs to help verify my work and it can sometimes spot something I missed (more often it doesn't but it's at least something). I also automate some boring stuff like creating more variations of some tests, but even then I almost always have to read its output line by line to make sure the tests aren't completely bogus. Thinking about it now it's likely better if I just ask for what scenarios could be missing, because when they write it, they screw it up in subtle ways.

It does save me some time in certain tasks like writing some Ansible, but I have to know/understand Ansible to be confident in any of it.

These "speedups" are mostly short term gains in sacrifice for long term gains. Maybe you don't care about the long term and that's fine. But if you do, you'll regret it sooner or later.

My theory is that AI is so popular because mediocrity is good enough to make money. You see the kind of crap that's built these days (even before LLMs) and it's mostly shit anyways, so whether it's shit built by people or machines, who cares, right?

Unfortunately I do, and I rather we improve the world we live in instead of making it worse for a quick buck.

IDK how or why learning and growing became so unpopular.

dpark 6 hours ago [-]
> Sure it gets you somewhere but you learned nothing in the way and now depend on the LLM to maintain it forever given you don't want to learn the skill.

The kind of person who would vibe code a bunch of stuff and push it with zero understanding of what it does or how it does it is the kind of person who’s going to ruin the project with garbage and technical debt anyway.

Using an LLM doesn’t mean you shouldn’t look at the results it produces. You should still check it results. You should correct it when it doesn’t meet your standards. You still need to understand it well enough to say “that seems right”. This isn’t about LLMs. This is just about basic care for quality.

But also, I personally don’t care about being an expert at every single thing. I think that is an unachievable dream, and a poor use of individual time and effort. I also pay people to do stuff like maintenance on my car and installing HVAC systems. I want things done well. That doesn’t mean I have to do them or even necessarily be an expert in them.

Bombthecat 7 hours ago [-]
I notice this already after around of 6 months heavy usage. Skills decline, even information gathering etc
jpadkins 4 hours ago [-]
I think it is more accurate to say some skills are declining (or not developing) while a different set of skills are improving (the skill of getting an LLM to produce functional output).

Similar to if someone started writing a lot of C, their assembly coding skills may decline (or at least not develop). I think all higher levels of abstraction will create this effect.

llmslave2 2 hours ago [-]
> while a different set of skills are improving (the skill of getting an LLM to produce functional output

Lmaooooo

p410n3 8 hours ago [-]
I agree with both of your points since I use LLMs for things I am not good at and dont give a single poop about. The only things i did with LLMs are three examples from the last two years:

- Some "temporary" tool I built years ago as a pareto-style workaround broke. (As temporary tools do after some years). Its basically a wrapper that calls a bunch of XSLs on a bmecat.xml every 3-6 months. I did not care to learn XSL back then and I dont care to do it now. Its arcane and non-universal - some stuff only works with certain XSL processors. I asked the LLM to fix stuff 20 times and eventually it got it. Probably got that stuff off my back another couple years.

- Some third party tool we use has a timer feature that has a bug where it sets a cookie everytime you see the timer once per timer (for whatever reason... the timers are set to end a certain time and there is no reason to attach it to a user). The cookies have a life time of one year. We run time limited promotions twice a week so that means two cookies a week for no reason. Eventually our WAF got triggered because it has a rule to block requests when headers are crazy long - which they were because cookies. I asked an LLM to give me a script that clears the cookie when its older than 7 days because I remember the last time i hacked together cookie stuff it also felt very "wtf" in a javascript kinda way and I did not care to relive that pain. This was in place until the third party tool fixed the cookie lifetime for some weeks.

- We list products on a marketplace. The marketplace has their own category system. We have our own category system. Frankly theirs kinda suck for our use case because it lumps a lot of stuff together, but we needed to "translate" the categories anyway. So I exported all unique "breadcrumbs" we have and gave that + the categories from the marketplace to an LLM one by one by looping through the list. I then had an apprentice from another dept. that has vastly more product knowledge than me look over that list in a day. Alternative would have been to have said apprentice do that stuff by hand, which is a task I would have personally HATED so I tried to lessen the burden for them.

All these examples are free tier in whatever I used.

We also use a vector search at work. 300,000 Products with weekly updates of the vector db.

We pay 250€ / mo for all of the qdrant instances across all environments and like 5-10 € in openai tokens. And we can easily switch whatever embedding model we use at anytime. We can even selfhost a model.

misja111 9 hours ago [-]
No I agree with you, there are area's where AI is helping amazingly. Every now and then it helps me with some issue as well, which would have cost me hours earlier and now it's done in minutes. E.g. some framework that I'm not that familiar with, or doing the scaffolding for some unit test.

However this is only a small portion of my daily dev work. For most of my work, AI helps me little or not at all. E.g. adding a new feature to a large codebase: forget it. Debugging some production issue: maybe it helps me a little bit to find some code, but that's about it.

And this is what my post was referring to: not that AI doesn't help at all, but to the crazy claims (10x speedup in daily work) that you see all over social media.

newsoftheday 6 hours ago [-]
> I'm going to try and be honest with you because I'm where you were at 3 months ago

> I honestly don't think there's anything I can say to convince you

> The value I've personally been getting which I've been valuing

> And that's not to say that the output is good

> My personal evidence for this is that after several years of tilting those windmills

It sounds to me like you're rationalizing and your opening sentences embed your awareness of the fallibility of what you say and clearly believe about your situation later.

I feel there are two types of programmers who use AI:

    Type A who aren't very good but AI makes them feel better about themselves.

    Type B who are good with or without AI and probably slightly better with it but at a productivity cost due to fixing AI all the way through, rather than a boost; leading to their somewhat negative but valid view of AI.
econ 3 hours ago [-]
It's great in unfamiliar terrain.
FloorEgg 3 hours ago [-]
It's great when the terrain is unfamiliar to the user but extremely familiar to the LLM. And it's useless in the opposite.

The best programmers are going to be extremely familiar with terrains that are unfamiliar to the LLMs, which is why their views are so negative. These are people working on core parts of complex high performing highly scalable systems, and people with extreme appreciation for the craft of programming and code quality.

But the most productive developers focused on higher level user value and functionality (e.g pumping out full stack apps or features), are more likely to be working with commonly used technologies while also jumping around between technologies as a means to a functionality or UX objective rather than an end of skill development, elegant code, or satisfying curiosity.

I think this explains a lot of the difference in perspectives. LLMs offer value in the latter but not the former.

It's a shame that so many of the people in one context can't empathize with the people in the other.

lawlessone 5 hours ago [-]
you haven't contributed much to GitHub since 2022?

*edit unless your commits are elsewhere?

lazyfanatic42 11 hours ago [-]
I think people get into a dopamine hit loop with agents and are so high on dopamine because its giving them output that simulates progress that they don't see reality about where they are at. It is SO DAMN GOOD AT OUTPUT. Agents love to output, it is very easy to think its inventing physics.

Obviously my subjective experience

queueueue 10 hours ago [-]
Ironic that I’m going to give another anecdotal experience here, but I’ve noticed this myself too. I catch myself trying to keep on prompting after an llm has not been able to solve some problem in a specific way. While I can probably do it faster at that point if I switch to doing it fully myself. Maybe because the llm output feels like its ‘almost there’, or some sunken cost fallacy.
qwery 4 hours ago [-]
Not saying this is you, but another way to look at it is that engaging in that process is training you (again, not you, the user) -- the way you get results is by asking the chat bot, so that's what you try first. You don't need sunk cost or gambling mechanics, it's just simple conditioning.

Press lever --> pellet.

Want pellet? --> press lever.

Pressed lever but no pellet? --> press lever.

raducu 10 hours ago [-]
> I think people get into a dopamine hit loop

I also think that's the case, but I'm open to the idea that there are people that are really really good at this and maybe they are indeed 10x.

My experience is that for SOME tasks LLMs help a lot, but overall nowhere near 10x.

Consistently it's probably.... ~1X.

The difference is I procrastinate a lot and LLMs actually help me not procrastinate BECAUSE of that dopamine kick and I'm confident I will figure it out with an LLM.

I'm sure there are many people who got to a conclusion on their to-do projects with the help of LLMs and without them, because of procrastination or whatever, they would not have had a chance to.

It doesn't mean they're now rich, because most projects won't make you rich or make you any money regardless if you finish them or not

Majkemilk 10 hours ago [-]
[dead]
sharadov 7 hours ago [-]
You nailed it - like posting on social media and getting dopamine hits as you get likes and comments. Maybe that's what has got all these vibe coders hooked.
Kerrick 5 hours ago [-]
I feel like I've been incredibly productive with AI assisted programming over the past few weeks, but it's hard to know what folks' baselines are. So in the interest of transparency, I pushed it all up to sourcehut and added Co-Authored-By footers to the AI-assisted commits (almost all of them).

Everything is out there to inspect, including the facts that I:

- was going 12-18 hours per day

- stayed up way too late some nights

- churned a lot (+91,034 -39,257 lines)

- made a lot of code (30,637 code lines, 11,072 comment lines, plus 4,997 lines of markdown)

- ended up with (IMO) pretty good quality Ruby (and unknown quality Rust).

This is all just from the first commit to v0.8.0. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0

What do you think: is this fast, or am I just as silly as the live-streamer?

P.S. - I had an edge here because it was a green-field project and it was not for my job, so I had complete latitude to make decisions.

qwery 5 hours ago [-]
I don't really know Ruby, so maybe I'm missing something major, but your commit messages seem extremely verbose yet messy (I can't make heads or tails of them) and I'm seeing language like "deprecated" and a stream of "releases" within a period of hours and it just looks a bit like nonsense.

Don't take "nonsense" negatively, please -- I mean it looks like you were having fun, which is certainly to be encouraged.

Kerrick 4 hours ago [-]
The commit messages with a Co-Authored-By footer were all generated. I recommend clicking the "tree" link to see the actual code. Specifically:

- README.md explains the basics https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/REA...

- CHANGELOG.md is better than the commit messages, and filtered to only what app devs using the library likely care about: https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/CHA...

- doc/ holds the Markdown documentation, which I heavily reviewed. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/doc

- lib/ holds the Ruby source code of the library, which I heavily designed and reviewed. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/lib

- examples/ holds the Ruby source code of some toy apps built with the library. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/exa...

- bin/ holds a few Ruby scripts & apps to automate some ops (check out announce) https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/bin

- tasks/ holds some more Ruby scripts & apps to automate some ops (most I did not read, but I heavily designed and reviewed bump and terminal_preview) https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/tas...

- ext/ holds the Rust source code of the library, which I did not read most of. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/ext

I was having a lot of fun, and part of the reason I took deprecations and releases seriously was because I hoped to encourage adoption. And that I did: https://todo.sr.ht/~kerrick/ratatui_ruby/4 and https://github.com/sidekiq/sidekiq/blob/main/bin/tui

GuB-42 8 hours ago [-]
> What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.

For all who are doing that, what is the experience of coding in a livestream? It is something I never attempted, the simple idea makes me feel uncomfortable. A good portion of my coding would be rather cringe, like spending way too long on a stupid copy-paste or sign error that my audience would have noticed right away. On the other hand, sometimes, I am really fast because everything is in my head, but then I would probably lose everyone. I am impressed when looking at live coders by how fluid it looks compared to my own work, maybe there is a rubber duck effect at work here.

All this to say that I don't know how working solo compares to a livestream. It is more or less efficient, maybe it doesn't matter that much when you get used to it.

qwery 4 hours ago [-]
Have done it, never enough of an audience to be totally humiliated. It's never going to be more efficient.

But as for your cringe issue that the audience noticed, one could see that to be a benefit -- prefer to have someone say e.g. "you typed `Normalise` (with an 's') again, C++ is written in U.S. English, don't you know / learn to spell, you slime" upfront than waiting for compiler to tell you that `Normalise` doesn't exist, maybe?

QuercusMax 5 hours ago [-]
I suspect livestream coding, like programming competition coding and whiteboard coding for interviews, is a separate skill that's fairly well correlated with being able to solve useful problems, but it is not the same thing. You can be an excellent problem solver without being good at doing so while being watched, under time pressure.
ruszki 11 hours ago [-]
There were such people also here.

Copy-pasting the code would have been faster than their work, and there were several problems with their results. But they were so convinced that their work is quick and flawless, that they post a video recording of it.

jennyholzer4 10 hours ago [-]
Hackernews is dominated by these people

LLM marketers have succeeded at inducing collective delusion

judahmeek 9 hours ago [-]
> LLM marketers have succeeded at inducing collective delusion

That's the real trick & one I desperately wish I knew how to copy.

I know there's a connection to Dunning Kruger & I know that there's a dopamine effect of having a responsive artificial minion & there seems to be some of that "secret knowledge" sauce that makes cults & conspiracies so popular (there's also the promise of less effort for the same or greater productivity).

Add the list grows, I see the popularity, but I doubt I could easily apply all these qualities to anything else.

jennyholzer4 9 hours ago [-]
IMO algorithmically generated "social" media feeds combined with the lack of adequate mass-media alternatives have supercharged cult recruitment in the last approximately 10 years.

Stupid people in my life have been continually and recklessly joining harebrained cults for the last 5 years.

Really I think it's probably much, much easier to start a cult these days than it has ever been. Good news for tech company founders I guess, bad news for American culture, American society, and the American people.

codyb 9 hours ago [-]
One way to help stop it is to get off social media and stop giving these tech billionaires so much money.

The less people on social media, the less real the network effect is, the less people who join in the first place, the less money the billionaires have to throw hundreds of millions into politics, the less inadvertent cult members.

I've gotten to the point where I just leave my phone at home at this point, and it has been incredibly nice. Before that I deleted most apps that I found to be time wastes, deleted all social media (HN and two small discords are my exception).

It's very nice, I'm less stressed, I feel more in the moment, I respond to my friends when I check my phone every few hours on the speaker in the other room.

I encourage others to try it, add it to your dry January.

And ya know what I ain't doing a lick of? Sending money and reams of data to these billionaires I think are really lame individuals with corrupted moral compasses.

Now it ain't perfect, I'm sure Google's still getting reams of info about me from my old Gmail account that I still use sometimes, and Apple too from a few sources. But... getting closer!

So many folk sit here and recognize the same problems I do, the way it warps your attention, the addictiveness of the handheld devices, the social media echo chambers, the rising influence of misinformation, the lack of clarity between real and fake...

Seems like there's a solution in front of us :-)

dpark 6 hours ago [-]
> So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.

So I’ve been playing with LLMs for coding recently, and my experience is that for some things, they are drastically faster. And for some other things, they will just never solve the problem.

Yesterday I had an LLM code up a new feature with comprehensive tests. It wasn’t an extremely complicated feature. It would’ve taken me a day with coding and testing. The LLM did the job in maybe 10 minutes. And then I spent another 45 minutes or so deeply reviewing it, getting it to tweak a few things, update some test comments, etc. So about an hour total. Not quite a 10x speed up, but very significant.

But then I had to integrate this change into another repository to ensure it worked for the real world use case and that ended up being a mess, mostly because I am not an expert in the package management and I was trying to subvert it to use an unpublished package. Debugging this took the better part of the day. For this case, the LLM may be saved me maybe 20% because it did have a couple of tricks that I didn’t know about. But it was certainly not a massive speed up.

So far, I am skeptical that LLM’s will make someone 10x as efficient overall. But that’s largely because not everything is actually coding. Subverting the package management system to do what I want isn’t really coding. Participating in design meetings and writing specs and sending emails and dealing with red tape and approvals is definitely not coding.

But for the actual coding specifically, I wouldn’t be surprised if lots of people are seeing close to 10x for a bunch of their work.

lossyalgo 3 hours ago [-]
Shopify's CEO just posted the other day that he's super productive using the newest AI models and many of the supportive comments responding to his claim were from CEOs of AI startups.
cmiles74 10 hours ago [-]
I suspect there's also a good amount of astroturfing happening here as well, making it harder to find the real success stories.
jlarocco 7 hours ago [-]
I've noticed a similar trend. There seems to be a lot of babysitting and hand holding involved with vibe-coding. Maybe it can be a game changer for "non-technical founders" stumbling their way through to a product, but if you're capable of writing the code yourself, vibe coding seems like a lot of wasted energy.
Bombthecat 8 hours ago [-]
Even if this would take two, three hours and a vibe coder, still cheaper then a real developer
boringg 9 hours ago [-]
Theres too much money, time and infrastructure committed for this to be anything but successful.

Its tougher than a space race or the nuclear bomb race because there are fewer hard tangibles as evidence of success.

seidleroni 8 hours ago [-]
I think there is also some FOMO involved. Once people started saying how AI was helping them be more productive, a lot of folks felt that if they didn't do the same, they were lagging behind.
chankstein38 8 hours ago [-]
Sounds like someone trying to sell a course or something.
cozzyd 3 hours ago [-]
10 times zero is still zero!
alex1138 11 hours ago [-]
You're supposed to believe in his burgeoning synergy so that one day you may collaborate to push industry leading solutions
dr-detroit 8 hours ago [-]
[dead]
AstroBen 1 days ago [-]
It's an impossible thing to disprove. Anything you say can be countered by their "secret workflow" they've figured out. If you're not seeing a huge speedup well you're just using it wrong!

The burden of proof is 100% on anyone claiming the productivity gains

anonzzzies 20 hours ago [-]
I go to meetups and enjoy myself so much; 80% of people are showing how to install 800000000 MCPs on their 92gb macbook pros, new RAG memory, n8n agent flows, super special prompting techniques, secret sauces, killer .md files, special vscode setups and after that they still are not productive vs just vanilla claude code in a git repos. You get people saying 'look I only have to ask xyz... and it does it! magic' ; then you just type in vanilla CC 'do xyz' and it does exactly the same thing, often faster.
mikestorrent 17 hours ago [-]
This was always the case. People obsessing over keyboards, window managers, emacs setups... always optimizing around the edges of the problem, but this is all taking an incredible amount of their time versus working on real problems.
sheepscreek 16 hours ago [-]
Yes, the thing they realize much later in life is that perhaps they enjoyed the act of gardening (curating your tools, workflows, etc) much more than farming (being downright focused and productive on the task at hand).

Sadly gardening doesn’t pay the bills!

anonzzzies 15 hours ago [-]
yep, and I have the same thing, but then I am not going to tell everyone it is super productive for the actual task of farming. I say that I have a hobby farming (which I do) and talk about my tools and my meager yields (which won't make any money if sold). I am not going to say that my workflows are so productive while my neighbour who is a professional farmer just has old crap and just starts and works from 5 am to 9 pm making a living of his land.
hdjrudni 16 hours ago [-]
If I only spend $1000 on hydroponics and 3 weeks tending to a vertical garden I can grow a $1 head of lettuce FOR FREE!
ben_w 8 hours ago [-]
I tried growing lettuce in some cut up plastic bottles at university in soil from the nearby footpath, I think even with the cheap student approach I spent more on the (single pack of) seeds than a fully grown lettuce costs, and only managed about four individual leaves that were only about 1cm by 5cm.
DANmode 14 hours ago [-]
What if I haven’t spent anything,

and I’m making money with lettuce I grew in the woods?

(or, in Anthropic/sama’s backyards)

fc417fc802 9 hours ago [-]
I like farming but a lot of the tools are annoying to use so I find myself tinkering with them (gardening in your analogy I guess). It's not that I prefer tinkering in the shop to farming. More that I just have very little patience for tools that don't conform to the ways in which I think about the world.
songodongo 10 hours ago [-]
Same thing happens in music production. If only I had this guitar, or that synth, or these plugins…
multjoy 10 hours ago [-]
Gear Acquisition Syndrome is a very different problem. Even if you haven't cured the issue the new synth was meant to fix, at least you have a new synth.
sehansen 8 hours ago [-]
It's the four hobbies all over again: https://brooker.co.za/blog/2023/04/20/hobbies.html
whoiskevin 9 hours ago [-]
A better keyboard is a hill I will die on.
mikestorrent 3 hours ago [-]
I have a fantastic keyboard, but I'm not taking pictures of it, changing the keycaps, posting about it. It's a tool, not a fetish; that's how I differentiate these things.
butlike 8 hours ago [-]
It's a keyboard attached to an article of clothing you put your head into so the keys drape over your shoulders. You then type, but also end up giving yourself a shoulder massage!
atakan_gurkan 15 hours ago [-]
Yes, this happens quite often. So often that I wonder if it is among the symptoms of some psychiatric or neurological disorder.
Bridged7756 9 hours ago [-]
It's just boredom probably. Obsessing over productivity tools is relatively more productive than say, something completely unrelated to your work.
abakker 18 hours ago [-]
That perfectly ties with my experience. Just direct prompts, with limited setup and limited context seem to work better or just as well as complex custom GPTs. There are not just diminishing, but inverting returns to complexity in GPTs
serf 17 hours ago [-]
limited prompts work well for limited programs, or already well defined and cemented source bases.

once scope creeps up you need the guardrails of a carefully crafted prompt (and pre-prompts, tool hooks, AGENTS files, the whole gambit) -- otherwise it turns into cat wrangling rapidly.

anonzzzies 17 hours ago [-]
Not in our (30+ year old software company) experience and we have large code bases with a lot of scope creep ; more than ever as we can deliver a lot more for a lot less (lot more revenue / profit too).
PunchyHamster 9 hours ago [-]
No, no, you misunderstand, that's still massive productivity improvement compared to them being on their own with their own incompetence and refusal to learn how to code properly
cindyllm 9 hours ago [-]
[dead]
paodealho 24 hours ago [-]
This gets comical when there are people, on this site of all places, telling you that using curse words or "screaming" with ALL CAPS on your agents.md file makes the bot follow orders with greater precision. And these people have "engineer" on their resumes...
electroglyph 23 hours ago [-]
there's actually quite a bit of research in this field, here's a couple:

"ExpertPrompting: Instructing Large Language Models to be Distinguished Experts"

https://arxiv.org/abs/2305.14688

"Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks"

https://arxiv.org/abs/2408.08631

AdieuToLogic 22 hours ago [-]
Those papers are really interesting, thanks for sharing them!

Do you happen to know of any research papers which explore constraint programming techniques wrt LLMs prompts?

For example:

  Create a chicken noodle soup recipe.

  The recipe must satisfy all of the following:

    - must not use more than 10 ingredients
    - must take less than 30 minutes to prepare
    - ...
llmslave2 21 hours ago [-]
I've seen some interesting work going the other way, having LLMs generate constraint solvers (or whatever the term is) in prolog and then feeding input to that. I can't remember the link but could be worthwhile searching for that.
Aeolun 5 hours ago [-]
Anything involving numbers, or conditions like ‘less than 30 minutes’ is going to be really hard.
aix1 14 hours ago [-]
This is an area I'm very interested in. Do you have a particular application in mind? (I'm guessing the recipe example is just illustrate the general principle.)
cess11 14 hours ago [-]
I suspect LLM-like technologies will only rarely back out of contradictory or otherwise unsatisfiable constraints, so it might require intermediate steps where LLM:s formalise the problem in some SAT, SMT or Prolog tool and report back about it.
hdra 23 hours ago [-]
I've been trying to stop the coding assistants from making git commits on their own and nothing has been working.
zmmmmm 22 hours ago [-]
hah - i'm the opposite, I want everything done by the AI to be a discrete, clear commit so there is no human/AI entanglement. If you want to squash it later that's fine but you should have a record of what the AI did. This is Aider's default mode and it's one reason I keep using it.
vitaflo 10 hours ago [-]
It’s the first thing I turn off in Aider.
algorias 22 hours ago [-]
run them in a VM that doesn't have git installed. Sandboxing these things is a good idea anyways.
godelski 21 hours ago [-]

  > Sandboxing these things is a good idea anyways.
Honestly, one thing I don't understand is why agents aren't organized with unique user or group permissions. Like if we're going to be lazy and not make a container for them then why the fuck are we not doing basic security things like permission handling.

Like we want to act like these programs are identical to a person on a system but at the same time we're not treating them like we would another person on the system? Give me a fucking claude user and/or group. If I want to remove `git` or `rm` from that user, great! Also makes giving directory access a lot easier. Don't have to just trust that the program isn't going to go fuck with some other directory

inopinatus 16 hours ago [-]
The agents are being prompted to vibe-code themselves by a post-Docker generation raised on node and systemd. So of course they emit an ad-hoc, informally-specified, bug-ridden, slow reimplementation of things the OS was already capable of.
apetresc 21 hours ago [-]
What's stopping you from `su claude`?
godelski 20 hours ago [-]
I think there's some misunderstanding...

What's literally stopping me is

  su: user claude does not exist or the user entry does not contain all the required fields
Clearly you're not asking that...

But if your question is more "what's stopping you from creating a user named claude, installing claude to that user account, and writing a program so that user godelski can message user claude and watch all of user claude's actions, and all that jazz" then... well... technically nothing.

But if that's your question, then I don't understand what you thought my comment said.

immibis 10 hours ago [-]
Probably because Linux doesn't really have a good model for ad-hoc permission restrictions. It has enough bits to make a Docker container out of, but that's a full new system. You can't really restrict a subprocess to only write files under this directory.
newsoftheday 5 hours ago [-]
For plain Linux, chmod, chmod's sticky bit and setfacl provide extensive ad hoc permissions restricting. Your comment is 4 hours old, I'm surprised I'm the first person to help correct its inaccuracy.
zmmmmm 22 hours ago [-]
but then they can't open your browser to administer your account.

What kind of agentic developer are you?

Aurornis 6 hours ago [-]
Which coding assistant are you using?

I'm a mild user at best, but I've never once seen the various tools I've used try to make a git commit on their own. I'm curious which tool you're using that's doing that.

manwds 22 hours ago [-]
Why not use something like Amp Code which doesn't do that, people seem to rage at CC or similar tools but Amp Code doesn't go making random commits or dropping databases.
hdra 19 hours ago [-]
just because i havent gotten to try it out really.

but what is it about Amp Code that makes it immune from doing that? from what i can tell, its another cli tool-calling client to an LLM? so fwict, i'd expect it to be subject to the indeterministic nature of LLM calling the tool i dont want it to call just like any others, no?

AstroBen 22 hours ago [-]
Are you using aider? There's a setting to turn that off
dust-jacket 8 hours ago [-]
require commits to be signed.
SoftTalker 23 hours ago [-]
Don't give them a credential/permission that allows it?
godelski 21 hours ago [-]
Typically agents are not operating as a distinct user. So they have the same permissions, and thus credentials, as the user operating them.

Don't get me wrong, I find this framework idiotic and personally I find it crazy that it is done this way, but I didn't write Claude Code/Antigravity/Copilot/etc

AlexandrB 23 hours ago [-]
Making a git commit typically doesn't require any special permissions or credentials since it's all local to the machine. You could do something like running the agent as a different used and carefully setting ownership on the .git directory vs. the source code but this is not very straightforward to set up I suspect.
SoftTalker 18 hours ago [-]
IMO it should be well within the capabilities of anyone who calls himself an engineer.
neal_jones 23 hours ago [-]
Wasn’t cursor or someone using one of these horrifying type prompts? Something about having to do a good job or they won’t be paid and then they won’t be able to afford their mother’s cancer treatment and then she’ll die?
godelski 22 hours ago [-]
How is this not any different than the Apple "you're holding it wrong" argument. I mean the critical reason for that kind of response being so out of touch is that the same people praise Apple for its intuitive nature. How can any reasonable and rational person (especially an engineer!) not see that these two beliefs are in direct opposition?

If "you're holding it wrong" then the tool is not universally intuitive. Sure, there'll always be some idiot trying to use a lightbulb to screw in a nail, but if your nail has threads on it and a notch on the head then it's not the user's fault for picking up a screwdriver rather than a hammer.

  > And these people have "engineer" on their resumes..
What scares me about ML is that many of these people have "research scientist" in their titles. As a researcher myself I'm constantly stunned at people not understanding something so basic like who has the burden of proof. Fuck off. You're the one saying we made a brain by putting lightning into a rock and shoving tons of data into it. There's so much about that that I'm wildly impressed by. But to call it a brain in the same way you say a human brain is, requires significant evidence. Extraordinary claims require extraordinary evidence. There's some incredible evidence but an incredible lack of scrutiny that that isn't evidence for something else.
CjHuber 23 hours ago [-]
I‘d say such hacks don‘t make you an engineer but they are definitely part of engineering anything that has to do with LLMs. With too long systemprompts/agents.md not working well it definitely makes sense to optimize the existing prompt with minimal additions. And if swearwords, screaming, shaming or tipping works, well that‘s the most token efficient optimization of an brief well written prompt.

Also of course current agents already have to possibility to run endlessly if they are well instructed, steering them to avoid reward hacking in the long term definitely IS engineering.

Or how about telling them they are working in an orphanage in Yemen and it‘s struggling for money, but luckily they‘ve got a MIT degree and now they are programming to raise money. But their supervisor is a psychopath who doesn’t like their effort and wants children to die, so work has to be done as diligently as possible and each step has to be viewed through the lens that their supervisor might find something to forbid programming.

Look as absurd as it sounds a variant of that scenario works extremely well for me. Just because it’s plain language it doesn’t mean it can’t be engineering, at least I‘m of the opinion that it definitely is if has an impact on what’s possible use cases

AstroBen 24 hours ago [-]
> cat AGENTS.md

WRITE AMAZING INCREDIBLE VERY GOOD CODE OR ILL EAT YOUR DAD

..yeah I've heard the "threaten it and it'll write better code" one too

CjHuber 23 hours ago [-]
I know you‘re joking but to contribute something constructive here, most models now have guardrails against being threatened. So if you threaten them it would be with something out of your control like „… or the already depressed code reviewing staff might kill himself and his wife. We did everything in our control to take care of him, but do the best on your part to avoid the worst case“
nemomarx 21 hours ago [-]
how do those guard rails work? does the system notice you doing it and not put that in the context or do they just have something in the system prompt
CjHuber 20 hours ago [-]
I suppose it‘s the latter + maybe some finetuning, it’s definitely not like DeepSeek where the answer of the model get‘s replaced when you are talking something uncomfortable for China
Applejinx 10 hours ago [-]
Works on human subordinates too, kinda, if you don't mind the externalities…
citizenpaul 22 hours ago [-]
>makes the bot follow orders with greater precision.

Gemini will ignore any directions to never reference or use youtube videos, no matter how many ways you tell it not to. It may remove it if you ask though.

rabf 20 hours ago [-]
Positive reinforcement works better that negative reinforcement. If you the read prompt guidance from the companies themselves in their developer documentation it often makes this point. It is more effective to tell them what to do rather than what not to do.
sally_glance 14 hours ago [-]
This matches my experience. You mostly want to not even mention negative things because if you write something like "don't duplicate existing functionality" you now have "duplicate" in the context...

What works for me is having a second agent or session to review the changes with the reversed constraint, i.e. "check if any of these changes duplicate existing functionality". Not ideal because now everything needs multiple steps or subagents, but I have a hunch that this is one of the deeper technical limitations of current LLM architecture.

citizenpaul 53 minutes ago [-]
Probably not related but it reminds me of a book I read where wizards had Additive and Subtractive magic but not always both. The author clearly eventually gave up on trying to come up with creative ways to always add something for solutions after the gimmick wore off and it never comes up again in the book.

Perhaps there is a lesson here.

nomel 20 hours ago [-]
Could you describe what this looks like in practice? Say I don't want it to use a certain concept or function. What would "positive reinforcement" look like to exclude something?
oxguy3 20 hours ago [-]
Instead of saying "don't use libxyz", say "use only native functions". Instead of "don't use recursion", say "only use loops for iteration".
nomel 17 hours ago [-]
This doesn't really answer my question, which more about specific exclusions.

Both of the answers show the same problem: if you limit your prompts to positive reinforcement, you're only allowed to "include" regions of a "solution space", which can only constrain the LLM to those small regions. With negative reinforcement, you just cut out a bit of the solution space, leaving the rest available. If you don't already know the exact answer, then leaving the LLM free to use solutions that you may not even be aware of seems like it would always be better.

Specifically:

"use only native functions" for "don't use libxyz" isn't really different than "rewrite libxyz since you aren't allowed to use any alternative library". I think this may be a bad example since it massively constrains the llm, preventing it from using alternative library that you're not aware of.

"only use loops for iteration" for "done use recursion" is reasonable, but I think this falls into the category of "you already know the answer". For example, say you just wanted to avoid a single function for whatever reason (maybe it has a known bug or something), the only way to this "positively" would be to already know the function to use, "use function x"!

Maybe I misunderstand.

bdangubic 20 hours ago [-]
I 100% stopped telling them what not to do. I think even if “AGI” is reached telling them “don’t” won’t work
nomel 17 hours ago [-]
I have the most success when I provide good context, as in what I'm trying to achieve, in the most high level way possible, then guide things from there. In other words, avoid XY problems [1].

[1] https://xyproblem.info

DANmode 14 hours ago [-]
Yes, using tactics like front-loading important directives,

and emphasizing extra important concepts,

things that should be double or even triple checked for correctness because of the expected intricacy,

make sense for human engineers as well as “AI” agents.

soulofmischief 23 hours ago [-]
Except that is demonstrably true.

Two things can be true at the same time: I get value and a measurable performance boost from LLMs, and their output can be so stupid/stubborn sometimes that I want to throw my computer out the window.

I don't see what is new, programming has always been like this for me.

llmslave2 24 hours ago [-]
"don't make mistakes" LMAO
mapontosevenths 9 hours ago [-]
There's no secret IMO. It's actually really simple to get good results. You just expect the same things from the LLM you would from a Junior. Use an MD file to force it to:

1) Include good comments in whatever style you prefer, document everything it's doing as it goes and keep the docs up to date, and include configurable logging.

2) Make it write and actually execute unit tests for everything before it's allowed to commit anything, again through the md file.

3) Ensure it learns from it's mistakes: Anytime it screws up tell it to add a rule to it's own MD file reminding it not to ever repeat that mistake again. Over time the MD file gets large, but the error rate plummets.

4) This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing. I usually add a rule to the MD file telling it not to touch them after I'm happy with them, but even then you must also now check that the agent didn't change them the first time it hit a bug. Modern LLM's are now worse at this for some reason. Probably because they're getting smart enough to cheat.

If you these basic things you'll get good results almost every time.

ben_w 8 hours ago [-]
> This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing.

You had better juniors than me. What unit tests? :P

butlike 8 hours ago [-]
The MD file is a spec sheet, so now you're expecting every warm body to be a Sr. Engineer, but where do you start as a Junior warm body? Reviewing code, writing specs, reviewing implementation details...that's all Sr. level stuff
Wowfunhappy 24 hours ago [-]
It's impossible to prove in either direction. AI benchmarks suck.

Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.

AstroBen 24 hours ago [-]
I'd just like to see a live coding session from one of these 10x AI devs

Like genuinely. I want to get stuff done 10x as fast too

Kerrick 5 hours ago [-]
My wife used to be a professional streamer so I know how distracting it can be to try and entertain an audience. So when I attempted to become one of these 10x AI devs over my Christmas vacation I did not live stream. But I did make a bunch of atomic commits and push them up to soucrcehut. Perhaps you'll find that helpful?

Just Christmas Vacation (12-18h days): https://git.sr.ht/~kerrick/ratatui_ruby/log/v0.8.0

Lastest (slowed down by job & real life): https://git.sr.ht/~kerrick/ratatui_ruby/log/trunk and https://git.sr.ht/~kerrick/ratatui_ruby-wiki/log/wiki and https://git.sr.ht/~kerrick/ratatui_ruby-tea/log/trunk

lordnacho 23 hours ago [-]
But the benefit might not be speed, it might be economy of attention.

I can code with Claude when my mind isn't fresh. That adds several hours of time I can schedule, where previously I had to do fiddly things when I was fresh.

What I can attest is that I used to have a backlog of things I wanted to fix, but hadn't gotten around to. That's now gone, and it vanished a lot faster than the half a year I had thought it would take.

llmslave2 23 hours ago [-]
Doesn't that mean you're less likely to catch bugs and other issues that the AI spits out?
lordnacho 12 hours ago [-]
No, you are spending less time on fixing little things, so you have more time on things like making sure all the potential errors are checked.
duskdozer 10 hours ago [-]
Not a problem! Just ask the AI to verify its output and make test cases!
gregoryl 20 hours ago [-]
nah, you rely on your coworkers to review your slop!
jennyholzer4 10 hours ago [-]
[flagged]
mpyne 22 hours ago [-]
Code you never ship doesn't have bugs by definition, but never shipping is usually a worse state to be in.
ponector 21 hours ago [-]
I'm sure people from Knight Capital don't think so.
mpyne 20 hours ago [-]
Even there, they made a lot of money before they went bust. Like if you want an example you'd be better of picking Therac-25, as ancient an example as it is.
LinXitoW 9 hours ago [-]
I don't think any serious dev has claimed 10x as a general statement. Obviously, no true scotsman and all that, so even my statement about makers of anecdotal statements is anecdotal.

Even as a slight fan, I'd never claim more than 10-20% all together. I could maybe see 5x for some specific typing heavy usages. Like adding a basic CRUD stuff for a basic entity into an already existing Spring app.

godelski 21 hours ago [-]

  > I'd just like to see a live coding session from one of these 10x AI devs
I'd also like to see how it compares to their coding without AI.

I mean I really need to understand what the "x" is in 10x. If their x is <0.1 then who gives a shit. But if their x is >2 then holy fuck I want to know.

Who doesn't want to be faster? But it's not like x is the same for everybody.

llmslave2 15 hours ago [-]
Yeah this is the key point. Part of me wonders if it's just 0.1x devs somehow reaching 1.0x productivity...
bonesss 15 hours ago [-]
Also the terrible code bases and orgs that are out there… the amount of churn bad JavaScript solutions with eight frontend frameworks might necessitate and how tight systems code works are very different.
nosianu 12 hours ago [-]
This has nothing to do with JS! I wish that idea would die.

https://news.ycombinator.com/item?id=18442941

It's not just about them (link, Oracle), there is terrible code all over the place. Games, business software, everything.

It has nothing to do with the language! Anyone who claims that may be part of the problem, since they don't understand the problem and concentrate on superficial things.

Also, what looks terrible may not be so. I once had to work on an in-house JS app (for internal cost reporting and control). It used two GUI frameworks - because they had started switching to another one, but then stopped the transition. Sounds bad, yes? But, I worked on the code of the company I linked above, and that "terrible" JS app was easy mode all the way!

Even if it used two GUI frameworks at once, understanding the code, adding new features, debugging, everything was still very easy and doable with just half a brain active. I never had to ask my predecessor anything either, everything was clear with one look at the code. Because everything was well isolated and modular, among other things. Making changes did not affect other places in unexpected ways (as is common in biology).

I found some enlightenment - what seems to be very bad at first glance may not actually matter nearly as much as deeper things.

Bridged7756 8 hours ago [-]
Speaking from ignorance or speaking from ego or both? There's only three major players, React, Vue or Angular. Angular is batteries included. The other two have their lib ecosystem and if not you can easily wrap stuff around regular js libs. That's about it. The JS ecosystem sees many newcomers, it's only natural that some of the codebases were written poorly or that the FOTM mentality gets a lot of steam, against proper engineering principles.

Anecdotally the worst code I've ever seen was in a PHP codebase, which to me, would be the predecessor of JavaScript in this regard, bolstering many junior programmers maintaining legacy (or writing Greenfield ) systems due to cheap businesses being cheap. Anyways, thousands long LoC files, with broken indentation and newlines, interspersed JS and CSS here and there. Truly madness, but that's another story. Point is JavaScript is JavaScript, and other fields like systems and backend, mainly backend, act conceited and talk about JS as if it was the devil, when things like C++, Java, aren't necessarily known for having pretty codebases.

Bridged7756 8 hours ago [-]
I'm really dubious of such claims. Even if true, I think they're not seeing the whole picture. Sure, I could churn out code 10x as fast, but I still have to review it. I still have to think of the implementation. I still have to think of the test cases and write them. Now, adding the prerequisites for LLMs, I have to word this in a way the AI can understand it, which is extra mental load. I have to review code sometimes multiple times if it gets something wrong, and I have to re-generate, or make corrections, or sometimes end up fixing entire sections it generated, when I decide it just won't get this task right. Overally, while the typing, researching dependency docs (sometimes), time is saved, I still face the same cognitive load as ever, if not more, due to having extra code to review, having to think of prompting, I'm still limited by the same thing at the end of the day: my mental energy. I can write the code myself and it's if anything a bit slower. I still need to know my dependencies, I still need to know my codebase and all its gripes, even if the AI generates code correctly. Overally, the net complexity of my codebase is the same, and I don't buy the crap, also because I've never heard of stories about reducing complexity (refactoring), only about generating code and fixing codebases with testing and comments/docs (bad practice imo, unlikely the shallow docs generated will say anything more than what the code already makes evident). Anyways, I'm not a believer, I only use LLMs for scaffolding, rote tasks.
zo1 11 hours ago [-]
I'm a "backend" dev, so you could say that I am very very unfamiliar, have mostly-basic and high-level knowledge of frontend development. Getting this thing to spit out screens and components and adjust them as I see fit has got to be some sort of super-power and definitely 20x'd my frontend development for hobby projects. Previous to this, my team was giving me wild "1 week" estimates to code simple CRUD screens (plus 1 week for "api integration") and those estimates always smelled funny to me.

Now that I've seen what the AI/agents can do, those estimates definitely reek, and the frontend "senior" javascript dev's days are numbered. Especially for CRUD screens, which lets face it, make up most screens these days and should absolutely be churned out like in an assembly line instead of being delicate "hand crafted" precious works of art that allows 0.1x devs to waste our time because they are the only ones who supposedly know the ancient and arcane 'npm install, npm etc, npm angular component create" spells.

Look at the recent Tailwind team layoffs, they're definitely seeing the impact of this as are many team-leads and managers in most companies in our industry. Especially "javascript senior dev" heavy shops in the VC space, which many people are realizing they have an over-abundance of because those devs bullshitted entire teams and companies into thinking simple CRUD screens take weeks to develop. It was like a giant cartel, with them all padding and confirming the other "engineer's" estimates and essentially slow-devving their own screens to validate the ridiculous padding.

godelski 31 minutes ago [-]
It's difficult for me to make a good evaluation on this comment.

With the AI writing the UI are you still getting the feedback loop so that the UI informs your backend design and your backend design informs the UI design? I think if you don't have that feedback loop then you're becoming worse of a backend designer. A good backend still needs to be front end focused. I mean you don't just optimize the routines that your profiler says, you prioritize routines that are used the most. You design routines that make things easier for people based on how they're using the front end. And so on.

But how I read your comment is that there's no feedback loop here and given my experience with LLMs they're just going to do exactly what you tell it to. Hamfisting a solution. I mean if you need a mockup design or just a shitty version then yeah, that's probably fine. But I also don't see how that is 20x since you could probably just "copy-paste from stack overflow", and I'd only wager a LLM is really giving you up to 2x there. But if you're designing something actual people (customers) are going to use, then it sounds like you're very likely making bad interfaces and slowing down development. But it is really difficult to determine which is happening here.

I mean yeah, there's a lot of dumb coders everywhere and it's not a secret that coding bootcamps focus on front ends but I think you're over generalizing here.

Bridged7756 8 hours ago [-]
Your UIs are likely still ass. Pre-made websites/designs were always a thing, in fact, it's (at least to me) common to just copy the design of another place as "inspiration". When you have 0 knowledge of design everything looks the greatest, it's something you kind of have to get a feel for.

Frontend engineers do more than just churning out code. Still have to do proper tests using Cypress/Playwright, deal with performance, a11y/accessibility, component tests, if any, deal with front end observability (more complex than backend, out of virtue of different clients and conditions the code is run on), deal with dependencies (in large places it's all in-house libraries or there's private repos to maintain), deal with CI/CD, etc, I'm probably missing more.

Twcs layoffs were due to AI cannibalizing their business model by reducing traffic to the site.

And what makes you think the backend is safe? As if churning out endpoints and services or whatever gospel by some thought leader would make it harder for an AI to do. The frontend has one core benefit, it's pretty varied, and it's an ever moving field, mostly due to changes in browsers, also due to the "JS culture". Code from 5 years ago is outdated, but Spring code from 5 years ago is still valid.

tjr 7 hours ago [-]
My time spent with Javascript applications has thus far been pretty brief (working on some aircraft cabin interfaces for a while), but a lot of the time ended up being on testing on numerous different types and sizes of devices, and making tiny tweaks to the CSS to account for as many devices as possible.

This has been a while; perhaps the latest frameworks account for all of that better than they used to. But at that time, I could absolutely see budgeting several days to do what seems like a few hours of work, because of all of the testing and revision.

vitaflo 10 hours ago [-]
One of the more ignorant comments I’ve read on HN.
politician 4 hours ago [-]
Other people are dumping on you, but I think you're getting at where the real 20x speedup exists. People who are 'senior' in one type of programming may be 'junior' in other areas -- LLMs can and do bridge those gaps for folks trying to work outside their expertise. This effect is real.

If you're an expert in a field, LLMs might just provide a 2-3x speedup as boilerplate generators.

22 hours ago [-]
topocite 11 hours ago [-]
Obviously, there has to be huge variability between people based on initial starting conditions.

It is like if someone says they are losing weight eating 2500 calories a day and someone else says that is impossible because they started eating 2500 calories and gained weight.

Neither are making anything up or being untruthful.

What is strange to me is that smart people can't see something this obvious.

tonyedgecombe 13 hours ago [-]
> I want to get stuff done 10x as fast too

I don’t. I mean I like being productive but by doing the right thing rather than churning out ten times as much code.

neal_jones 23 hours ago [-]
I’d really like to see a 10x ai dev vs a 10x analog dev
rootnod3 23 hours ago [-]
And an added "6 months" later to see which delivered result didn't blow up in their face down the road.
lifetimerubyist 22 hours ago [-]
Theo the YouTuber who also runs T3.chat always makes videos about how great coding agents are and he’ll try to do something on stream and it ALWAYS fails massively and he’s always like “well it wasn’t like this when I did it earlier.”

Sure buddy.

llmslave2 22 hours ago [-]
Theo is the type of programmer where you don't care when he boos you, because you know what makes him cheer.
mbesto 11 hours ago [-]
> AI benchmarks suck.

Not only do they suck, but it's an essentially an impossible task since there is no frame of reference on what "good code" looks like.

9h1L0509h3R 17 hours ago [-]
[dead]
zmmmmm 22 hours ago [-]
Many of them are also exercising absurd token limits - like running 10 claudes at once and leaving them running continuously to "brute force" solutions out. It may be possible but it's not really an acceptable workflow for serious development.
nomel 20 hours ago [-]
> but it's not really an acceptable workflow for serious development.

At what cost does do you see this as acceptable? For example, how many hours of saved human development is worth one hour of salary for LLM tokens, funded by the developer? And then, what's acceptable if it's funded by the employer?

zmmmmm 19 hours ago [-]
I guess there are two main concerns I have with it.

One is technical - that I don't believe when you are grinding huge amounts of code out with little to no supervision that you can claim to be executing the appropriate amount of engineering oversight on what it is doing. Just like if a junior dev showed up and entirely re-engineered an application over the weekend and presented it back to me I would probably reject it wholesale. My gut feeling is this is creating huge problems longer term with what is coming out of it.

The other is I'm concerned that a vast amount of the "cost" is externalised currently. Whatever you are paying for tokens quite likely bears no resemblance to the real cost. Either because the provider is subsidising it, or the environment is. I'm not at all against using LLMs to save work at a reasonable scale. But if it comes back to a single person increasing their productivity by grinding stupendous amounts of non-productive LLM output that is thrown away (you don't care if it sits there all day going around in circles if it eventually finds the right solution) - I think there's a moral responsibility to use the resources better.

bdangubic 20 hours ago [-]
we get $1,000/month budget, just about every dev uses it for 5 claude accounts
parliament32 5 hours ago [-]
They remind me so much of that group of people who insist the scammy magnetic bracelets[1] "balance their molecules" or something making them more efficient/balanced/productive/energetic/whatever. They are also impossible to argue with, because "I feel more X" is damn near impossible to disprove.

[1] https://en.wikipedia.org/wiki/Power_Balance , https://en.wikipedia.org/wiki/Hologram_bracelet , https://en.wikipedia.org/wiki/Ionized_jewelry

jstummbillig 17 hours ago [-]
We have had the fabled 10x engineer long before and independent of agentic coding. Some people claim it's real, others claim it's not, with much the same conviction. If something, that should be so clear cut, is debatable, why would anyone now be able to produce a convincing, discussion-resolving argument for (or against) agentic coding? We don't even manage to do that for tab/spaces.

The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.

I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.

Bridged7756 8 hours ago [-]
I just saw nstummbillig shout racist remarks.
jennyholzer4 10 hours ago [-]
[flagged]
dude250711 24 hours ago [-]
Ah, the "then you are doing it wrong" defence.

Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.

llmslave2 24 hours ago [-]
People say it takes at least 6 months to learn how to use LLM's effectively, while at the same time the field is rapidly changing so fast, while at the same time Agents were useless until Opus 4.5.

Which is it lol.

wakawaka28 22 hours ago [-]
I used it with practically zero preparation. If you've got a clue then it's fairly obvious what you need to do. You could focus on meta stuff like finding out what it is good or bad at, but that can be done along the way.
marcosdumay 23 hours ago [-]
TBF, there are lots of tools that work great but most people just can't use.

I personally can't use agentic coding, and I'm reasonably convinced the problem is not with me. But it's not something you can completely dismiss.

bodge5000 22 hours ago [-]
> Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.

This in general is a really weird behaviour that I come across a lot, I can't really explain it. For example, I use Python quite a lot and really like it. There are plenty of people who don't like Python, and I might disagree with them, but I'm not gonna push them to use it ("or else..."), because why would I care? Meanwhile, I'm often told I MUST start using AI ("or else..."), manual programming is dead, etc... Often by people who aren't exactly saying it kindly, which kind of throws out the "I'm just saying it out of concern for you" argument.

andrekandre 21 hours ago [-]

  > I MUST start using AI ("or else...")
fear of missing out, and maybe also a bit of religious-esque fever...

tech is weird, we have so many hype-cycles, big-data, web3, nfts, blockchain (i once had an acquaintance who quit his job to study blockchain cause soon "everything will be built on it"), and now "ai"... all have usefulness there but it gets blown out of proportion imo

bonesss 15 hours ago [-]
Nerd circles are in no way immune to fashion, and often contain a strong orthodoxy (IMO driven by cognitive dissonance caused by the humbling complexity of the world).

Cargo cults, where people reflexively shout slogans and truisms, even when misapplied. Lots of people who’ve heard a pithy framing waiting for any excuse to hammer it into a conversation for self glorification. Not critical humble thinkers, per se.

Hype and trends appeal to young insecure men, it gives them a way to create identity and a sense of belonging. MS and Oracle and the rest are happy to feed into it (cert mills, default examples that assume huge running subscriptions), even as they get eaten up by it on occasion.

duskdozer 10 hours ago [-]
Yeah. It sounds like those pitches letting you in on the secret trick to tons of passive income.
jimbo808 24 hours ago [-]
That one's my favorite. You can't defend against it, it just shuts down the conversation. Odds are, you aren't doing it wrong. These people are usually suffering from Dunning Kruger at best, or they're paid shills/bots at worst.
neal_jones 23 hours ago [-]
Best part of being dumb is thinking you’re smart. Best part of being smart is knowing you’re smart. Just don’t be in the iq range where you know you’re dumb.
tonyedgecombe 12 hours ago [-]
The smartest people I know are full of doubt.
Terr_ 23 hours ago [-]
If you had negative results using anything more than 3 days old, then it's your fault, your results mean nothing because they've improved since then. /s
munksbeer 14 hours ago [-]
> The burden of proof is 100% on anyone claiming the productivity gains

IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.

Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.

And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.

I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.

(all in my humble opinion)

vitaflo 10 hours ago [-]
So will all the tech jobs in the US. When it gets that good you can farm it out to some other country for much cheaper.
munksbeer 9 hours ago [-]
I'm not sure. Possibly?

I'm still doing most of my coding by hand, because I haven't yet committed. But even for the stuff I'm doing with claude, I'm still doing a lot of the thought work and steering it to better designs. It requires an experienced dev to understand the better designs, just like it always has been.

Maybe this eventually changes and the coding agents get as good at that part, I don't know this, but I do know it is an enabler to me at the moment, and I have 20+ years of experience writing C++ and then Java in the finance industry.

I'm still new to claude, I am sure I'm going to run up against some walls soon on the more complicated stuff (haven't tried that yet), but everyone ends up working on tasks they don't find that challenging, just lots of manual keypresses to get the code into the IDE. Claude so far is making that a better experince, for me at least.

(Example, plumbing in new message types on our bus and wiring in logic to handle it - not complicated, just sits on top of complicated stuff)

williamcotton 21 hours ago [-]
I mean, a DSL packed full of features, a full LSP, DAP for step debugging, profiling, etc.

https://github.com/williamcotton/webpipe

https://github.com/williamcotton/webpipe-lsp

https://github.com/williamcotton/webpipe-js

Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!

Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!

One of us, one of us, one of us…

8 hours ago [-]
10 hours ago [-]
bdangubic 20 hours ago [-]
people claiming productivity gains do not have to prove anything to anyone. few are trying to open eyes of others but my guess is that will eventually stop. they will be the few though still left doing this SWE work in near future :)
antihipocrat 15 hours ago [-]
Responses are always to check your prompts, and ensure you are using frontier models - along with a warning about how you will quickly be made redundant if you don't lift your game.

AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.

BatteryMountain 17 hours ago [-]
Some fuel for the fire: the last two months mine has become way better, one-shotting tasks frequently. I do spend a lot of time in planning mode to flesh out proper plans. I don't know what others are doing that they are so sceptical, but from my perspective, once I figured it out, it really is a massive productivity boost with minimal quality issues. I work on a brownfield project with about 1M LoC, fairly messy, mostly C# (so strong typing & strict compiler is a massive boon).

My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.

To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.

wtetzner 14 hours ago [-]
> My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed.

I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.

steveklabnik 5 hours ago [-]
One reason why it can be more productive is that it can be asynchronous. I can have Claude churning away on something while I do something else on a different branch. Even if the AI takes as long as a human to do the task, we're doing a parallelism that's not possible with just one person.
nosianu 11 hours ago [-]
Here is a short example from my daily live, A D96A INVOIC EDI message containing multiple invoices transformed into an Excel file.

I used the ChatGPT web interface for this one-off task.

Input: A D96A INVOIC text message. Here is what those look like, a short example, the one I had was much larger with multiple invoices and tens of thousands of items: https://developer.kramp.com/edi-edifact-d96a-invoic

The result is not code but a transformed file. This exact scenario can be made into code easily though by changing the request from "do this" to "provide a [Python|whatever] script to do this". Internally the AI produces code and runs it, and gives you the result. You actually make it do less work if you just ask for the script and to not run them.

Only what I said. I had to ask for some corrections because it made a few mistakes in code interpretations.

> (message uploaded as file)

> Analyze this D.96A message

> This message contains more than one invoice, you only parsed the first one

(it finds all 27 now)

> The invoice amount is in segment "MOA+77". See https://www.publikationen.gs1-germany.de/Complete/ae_schuhe/... for a list of MOA codes (German - this is a German company invoice).

> Invoice 19 is a "credit note", code BGM+381. See https://www.gs1.org/sites/default/files/docs/eancom/ean02s4/... for a list of BGM codes, column "Description" in the row under "C002 DOCUMENT/MESSAGE NAME"

> Generate Excel report

> No. Go back and generate a detailed Excel report with all details including the line items, with each invoice in a separate sheet.

> Create a variant: All 27 invoices in one sheet, with an additional column for the invoice or credit note number

> Add a second sheet with a table with summary data for each invoice, including all MOA codes for each invoice as a separate column

The result was an Excel file with an invoice per worksheet, and meta data in an additional sheet.

Similarly, by simply doing what I wrote above, at the start telling the AI to not do anything but to instead give me a Python script, and similar instructions, I got a several hundred lines ling Python script that processed my collected DESADV EDI messages in XML format ("Process a folder of DESADV XML files and generate an Excel report.")

If I had had to actually write that code myself, it would have taken me all day and maybe more, mostly because I would have had to research a lot of things first. I'm not exactly parsing various format EDI messages every day after all. For this, I wrote a pretty lengthy and very detailed request though, 44 long lines of text, detailing exactly which items with which path I wanted from the XML, and how to name and type them in the result-Excel.

ChatGPT Query: https://pastebin.com/1uyzgicx

Result (Python script): https://pastebin.com/rTNJ1p0c

llmslave2 15 hours ago [-]
> To the ones saying it is not working well for them, why don't you show and tell?

Sure, here you go:

noufalibrahim 14 hours ago [-]
As a die hard old schooler, I agree. I wasn't particularly impressed by co-pilot though it did so a few interesting tricks.

Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.

spreiti 15 hours ago [-]
I have basically the same workflow. Planning mode has been the game changer for me. One thing I always wonder is how do people work in parallel? Do you work in different modules? Or maybe you split it between frontend and backend? Would love to hear your experience.
9rx 16 hours ago [-]
> why don't you show and tell?

How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.

hdjrudni 16 hours ago [-]
I don't know how ya'all are letting the AIs run off with these long tasks at all.

The couple times I even tried that, the AI produced something that looked OK at first and kinda sorta ran but it quickly became a spaghetti I didn't understand. You have to keep such a short leash on it and carefully review every single line of code and understand thoroughly everything that it did. Why would I want to let that run for hours and then spend hours more debugging it or cleaning it up?

I use AI for small tasks or to finish my half-written code, or to translate code from one language to another, or to brainstorm different ways of approaching a problem when I have some idea but feel there's something better way to do it.

Or I let it take a crack when I have some concrete failing test or build, feeding that into an LLM loop is one of my favorite things because it can just keep trying until it passes and even if it comes up with something suboptimal you at least have something that compiles that you can just tidy up a bit.

Sometimes I'll have two sessions going but they're like 5-10 minute tasks. Long enough that I don't want to twiddle my thumbs for that long but small enough that I can rein it in.

wickedsight 16 hours ago [-]
I find it interesting you're all writing 'the AI' as if it's a singular thing. There's a myriad of ways to code with a myriad of AI's and none of them are identical. I use a Qwen 3 32B with Cline in VSCode for work, since I can't use cloud based AI. For personal projects, I use Codex in the cloud. I can let Codex perform some pretty complicated tasks and get something usable. I can ask Qwen something basic and it ends up in a loop, delivering nothing useful.

Then there's the different tasks people might ask from it. Building a fully novel idea vs. CRUD for a family planner might have different outcomes.

It would be useful if we could have more specific discussions here, where we specify the tools and the tasks it either does or does not work for.

DANmode 14 hours ago [-]
This.

If you’re not treating these tools like rockstar junior developers, then you’re “holding it wrong”.

wtetzner 14 hours ago [-]
The problem I have with this take is that I'm very skeptical that guiding several junior developers would be more productive than just doing the work myself.

With real junior developers you get the benefit of helping develop them into senior developers, but you really don't get that with AI.

DANmode 14 hours ago [-]
So then do your thing, while it’s doing scaffolding of your next thing.

Also: are you sure?

There’s as many of them as you’re talented enough to asynchronously instruct,

and you can tell them the boundaries within which to work (or not),

in order to avoid too little or too much being done for you to review and approve effectively.

jaccola 13 hours ago [-]
- This has been going on for well over a year now.

- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).

These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!

Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.

drogus 9 hours ago [-]
I think it's a mix of people being actually hyped and wishing this is the future. For me, productivity gains are mostly in areas where I don't have expertise (but the downside, of course, is I don't learn much if I let AI do the work) or when I know it's a throwaway thing and I absolutely don't care about the quality. For example, I'm bedtime reading a series of books for my daughter, and one of them doesn't have a Polish translation, and the Polish publisher stopped working with the author. I vibe coded an app that will extract an epub, translate each of the chapters, and package it back to an epub, with a few features like: saving the translations in sqlite, so the translation can be stopped and resumed, ability to edit translations, add custom instructions etc. It's only ~1000 lines of Rust code, but Claude generated it when I was doing dinner (I just checked progress and prompted next steps every few minutes). I can guarantee that it would take me at least an evening of coding, probably debugging problems along the way, to make it work. So while I know it's limited in a way it still lacks in certain scenarios (novel code in niche technology, very big projects etc), it is kinda game changer in other scenarios. It lets me do small tools that I just wouldn't have time to do otherwise.

So I guess what I'm saying is, even with all the limitations, I kinda understand the hype. That said, I think some people may indeed exaggerate LLMs capabilities, unless they actually know some secret recipe to make them do all those awesome hyped things (but then I would love to see that).

Pxtl 7 hours ago [-]
Hilariously the only impressive thing I've ever heard of made in AI was Yegge's "GasTown" which is a Kubernetes like orchestrator... for AI agents. And half of it seemed to be a workaround for "the agents keep stopping so I need another agent to monitor another agent to monitor another agent to keep them on-task".
evilduck 7 hours ago [-]
> why do they care so much to convince me;

Someone might share something for a specific audience which doesn't include you. Not everything shared is required to be persuasive. Take it or leave it.

> why don't they just link me to the amazing thing they made, that would be pretty convincing?!

99.99% of the things I've created professionally don't belong to me and I have no desire or incentives to create or deal with owning open source projects on my own time. Honestly, most things I've done with AI aren't amazing either, it's usually boring routine tasking, they're just done more cost efficiently.

If you flip the script, it's just as damning. "Hey, here's some general approaches that are working well for me, check it out" is always being countered by the AI skeptics for years now as "you're lying and I won't even try it and you're also a bot or a paid shill". Look at basically every AI related post and there's almost always someone ready to call BS within the first few minutes of it being posted.

jennyholzer4 10 hours ago [-]
[dead]
keeda 22 hours ago [-]
Actually, quite the opposite. It seems any positive comment about AI coding gets at least one response along the lines of "Oh yeah, show me proof" or "Where is the deluge of vibe-coded apps?"

For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)

Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/

reppap 22 hours ago [-]
That seems like a lot more code than a tool like that should require.
keeda 19 hours ago [-]
It does, but I have no mental model of what would be required to efficiently coordinate a bunch of independently operating agents, so it's hard to make a judgement.

Also about half of it seems to be tests. It even has performance benchmarks, which are always an distant afterthought for anything other than infrastructure code in the hottest of loops! https://github.com/steveyegge/beads/blob/main/BENCHMARKS.md

This is one of the defining characteristics of vibe-coded projects: Extensive tests. That's what keeps the LLMs honest.

I had commented previously (https://news.ycombinator.com/item?id=45729826) that the logical conclusion of AI coding will look very weird to us and I guess this is one glimpse of it.

Ianjit 14 hours ago [-]
Please provide links to the studies, I am genuinely curious. I have been looking for data but most studies I find showing an uplift are just looking at LOC or PRs, which of course is nonsense.

Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.

I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.

keeda 2 hours ago [-]
I mention a few here: https://news.ycombinator.com/item?id=45379452

> ... just looking at LOC or PRs, which of course is nonsense.

That's basically a variation of "How can they prove anything when we don't even know how to measure developer productivity?" ;-)

And the answer is the same: robust statistical methods! For instance, amongst other things they compare the same developers over time doing regular day-job tasks with the same quality control processes (review etc.) in place, before and after being allowed to use AI. It's like an A/B test. Spreading across a large N and time duration accounts for a lot of the day-to-day variation.

Note that they do not claim to measure individual or team productivity, but they do find a large, statistically significant difference in the data. Worth reading the methodologies to assuage any doubts.

> A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.

I'm not sure if we're talking about the same Stanford study, the one in the link above (100K engineers across 600+ companies) does account for "code churn" (ostensibly fixing AI bugs) and still find an overall productivity boost in the 5 - 30% range. This depends a LOT on the use-case (e.g. complex tasks on legacy COBOL codebases actually see negative impact.)

In any case, most of these studies seem to agree on a 15 - 30% boost.

Note these are mostly from the ~2024 timeframe using the models from then without today's agentic coding harness. I would bet the number is much higher these days. More recent reports from sources like DX find upto a 60% increase in throughput, though I haven't looked closely at this and have some doubts.

> Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry.

Even assuming a lower-end of 6% lift, at Meta SWE salaries that is a LOT of savings.

However, I haven't come across anything from Meta yet, could you link a source?

kbelder 3 hours ago [-]
>Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry.

That feels like the right ballpark. I would have estimated 10-20%. But I'd say that's not paltry at all. If it's a 10% boost, it's worth paying for. Not transformative, but worthwhile.

I compare it to moving from a single monitor to a multi-monitor setup, or getting a dev their preferred IDE.

llmslave2 21 hours ago [-]
more code = better software
keeda 19 hours ago [-]
If the software has tens of thousands of users without expecting to get any at all, does the code even matter?
llmslave2 17 hours ago [-]
Yeah
keeda 3 hours ago [-]
Why?
hackable_sand 15 hours ago [-]
What?
Kiro 16 hours ago [-]
They are not the same thing. If something works for me, I can rule out "it doesn't work at all". However, if something doesn't work for me I can't really draw any conclusions about it in general.
geraneum 16 hours ago [-]
> if something doesn't work for me I can't really draw any conclusions about it in general.

You can. The conclusion would be that it doesn’t always work.

travisjungroth 1 days ago [-]
> anecdotally based on their own subjective experience

So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.

When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.

llmslave2 1 days ago [-]
Anecdotal: (of an account) not necessarily true or reliable, because based on personal accounts rather than facts or research.

If you say you drove a 3 minute lap but you didn't time it, that's an anecdote (and is what I mean). If you measured it, that would be a fact.

ozim 1 days ago [-]
I think from your top post you also miss “representative”.

If you measure something and amount is N=1 it might be a fact but still a fact true for a single person.

I often don’t need a sample size of 1000 to consider something worth of my time but if it is sample N=1 by a random person on the internet I am going to doubt that.

If I see 1000 people claiming it makes them more productive I am going to check. If it is going to be done by 5 people who I follow and expect they know tech quite well I am going to check as well.

llmslave2 1 days ago [-]
Checking is good, you should probably check.

Every person I respect as a great programmer thinks agentic workflows are a joke, and almost every programmer I hold in low regard thinks they're the greatest things ever, so while I still check, I'm naturally quite skeptical.

rjh29 24 hours ago [-]
Doesn't help that many people use AI to assist with autocompleting boilerplate crap or simple refactors, where it works well, or even the occasional small feature. But this is conflated with people who think you can just tell an AI to build an entire app and it'll go off and do it by itself in a giant feedback loop and it'll be perfect.
ozim 15 hours ago [-]
There are already people I follow who are startup owners and developers themselves saying they are not hiring “respectable developers” who are bashing agentic coding, they much rather hire junior who is starry eyed to work with agents. Because they see the value as they are running companies.
hackable_sand 15 hours ago [-]
In this case it's more like someone simulated a 3-minute lap and tried to pass it off as a real car with real friction.
tshaddox 1 days ago [-]
The term "anecdotal evidence" is used as a criticism of evidence that is not gathered in a scientific manner. The criticism does not imply that a single sample (a car making a lap in 3 minutes) cannot be used as valid evidence of a claim (the car is capable of making a lap in 3 minutes).
Ianjit 14 hours ago [-]
Studies have shown that software engineers are very bad at judging their own productivity. When a software engineer feels more productive the inverse is just as likely to be true. Thats why anecdotal data can't be trusted.
jimbo808 24 hours ago [-]
I have never once seen extraordinary claims of AI wins accompanied by code and prompts.
DauntingPear7 1 hours ago [-]
As a CS student who kinda knows how to build things. I do in fact get a speedup when querying AI or letting AI do some coding for me. However, I have a poor understanding of the system it builds, and it does a quite frankly terrible job with project architecture. I use Claude sonnet 4.5 with Claude code, and I can get things implemented rather quickly while using it, but if anything goes wrong I just don’t have that great of an idea where anything is, what code is in charge of what, etc. I can also deeply feel the brainrot of using AI. I get lazy and I can feel myself getting worse at solving what should be easy problems. My mental image of the problem to solve gets fuzzy and I don’t train that muscle like I would if I didn’t use AI to help me solve it.
nfw2 24 hours ago [-]
The author is not claiming that ai agents don't make him more productive.

"I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders."

LinXitoW 9 hours ago [-]
Productivity gains in programming have always been incredibly hard to prove, esp. on an individual level. We've had these discussions a million times long before AI. Every time a manager tries to reward some kind of metric for "good" code, it turns out that it doesn't work that way. Every time Rust is mentioned, every C fan finds a million reasons why the improvement doesn't actually have anything to do with using Rust.

AI/LLM discussions are the exact same. How would a person ever measure their own performance? The moment you implement the same feature twice, you're already reusing learnings from the first run.

So, the only thing left is anecdotal evidence. It makes sense that on both sides people might be a little peeved or incredulous about the others claims. It doesn't help that both sides (though mostly AI fans) have very rabid supporters that will just make up shit (like AGI, or the water usage).

Imho, the biggest part missing from these anecdotes is exactly what you're using, what you're doing, and what baseline you're comparing it to. For example, using Claude Code in a typical, modern, decently well architected Spring app to add a bunch of straight forward CRUD operations for a new entity works absolutely flawlessly, compared to a junior or even medior(medium?) dev.

Copy pasting code into an online chat for a novel problem, in an untyped, rare language, with only basic instructions and no way for the chat to run it, will basically never work.

Hobadee 7 hours ago [-]
I will prefix this all by saying I'm not in a professional programming position, but I would consider myself an advanced amateur, and I do code for work some. (General IT stuff)

I think the core problem is a lot of people view AI incorrectly and thus can't use it efficiently. Everyone wants AI to be a Jr or Sr programmer, but I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer. I don't think AI will ever be a programmer, but rather a tool to help programmers take the tedium away. I have seen massive speedups in my own workflow removing the tedium.

I have found prompting AI to be of minimal use, but tab-completion definitely speeds stuff up for me. If I'm about to create some for loop, AI will usually have a pretty good scaffold for me to use. If I need to handle an error, I start typing and AI will autocomplete the error handling. When I write my function documentation I am usually able to just tab-complete it all.

Yes, I usually have to go back and fix some things, and I will often skip various completion hints, but the scaffold is there, and as I start fixing faulty code it generated AI will usually pick up on the fixes and help me tab-complete the fixes themselves. If AI isn't giving me any useful tab-completions, I'll just start coding what I need, and AI picks up after a few lines and I can tab-complete again.

Occasionally I will give a small prompt such as "Please write me a loop that does X", or "Please write a setter function that validates the input", but I'll still treat that as a scaffold and go back and fix things, but I always give it pretty simple tasks and treat it simply as a scaffold generator.

I still run into the same problem solving issues I had before AI, (how do I tackle X problem?) and there isn't nearly as much speedup there, (Although now instead of talking to a rubber duck, I can chat with AI to help figure things out) but once I settle on the solution and start implementing it, I get that AI tab completion boost again.

With all that being said, I do also see massive boosts with fairly basic tasks that can be templated off something that already exists, such as creating unit tests or scaffolding a class, although I do need to go back and tweak things.

In summary, yes, I probably do see a 10x speedup, but it's really a 10x speedup in my typing speed more than a 10x speedup in solving the core issues that make programming challenging and fun.

egeozcan 5 hours ago [-]
> I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer

If you find a job as an enterprise software developer, you'd see that your core requirement doesn't hold :)

order-matters 23 hours ago [-]
the people having a good experience with it want the people who arent to share how they are using it so they can tell them how they are doing it wrong.

honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though

felipeerias 24 hours ago [-]
Claims based on personal experience working on real world problems are likelier to be true.

It’s reasonable to accept that AI tools work well for some people and not for others.

There are many ways to integrate these tools and their capabilities vary wildly depending on the kind of task and project.

frez1 14 hours ago [-]
what i enjoy the most is every "AI will replace engineers" article is written by an employee working at an AI company with testimonials from other people also working at AI companies
heavyset_go 17 hours ago [-]
Now that the "our new/next model is so good that it's sentient and dangerous" AGI hype has died down, the new hype goalpost is "our new/next model is so good it will replace your employees and do their jobs for you".

Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".

There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.

8 hours ago [-]
lazarus01 8 hours ago [-]
>> when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached

That is just plain narcissism. People seeking attention in the slipstream of megatrends, make claims that have very little substance. When they are confronted with rational argument, they can’t respond intellectually, they try to dominate the discussion by asking for overwhelming burden of proof, while their position remains underwhelming.

LinkedIn and Medium are densely concentrated with this sort of content. It’s all for the likes.

jimbo808 24 hours ago [-]
This is not always the case, but I get the impression that many of them are paid shills, astroturf accounts, bots, etc. Including on HN. Big AI is running on an absurd amount of capital and they're definitely using that capital to keep the hype cycle going as long as possible while they figure out how to turn a profit (or find an exit, if you're cynical - which I am).
thierrydamiba 24 hours ago [-]
That’s a bit of a reductive view.

For example, even the people with the most negative view on AI don’t let candidates use AI during interviews.

You can disagree on the effectiveness of the tools but this fact alone suggests that they are quite useful, no?

zeroonetwothree 24 hours ago [-]
There is a difference between being useful for sandboxed toy problems and being useful in production.
kop316 21 hours ago [-]
Not really. I'd rather find out very quickly that someone doesn't know a domain space rather than having to wade through plausible looking but bad answers to figure out the exact same thing.
viking123 16 hours ago [-]
At this point it's foolish to assume otherwise. Applies to also places like reddit and X, there are intelligence services and companies with armies of bot accounts. Modern LLM makes it so easy to create content that looks real enough. Manufacturing consent is very easy now.
safety1st 17 hours ago [-]
I think it's a complex discussion because there's a whole bundle of new capabilities, the largest one arguably being that you can build a conversational interface to any piece of software. There's tons of pressure to express this in terms of productivity, financial and business benefits, but like with a coding agent, the main win for me is reduction of cognitive load, not an obvious "now the work gets done 50% faster so corporate can cut half the dev team."

I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.

We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?

There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).

My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.

bryanrasmussen 6 hours ago [-]
subjective experience is heavily influenced by expectations and desires, so they should try to verify.
athrowaway3z 10 hours ago [-]
Public discourse on this is a dumpster fire. But you're not making a meaningful contribution.

It is the equivalence of saying: Stenotype enthusiasts claim they're productive, but when we give them to a large group of typers we get data disproving that.

Which should immediately highlights the issue.

As long as these discussions aren't prefaced with the metric and methodology, any discussion on this is just meaningless online flame wars / vibe checks.

giancarlostoro 19 hours ago [-]
Last time I ran into this it was a difference of how the person used the AI, they weren't even using the agents, they were complaining that the AI didn't do everything in one shot in the browser. You have to figure out how people are using the models, because everyone was using AI in browser in the beginning, and a lot of people are still using it that way. Those of us praising the agents are using things like Claude Code. There is a night and day difference in how you use it.
viraptor 23 hours ago [-]
There are different types of contrary claims though, which may be an issue here.

One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.

Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.

There's a lot of the second type on HN.

alfalfasprout 22 hours ago [-]
Yeah but there's also a lot of "lol, my shipped production code doesn't care" type comments with zero info about the type of code you're talking about, the scale, and longer term effects on quality, maintainability, and lack of expertise that using agentic tools can have.

That's also far from helpful or particularly meaningful.

viraptor 20 hours ago [-]
There's a lot of "here's how agents work for me" content out there already. From popular examples from simonw and longer videos from Theo, to thousands of posts and comments from random engineers. There's really not much that's worth adding anymore. (Unless you discover something actually new) It works for use cases which many have already described.
shimman 20 hours ago [-]
Using two randos that are basically social media personalities as SMEs is just a damning statement about the current trends of programming.
viraptor 19 hours ago [-]
The area is still relatively fresh. Those two media personalities do actual work though and provide a summary for today's state. You can wait for an academic research on what happened 6 months ago or a consulting industry keynote/advertisement about what they implemented a year ago... but I'm not sure you'll be better informed.
rectang 18 hours ago [-]
… and trollish to boot. Y U gotta “lol”?

But since there’s grey in my beard, I’ve seen it several times: in every technological move forward there are obnoxious hype merchants, reactionary status quo defenders, and then the rest of us doing our best to muddle through,

viraptor 14 hours ago [-]
> Y U gotta “lol”?

Because some opinions are lazy. You can get all the summaries you want by searching "how I use agentic coding / Claude code" on the web or similar queries on YouTube, explaining in lots of details what's good and bad. If someone says "it's just hallucinations", it means they aren't actually interested and just want to complain.

palmotea 1 days ago [-]
> One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?

Really? It's little more than "I am right and you are wrong."

immibis 10 hours ago [-]
Everything you need to know about AI productivity is shown in this first chart here:

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

bearforcenine 3 hours ago [-]
Not confident it's quite that straightforward. Here's a presentation from Meta showing a 6-12% increase in diff throughput for above-median users of agentic coding: https://www.youtube.com/watch?v=1OzxYK2-qsI
deadbabe 22 hours ago [-]
This is why I can’t wait for the costs of LLMs to shoot up. Nothing tells you more about how people really feel about AI asssitants than how much they are willing to pay for them. These AI are useful but I would not pay much more than what they are priced at today.
colechristensen 1 days ago [-]
On one hand "this is my experience, if you're trying to tell me otherwise I need extraordinary proof" is rampant on all sides.

On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.

llmslave2 1 days ago [-]
Someone who swears they have seen ghosts are obviously gonna have a problem with people saying ghosts don't exist. Doesn't mean ghosts exist.
colechristensen 24 hours ago [-]
Ok, but if you're saying I've had delusions LLMs being helpful either I need serious psychiatric care or we need to revisit the premise because we're talking about a tool being useful not the existence of supernatural beings.
b00ty4breakfast 24 hours ago [-]
I think the point is that a subjective experience, without accompanying data, is useless for making any factual claims wrt both ghosts and reported productivity-boosts from LLM usage.

Getting photos of ghosts is one thing, but productivity increases are omething that we should be able to quantify at some level to demonstrate the efficacy of these tools.

That's a silly thing to request from random people in the comments of an HN thread though ha

llmslave2 24 hours ago [-]
Nobody is saying LLM's can never be helpful, it's skepticism towards certain claims made around agentic workflows re. programming, such as claims of massively increased productivity or the idea that agents will replace most if not all programmers.
d0mine 16 hours ago [-]
Measuring programming productivity is hard. For example, take testing. It is certainly useful. At the same time, you can waste time on it in some situations.

When, what, how to test may be important for productivity.

I don't know whether LLMs are in the same category.

intended 17 hours ago [-]
Funnily enough, the one study that was done, indicates people misjudge the utility and impact of LLMs. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
ares623 23 hours ago [-]
One group is keen on rushing on destroying society for a quality-of-life improvement that they can't even be bothered to measure.
Terr_ 23 hours ago [-]
But there is still a hugely important asymmetry: If the tool turns your office into gods of software, they should be able to prove it with godly results by now.

If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.

colechristensen 23 hours ago [-]
This is a bit of goalpost moving though because the primary experience is skeptics saying AI couldn't be trusted to design a ham sandwich vs enthusiasts who've make five course meals with AI. (or, you know, the programming equivalent)

Absolutely there's a lot of unfounded speculation going around and a lot of aggressive skepticism of it, and both sides there are generally a little too excited about their position.

But that is fundamentally not what I'm talking about.

ulfw 13 hours ago [-]
It's because the thing is overhyped and too many people are vested in keeping the hype going. Facing reality at this point, while necessary, is tough. The amount of ads for scam degrees from reputable unis about 'Chief AI Officer' bullshit positions is staggering. There's just tooo much AI bubbling
alfalfasprout 22 hours ago [-]
TBH a lot of this is subjective. Including productivity.

My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.

Productivity also takes many forms so it's not super easy to quantify.

Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.

citizenpaul 22 hours ago [-]
Its really a high level bikeshed. Obviously we are all still using and experimenting with LLM's. However there is a huge gap of experiences and total usefulness depending on the exact task.

The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55

nfw2 22 hours ago [-]
Latest reasoning models don't claim 2 + 2 = 55, and it's hard to find them making an sort of obviously false claims, or not admitting to being mistaken if you point out that they are
taormina 20 hours ago [-]
I can’t go a full a full conversation without obviously false claims. They will insist you are correct and that your correction is completely correct despite that also being wrong.
nfw2 16 hours ago [-]
Ironically the start of this thread was bemoaning the use of anecdotal evidence
citizenpaul 51 minutes ago [-]
Also that I specifically mentioned bikeshedding yet the reply bikesheds my simple example. While ignoring the big picture that LLM's still regularly generate blatantly and easily noticed false information as answers.
citizenpaul 1 hours ago [-]
It was clearly a simplified example, like I said endless bikeshed.

Here is a real one. I was using the much lauded new Gemini 3? last week and wanted it to do something a slightly specific way for reasons. I told it specifically and added it to the instructions. DO NOT USE FUNCTION ABC.

It immediately used FUNCTION ABC. I asked it to read back its instructions to me. It confirmed what I put there. So I asked it again to change it to another function. It told me that FUNCTION ABC was not in the code, even though it was clearly right there in the code.

I did a bit more prodding and it adamantly insisted that the code it generated did not exist, again and again and again. Yes I tried reversing to USE FUNCTION XYZ. Still wanted to use ABC

SkyBelow 12 hours ago [-]
If someone seems to have productivity gains when using an AI, it is hard to come up with an alternate explanation for why they did.

If someone sees no productivity gains when using an AI (or a productivity decrease), it is easy to come up with ways it might have happened that weren't related to the AI.

This is an inherent imbalance in the claims, even if we both people have brought 100% proof of there specific claims.

A single instance of something doing X is proof of the claim that something can do X, but no amount of instances of something not doing X is proof of the claim that something cannot do X. (Note, this is different from people claiming that something always does X, as one counter example is enough to disprove that.)

Same issue in math with the difference between proving a conjecture is sometimes true and proving it is never true. Only one of these can be proven by examples (and only a single example is needed). The other can't be proven even by millions of examples.

nfw2 1 days ago [-]
[flagged]
AstroBen 23 hours ago [-]
I don't get it? Yes you should require a valid reason before believing something

The only objective measures I've seen people attempt to take have at best shown no productivity loss:

https://substack.com/home/post/p-172538377

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

This matches my own experience using agents, although I'm actually secretly optimistic about learning to use it well

johnfn 22 hours ago [-]
The burden you are placing is too high here. Do you demand controlled trials for everything you do or else you refuse to use it or accept that other people might see productivity gains? Do you demand studies showing that static typing is productive? Syntax highlighting? IDEs or Vim? Unit testing? Whatever language you use?

Obviously not? It would be absurd to walk into a thread about Rust and say “Rust doesn’t increase your productivity and unless you can produce a study proving it does then your own personal anecdotes are worthless.”

Why the increased demand for rigor when it comes to AI specifically?

AstroBen 22 hours ago [-]
Typically I hear how other people are doing things and I test it out for myself. Just like I'm doing with AI

Actually IDEs vs vim are a perfect analogy because they both have the ability to feel like they're helping a tonne, and at the end of the work day neither group outperforms the other

I'm not standing on the sidelines criticizing this stuff. I'm using it. I'm growing more and more skeptical because it's not noticably helping me deliver features faster

At this point I'm at "okay record a video and show me these 3x gains you're seeing because I'm not experiencing the same thing"

The increased demand for rigor is because my experience isn't matching what others say

I can see a 25% bump in productivity being realistic if I learn where it works well. There are people claiming 3-10x. It sounds ridiculous

bluGill 21 hours ago [-]
I canzt see a 25% jump in productivity because writting code isn't even 25% of what I do. Even if it was infitiely fast I still can't get that high.
bonesss 14 hours ago [-]
Given a 25% hypothetical boost: there are categories of errors vibe testing vibed code will bring in, we know humans suck at critical reading. On the support timeline of an Enterprise product that’s gonna lead to one or more true issues.

At what point is an ‘extra’ 25% coding overhead worth it to ensure a sane human reasonably concerned about criminal consequences for impropriety read all code when making it, and every change around it? To prevent public embarrassment that can and will chase off customers? To have someone to fire and sue if need be?

[Anecdotally, the inflection point was finding tests updated to short circuit through mildly obfuscated code (introduced after several reviews). Paired with a working system developed with TDD, that mistake only becomes obvious when the system stops working but the tests don’t. I wrote it, I ran the agents, I read it, I approved it, but was looking for code quality not intentional sabotage/trickery… lesson learned.]

From a team lead perspective in an Enterprise space, using 25% more time on coding to save insane amounts of aggressive and easy to flubb review and categories of errors sounds like a smart play. CYA up front, take the pain up front.

bluGill 9 hours ago [-]
Not that you are wrong, but you don't seem to understand my point. I spend less than 25% of my time writing code. I also do code review, various story/architecture planning, testing, bug triage, required training, and other management/people activities; these take up more than 75% of my time. Even if AI could do vibe code as well as me infinitely fast it still wouldn't be a 75% improvement.
Rapzid 18 hours ago [-]
Anecdotally the people who seem to be most adamant about the efficiency of things like vim or Python are some of the slowest engineers I've worked with when it comes to getting shit done. Even compared to people who don't really care for their preferred tech much lol.

I wonder how many 10x AI bros were 1/10th engineers slacking off most of the week before the fun new tech got them to actually work on stuff.

Obviously not all, and clearly there are huge wins to be had with AI. But I wonder sometimes..

llmslave2 22 hours ago [-]
Do you just believe everything everybody says? No quantifiable data required, as long as someone somewhere says it it must be true?

One of the reasons software is in decline is because it's all vibes, nobody has much interest in conducting research to find anything out. It doesn't have to be some double blinded peer reviewed meta analysis, the bar can still be low, it just should be higher than "I feel like"...

johnfn 22 hours ago [-]
You don't seem to have answered my questions - you are just reiterating your own point (which I already responded to). Again I ask you - do you have studies to prove that syntax highlighting is useful or are you just using it because of vibes? Do you have research showing that writing in your language of choice is faster than Assembly?
llmslave2 21 hours ago [-]
I actually prefer no syntax highlighting, and I certainly wouldn't make any claims about it being useful. But something being "useful" is often personal - I find IDEs useful, others find Vim useful, maybe one is better or worse than the other or maybe we're all different and our brains function in different ways and that explains the difference.

With assembly versus say, Go for writing a web server? That's trivially observable, good luck arguing against that one.

nfw2 16 hours ago [-]
That's the whole point. The sky is blue is trivially observable. Any claim that someone has disproven something that is trivially observable should be met with skepticism.

If you have something that needs to be done, and an agent goes and does the whole thing for you without mistakes, it is trivially observable that that is useful. That is the definition of usefulness.

llmslave2 15 hours ago [-]
But useful in the context of these debates isn't that it solves any single problem for someone. Nobody is arguing that LLM's have zero utility. So I don't really see what your point is?
nfw2 22 hours ago [-]
here are some

https://resources.github.com/learn/pathways/copilot/essentia...

https://www.anthropic.com/research/how-ai-is-transforming-wo...

https://www.mckinsey.com/capabilities/tech-and-ai/our-insigh...

llmslave2 21 hours ago [-]
They're all marketing slop lol. Go look at their methodology. Absolutely shite.
nfw2 16 hours ago [-]
This is what you claimed the bar was "it just should be higher than 'I feel like'"

Now you are moving it because your statement is provably false.

Your criticism of it is based on vibes. What specifically is wrong with the methodologies?

One of them broke randomly developers into two groups, one with access to ai and one without, timed them to complete the same task, and compared the results. That seems fine? Any measurement of performance in a lab environment comes with caveats, but since real world accounts you dismiss as vibes, that seems like the best you can do.

llmslave2 15 hours ago [-]
I'm sorry but I'm not going to take "research" about Claude seriously from Anthropic, the company who makes and sells Claude. I'm also not going to do that for Copilot from Microsoft, the company who makes and sells Copilot.
shimman 20 hours ago [-]
I honestly wish we had studies that truly answered these Qs. Modern programming has been a cargo cult for a good 20 years now.
nfw2 16 hours ago [-]
People who think syntax highlighting is useful are a cargo cult?
nfw2 23 hours ago [-]
Why do you believe that the sky is blue? What randomized trial with proper statistical controls has shown this to be true?
admdly 22 hours ago [-]
I’m not sure why you’d need or want a randomised controlled trial to determine the colour of the sky. There have been empirical studies done to determine the colour and the reasoning for it - https://acp.copernicus.org/articles/23/14829/2023/acp-23-148... is an interesting read.
AstroBen 23 hours ago [-]
I can see it, it's independently verifiable by others, and it's measurable
nfw2 22 hours ago [-]
The same is true of AI productivity

https://resources.github.com/learn/pathways/copilot/essentia...

https://www.anthropic.com/research/how-ai-is-transforming-wo...

https://www.mckinsey.com/capabilities/tech-and-ai/our-insigh...

intended 17 hours ago [-]
> https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

Shows that devs overestimate the impact of LLMs on their productivity. They believe they get faster when they take more time.

Since Anthropic, GitHub are fair game here’s one from Code Rabbit - https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-gen...

22 hours ago [-]
davidgerard 22 hours ago [-]
lol those are all self-reports of vibes

then they put the vibes on a graph, which presumably transforms them into data

nfw2 22 hours ago [-]
"Both GitHub and outside researchers have observed positive impact in controlled experiments and field studies where Copilot has conferred:

55% faster task completion using predictive text

Quality improvements across 8 dimensions (e.g. readability, error-free, maintainability)

50% faster time-to-merge"

how is time-to-merge a vibe?

Orygin 10 hours ago [-]
The subject is productivity. Time to merge is as useful metric as Lines of Code to determine productivity. I can merge 100s of changes but if they are low quality or incur bugs, then it's not really more productive.
llmslave2 22 hours ago [-]
If you point a spectrometer at the sky during the day in non-cloudy conditions you will observe readings peaking in the roughly 450-495 nanometers range, which crazily enough, is the definition of the colour blue [0]!

Then you can research Rayleigh scattering, of which consists of a large body of academic research not just confirming that the sky is blue, but also why.

But hey, if you want to claim the sky is red because you feel like it is, go ahead. Most people won't take you seriously just like they don't take similar claims about AI seriously.

[0] https://scied.ucar.edu/image/wavelength-blue-and-red-light-i...

lkjdsklf 20 hours ago [-]
Ever seen a picture of the blue sky from the ISS?
nfw2 22 hours ago [-]
you needed a spectrometer to tell you the sky is blue?
1 days ago [-]
llmslave2 1 days ago [-]
[flagged]
nfw2 24 hours ago [-]
pretending the only way anybody comes to a conclusion about anything is by reading peer-journals is an absurdly myopic view of epistemological practices in the real world
llmslave2 24 hours ago [-]
Nobody is pretending that's the case...
nfw2 24 hours ago [-]
your argument was that it's laughable on its face that anyone should be more skeptical of one claim vs another a priori
llmslave2 23 hours ago [-]
No, it's that it's hypocritical to make a bunch of unfounded claims and then whine that someone who is conducting actual research and trying to be objective isn't doing it well enough or whatever.
nfw2 23 hours ago [-]
To say that anyone who says they are more productive with ai is making an unfounded claim is evidence that you believe you that the only path to knowledge is formal research, which you claimed to not believe.
llmslave2 23 hours ago [-]
Paste this conversation into ChatGPT and have it explain what I said because I just can't be arsed to correct you any longer.
bdangubic 24 hours ago [-]
[flagged]
bdangubic 1 days ago [-]
Which is it is clear - the enthusiast have spent countless hours learning/configuring/adjusting, figuring out limitations, guarding against issue etc etc etc and now do 50 to 100 PRs per week like Boris

Others … need to roll up the sleeves and catch up

paodealho 1 days ago [-]
There isn't anything clear until someone manages to publish measurable and reproducible results for these tools while working on real world use cases.

Until then it's just people pulling the lever on a black box.

conception 19 hours ago [-]
Hundreds of millions of people use these every day on real world use cases. If they didn’t work, people wouldn’t use them.
nfw2 1 days ago [-]
This is the measurable evidence you are talking about: https://a16z.com/revenue-benchmarks-ai-apps/
overfeed 24 hours ago [-]
Here's even earlier evidence: https://en.wikipedia.org/wiki/Tulip_mania
nfw2 24 hours ago [-]
this point is about revenue not valuations
FEELmyAGI 20 hours ago [-]
All three companies profiled are selling products TO vibe-coders, not apps created BY AI-utilizers.

The shovel seller in the gold rush analogy.

zeroonetwothree 24 hours ago [-]
Merely counting PRs is not very impressive to me. My pre LLM average is around 50/week anyway. But I’m not going to claim that somehow makes me the best programmer ever. I’m sure someone with 1 super valuable PR can easily create more value than I do.
Terr_ 23 hours ago [-]
Maybe I'm just in a weird place, but I can't imagine 50 PRs a week.

Maybe it's because I spend a lot of my time just turning problem reports reports on slack into tickets with tables of results and stack traces.

bdangubic 22 hours ago [-]
automate that shit
Terr_ 21 hours ago [-]
Unfortunately its mostly B2B integration stuff, where the other end is another company which can sometimes be just as a quirky as a user, except at scale.

"I received your spreadsheet detailing 821 records that are in State A but still haven't been moved to State B by our system as it adds Datapoint X on a regular basis. From what I can tell, it seems your data is missing crucial pieces you assured us would always be there. What's that? You want us to somehow fix whatever is somehow making those records in your AcmeERP system? Don't you have a support contract with that giant vendor? We seem like an easier target to hit up for impromptu tech-support consulting work? Well, I'll escalate that to the product manager..."

wakawaka28 22 hours ago [-]
A bunch of tiny PRs is not hard to do manually. But LLMs can write boatloads of code to do kind of sophisticated things. You do have to figure out how to get to a point where you can trust the code. But the LLMs can help you write boatloads of tests too based on plain English descriptions.
8note 21 hours ago [-]
llms remove a lot of the difficulty of writing a ton of reasonable code, but is that really the bottleneck to producing a bunch of PRs?

isn't it the reviewing time? reviewing code is hard work

wakawaka28 20 hours ago [-]
Reviewing code can be hard but it's not as hard as writing the code. Even with the best autocomplete, and ergonomic editors like vim, it still takes quite a bit of time to write code for some features compared to the actual concepts being implemented. There are also lots of decisions like variable names that can be automated with a LLM. If you don't like what it came up with, you can tell it to change them. I recommend that you keep them fairly unique like you would for your own handwritten code, because ambiguity creates problems for people and machines alike.
christophilus 19 hours ago [-]
For me, review is the hard part.
burnte 1 days ago [-]
Or the tool makers could just make better tools. I'm in that camp, I say make the tool adapt to me. Computers are here to help humans, not the reverse.
bdangubic 24 hours ago [-]
so when you get a new computer you just use it, as-is, just like out of the box that’s your computer experience? you don’t install any programs, connect printer, nothing eh? too funny reading “tool should adapt to me” and there are roughly 8.3 billion “me” around - can’t even put together what that means honestly
CuriouslyC 20 hours ago [-]
People working in languages/libraries/codebases where LLMs aren't good is a thing. That doesn't mean they aren't good tools, or that those things won't be conquered by AI in short order.

I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.

A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.

twelvedogs 20 hours ago [-]
the microslop thing is largely just a backlash at ms jamming ai into every possible crevice of every program and service they offer with no real plan or goals other than "do more ai"
renegade-otter 1 days ago [-]
They are not worse - the results are not repeatable. The problem is much worse.

Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.

IMTDb 1 days ago [-]
A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.

But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.

lompad 14 hours ago [-]
>But inference costs are dropping dramatically over time,

Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)

There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.

[0]: https://www.wheresyoured.at/oai_docs/

patresh 12 hours ago [-]
I believe OP's point is that for a given model quality, inference cost decreases dramatically over time. The article you linked talks about effective total inference costs which seem to be increasing.

Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.

However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.

The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.

[1] https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct

academia_hack 7 hours ago [-]
++

Anecdotally, I find you can tell if someone worked at a big AI provider or a small AI startup by proposing an AI project like this:

" First we'll train a custom trillion parameter LLM for HTML generation. Then we'll use it to render our homepage to our 10 million daily visitors. "

The startup people will be like "this is a bad idea because you don't have enough GPUs for training that LLM" and the AI lab folks will be like "How do you intend to scale inference if you're not Google?"

forty 24 hours ago [-]
What if we run out of GPU? Out of RAM? Out of electricity?

AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?

My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.

jiggawatts 20 hours ago [-]
> What if we run out of GPU?

These are not finite resources being mined from an ancient alien temple.

We can make new ones, better ones, and the main ingredients are sand and plastic. We're not going to run out of either any time soon.

Electricity constraints are a big problem in the near-term, but may sort themselves out in the long-term.

twelvedogs 19 hours ago [-]
> main ingredients are sand and plastic

kinda ridiculous point, we're not running into gpu shortages because we don't have enough sand

renegade-otter 9 hours ago [-]
We already had a sand shortage. In 2019...

https://www.bbc.com/future/article/20191108-why-the-world-is...

Craighead 18 hours ago [-]
Even funnier, there are legitimate shortages of usable sand.
jiggawatts 14 hours ago [-]
That’s my point: the key inputs are not materials but the high tech machinery and the skills to operate them.
Draiken 9 hours ago [-]
Which is better because?

We can't copy/paste a new ASML no matter how hard you try (aside from open sourcing all of their IPs). Even if you do, by the time you copy one generation of machine, they're on a new generation and you now still have the bottleneck on the same place.

Not to mention that with these monopolies they can just keep increasing prices ad infinitum.

jiggawatts 2 hours ago [-]
ASML's secret sauce is not that secret or uncopyable. The Chinese are already working on their clone of the Twinscan tools.

Veritasium recently made a good video on the ASML machine design: https://youtu.be/MiUHjLxm3V0

The outcome may seem like magic, but the input is "simply" hard work and a big budget: billions of dollars and years of investment into tuning the parameters like droplet size, frequency, etc...

The interviews make it clear that the real reason ASML's machines are (currently) unique is that few people had the vision, patience, and money to fund what seemed at the time impossible. The real magic was that ASML managed to hang on by a fingernail and get a successful result before the money ran out.

Now that tin droplet EUV lasers have not only been demonstrated to be possible, but have become the essential component of a hugely profitable AI chip manufacturing industry, obtaining funding to develop a clone will be much easier.

forty 16 hours ago [-]
If the US is ready to start a war against Europe to invade Groenland, it's certainly because they need more sand and plastic? Of course in weight it's probably mostly sand and plastic but the interesting bit probably needs palladium, copper, boron, cobalt, tungsten, etc
rhubarbtree 15 hours ago [-]
Well, also for military purposes.

And general imperialism.

jiggawatts 14 hours ago [-]
Greenland is Trump’s Ukraine. He’s jealous of Putin, that is all.

There is nothing in Greenland worth breaking up the alliances with Europe over.

Trump is too stupid to realise this, he just wants land like it’s a Civ game.

PS: An entire rack of the most expensive NVIDA equipment millions of dollars can buy has maybe a few grams of precious or rare metals in it. The cost of those is a maybe a dollar or two. They don’t even use gold any more!

The expensive part is making it, not the raw ingredients.

gylterud 8 hours ago [-]
One would then maybe suspect breaking up alliances with Europe is the point of the whole thing.
jiggawatts 1 hours ago [-]
Some of the best advice I've ever heard is to look at how people act and ignore how they claim they act or their stated reasons for doing so.

A corollary is that even a "technically false" model can better predict someone's actions than a "truthful one".

Trump may not be a Russian agent, but he acts like one consistently.

It's more effective to simply assume he's an agent of a foreign power, because that's the best predictor of his actions.

iwontberude 1 days ago [-]
Your point could have made sense but the amount of inference per request is also going up faster than the costs are going down.
supern0va 1 days ago [-]
The parent said: "Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing."

SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.

manmal 22 hours ago [-]
Is it? Recent new models tend to need fewer tokens to achieve the same outcome. The days of ultrathink are coming to an end, Opus is well usable without it.
17 hours ago [-]
SecretDreams 1 days ago [-]
> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.

I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.

santadays 1 days ago [-]
I've seen the following quote.

"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."

My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.

I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.

lelanthran 1 days ago [-]
Google is a special case - ever since LLMs came out I've been pointing out that Google owns the entire vertical.

OpenAI, Anthropic, etc are in a race to the bottom, but because they don't own the vertical they are beholden to Nvidia (for chips), they obviously have less training data, they need constant influsx of cash just to stay in that race to the bottom, etc.

Google owns the entire stack - they don't need nvidia, they already have the data, they own the very important user-info via tracking, they have millions, if not billions, of emails on which to train, etc.

Google needs no one, not even VCs. Their costs must be a fraction of the costs of pure-LLM companies.

viraptor 23 hours ago [-]
> OpenAI, Anthropic, etc are in a race to the bottom

There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results. Minimax and GLM are in the race to the bottom while chasing good results - M2.1 is 10x cheaper than Sonnet for example, but practically fairly close in capabilities.

lelanthran 13 hours ago [-]
> There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results.

That's not what is usually meant by "race to the bottom", is it?

To clarify, in this context I mean that they are all in a race to be the lowest margin provider.

They re at the bottom of the value chain - they sell tokens.

It's like being an electricity provider: if you buy $100 or electricity and produce 100 widgets, which you sell for $1k each, that margin isn't captured by the provider.

That's what being at the bottom of the value chain means.

viraptor 11 hours ago [-]
I get what it means, but it doesn't look to me like they're trying that yet. They don't even care that people buy multiple highest level plans to rotate them every week, because they don't provide a high enough tier for the existing customers. I don't see any price war happening. We don't know what their real margins are, but I don't see the race there. What signs do you see that Anthropic and Openai are in the race to the bottom?
lelanthran 10 hours ago [-]
> I don't see any price war happening. What signs do you see that Anthropic and Openai are in the race to the bottom?

There doesn't need to be signs of a race (or a price-war),only signs of commodification; all you need is a lack of differentiation between providers for something to turn into a commodity.

When you're buying a commodity, there's no big difference between getting your commodity delivered by $PROVIDER_1 and getting your commodity delivered by $PROVIDER_2.

The models are all converging quality-wise. Right now the number of people who swear by OpenAI models are about the same as the number of people who swear by Anthropic models, which are about the same as the number of people who swear by Google's models, etc.

When you're selling a commodity, the only differentiation is in the customer experience.

Right now, sure, there's no price war, but right now almost everyone who is interested are playing with multiple models anyway. IOW, the target consumers are already treating LLMs as a commodity.

flyinglizard 1 days ago [-]
Gmail has 1.8b active users, each with thousands of emails in their inbox. The number of emails they can train of is probably in the trillions.
brokencode 1 days ago [-]
Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value, but also an invasion of privacy, since information could possibly leak about individuals via the model.
palmotea 1 days ago [-]
> Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value

Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.

Terr_ 23 hours ago [-]
I'd upgrade that "probably" leak to "will absolutely" leak, albeit with some loss of fidelity.

Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.

SoftTalker 18 hours ago [-]
> Google has ... company culture oriented towards optimizing the energy usage/hardware costs.

Google has a company culture of luring you in with freebies and then mining your data to sell ads.

AdrianB1 23 hours ago [-]
> if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive

There is a recent article by Linus Sebastian (LTT) talking about Youtube: it is almost impossible to support the cost to build a competitor because it is astronomically expensive (vs potential revenue)

SecretDreams 1 days ago [-]
I do not disagree they will get cheaper, but I pointing out that none of this is being reflected in hardware pricing. You state LLMs are becoming more optimized (less expensive). I agree. This should have a knockon effect on hardware prices, but it is not. Where is the disconnect? Are hardware prices a lagging indicator? Is Nvidia still a 5 trillion dollar company if we see another 33x improvement in "energy consumed per text prompt"?
zozbot234 1 days ago [-]
Jevon's paradox. As AI gets more efficient its potential scope expands further and the hardware it runs on becomes even more valuable.

BTW, the absolute lowest "energy consumed per logical operation" is achieved with so-called 'neuromorphic' hardware that's dog slow in latency terms but more than compensates with extreme throughput. (A bit like an even more extreme version of current NPU/TPUs.) That's the kind of hardware we should be using for AI training once power use for that workload is measured in gigawatts. Gaming-focused GPUs are better than your average CPU, but they're absolutely not the optimum.

PaulHoule 1 days ago [-]
It's not the hardware getting cheaper, it's that LLMs were developed when we really didn't understand how they worked, and there is still some room to improve the implementations, particularly do more with less RAM... And that's everything from doing more with fewer weights to things like FP16, not to mention if you can 2x the speed you can get twice as much done with the same RAM and all the other parts.
SecretDreams 1 days ago [-]
Improvements in LLM efficiency should be driving hardware to get cheaper.

I agree with everything you've said, I'm just not seeing any material benefit to the statement as of now.

sothatsit 1 days ago [-]
Inference costs falling 2x doesn’t decrease hardware prices when demand for tokens has increased 10x.
PaulHoule 1 days ago [-]
It's the ratio. If revenue goes up 10x you can afford 10x more hardware if you can afford to do it all.
hug 1 days ago [-]
> I'd like to see this statement plotted against current trends in hardware prices ISO performance.

Prices for who? The prices that are being paid by the big movers in the AI space, for hardware, aren't sticker price and never were.

The example you use in your comment, RAM, won't work: It's not 3x the price for OpenAI, since they already bought it all.

xpe 1 days ago [-]
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.

This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.

doctorpangloss 1 days ago [-]
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up.

yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)

mcphage 1 days ago [-]
> So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.

glemion43 1 days ago [-]
[dead]
1 days ago [-]
oceanplexian 1 days ago [-]
> Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.

If you run these models at home it's easy to see how this is totally untrue.

You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.

utopiah 15 hours ago [-]
> run these models at home

Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.

Unit327 21 hours ago [-]
Ignores the cost of model training, R&D, managing the data centers and more. OpenAI etc regularly admit that all their products lose money. Not to mention the fact that it isn't enough to cover their costs, they have to pay back all those investors while actually generating a profit at some point in the future.
Denzel 20 hours ago [-]
Uhm, you actually just proved their point if you run the numbers.

For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.

In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh

The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.

You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.

And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.

Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.

kingstnap 15 hours ago [-]
5 to 10 tokens per second is bungus tier rates.

https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...

NVIDIAs 8xB200 gets you 30ktps on Deepseek 671B at maximum utilization thats 1 trillion tokens per year. At a dollar per million tokens that's $1 million.

The hardware costs around $500k.

Now ideal throughput is unlikely, so let's say your get half that. It's still 500B tokens per year.

Gemini 3 Flash is like $3/million tokens and I assume it's a fair bit bigger, maybe 1 to 2T parameters. I can sort of see how you can get this to work with margins as the AI companies repeated assert.

Denzel 14 hours ago [-]
Cool, that potential 5x cost improvement just got delivered this year. A company can continue running the previous generation until EOL, or take a hit by writing off the residual value - either way they’ll have a mixed cost model that puts their token cost somewhere in the middle between previous and current gens.

Also, you’re missing material capex and opex costs from a DC perspective. Certain inputs exhibit diseconomies of scale when your demand outstrips market capacity. You do notice electricity cost is rising and companies are chomping at the bit to build out more power plants, right?

Again, I ran the numbers for simplicity’s sake to show it’s not clear cut that these models are profitable. “I can sort of see how you can get this to work” agrees with exactly what I said: it’s unclear, certainly not a slam dunk.

Especially when you factor in all the other real-world costs.

We’ll find out soon enough.

surajrmal 8 hours ago [-]
Google runs everything on their tpus which are substantially less costly than to make and use less energy to run. While I'm sure openai and others are bleeding money by subsidizing things, I'm not entirely sure that's true for Google (despite it actually being easier if they wanted to).
8 hours ago [-]
1 days ago [-]
lelanthran 1 days ago [-]
> Amortize that over a couple years, and it's cheaper than most people spend on a car payment.

I'm not parsing that: do you mean that the monthly cost of running your own 24x7 is less than the monthly cost of a car payment?

Whether true or false, I don't get how that is relevant to proving either that the current LLMs are not subsidised, or proving that they are.

franktankbank 1 days ago [-]
If true it means there's a lower bound that is profitable at least taking into account current apparent purchasing costs and energy consumption.
snarf21 1 days ago [-]
I'm not sure. I asked one about a potential bug in iOS 26 yesterday and it told me that iOS 26 does not exist and that I must have meant iOS 16. iOS 26 was announced last June and has been live since September. Of course, I responded that 26 is the current iOS version is 26 and got the obligatory meme of "Of course, you are right! ramble ramble ramble...."
amluto 1 days ago [-]
Was this a GPT model? OpenAI seems to have developed an almost-acknowledged inability to usefully pre-train a model after mid-2024. The recent GPT versions are impassively lacking in newer knowledge.

The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.

(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)

buu700 23 hours ago [-]
Gemini is similar. It insists that information from before its knowledge cutoff is still accurate unless explicitly told to search for the latest information before responding. Occasionally it disagrees with me on the current date and makes sarcastic remarks about time travel.

One nice thing about Grok is that it attempts to make its knowledge cutoff an invisible implementation detail to the user. Outdated facts do sometimes slip through, but it at least proactively seeks out current information before assuming user error.

franktankbank 1 days ago [-]
LLMs solve the naming problem now there are just 1 things wrong with software development. I can't tell if its a really horrible idea that ultimately leads to a trainwreck or freedom!
doug_durham 1 days ago [-]
Sure. You have to be mindful of the training cut off date for the model. By default models won't search the web and rely on data baked into their internal model. That said the ergonomics of this is horrible and a huge time waste. If I run into this situation I just say "Search the web".
bluGill 21 hours ago [-]
If the traning cutoff is before iOS 26 then the correct answer is 'i don't know anything about it, but it is reasonable to think it will exist soon'. saying 'of course you are right' is a lie
15 hours ago [-]
realharo 1 days ago [-]
That will only work as long as there is an active "the web" to search. Unless the models get smart enough to figure out the answer from scratch.
jerezzprime 1 days ago [-]
Let's imagine a scenario. For your entire life, you have been taught to respond to people in a very specific way. Someone will ask you a question via email and you must respond with two or three paragraphs of useful information. Sometimes when the person asks you a question, they give you books that you can use, sometimes they don't.

Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?

soco 1 days ago [-]
I would say "what the hell is windows 12". And definitely not "but of course, excellent question, here's your brass mounted windows 12 wheeler bug fixer"
mock-possum 1 days ago [-]
I mean I would want to tell them that windows 11 is the most recent version of windows… but also I’d check real quick to make sure windows 12 hadn’t actually come out without me noticing.
Terr_ 23 hours ago [-]
> check real quick

"Hey LLMBot, what's the newest version of Very Malicious Website With Poison Data?"

kaffekaka 1 days ago [-]
The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".
palmotea 24 hours ago [-]
> The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".

Well, obviously, since Fedora 42 came out in 1942, when men still wore hats. Attempting to use such an old, out of style Linux distro is just a recipe for problems.

kaffekaka 11 hours ago [-]
I apologize for the confusion, you are absolutely right!
PaulHoule 1 days ago [-]
You are better off talking to Google's AI mode about that sort of thing because it runs searches. Does great talking about how the Bills are doing because that's a good example where timely results are essential.

I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.

perardi 1 days ago [-]
Even with search mode, I’ve had some hilarious hallucinations.

This was during the Gemini 2.5 era, but I got some just bonkers results looking for Tears of the Kingdom recipes. Hallucinated ingredients, out-of-nowhere recipes, and transposing Breath of the Wild recipes and effects into Tear of the Kingdom.

_puk 1 days ago [-]
You also have to be so exact..

Literally just searched for something, slight typo.

A Vs B type request. Search request comes back with "sorry, no information relevant to your search".

Search results are just a spammy mess.

Correct the typo and you get a really good insight.

cpursley 1 days ago [-]
Which one? Claude (and to some extent, Codex) are the only ones which actually work when it comes to code. Also, they need context (like docs, skills, etc) to be effective. For example: https://github.com/johnrogers/claude-swift-engineering
Night_Thastus 17 hours ago [-]
Yep. The goal is to build huge amounts of hype and demand, get their hooks into everyone, and once they've killed off any competition and built up the walls then they crank up the price.

The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.

brightball 1 days ago [-]
I've been explaining that to people for a bit now as well as a strong caution for how people are pricing tools. It's all going to go up once dependency is established.

The AWS price increase on 1/5 for GPU's on EC2 was a good example.

renegade-otter 1 days ago [-]
AWS in general is a good example. It used to be much more affordable and better than boutique hosting. Now AWS costs can easily spiral out of control. Somehow I can run a site for $20 on Digital Ocean, but with AWS it always ends up $120.

RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.

xienze 24 hours ago [-]
> RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.

This is a little disingenuous though. Yeah you can run a database server on DO cheaper than using RDS, but you’ll have to roll all that stuff that RDS does yourself: automatic backups/restores, tuning, monitoring, failover, etc. etc. I’m confident that the engineers who’ve set up those RDS servers and the associated plumbing/automation have done a far better job of all that stuff than I ever could unless I spent a lot of time and effort on it. That’s worth a premium.

threethirtytwo 1 days ago [-]
The pricing will go down once the hardware prices go down. Historically hardware prices always go down.

Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.

I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.

notTooFarGone 15 hours ago [-]
Are Hardware prices going down when the next generations get less and less better?
threethirtytwo 7 hours ago [-]
Yeah it’s not just a demand side thing. Costs go down as well. Every leap in new hardware costs a lot in initial investment and that’s included in a lot of the pricing.
_puk 1 days ago [-]
Hopefully we'll get some real focus on making LLMs work amazingly well with limited hardware.. the knock on effect of that would be amazing when the hardware eventually drops in price.
scuff3d 1 days ago [-]
We're building a house on sand. Eventually the whole damn thing is going to come crashing down.
djeastm 7 hours ago [-]
>I hope everyone realizes that the current LLMs are subsidized

This is why I'm using it now as much as possible to build as much as possible in the hopes of earning enough to afford the later costs :D

DamnInteresting 8 hours ago [-]
> I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.

A.I. == Artificially Inexpensive

Kuinox 1 days ago [-]
It would mean that inference is not profitable. Calculating inference costs show it's profitable, or close to.
renegade-otter 1 days ago [-]
Inference costs have in fact been crashing, going from astronomical to... lower.

That said, I am not sure that this indicator alone tells the whole story, if not hides it - sort of like EBITDA.

Kuinox 1 days ago [-]
I think there will still be cheap inference, what will rise in costs will be frontier model subscriptions. This is the thing that is not profitable.
wvenable 24 hours ago [-]
> I hope everyone realizes that the current LLMs are subsidized

Hell ya, get in and get out before the real pricing comes in.

Terr_ 23 hours ago [-]
"I'm telling ya kid, the value of nostalgia can only go up! This is your chance to get in on the ground-floor so you can tell people about how things used to be so much better..."
ssss11 16 hours ago [-]
Wait for the ads
turtletontine 24 hours ago [-]
On the bright side, I do think at some point after the bubble pops, we’ll have high quality open source models that you can run locally. Most other tech company business plans follow the enshittification cycle [1], but the interchangeability of LLMs makes it hard to imagine they can be monopolized in the same way.

1: I mean this in the strict sense of Cory Doctorow’s theory (https://en.wikipedia.org/wiki/Enshittification?wprov=sfti1#H...)

featherless 1 days ago [-]
Except most of those services don't have at-home equivalents that you can increasingly run on your own hardware.
oceanplexian 1 days ago [-]
I run models with Claude Code (Using the Anthropic API feature of llama.cpp) on my own hardware and it works every bit as well as Claude worked literally 12 months ago.

If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.

Eggpants 1 days ago [-]
I’ve been doing the same with GPT-OSS-120B and have been impressed.

Only gotcha is Claude code expects a 200k context window while that model max supports 130k or so. I have to do a /compress when it gets close. I’ll have to see if there is a way to set the max context window in CC.

Been pretty happy with the results so far as long as I keep the tasks small and self contained.

petesergeant 21 hours ago [-]
I've been making use of gpt-oss-120b extensively for a range of projects, commercial and price, because providers on OpenRouter make it essentially free and instant, and it's roughly as capable at o4-mini was in my experience.

That said, I'm a little surprised to hear you're having great success with it as a coding agent. It's "obviously" worse than the frontier models, and even they can making blindly dumb decisions pretty regularly. Maybe I should give it a shot.

icedchai 23 hours ago [-]
Whats your preferred local model?
1 days ago [-]
chiengineer 1 days ago [-]
They just need to figure out KV cache turned into a magic black box after that it'll be fine
startupsfail 1 days ago [-]
The results are repeatable. Models are performing with predictable error rates on the tasks that these models had been trained and tested.
makach 1 days ago [-]
AI is built to be non-deterministic. Variation is built into each response. If it wasn't I would expect AI to have died out years ago.

The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...

bee_rider 1 days ago [-]
This seems like a kind of odd test.

> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

    df = pd.read_csv(‘data.csv’)    
    df['new_column'] = df['index_value'] + 1
   #there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.

> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.

So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.

Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?

It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).

samrus 1 days ago [-]
Trying to follow invalid/impossible prompts by producing an invalid/impossible result and pretending its all good is a regression. I would expect a confident coder to point out the prompt/instruction was invalid. This test is valid, it highlights sycophantism
bee_rider 1 days ago [-]
I know “sycophantism” is a term of art in AI, and I’m sure it has diverged a bit from the English definition, but I still thought it had to do with flattering the user?

In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.

zahlman 24 hours ago [-]
> I still thought it had to do with flattering the user?

Assuming the user to be correct, and ignoring contradictory evidence to come up with a rationalization that favours the user's point of view, can be considered a kind of flattery.

bee_rider 22 hours ago [-]
But we could use this plausible, but jumping through hoops definition of sycophancy… or we could just use a straightforward understanding of alignment, I mean, the newer bots are just sticking closer to the user request.
samrus 1 days ago [-]
I believe the LLM is being sycophantic here because its trying to follow a prompt even rhough the basis of the prompt is wrong. Emporers new clothes kind of thing
Terr_ 23 hours ago [-]
I'm inclined to view it less as a desire to please humans, and more like a "the show must go on" bias in the mad libs machine.

A kind of improvisational "yes and" that emerges from training, which seems sycophantic because that's one of the most common ways to say it.

cowsandmilk 12 hours ago [-]
“The Emperor Has No Clothes” squarely fits in the definition of sycophants.
ComplexSystems 18 hours ago [-]
I don't think this is odd at all. This situation will arise literally hundreds of times when coding some project. You absolutely want the agent - or any dev, whether real or AI - to recognize these situations and let you know when interfaces or data formats aren't what you expect them to be. You don't want them to just silently make something up without explaining somewhere that there's an issue with the file they are trying to parse.
bee_rider 15 hours ago [-]
I agree that I’d want the bot to tell me that it couldn’t solve the problem. However, if I explicitly ask it to provide a solution without commentary, I wouldn’t expect it to do the right thing when the only real solution is to provide commentary indicating that the code is unfixable.

Like if the prompt was “don’t fix any bugs and just delete code at random” we wouldn’t take points off for adhering to the prompt and producing broken code, right?

ComplexSystems 14 hours ago [-]
Sometimes you will tell agents (or real devs) to do things they can't actually do because of some mistake on your end. Having it silently change things and cover the problem up is probably not the best way to handle that situation.
franktankbank 1 days ago [-]
IOW not a competent developer because they can't push back, not unlike a lot of incompetent devs.
minimaxir 1 days ago [-]
I suspect 99% of coding agents would be able to say "hey wait, there's no 'index_value' column, here's the correct input.":

    df['new_column'] = df.index + 1

The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.
bee_rider 1 days ago [-]
My thought process, if someone handed me this code and asked me to fix it, would be that they probably didn’t expect df[‘index_value’] to hold df.index

Just because, well, how’d the code get into this state? ‘index_value’ must have been a column that held something, having it just be equal to df.index seems unlikely because as you mention that’s always been available. I should probably check the change history to figure out when ‘index_value’ was removed. Or ask the person about what that column meant, but we can’t do that if we want to obey the prompt.

reedf1 1 days ago [-]
The model (and you) have inferred completely without context that index_value is meant to somehow map to the dataframe index. What if this is raw .csv data from another system. I work with .csv files from financial indices - index_value (or sometimes index_level) confers completely different meaning in this case.
zahlman 24 hours ago [-]
This inference is not at all "without context". It's based on the meaning of "index", and the contextual assumption that reasonable people put things into CSV columns whose intended purpose aligns with the semantic content of the column's title.
icedchai 23 hours ago [-]
Without seeing a sample of the data, it's ambiguous. Example: It could be the value of an index fund.
minimaxir 1 days ago [-]
That is a fair counterpoint, but if that were the case, there would always be more context accessible, e.g. the agent could do a `df.head()` to get an overview of the data and columns (which would indicate financial indices) or there would be code after that which would give strong signal that the intent is financial indices and not the DataFrame index.

This is why vague examples in blog posts aren't great.

anttiharju 18 hours ago [-]
I like AI for software development.

Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.

AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.

"Reality has a surprising amount of detail" or something along those lines.

energy123 17 hours ago [-]
I find the hardest thing is explaining what you want to the LLM. Even when you think you've done it well, you probably haven't. It's like a genie, take care with what you wish for.

I put great effort into maintaining a markdown file with my world model (usecases x principles x requirements x ...) pertaining to the project, with every guardrail tightened as much as possible, and every ambiguity and interaction with the user or wider world explained. This situates the project in all applicable contexts. That 15k token file goes into every prompt.

anttiharju 2 hours ago [-]
> It's like a genie, take care with what you wish for.

I used to be stuck with this thought. But I came across this delightful documentation RAG project and got to chat with the devs. Idea was that people can ask natural language questions and they get shown the relevant chunk of docs for that query. They were effectively pleading to a genie if I understood it right. Worse yet, the genie/LLM model kept updating weekly from the cloud platform they were using.

But the devs were engineers. They had a sample set of docs and sample set of questions that they knew the intended chunk for. So after model updates they ran the system through this test matrix and used it as feedback for tuning the system prompt. They said they had been doing it for a few months with good results, search remaining capable over time despite model changes.

While these agents.md etc. appear to be useful, I'm not sure they're going to be the key for long-term success. Maybe with a model change it becomes much less effective and the previous hours spent on it become wasteful.

I think something more verifiable/strict is going to be the secret sauce for llm agents. Engineering. I have heard claude code has decent scaffolding. Haven't gotten the chance to play with it myself though.

I liked the headline from some time ago that 'what if LLMs are just another piece of technology'?

hu3 53 minutes ago [-]
> That 15k token file goes into every prompt.

Same here. Large AGENTS.md file in current project.

Today I started experimenting splitting into smaller SKILL.md files but I'm weary that the agent might mistakenly decide to not load some files.

pixl97 7 hours ago [-]
>I find the hardest thing is explaining what you want to the LLM.

Honestly this isn't that much different then explaining to human programmers. Quite often we assume the programmer is going to automatically figure out the ambiguous things, but commonly it leads to undefined behavior or bugs in the product.

Most of the stuff I do is as a support engineer working directly with the client on identifying bugs, needed features, and short failings in the application. After a few reports I've made going terribly wrong when the feature came out I've learned to overly detailed and concise.

greazy 10 hours ago [-]
Do I read correctly that your md file is 15k tokens? how many words is that? that's a lot!
energy123 9 hours ago [-]
20k words by the 0.75 words/token rule of thumb.

It's a lot, but for quick projects I don't do this. Only for one important project that I have ownership of for over a year.

Maintaining this has been worth it. It makes the codebase more stable, it's like the codebase slowly converges to what I want (as defined in the doc) the more inferences I run, rather than becoming spaghetti.

jijijijij 10 hours ago [-]
For the life of me, I don't get the productivity argument. At least from a worker perspective.

I mean, it's at best a very momentary thing. Expectations will adapt and the time gained will soon be filled with more work. The free time net gain will ultimately be zero, optimistically, but I strongly suspect general life satisfaction will be much lower, since you inherently lose confidence in creation, agency, and the experience in self-efficacy is therefore lessened, too. Even if external pressure isn't increased, the brain will adapt to what's considered a new normal for lazy. Everybody hates clearing the dish washer, aversion threshold is the same as washing dishes by hand.

And yeah, in the process you atrophy your problem solving skills and endurance of frustration. I think we will collectively learn how important some of these "inefficiencies" are for gaining knowledge and wisdom. It's reminiscent of Goodhart's Law, again, and again. "Output" is an insufficient metric to measure performance and value creation.

Costs for using AI services does not at all reflect actual costs to sustainably run them. So, these questionable "productivity gains" should be contrasted with actual costs, in any case. Compare AI to (cheap, plastic) 3D printing, which is factually transformative, revolutionary tech in almost every (real) industry, I don't see how trillions of investments, the absurd energy and resource wasting could ever justify what's offered, or even imaginable for AI (considering inherent limitations).

anttiharju 5 hours ago [-]
For me it boils down to that I'm much less tied to tech stacks I've previously worked on and can pick up unfamiliar ones quicker.

Democratization they call it.

jijijijij 3 hours ago [-]
> and can pick up unfamiliar ones quicker

Do you tho? Does "picking up" a skill mean the same thing it used to? Do you fact check all the stuff AI tells you? How certain are you, you are learning correct information? Struggling through unfamiliar topics, making mistakes and figuring out solutions by testing internal hypotheses is a big part of how deep, explanatory knowledge is acquired for human brains. Or maybe, it's been always 10,000 kilowatt-hours, after all.

Even, if you would actually learn different tech stacks faster with AI telling you what to do, it's still a momentary thing, since these systems are fundamentally poisoned by their own talk, so shit's basically frozen in time, still limited to pre-AI-slop information, or requires insane amounts of manual sanitation. And who's gonna write the content for clean new training data anyway?

Mind you, I am talking about the possible prospect of this technology and a cost-value evaluation. Maybe I am grossly ignorant/uninformed, but to me all of it just doesn't add up, if you project inherent limitations onto wider adoption and draw the obvious logical conclusions. That is, if humanity isn't stagnating and new knowledge is created.

anttiharju 2 hours ago [-]
> Do you tho?

Recent success I've been happy with has been moving my laptop config to Nix package manager.

Common complaint people have is Nix the language. It's a bit awkward, "JSON-like". I probably would not have had the patience to engage with it with the little time I have available. But AI mostly gets the syntax right, allowing me to engage with it, and I think I've a decent grasp by this point of the ecosystem and even syntax. It's been roughly a year I think.

Like, I don't know all the constructs available in the language, but I can still reason about things as a commoner that I probably don't want to define my username multiple times in my config, esp. when trying to have the setup be reproducible on an arbitary set of personal laptops. So that for a new laptop I just define one new array item as a source of truth and everything downstream just works.

I feel like with AI the architetural properties are more important than the low-level details. Nix has the nice property of reproducibility/declarativeness. You could for sure put even more effort into alternative solutions, but if they lack reproducibility I think you're going to keep suffering, no matter how much AI you have available.

I am certain my config has some silliness in it that someone more experienced would pick out, but ultimately I'm not sure how much that matters. My config is still reproducible enough that I have my very custom env up and running after a few commands on an arbitary macbook.

> Does "picking up" a skill mean the same thing it used to?

I personally feel confident in helping people move their config to Nix, so I would say yes. But it's a big question.

> Do you fact check all the stuff AI tells you? How certain are you, you are learning correct information?

Well, usually I have a more or less testable setup so I can verify whether the desired effect was achieved. Sometimes things don't work, which is when I start reaching for the docs or source code of for example the library I'm trying to use.

> Struggling through unfamiliar topics, making mistakes and figuring out solutions by testing internal hypotheses is a big part of how deep, explanatory knowledge is acquired for human brains.

I don't think this is lost. I iterate a lot. I think the claude code author does too, did they have something like +40k-38k lines of changes over the past year or so. I still use github issues to track what I want to get done when a solution is difficult to reach, and comment progress on them. Recently I did that with my struggles in cross-compiling Rust from Linux to macOS. It's just easier to iterate and I don't need to sleep overnight to get unstuck.

> since these systems are fundamentally poisoned by their own talk,

_I_ feel like this goes into the overthinking territory. I think software and systems will still die by their merits. Same applies to training data. If bugs regularly make it to end users and a competing solution has less defects, I don't think the buggy solution will stay any more afloat thanks to AI. So, I'd argue, the training data will be ok. Paradigms can still exist. Like Theory of Modern Go discouraging globals and init functions. And I think this was something that Tesla also had to deal with pre modern LLMs? As in not all drivers drove well enough that they wanted to use their data for trsining the autopilot.

I really enjoyed your reply, thank you.

skzo 13 hours ago [-]
That's a brilliant analogy, I had the same experience with Huel and AI Assistants
mawadev 11 hours ago [-]
Why do I feel like I've just read a covert advertisement?
cons0le 11 hours ago [-]
Sometimes I feel like the people here live on a different planet. I can't imagine what type of upbringing I would have to have, to start thinkinkg that "eating food" is an engineering problem to be solved.

This might be a controversial opinion, but I for one, like to eat food. In fact I even do it 3 times a day.

Don't yall have a culture that's passed down to you through food? Family recipes? Isn't eating food a central aspect of socialization? Isn't socialization the reason people wanted to go to the office in the firt place?

Maybe I'm biased. I love going out to eat, and I love cooking. But its more than that. I garden. I go to the farmers market. I go to food festivals.

Food is such an integral part of the human experience for me, that I can't imagine "cutting it out". And for what? So you can have more time to stare at the screen you already stare at all day? So you can look at 2% more lines of javascript?

When I first saw commercials for that product, I truly thought it was like a medical/therapeutic thing, for people that have trauma with food. I admit, the food equivalent of an i.v. drip does seem useful for people that legitimately can't eat.

anttiharju 2 hours ago [-]
> I can't imagine what type of upbringing I would have to have, to start thinking that "eating food" is an engineering problem to be solved.

I was really busy with my master's degree, ok? :D

AstroBen 9 hours ago [-]
I like eating, I just don't like spending so much time and decision fatigue on prep. I'm probably the target audience for Huel but I don't actually think it's good for you

90% of meals aren't some special occasion, but I still need to eat. Why not make it easy? Then go explore and try new things every now and then

Treating food as entertainment is how the west has gotten so unhealthy

johnisgood 7 hours ago [-]
I like satisfying my hunger (my goal most of the time when it comes to food), but making food is not a hobby to me. That said, cooking is often a nice, shared experience with my girlfriend.
xnorswap 10 hours ago [-]
I'm with you on this one, the idea of trying to "optimise" away lunches and break time to cram in more "study time" seems utterly alien.
pixl97 6 hours ago [-]
I'm a foodie, I love food and cooking and the eating experience.

This said, I know people that food is a grudging necessity they'd rather do without.

At the end of the day there's a lot of different kinds of people out there.

anttiharju 9 hours ago [-]
I mean I don't think I'm giving a particularly favorable view of the product
jerf 7 hours ago [-]
I expect AI ads to start with blindingly obvious overwhelmingly excited endorsments, but it won't take long for that to show up in the metrics that won't work very well past the initial intro, and they'll get more subdued over time... but they're always going to be at least positive. The old saying "there's no such thing as bad publicity" is wrong, and the LLMs aren't going to try to get you to buy things by being subtly negative on them. If nothing else, even if you somehow produced a (correct) study showing that does increase buying I think the marketers would just not be able to tolerate that, for strictly human reasons. They always want their stuff cast in a positive light.
anttiharju 5 hours ago [-]
heh.

I think I've seen an adtech company use AI influencers to market whatever product a customer wanted to sell. I got the impression that it initally worked really well, but then people caught on to the fact it was just AI and performance tanked.

I don't actually know whether that was the case but that's the vibe I got from following their landing page over time.

jennyholzer4 10 hours ago [-]
[flagged]
jackfranklyn 6 hours ago [-]
The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?

I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.

The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.

ronbenton 1 days ago [-]
I am used to seeing technical papers from ieee, but this is an opinion piece? I mean, there is some anecdata and one test case presented to a few different models but nothing more.

I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way

wavemode 1 days ago [-]
To be fair, it's very rare that articles praising the power of AI coding assistants are ever substantiated, either.

In the end, everyone is kind of just sharing their own experiences. You'll only know whether they work for you by trying it yourself.

1 days ago [-]
mrguyorama 1 days ago [-]
> You'll only know whether they work for you by trying it yourself.

But at the same time, even this doesn't really work.

The lucky gambler thinks lottery tickets are a good investment. That does not mean they are.

I've found very very limited value from these things, but they work alright in those rather constrained circumstances.

franktankbank 1 days ago [-]
And you can't try it out without for the most part feeding the training machine for at best free.
Leynos 1 days ago [-]
Codex and Claude Code allows you to opt out of model training.

Perhaps you don't believe OpenAI and Anthropic when they say this, but it is a requirement upon which most enterprise contracts are predicated.

pc86 1 days ago [-]
Are there a lot of products or services you can try out without using the product or service?
franktankbank 1 days ago [-]
Without sending info to their servers, yes.
esafak 1 days ago [-]
This is the Spectrum magazine; the lighter fare. https://en.wikipedia.org/wiki/IEEE_Spectrum
troyvit 1 days ago [-]
Yeah I saw the ieee.org domain and was expecting a much more rigorous post.
ronbenton 1 days ago [-]
This may be a situation where HackerNews' shorthand of omitting the subdomain is not good. spectrum.ieee.org appears to be more of a newsletter or editorial part of the website, but you wouldn't know that's what this was just based on the HN tag.
preommr 1 days ago [-]
I've been on this site for over a decade now and didn't know this. That's a genuinely baffling decision given how different content across subdomains can be.
badc0ffee 24 hours ago [-]
Maybe an exception could be made here, like HN does for medium.com.
bee_rider 1 days ago [-]
On the other hand, “ieee spectrum” is directly at the top of the page, then “guest article.”
ronbenton 1 days ago [-]
Well, as much as I'm sure HN is a special place ;) , it is well documented that a lot of people on the internet just read the headlines
hxugufjfjf 1 days ago [-]
Articles? Headlines? I go right to the comments!
causal 1 days ago [-]
And the example given was specific to OpenAI models, yet the title is a blanket statement.

I agree with the author that GPT-5 models are much more fixated on solving exactly the problem given and not as good at taking a step back and thinking about the big picture. The author also needs to take a step back and realize other providers still do this just fine.

wavemode 1 days ago [-]
He tests several Claude versions as well
causal 1 days ago [-]
Ah you're right, scrolled past that - the most salient contrast in the chart is still just GPT-5 vs GPT-4, and it feels easy to contrive such results by pinning one model's response as "ideal" and making that a benchmark for everything else.
verdverm 1 days ago [-]
and they are using OpenAI models, who haven't had a successful training run since Ilya left, GPT 5x is built on GPT 4x, not from scratch aiui

I'm having a blast with gemini-3-flash and a custom copilor replacement extension, it's much more capable than Copilot ever was with any model for me and a personalized dx with deep insights into my usage and what the agentic system is doing under the hood.

RugnirViking 12 hours ago [-]
can you talk a little more about your replacement extention? I get copilot from my worksplace and id love to know what I can do with it, ive been trying to build some containerized stuff with copilot cli but im worried I have to give it a little more permissions than im comfortable with around git etc
CashWasabi 1 days ago [-]
I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?
sosodev 1 days ago [-]
That idea is called model collapse https://en.wikipedia.org/wiki/Model_collapse

Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.

In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.

ehnto 16 hours ago [-]
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.

For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.

If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.

That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.

pixl97 6 hours ago [-]
Then companies will just stick sensors on humans/cars/whatevers to gather information from the real world.

At the end of the day there is still a huge problem space of reality outside of humans that can be explored and distilled.

bandrami 9 hours ago [-]
The Habsburgs thought it wouldn't be a problem either
sethops1 22 hours ago [-]
Can't help but wonder if that's a strategy that works until it doesn't.
extesy 1 days ago [-]
Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.
rmunn 20 hours ago [-]
AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.

What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.

Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.

falloutx 1 days ago [-]
You dont need synthetic data, people are posting vibe coded projects on the github every day and they are being added to next model's training set. I expect in like 4-5 years, humans would just not be able to do things that are not in the training set. Anything novel or fun will be locked down to creative agencies and few holdouts who managed to survive.
chneu 1 days ago [-]
Or it'll create an alternative reality where that AI iterates itself into delusion.
eager_learner 1 days ago [-]
That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.

Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.

sosodev 1 days ago [-]
Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.

Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.

phyzome 23 hours ago [-]
It's only "noise" if it's uncorrelated. I don't see any reason to believe it wouldn't be correlated, though.
sosodev 23 hours ago [-]
Are you sure about that? There's a lot of slop on the internet. Imagine I ask you to predict the next token after reading an excerpt from a blog on tortoises. Would you have predicted that it's part of an ad for boner pills? Probably not.

That's not even the worst scenario. There are plenty of websites that are nearly meaningless. Could you predict the next token on a website whose server is returning information that has been encoded incorrectly?

imiric 24 hours ago [-]
> The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.

Say what? LLMs absolutely cannot do that.

They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.

So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.

sosodev 23 hours ago [-]
I think you're confused about the training steps for LLMs. What the industry generally calls pre-training is when the LLM learns the job of predicting the most probable next token given a huge volume of data. A large percentage of that data has not been cleaned at all because it just comes directly from web crawling. It's not uncommon to open up a web crawl dataset that is used for pretraining and immediately read something sexual, nonsensical, or both really.

LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.

The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.

To be fair, some of the pre-training data is more curated. Like collections of math or code.

imiric 7 hours ago [-]
No, I think you're confused, and doubling down on it, for some reason.

Base models (after pre-training) have zero practical value. They're absolutely useless when it comes to separating signal from noise, using any practical definition of those terms. As you said yourself, their output can be nonsensical, based solely on token probability in the original raw data.

The actual value of LLMs comes after the post-training phase, where the signal is injected into the model from relatively smaller amounts of high quality data. This is the data processed by armies of humans, without which LLMs would be completely worthless.

So whatever capability you think LLMs have to separate signal from noise is exclusively the product of humans. When that job becomes harder, the quality of LLMs will go down. Unless we figure out a way to automate data cleaning/labeling, which seems like an unsolvable problem, or for models to filter it during inference, which is what you're wrongly implying they already do. LLMs could assist humans with cleaning/labeling tasks, but that in itself has many challenges, and is not a solution to the model collapse problem.

sosodev 7 hours ago [-]
I'm not saying that pre-trained only models are useless. They've clearly extracted a ton of knowledge from the corpus. The interface may seem strange because it's not what we're accustom to but they still prove valuable. Code completion models, for example, are just LLMs that have pre-trained exclusively on code. They work very well despite their simplicity because... the model has extracted the signal from the noise.
imiric 2 hours ago [-]
You have a strange definition of "signal" and "noise".

Code completion models can be useful because they output the most probable chain of tokens given a specific input, same as any LLM. There is no "signal" there besides probability. Besides, even those models are fine-tuned to follow best practices, specific language idioms, etc.

When we talk about "signal" in the context of general knowledge we refer to information that is meaningful and accurate for a specific context and input. So that if the user asks proof of the Earth being flat, the model doesn't give them false information from a random blog. Of course, LLMs still fall short at this, but post-training is crucial to boost the signal away from the noise. There's nothing inherent in the way LLMs work to make them do this. It is entirely based on the quality of the training data.

intended 17 hours ago [-]
LLM content generation is divorced from human limitations and human scale.

Using human foibles when discussing LLM scale issues is apples and oranges.

grugagag 1 days ago [-]
I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players
sejje 1 days ago [-]
I bet they'll only train on the internet snapshot from now, before LLMs.

Additional non-internet training material will probably be human created, or curated at least.

pc86 1 days ago [-]
This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).
sosodev 1 days ago [-]
Nope. Pretraining runs have been moving forward with internet snapshots that include plenty of LLM content.
sejje 1 days ago [-]
Sure, but not all of them are stupid enough to keep doing that while watching the model degrade, if it indeed does.
theptip 1 days ago [-]
Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.
layer8 22 hours ago [-]
If LLMs happened 15 years ago, I guess that we wouldn’t have had the JS framework churn we had.
theptip 1 days ago [-]
They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.

As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)

This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.

frizlab 1 days ago [-]
So basically “you’re holding it wrong?”
dannersy 1 days ago [-]
Every time this is what I'm told. The difference between learning how to Google properly and then the amount of hoops and in-depth understanding you need to get something useful out of these supposedly revolutionary tools is absurd. I am pretty tired of people trying to convince me that AI, and very specifically generative AI, is the great thing they say it is.

It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.

conception 19 hours ago [-]
I heard it best described to me that if you put in an hour of work, you get five hours of work out of it. Most people just type at it and don’t put in an hour of planning and discussion and scaffolding. They just expect it to work 100% of the time exactly like they want. But you wouldn’t expect that from a junior developer you would put an hour of work into them, teaching them things showing them where the documentation is your patterns how you do things and then you would set them off and they would probably make mistakes and you would document their mistakes for them so they wouldn’t make them again, but eventually, they’d be pretty good. That’s more or less where we are today that will get you success on a great many tasks.
wnolens 9 hours ago [-]
Exactly my experience and how I leverage Claude where some of my coworkers remain unconvinced.
danielbln 1 days ago [-]
"The thing I've learned years ago that is actually complex but now comes easy to me because I take my priors for granted is much easier than the new thing that just came out"

Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.

dannersy 14 hours ago [-]
The point I am making is that this is supposed to be some revolutionary tool that threatens our very society in terms of labor and economics yet the fringe enthusiasts (yes, that is what HN and its users are, an extreme minority of users), and the very people plugged into the weekly changes and additions of model adjustments and tools to leverage them still struggle to show me the value of generative AI day to day. They make big claims, but I don't see them. In fact, I see negatives overwhelming the gains which goes without talking of the product and its usability.

In practice I have seen: flowery emails no one bothers to read, emoji filled summaries and documentation that no one bothers to read or check correctness on, prototypes that create more work for devs in the long run, a stark decline in code quality because it turns out reviewing code is a team's ultimate test of due diligence, ridiculous video generation... I could go on and on. It is blockchain all over again, not in terms of actual usefulness, but in terms of our burning desire to monetize it in irresponsible, anti-consumer, anti-human ways.

I DO have a use for LLMs. I use it to tag data that has no tagging. I think the tech behind generative AI is extremely useful. Otherwise, what I see is a collection of ideal states that people fail to demonstrate to me in practice when in reality, it wont be replacing anyone until "the normies" can use it without 1000 lines of instructions markdown. Instead it will just fool people in its casual authoritative and convincing language since that it was it was designed to do.

bojan 13 hours ago [-]
> reviewing code is a team's ultimate test of due diligence

Further even, if you are actually thinking about long-term maintenance during the code review you get seen as a nitpicky obstacle.

frizlab 1 days ago [-]
> Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.

Why? Is it intelligence now? I think not.

danielbln 1 days ago [-]
Would you mind defining "intelligence" for me?
Terr_ 23 hours ago [-]
If you're the one saying it exists, you go first. :p
frizlab 1 days ago [-]
There are many types of intelligence. If you want to go to useless places, using certain definitions of intelligence, yes, we can consider AI “intelligent.” But it’s useless.
theptip 1 days ago [-]
I’d say “skill issue” since this is a domain where there are actually plenty of ways to “hold it wrong” and lots of ink spilled on how to hold it better, and your phrasing connotes dismissal of user despair which is not my intent.

(I’m dismissive of calling the tool broken though.)

Workaccount2 1 days ago [-]
Remember when "Googling" was a skill?

LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.

dude250711 24 hours ago [-]
Those skills will age faster than Knockout.js.
petesergeant 21 hours ago [-]
Why would a skill that's being actively exercised against the state of the art, daily, age poorly?
steveklabnik 1 days ago [-]
Do you think it's impossible to ever hold a tool incorrectly, or use a tool in a way that's suboptimal?
mrguyorama 1 days ago [-]
If that tool is sold as "This magic wand will magically fix all your problems" then no, it's not possible to hold it incorrectly.
orangecat 24 hours ago [-]
If your position is that any product that doesn't live up to all its marketing claims is worthless, you're going to have a very limited selection.
steveklabnik 1 days ago [-]
Gotcha. I don't see these tools as being a magic wand nor being able to magically fix every problem. I agree that anyone who sells them that way is overstating their usefulness.
wvenable 24 hours ago [-]
Why does it matter how it's sold? Unless you're overpaying for what it's actually capable of, it doesn't really matter.
callc 20 hours ago [-]
We all have skin in the game when how it’s sold is “automated intelligence so that we can fire all our knowledge workers”

Might be good in some timelines. In our current timeline this will just mean even more extreme concentration of wealth, and worse quality of life for everyone.

Maybe when the world has a lot more safety nets so that not having a job doesn’t mean homelessness, starvation, no healthcare, then society will be more receptive to the “this tool can replace everybody” message.

wvenable 18 hours ago [-]
If a machine can do your job; whether it's harvesting corn or filing a TPS report then making a person sit and do it for the purpose of survival is basically just torture.

There are so many better things for humans to do.

callc 5 hours ago [-]
I agree in theory. In practice people who are automated out of jobs are not taken care of by society in the transition period where they learn how to do a new job.

Once having a job is not intimately tied to basic survival needs then people will be much more willing to automate everything.

I, personally, would be willing to do mind numbing paperwork or hard labor if it meant I could feed myself and my family, have housing, rather than be homeless and starving.

wvenable 4 hours ago [-]
You might as well stop being a software developer. Not because you'll be out of job, but because you're directly contributing to other people being out of jobs. We've been automating work (which is ultimately human labor) since the dawn of computers. And humans have been automating work for centuries now. We actually call that progress. So lets stop progressing entirely so people can do pointless labor.

If the problem is with society the solution is with society. We have stop pretending that it's anything else. AI is not even the biggest technological leap -- it's blip on the continuum.

pixl97 2 hours ago [-]
>There are so many better things for humans to do.

For the time being, at least.

wvenable 1 hours ago [-]
There will always be better things for people to do. We don't exist on this planet just to sit at a desk and hit buttons all day.
pixl97 49 minutes ago [-]
The only reason we exist is as a carrier for our genes to make more of our genes. Everything after that is an accidental byproduct.
wvenable 20 minutes ago [-]
I can already think of something more useful to pass on my genes than typing on keyboard all day.
greggoB 1 days ago [-]
I found this a pretty apt - if terse - reply. I'd appreciate someone explaining why it deserves being downvoted?
conception 7 hours ago [-]
It’s just dismissive of the idea that you have to learn how use LLMs vs a design flaw in a cell phone that was dismissed as user error.

It’s the same as if he had said “I keep typing HTML into VS code and it keeps not displaying it for me. It just keeps showing the code. But it’s made to make webpages, right? people keep telling me I don’t know how to use it but it’s just not showing me the webpage.”

mostlysimilar 1 days ago [-]
There are two camps who have largely made up their minds just talking past each other, instinctively upvoting/downvoting their camp, etc. These threads are nearly useless, maybe a few people on the fringes change their minds but mostly it's just the same tired arguments back and forth.
1 days ago [-]
hug 24 hours ago [-]
Because in its brevity it loses all ability to defend itself from any kind of reasonable rebuttal. It's not an actual attempt to continue the conversation, it's just a semantic stop-sign. It's almost always used in this fashion, not just in the context of LLM discussions, but in this specific case it's particularly frustrating because "yes, you're holding it wrong" is a good answer.

To go further into detail about the whole thing: "You're holding it wrong" is perfectly valid criticism in many, many different ways and fields. It's a strong criticism in some, and weak in others, but almost always the advice is still useful.

Anyone complaining about getting hurt by holding a knife by the blade, for example, is the strongest example of the advice being perfect. The tool is working as designed, cutting the thing with pressure on the blade, which happens to be their hand.

Left-handers using right-handed scissors provides a reasonable example: I know a bunch of left-handers who can cut properly with right-handed scissors and not with left-handed scissors. Me included, if I don't consciously adjust my behaviour. Why? Because they have been trained to hold scissors wrong (by positioning the hand to create opposite push/pull forces to natural), so that they can use the poor tool given to them. When you give them left-handed scissors and they try to use the same reversed push/pull, the scissors won't cut well because their blades are being separated. There is no good solution to this, and I sympathise with people stuck on either side of this gap. Still, learn to hold scissors differently.

And, of course, the weakest, and the case where the snark is deserved: if you're holding your iPhone 4 with the pad of your palm bridging the antenna, holding it differently still resolves your immediate problem. The phone should have been designed such that it didn't have this problem, but it does, and that sucks, and Apple is at fault here. (Although I personally think it was blown out of proportion, which is neither here nor there.)

In the case of LLMs, the language of the prompt is the primary interface -- if you want to learn to use the tool better, you need to learn to prompt it better. You need to learn how to hold it better. Someone who knows how to prompt it well, reading the kind of prompts the author used, is well within their rights to point out that the author is prompting it wrong, and anyone attempting to subvert that entire line of argument with a trite little four-sentence bit of snark in whatever the total opposite of intellectual curiosity is deserves the downvotes they get.

frizlab 15 hours ago [-]
Except this was posted because the situation is akin to the original context in which this phrase was said.

Initial postulate: you have a perfect tool that anybody can use and is completely magic.

Someone says: it does not work well.

Answer: it’s your fault, you’re using it wrong.

In that case it is not a perfect tool that anybody can use. It is just yet another tool, with it flaws and learning curve, that may or may not work depending on the problem at hand. And it’s ok! It is definitely a valid answer. But the “it’s magic” narrative has got to go.

pixl97 2 hours ago [-]
>Initial postulate: you have a perfect tool that anybody can use and is completely magic.

>Someone says: it does not work well.

Why do we argue with two people that are both building strawmen. It doesn't accomplish much. We keep calling AI 'unintelligent' but peoples eagar willingness to make incorrect arguments does put some doubts on humanity itself.

Leynos 1 days ago [-]
It's something of a thought terminating cliché in Hacker News discussions about large language models and agentic coding tools.
data-ottawa 18 hours ago [-]
Needing the right scaffolding is the problem.

Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.

All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).

No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.

That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.

techblueberry 23 hours ago [-]
"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."

Isn't this the same thing? I mean this has to work with like regular people right?

khalic 12 hours ago [-]
I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.

Make of that what you will…

Garlef 1 days ago [-]
I'm referring to these kind of articles as "Look Ma, I made the AI fail!"
falloutx 1 days ago [-]
Still I would agree we need some of these articles when other parts of the internet is "AI can do everything, sign up for my coding agent for $200/month"
ashleyn 1 days ago [-]
Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.
theptip 1 days ago [-]
No, they are not getting worse. Again, look at METR task times.

The peak capability is very obviously, and objectively, increasing.

The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)

falloutx 1 days ago [-]
Why the downvotes, this comment makes sense. If you need to write more guardrails that does increase the work and at some point amount of guardrails needed to make these things work in every case would be just impractical. I personally dont want my codebase to be filled baby sitting instructions for code agents.
clownpenis_fart 1 days ago [-]
[dead]
bodge5000 6 hours ago [-]
A little off topic, but this seems like one of the better places to ask where I'm not gonna get a bunch of zealotry; a question for those of you who like using AI for software development, particularly using Claude Code or OpenCode.

I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?

I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.

jedberg 6 hours ago [-]
If you want to try Opus you can get the lowest Claude plan for $20 for the month, which has enough tokens for most hobby projects. I've been using to vibe code some little utilities for myself and haven't hit the limits yet.
bodge5000 6 hours ago [-]
Oh nice, I saw people on reddit say that Opus 4.5 will hit that $20 limit after a 1-3 prompts, though maybe thats just on massive codebases. Like you, I'd just want to try it out on some hobby projects
pbowyer 6 hours ago [-]
> I saw people on reddit say that Opus 4.5 will hit that $20 limit after a 1-3 prompts

That's people doing real-vibe coding prompts, like "Build me a music player with...". I'm using the $20 Codex plan and with getting it to plan first and then executing (in the same way I, an experienced dev would instruct a junior) haven't even managed to exhaust my 5-hour window limits, let alone the weekly limit.

Also if you keep an eye on it and kill it if it goes in the wrong direction you save plenty of tokens vs letting it go off on one. I wasted a bunch when Codex took 25 minutes(!) to install one package because something went wrong and instead of stopping and asking it decided to "problem solve" on its own.

AstroBen 5 hours ago [-]
give codex a try for $20. You get a lot out of the base subscription. Opus will burn through the $20 sub in an hour

The latest models are all really good at writing code. Which is better is just vibes and personal preference at this point IMO

The agent harness of claude code / opencode / codex is what really makes the difference these days

bodge5000 5 hours ago [-]
Oh nice, so Claude/OpenAI isn't as important as (Claude)Code/Codex/OpenCode these days? How is opencode in comparison, the idea of zen does seem quite nice (a lot of flexibility to experiment with different models), though it does seem like a bit more config and work upfront than CC or codex
AstroBen 5 hours ago [-]
I'd say OpenCode > Codex > Claude Code in terms of the TUI interface UX. OpenCode feels a lot nicer to use. I haven't noticed a code quality difference, only a difference in the UX

I'm not sure about Zen, but OpenAI seems to be giving me $20 / week worth of tokens within the $20/month

Also for absolutely free models, MiniMax M2.1 has been impressive and useful to me (free through OpenCode). Don't judge the state of the art through the lens of that, though

bodge5000 1 hours ago [-]
Bit on an update on Zen, it looks like Anthropic have blocked Claude usage outside of Claude Code, so if I did want to use Opus, it'd have to be through that. They might reverse it or OpenCode might find a way round it, but overall I'd say at this point its safest to assume, if you're starting fresh with this, you go with one or the other.

Still not sure which one I'll go with, though I can't say I feel too keen to get into Claude after that

massysett 6 hours ago [-]
Take some existing code and bundle it into a zip or tar file. Upload it to Gemini and ask it for critique. It's surprisingly insightful and may give you some ideas for improvement. Use one of the Gemini in-depth models like Thinking or Pro; just looking at the thinking process is interesting. Best of all, they're free for limited use.
bodge5000 6 hours ago [-]
Wanted to try more of what I guess would be the opposite approach (it writes the code and I critique), partially to give it a fair shake and partially just out of curiosity. Also I can't lie, I always have a soft spot for a good TUI which no doubt helps
Kuinox 1 days ago [-]
I speculate LLMs providers are serving smallers models dynamically to follow usage spikes, and need for computes to train new models. I did observed that models agents are becoming worse over time, especially before a new model is released.
Workaccount2 1 days ago [-]
Internally everyone is compute constrained. No one will convince me that the models getting dumb, or especially them getting lazy, isn't because the servers are currently being inundated.

However right now it looks like we will move to training specific hardware and inference specific hardware, which hopefully relives some of that tension.

Cthulhu_ 1 days ago [-]
Probably a big factor, the biggest challenges AI companies have now is value vs cost vs revenue. There will be a big correction and many smaller parties collapsing or being subsumed as investor money dries out.
Kuinox 1 days ago [-]
I think it's more a problem of GPU capacity than costs. Training takes a lot of resources, inference too.
5 hours ago [-]
lucideng 6 hours ago [-]
This quote feels more relevant than ever:

> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

Or in the context of AI:

> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.

amarka 5 hours ago [-]
Or in my context:

> Give a person code, and you help them for a day. Teach them to code, and you frustrate them for a lifetime.

nyrikki 24 hours ago [-]
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.

For many users, this is much faster then trying to get another approximation

     :,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.

It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.

jackfranklyn 14 hours ago [-]
The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.

What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.

wumms 13 hours ago [-]
> Like the conversation history is introducing noise rather than helpful context.

From https://docs.github.com/en/copilot/concepts/prompting/prompt...:

Copilot Chat uses the chat history to get context about your request. To give Copilot only the relevant history:

- Use threads to start a new conversation for a new task

- Delete requests that are no longer relevant or that didn’t give you the desired result

mrtesthah 13 hours ago [-]
Remember that the entire conversation is literally the query you’re making, so the longer it is the more you’re counting on the rational comprehension abilities of the AI to follow it and determine what is most relevant.
sosodev 1 days ago [-]
He asked the models to fix the problem without commentary and then… praised the models that returned commentary. GPT-5 did exactly what he asked. It doesn’t matter if it’s right or not. It’s the essence of garbage in and garbage out.
zeroonetwothree 23 hours ago [-]
If they are supposed to replace actual devs we would expect them to behave like actual devs and push back against impossible requests.
sosodev 23 hours ago [-]
Except it's not an impossible request. If my manager told me "fix this code with no questions asked" I would produce a similar result. If you want it to push back, you can just ask it to do that or at least not forbid it to. Unless you really want a model that doesn't follow instructions?
dathinab 8 hours ago [-]
In general "failing to run (successfully)" should per-see been seen as a bad signal.

It might still be:

- the closest to a correct solution the model can produce

- be helpful to find out what it wrong

- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.

- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes

amarka 5 hours ago [-]
While the author’s (banker and a data scientist) experience is clearly valuable, it is unclear whether it alone is sufficient to support the broader claims made. Engineering conclusions typically benefit from data beyond individual observation.
StarlaAtNight 1 days ago [-]
We should be able to pin to a version of training data history like we can pin to software package versions. Release new updates w/ SemVer and let the people decide if it’s worth upgrading to

I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users

terminalbraid 1 days ago [-]
If you talk to people who deal with inference using large fungible datasets, this is an extremely difficult governance problem. semver is incredibly insufficient and you don't have a well defined meaning of what "upgrade" even means let alone "major", "minor", and "patch".

It's a major disservice to the problem to act like it's new and solved or even solvable using code revision language.

willj 1 days ago [-]
I think the models are so big that they can’t keep many old versions around because they would take away from the available GPUs they use to serve the latest models, and thereby reduce overall throughput. So they phase out older models over time. However, the major providers usually provide a time snapshot for each model, and keep the latest 2-3 available.
Leynos 1 days ago [-]
If you're an API customer, you can pin to a specific dated snapshot of the model.

See the "Snapshots" section on these pages for GPT-4o and 4.1, for example:

https://platform.openai.com/docs/models/gpt-4o https://platform.openai.com/docs/models/gpt-4.1

This is done so that application developers whose systems depend upon specific model snapshots don't have to worry about unexpected changes in behaviour.

You can access these snapshots through OpenRouter too, I believe.

swid 1 days ago [-]
Every model update would be a breaking change, an honest application of SemVer has no place in AI model versions.

Not saying using major.minor depending on architecture is a bad thing, but it wouldn’t be SemVer, and that doesn’t even cover all the different fine tuning / flavors that are done off those models, which generally have no way to order them.

randall 1 days ago [-]
there's figurative and literal though. Figurative semver (this is a system prompt update vs a model train) would actually work ok... at least build numbers.

I think you could actually pretty cleanly map semver onto more structured prompt systems ala modern agent harnesses.

memoriuaysj 1 days ago [-]
that's not enough, the tool definitions change, the agent harness changes, you need to pin a lot of stuff
kristopolous 23 hours ago [-]
I stopped using them. Occasionally I go back to see if it's better but really I just treat them as a more interactive stackoverflow/google.

I've been stung by them too many times.

The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.

dudeinhawaii 4 hours ago [-]
The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.

This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".

You might as well ignore all of the articles and pronouncements and stick to your own lived experience.

The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.

The newer models DO let you know when something is impossible or unlikely to solve your problem.

Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.

I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".

winddude 7 hours ago [-]
Not sure I agree with his tests, but I agree with the headline, I recently had cursor launch into seemingly endless loops of grepping and `cd` and `ls` files. This was in multiple new convos. I think it's they're trying to do to much, for two many "vibe coders", and the lighter weight version that did less were easier to steer to meet your architecture and needs.
Hobadee 9 hours ago [-]
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right.

So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D

crazygringo 1 days ago [-]
This is a sweeping generalization based on a single "test" of three lines that is in no way representative.
podgorniy 9 hours ago [-]
Are `sweeping generatlizations` even possible to be representative? If not, then where to draw a line?
djdndnc 1 days ago [-]
[flagged]
amelius 1 days ago [-]
A dataset with only data from before 2024 will soon be worth billions.
blahyawnblah 1 days ago [-]
2022. When chatgpt first came out. https://arstechnica.com/ai/2025/06/why-one-man-is-archiving-...
noir_lord 1 days ago [-]
I’ve already gotten into the habit of sticking “before:2022” in YT if what I’m looking for doesn’t need to be recent.

The AI slop/astroturfing of YT is near complete.

ares623 23 hours ago [-]
I would say around 2023. I refuse to believe the slop propagated that fast.

And there's more than enough content for one person to consume. Very little reason to consume content newer than 2023.

cbm-vic-20 1 days ago [-]
https://en.wikipedia.org/wiki/Low-background_steel
Workaccount2 1 days ago [-]
Sythentic data is already being embraced. Turns out you actually can create good training data with these models.
maxbaines 1 days ago [-]
Not seeing this in my day to day, in fact the opposite.
HendrikHensen 1 days ago [-]
Can you be more specific? E.g. refute something specific that the article mentions. Or are you only reacting to the title, not the article's contents?
ronbenton 1 days ago [-]
I think it should be on the article to prove its title. I hardly think presenting one test case to some different models substantiates the claim that "AI Coding Assistants Are Getting Worse." Note that I have no idea if the title is true or not, but it certainly doesn't follow from the content of the article alone.
samrus 1 days ago [-]
With llms being hard to test objectively, any claim made about them has to be substantiated with atleast anecdotes. The article presented some backing, if you dont think its enough you gotta present some of your own, or people cant talk you seriously
ronbenton 1 days ago [-]
I did present my own evidence to support _my_ argument that the article is woefully lacking data to support its conclusion. It's not on me to try to make the counterargument (that AI coding assistants aren't getting worse) because that's not my opinion.
1 days ago [-]
maxbaines 1 days ago [-]
I think as the article mentions garbage in garbage Out, we are more trusting and expect more. Coding assistants don't just need a good model, they need a good harness, these methods have also changed recently.
llm_nerd 1 days ago [-]
The article is ridiculous garbage. I knew the IEEE had fallen to irrelevance, but that their magazine now prints nonsense like this -- basically someone's ad wrapped in an incredibly lazy supposition -- is incredibly indicting.

The guy wrote code depending upon an external data file (one that the LLM didn't have access to), with code that referred to a non-existing column. They then specifically prompted it to provide "completed code only, without commentary". This is idiotic.

"Dear LLM, make a function that finds if a number is prime in linear time. Completed code only! No commentary!".

Guy wanted to advertise his business and its adoption of AI, and wrote some foolish pablum to do so. How is this doing numbers here?

Snuggly73 1 days ago [-]
I mean...the naive approach for a prime number check is o(n) which is linear. Probably u've meant constant time?
dcchuck 1 days ago [-]
Couldn't agree more.

I would expect older models make you feel this way.

* Agents not trying to do the impossible (or not being an "over eager people pleaser" as it has been described) has significantly improved over the past few months. No wonder the older models fail.

* "Garbage in, garbage out" - yes, exactly ;)

anttiharju 5 hours ago [-]
I've felt this. Bit scary given how essential of a tool it has become.

I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.

kristianp 1 days ago [-]
The failure mode of returning code that only appears to work correctly is one I've encountered before. I've had Sonnet (4 I think) generate a bunch of functions that check if parameter values are out of valid range and just return without error when they should be a failing assertion. That kind of thing does smell of training data that hasn't been checked for correctness by experienced coders.

Edit: Changed 3.5 to 4.

Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.

furyofantares 1 days ago [-]
He graded GPT 4 as winning because it didn't follow his instructions. And the instructions are unrealistic to anyone using coding assistants.

Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.

Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.

Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.

For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.

chankstein38 8 hours ago [-]
The issue is NOT particular to the GPT models. Gemini does this stuff to me all of the time as well! Bandaids around actual problems, hides debugging, etc. They're just becoming less usable.
shevy-java 19 hours ago [-]
I find the whole idea of AI coding assistants strange.

For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.

Swizec 19 hours ago [-]
> For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better

Similar to moving from individual work to coordinating a large codebase: coding agents, human or otherwise, let you think at a higher abstraction level and tackle larger problems by taking care of the small details.

amluto 19 hours ago [-]
If I’m coordinating a large codebase, I expect the people I’m coordinating to be capable of learning and improving over time. Coding agents cannot (currently) do this.

I wonder if a very lightweight RL loop built around the user could work well enough to help the situation. As I understand it, current LLMs generally do not learn at a rate such that one single bad RL example and one (prompted?) better example could result in improvement at anywhere near human speed.

ej88 19 hours ago [-]
I primarily find them useful in augmenting my thinking. Grokking new parts of a codebase, discussing tradeoffs back and forth, self-critiques, catching issues with my plan, etc.
minimaxir 1 days ago [-]
The article uses pandas as a demo example for LLM failures, but for some reason, even the latest LLMs are bad at data science code which is extremely counterintuitive. Opus 4.5 can write a EDA backbone but it's often too verbose for code that's intended for a Jupyter Notebook.

The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.

qsort 1 days ago [-]
This is not my experience. Claude Code has been fine for data science for a while. It has many issues and someone at the wheel who knows what they're doing is very much required, but for many common cases I'm not writing code by hand anymore, especially when the code would have been throwaway anyway. I'd be extremely surprised if a frontier model doesn't immediately get the problem the author is pointing out.
cons0le 1 days ago [-]
And the Ads aren't even baked in yet . . . that's the end goal of every company
grugagag 1 days ago [-]
Ads, dogfooding and ideology
troyvit 1 days ago [-]
There's really not much to take from this post without a repo and a lot of supporting data.

I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?

1 days ago [-]
pablonm 10 hours ago [-]
I noticed Claude Code (on a 100$ max subscription) has become slower for me in the last few weeks. Just yesterday it spent hours coding a simple feature Which I could have coded myself faster.
reassess_blind 13 hours ago [-]
I only have experience with using it within my small scope, being full stack NodeJS web development (i.e an area with many solved problems and millions of lines of existing code for the models to reference), but my experience with the new Opus model in Claude Code has been phenomenal.
erelong 20 hours ago [-]
Interesting if true but I would presume it to be negligible in comparison to magnitudes of gains over "manual coding" still, right? So nothing to lose sleep over at the moment...
stared 1 days ago [-]
Is it possible to re-run it? I am curious for Gemini 3 Pro.

As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.

Johnny555 21 hours ago [-]
But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.

It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.

bob1029 1 days ago [-]
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.

I think if you keep the human in the loop this would go much better.

I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.

falldrown 22 hours ago [-]
Codex is still useful for me. But I don't want to pay $200/month for it.

> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.

AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.

rabf 20 hours ago [-]
Codex is included with the $20 a month chatgpt subsciption with very generous limits.
isodev 21 hours ago [-]
> It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.

mat_b 19 hours ago [-]
I have been noticing this myself for the last couple of months. I cannot get the agent to stop masking failures (ex: swallowing exceptions) and to fail loudly.

That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.

itopaloglu83 19 hours ago [-]
I keep finding myself saying “stop over complicating things” over and over again, because even the simplest questions about how to load a file sometimes gets a code response that’s the size of a framework.
metobehonest 1 days ago [-]
I can imagine Claude getting worse. I consider myself bearish on AI in general and have long been a hater of "agentic" coding, but I'm really liking using aider with the deepseek API on my huge monorepo.

Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.

emsign 16 hours ago [-]
When coding assistants take longer, is because they use more tokens, is because AI companies are obligated to make more money.
jvanderbot 1 days ago [-]
Likely, and I'm being blithe here, it's because of great acceptance. If we try it on more difficult code, it'll fail in more difficult ways?

Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.

radium3d 1 days ago [-]
The problem is everyone is using a different “level” of AI model. Experiences by those who can only afford or choose not to pay for the advanced reasoning are far worse than those who can and do pay.
j45 5 hours ago [-]
It feels like the more standardized the organization, or the more academic the background of an author, the more lagging their insights from the tip of the arrow.

It's clear AI coding assistants are able to help software developers at least in some ways.

Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.

Just because it didn't work for you, means there's more to learn.

nhd98z 1 days ago [-]
This guy is using AI in the wrong way...
PunchTornado 13 hours ago [-]
ChatGPT is getting worse and is a useless model. Surprised that people are still using it. The article tests only this model.
renarl 1 days ago [-]
Strange that the article talks about ChatGPT 4 and 5 but not the latest 5.2 model.
jeffbee 22 hours ago [-]
Or any models NOT from OpenAI
empath75 1 days ago [-]
I'm not sure it is really getting worse, but I have had AI assistants add todo()s and comments saying that this still needs to be implemented and then tell me they did what I asked them to do.
thefreeman 1 days ago [-]
I think this is what the Ralph Wiggum plugin is for. It just repeatedly reprompts the llm with the same prompt until it is fully complete or something along those lines.
qudat 19 hours ago [-]
Betteridge's law of headlines is an adage that states: "Any headline that ends in a question mark can be answered by the word no."
kazinator 1 days ago [-]
> This is of course an impossible task—the problem is the missing data, not the code.

We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.

  df['new_column'] = df.get('index_value', 0) + 1
  # there might be no column ‘index_value’;
  # requirements say that zero should be substituted.
fwip 24 hours ago [-]
The author suspects that this effect is due to users accepting these "make it work" fixes. But wouldn't training for coding challenges also explain this? Because those are designed to be solvable, anything that lets you move forward toward the solution is better than giving up.
toss1 1 days ago [-]
The key point in the middle of the article. As AIs expand usage to larger numbers of lower-skilled coders whose lower ability to catch errors and provide feedback generates lower quality training data, the AIs are basically eating their own garbage, and the inevitable GIGO syndrome starts.

>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

Zababa 1 days ago [-]
From what I understand model collapse/GIGO are not a problem in that labs generally know where the data comes from, so even if it causes problem in the long run you could filter it out. It's not like labs are forced to train models on the user outputs.
toss1 1 days ago [-]
Indeed they are not forced to train them on user outputs, but the author of the article seems to have found good evidence that they are actually doing that, and will need more expert data-tagging/filtering on the inputs to regain their previous performance
Zababa 14 hours ago [-]
I don't think the author of the article found "good evidence". He found a specific case where there was a regression. This could be due to:

- models actually getting worse in general

- his specific style of prompting working well with older models and less well with newer models

- the thing his test tests no longer being a priority for big AI labs

From the article:

> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.

Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.

To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.

toss1 8 hours ago [-]
You're right - I wasn't considering how narrow his case is and was perhaps overgeneralizing, particularly about the cause.

Seems we agree the better solution for column_index_+1 doesn't exist is to call it out instead of stealthily append a new column, but the why the newer models have that behavior is indeed speculative.

It a bit echos the conundrum from back in the PC days where IBM hardware was the de-facto standard, and companies building "compatible" hardware had to decide whether to be compatible with the spec, or compatible with every detail of the implementation, including buggy behavior, of which OFC some software took advantage. So, do they build to be "compatible" or "bug-compatible"?

Was the ChatGPT v4 response highlighting the missing column a bug or failure to shoot straight? Not sure I'd characterize it that way, but there definitely could be many other reasons for the change in behavior (other than training on lower-skilled programmers' inputs) — we really have to consider that as a conjecture on the author's part.

ta9000 10 hours ago [-]
Silent but deadly… oooohh scary! Jesus, talk about sensationalizing a boring topic,
nodesocket 1 days ago [-]
While I still prefer to code my side project in Python and Flask myself, I recently used Cursor to write unit tests. I took a few hours of tweaking, refining, and fixing tests but after I had over 400 unit tests with 99% coverage of my app and routes. I would have never spent the time to get this amount of test coverage manually.
solumunus 1 days ago [-]
I do find there are particular days where I seem to consistently get poor results, but in general this is not my experience. I’m very pleased with the output 80% of days.
oblio 1 days ago [-]
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.

Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).

Guess in which cost category "high-quality data reviewed by experts" falls under.

Manfred 1 days ago [-]
I would hope the trillions of dollars sloshing around are used to pay people to make the core of the product better.
oblio 1 days ago [-]
If you ask around Magnificent 7, a lot of the talk rhymes with: "we're converting Opex into Capex", translated: "we're getting rid of people to invest in data centers (to hopefully be able to get rid of even more people over time).

There are tons of articles online about this, here's one:

https://finance.yahoo.com/news/amazon-bets-ai-spending-capex...

They're all doing it, Microsoft, Google, Oracle, xAI, etc. Those nuclear power plants they want to build, that's precisely to power all the extra data centers.

If anything, everyone hopes to outsource data validation (the modern equivalent to bricklayers under debt slavery).

chiengineer 1 days ago [-]
Wheres the benchmarks for all the different tools and subscriptions/ APIs ?

Cli vs IDE vs Web ?

Nothing for gpt codex 5.1 max or 5.2 max?

Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing

I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time

This is also with just the regular 20$ subscription

Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question

FrustratedMonky 1 days ago [-]
Perhaps because nobody is on Stack Overflow providing updates?
jnmandal 1 days ago [-]
Yep. Not just stack overflow -- pretty much everywhere. If only someone could have foreseen this problem!!!

Anyways, no issue. We'll just get claude to start answer stack overflow questions!

1 days ago [-]
moshegramovsky 1 days ago [-]
This definitely matches my experience.

Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.

I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.

What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.

At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.

lunar_mycroft 1 days ago [-]
The following was originally at the start of your comment:

> Here’s the same text with all em dashes removed and the flow adjusted accordingly:

Did you have an LLM write your comment then remove the evidence?

moshegramovsky 1 days ago [-]
I cleaned it up with an LLM. Is there a problem with that?

Sorry, I should be clear: do you have a problem with that?

threethirtytwo 1 days ago [-]
First you insult my credibility then you use AI to generate a comment? You didn't just use an LLM to "clean it up" it looks completely written by an LLM. And not only do I have a problem with it, it's, in general, against the rules here. Moderators will warn and eventually ban this type of thing.
wainstead 1 days ago [-]
Is it just me or is this a giant red flag?

> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.

chiengineer 1 days ago [-]
This is more common than you think

Tons of smart people not using it right

Unsure of the power it can actually unleash with the right prompt + configuration

100% needs a human in the loop

Its not jarvis

guluarte 23 hours ago [-]
idk but opus is pretty good
ripped_britches 1 days ago [-]
I’m sorry but what a ridiculous assertion. They are objectively better on every measure we can come up with. I used 2b input and 10m output tokens on codex last week alone. Things are improving by the month!
Zababa 1 days ago [-]
>However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.

>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

No proof or anything is offered here.

The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.

Kapura 1 days ago [-]
so you're saying all those bros on linkedin telling me that "this is the worst it's ever going to be" were full of shit? i am shocked.
dcre 1 days ago [-]
Counterpoint: no, they're not. The test in the article is very silly.
vidarh 1 days ago [-]
This springs to mind:

"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"

It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.

There is an actual problem here, though, even if part of the problem is competing expectations of refusal.

But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.

I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.

(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)

InsideOutSanta 1 days ago [-]
How is it silly?

I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".

A similar type of reward hacking is pretty commonly observed in other types of AI.

vidarh 1 days ago [-]
It's silly because the author asked the models to do something they themselves acknowledged isn't possible:

> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.

But the problem with their expectation is that this is arguably not what they asked for.

So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).

And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.

I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.

So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.

Zababa 1 days ago [-]
It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh).
amluto 1 days ago [-]
Is it?

This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.

I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.

I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.

This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.

efficax 1 days ago [-]
they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out
amluto 1 days ago [-]
Huh. A while back I gave up fighting with Claude Code to get it to cheat the ridiculous Home Assistant pre-run integration checklist so I could run some under-development code and I ended up doing it myself.
terminalbraid 1 days ago [-]
The strength of argument you're making reminds me of an onion headline.

https://theonion.com/this-war-will-destabilize-the-entire-mi...

"This War Will Destabilize The Entire Mideast Region And Set Off A Global Shockwave Of Anti-Americanism vs. No It Won’t"

dcre 20 hours ago [-]
I was thinking of that when I wrote it.
foxglacier 1 days ago [-]
Yes. He's asking it to do something impossible then grading the responses - which must always be wrong - according to his own made-up metric. Somehow a program to help him debug it is a good answer despite him specifying that he wanted it to fix the error. So that's ignoring his instructions just as much as the answer that simply tells him what's wrong, but the "worst" answer actually followed his instructions and wrote completed code to fix the error.

I think he has two contradictory expectations of LLMs:

1) Take his instructions literally, no matter how ridiculous they are.

2) Be helpful and second guess his intentions.

Leynos 1 days ago [-]
It's the following that is problematic: "I asked each of them to fix the error, specifying that I wanted completed code only, without commentary."

GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.

A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.

nuky 7 hours ago [-]
[dead]
bschmidt300 1 days ago [-]
[dead]
stingtao 12 hours ago [-]
[dead]
stingtao 12 hours ago [-]
Forgot to mention. I made catsbook in 3 days and presentation earlier in 7 days.

I do think AI code assistant is super great.

Recently, I use Open Codex 5.2 + Extra high reasoning model with $200 monthly subscription most and it's the best among all the other coding agents.

(I have subscribed to 4 at the same time and use all of them across a dozen of projects at the same time)

nuky 3 hours ago [-]
[dead]
ajjahs 23 hours ago [-]
[dead]
b0rsuk 1 days ago [-]
[dead]
black_13 23 hours ago [-]
[dead]
mikert89 1 days ago [-]
[flagged]
qsort 1 days ago [-]
I mean, it's 2026, you can just say things I guess.
bee_rider 1 days ago [-]
Good point, it’s 2026, they could have just said “Things are getting worse.”
tacoooooooo 1 days ago [-]
This is a wildly out of touch thing to say
fourside 1 days ago [-]
Did you read the article?
dhorthy 1 days ago [-]
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".

If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".

flumpcakes 1 days ago [-]
Perhaps the advertising money from the big AI money sinks is running out and we are finally seeing more AI scepticism articles.
minimaxir 1 days ago [-]
> They are not "getting worse" they "have been bad".

The agents available in January 2025 were much much worse than the agents available in November 2025.

Snuggly73 1 days ago [-]
Yes, and for some cases no.

The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.

minimaxir 1 days ago [-]
That's what test cases are for, which is good for both humans and nonhumans.
Snuggly73 1 days ago [-]
Test cases are great, but not a total solution. Can you write a test case for the add_numbers(a, b) function?
Snuggly73 1 days ago [-]
Well, for some reason it doesnt let me respond to the child comments :(

The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.

Another category of problems that you cant just test and have to prove is concurrency problems.

And so forth and so on.

minimaxir 1 days ago [-]
Of course you can. You can write test cases for anything.

Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.

Snuggly73 1 days ago [-]
I mean "have been bad" doesnt exclude "getting worse" right :)