NHacker Next
login
▲“Bugs are 100x more expensive to fix in production” study might not exist (2021)theregister.com
65 points by rafaepta 2 days ago | 52 comments
Loading comments...
tester756 2 days ago [-]
I work in HW industry, but doing SW.

If I deploy a bug despite "unit" tests, then it will probably be caught by 2nd tier tests 1 day later, which will already be more expensive since I'll need to read report, context switch from other thing, fix it, redeploy, wait for CI, wait for 2nd tier tests, etc.

If it will not be caught by 2nd tier, then it probably will be caught by validation engineers, which not only costs their time, but they also have to contact me / fill some bug, I need to get familiar with it, context switch again, fix it, redeploy, wait for tests #1 and #2 and wait for their confirmation

If it will be released to actual customers then there's reputational dmg (can be huge depending on impact), possibly some lost sales / customers, probably a few weeks before it goes back to me thru customer->???->validation->me

So, while 100x more expensive is some extreme case, then usually the cost is very significantly higher the later you find the bug.

But what about HW bugs?

I think those can be really, really expensive.

Imagine catching hardware bug which affects some computation that MUST be fixed at hardware level.

Catching it before release and catching it after selling 100m copies is tens of bilions of dollars difference for top GPUs and CPUs.

Why do you think that those companies are willing to build same product twice so they test eachother? It significantly increases the cost, but reduces the risk.

Think about CrowdStrike's incident: catching that bug at dev level would cost them shitton times less than what actually happened :P

Their stock dropped by like 45%

monksy 2 days ago [-]
> So, while 100x more expensive is some extreme case, then usually the cost is very significantly higher the later you find the bug.

When you said the reputational damage, lost sales/customers.. you're already in 100x territory. It may not be exactly there but it's close or exceeding it.

Yet we'll still get the undisciplined engineers whining that they have to write a small and isolated unit test.

cwsx 2 days ago [-]
> Yet we'll still get the undisciplined engineers whining that they have to write a small and isolated unit test.

Oh I wish that were true for my experiences - much more commonly it's a project manager that doesn't understand the value of tests....

2 days ago [-]
Dylan16807 2 days ago [-]
> Think about CrowdStrike's incident: catching that bug at dev level would cost them shitton times less than what actually happened :P

> Their stock dropped by like 45%

A program that can block booting is definitely on the high side of software bug risk.

But even in that case, the post-drop stock was still trading higher than a year before and they got back to the same level in six months.

hobs 2 days ago [-]
And in most corporate mindsets that's six months of growth off the table, six months of fighting just to get back to normal - that's a huge real cost and a huge opportunity cost.
Dylan16807 2 days ago [-]
There was still plenty of growth to today. I doubt the stock is much lower than it would have been if that event never happened. Mild real cost, mild opportunity cost.
mixmastamyk 1 days ago [-]
Yes, am reminded of the Pentium FDIV bug:

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

$475 million charge, in 1990's dollars.

Every additional person who gets involved who didn't author the bug needs to research and communicate the problem. 100x sounds small after just a few layers of indirection.

aaron695 2 days ago [-]
[dead]
faizshah 2 days ago [-]
I have some thoughts on this (in the context of modern SaaS companies).

The most expensive parts of fixing a bug are discovering/diagnosing/triaging the bug, cleaning up corrupted records, and customer communication. If you discover a bug in development or even better while you are coding the function or during a code review you get to bypass triaging, customer calls, escalations, RCAs, etc. At a SaaS company with enterprise customers each of those steps involves multiple meetings with your Support, Account Manager, Senior Engineer, Product Manager, Engineering Manager, Department Manager, sometimes Legal or a Security Engineer and then finally the actual coder. So of course if you can resolve an issue (at a modern SaaS company) during development it can be 10-100x less expensive just because of how much bureaucracy is involved in running a large scale enterprise SaaS company.

It also brings up the interesting side effect of companies adopting non-deterministic coding (AI Code) in that now bugs that could have been discovered during design/development by a human engineer while writing the code can now leak all the way into prod.

ziggure 2 days ago [-]
The bureaucracy involved is usually the biggest cost driver. Another is the refactoring needed once mode code has been built atop the buggy code.
janice1999 2 days ago [-]
If you ship firmware to devices, it could be far more expensive. [1]

[1] https://www.bleepingcomputer.com/news/hardware/botched-firmw...

mrheosuper 2 days ago [-]
We used to have a bug in our FW that was caught quite late. We have to unbox thousands of products, connecting it to our phone to download new FW, then re-package it. Not fun at all.
0xbadcafebee 2 days ago [-]
Forget the study, let's just do a simple thought experiment. Your developer gets paid $140k/yr (let's round up to ~$70/hr). Let's say a given bug found in testing takes 1 hour to fix; that's $70 (not counting the costs of ci/cd etc). If they miss it in test, and it hits production, would it cost $7,000 to fix? Depends what you mean by "bug", what it affects, and what you mean by "fix in production".

- Did you screw up the font size on some text you just published? Ok, you can fix that in about 5 seconds, and it affects pretty much nothing. Doesn't cost 100x.

- Did your sql migration just delete all records in the production database? Ok, that's going to take longer than 5 seconds to fix. People's data is gone, apps stop working, the lack of or bad data fed to other systems causes larger downstream issues, there's the reputational harm, the money you'll have to pay back to advertisers for their ads / your content being down, and all of that multiplied by however long it takes you to restore the database from backup (um... you do test restoring your backups... right?). That's closer to 100x more expensive to fix in production.

- Did you release a car, airplane, satellite, etc with a bug? We're looking at potentially millions in losses. Way more than 1000x.

And those are just the easy ones. What about a bug you release, that then is adopted (and depended on) by downstream api consumers, and that you then spend decades to patch over and engineer around? How about when production bugs cause your product team to lose confidence in deployments, so they spend weeks and weeks to "get ready" for a single deploy, afraid of it failing and not being able to respond quickly? That fear will dramatically slow down the pace of development/shipping.

The "long tail" of fixing bugs in production involves a lot more complexity than in non-production; that's where the extra cost comes from. These costs could end up costing 10,000x over the long term, when all is said and done. Security bugs, reliability bugs, performance bugs, user interface bugs, etc. There's a universe of bugs which are much harder/costlier to fix in production.

But you know what is certain? It always costs more to fix in production. 1.2x, 10x, 1000x, that's not the point; the point is, fix your bugs before it goes to production. ("Shift Left" is how we refer to this in the DevOps space, but it applies to everything in the world that has to do with quality. Improve quality before it gets shipped to customers, and you save money in the long run.)

TuringNYC 2 days ago [-]
>> Did you screw up the font size on some text you just published? Ok, you can fix that in about 5 seconds, and it affects pretty much nothing. Doesn't cost 100x.

Actually, i find these to be worse, because i've been in scrum meetings where 6 people spend 2 minutes talking about this bug, then another 2 minutes talking about the QA of it the next day. Tiny issues are very expensive to fix if you have formulaic team members who arent taking the reigns.

hirsin 2 days ago [-]
And you're lucky if it's two. If they're not familiar with the rendering engine/docs system/API platform/middleware etc (or just think they aren't), or have low confidence in a debt ridden platform, they'll spend five minutes a day for a week or three theorizing on how it could have gone wrong, debating if, actually, it's correct the way it is, refreshing their knowledge of deep internals, debating if a fix could break something, and so on. Way safer to do that than risk making things worse in an uncertain system.
cranky908canuck 2 days ago [-]
I will willingly risk karma hazard by suggesting this isn't about "bugs leaking into later phases" but about "corporate culture allowing endless nitpicking".

(Which I reckon is what my parent comment is saying too!.)

So ok, not totally willing: 'end-sarky'.

2 days ago [-]
jamesfinlayson 2 days ago [-]
Yep, bike-shedding in action - the most mundane issues (and features too) will generate way too much discussion. At least with some complicated internal issue, no clueless manager is able to offer their 2c on how to fix it.
Supermancho 2 days ago [-]
Bugs are more like viruses in practice. The cost is useful to measure in negative lifespan, not cost to fix, per se. This is why many bugs are never fixed. Those cost nothing to fix, because they don't have to be.

> Did your sql migration just delete all records in the production database? That's closer to 100x more expensive to fix in production.

Companies that do this often, don't stay in business. It's not 100x more expensive if you're not in business. Survivorship ensures that classes of bugs don't have a consistent negative return, because they are often fatal.

tedunangst 2 days ago [-]
Is the opportunity cost of one hour of developer time only $70? Is every hour spent testing guaranteed to fix one bug?

I feel like the logic here gets a little twisted because it's comparing the value of a known outcome with an earlier probability. You can save millions of dollars by buying a winning lottery ticket before they announce the number.

RedNifre 1 days ago [-]
Not necessarily, crashes in production give you way more crash logs, which sometimes make the root cause instantly obvious and thus quicker to fix.
cpeterso 2 days ago [-]
HN discussion about the Register article from 2021: https://news.ycombinator.com/item?id=27917595

HN discussion about the original blog post from 2021: https://news.ycombinator.com/item?id=27892615

tomhow 2 days ago [-]
Thanks!

The “bugs are 100x more expensive to fix in production” study might not exist - https://news.ycombinator.com/item?id=27917595 - July 2021 (130 comments)

Software engineering research is a train wreck - https://news.ycombinator.com/item?id=27892615 - July 2021 (166 comments)

ckastner 1 days ago [-]
There was another 1981 paper that went into this by Boehm "Software Engineering Economics", but I can't find the details right now.

This NASA publication [1] cites a number of studies and the cost increase factors they estimate for various stages of development, including the Boehm study.

[1]: https://ntrs.nasa.gov/api/citations/20100036670/downloads/20...

MaulingMonkey 1 days ago [-]
Spelling mistake in the "about us" section of your continuously deployed website? Production or pre-production matters little if it's a developer catching the bug. Maybe if it's caught by a customer, you have some overhead as it's triaged through QA, product leads, routed to someone with commit permissions, etc.

Spelling mistake in the "about us" section of your program baked into ROMs of internationally sold hardware? 100x is a vast underestimate of the cost multiplier to "fix in production", which would likely involve recalls, if not trashing product outright, and ROMs were a lot more common in the era this 100x figure supposedly came from. You might fix it for the next batch, or include the fix if you had a more critical bug that might make the recall worth it, but otherwise that bug lives in production for the life of the hardware.

Spelling mistake in the HTTP 1.x referrer field? You update the world's dictionaries, because that's significantly cheaper than mutating the existing bits of the protocol. Any half measures that would require maintaining backwards compatability would cause more problems than fixing the spelling "fixes", and any full measure that would fix all the old software for everyone might require a war or three after bankrupting a few billionares. That bug isn't just in software now, it's burrowed into books and minds.

"Same" bug, but different contexts lead to wildly different multipliers. If you want useful numbers for your context, you probably don't need or want a generic study and statistics - I'd bet even a dataless wild guess would be more accurate. Alternatively, you can run your own study in your own context...

tobyjsullivan 2 days ago [-]
Subtitle:

> It's probably still true, though, says formal methods expert

Seems like click bait. The thesis is predicated on the idea that people claim this is the result of some study. I’ve never once heard it presented that way. It’s a rule of thumb.

SloopJon 2 days ago [-]
> The thesis is predicated on the idea that people claim this is the result of some study. I’ve never once heard it presented that way. It's a rule of thumb.

Code Complete cites eight sources to support the claim that the average cost to fix a defect introduced during requirements is 10-100x if it's not detected until after release. My qualm with Hillel's original assertion is that "They all use this chart from the 'IBM Systems Sciences Institute'" (emphasis added). I haven't personally vetted Steve McConnell's citations, but I am skeptical that they all share this common origin.

jdlshore 1 days ago [-]
Laurent Bossavit’s The Leorechauns of Software Engineering looks into this claim (and several others) and finds that, yes, many of these studies do share a common origin, and often misquote/misrepresent it.
TeMPOraL 2 days ago [-]
I'm quite sure that, over the years, I've seen this claim presented many times with a citation or at least reference pointing at a study somewhere; can't find any particular example right now, unfortunately.

(This claim sits in my memory adjacent to things like "fixed number of bugs per 1000 lines of code", in a bucket labeled "seen multiple times, supposedly came out of some study on software engineering, something IBM or ACM or such".)

rzzzt 2 days ago [-]
If you have impressively low error rates in mind, I think they are coming from "They Write the Right Stuff":

  > Consider these stats: the last three versions of the program - each
  > 420,000 lines long-had just one error each. The last 11 versions of this
  > software had a total of 17 errors. Commercial programs of equivalent
  > complexity would have 5,000 errors.
TeMPOraL 1 days ago [-]
I don't recall that. What I had in mind was this result being used in support of higher-level programming languages; the argument as I remember went, when comparing teams writing the same stuff in multiple languages (IIRC Java, and either Assembly or C, were involved), it was found that the number of bugs per KLOC was about the same in each case, but obviously the number of features implemented in the same number of lines was much greater in high-level, more expressive languages, therefore it's better to use high-level languages.

I do buy the general idea (more expressive language -> some class of bugs inexpressible by design + less lines of code for bugs to hide in), but I'm suspicious about the specific result showing a constant bug/KLOC ratio in all tested languages; feels more like a lucky statistical artifact than some deep, fundamental relationship.

tanseydavid 2 days ago [-]
I don’t get the hairsplitting here—it seems obvious to me that if you build the wrong feature, you have to replace it with something else which needs building as well as something akin to demolition of the first feature.

Repeat this cycle more than once for the same feature and it clearly accrues to real impact…

the 100x may be exaggerated but that’s beside the point to me — I think even 2x or 3x on a feature is regrettable and oftentimes avoidable

Dylan16807 2 days ago [-]
It depends on how much money you make from getting features out faster. And how much happier your customers are to have the average feature earlier.

If the cost is only 2x or 3x, there are many situations where the benefits are bigger. If it's 100x there are a lot fewer such situations.

asimeqi 2 days ago [-]
The 100x also is kind of meaningless. 100x compared to what? I can introduce a bug by being careless for 1 minute, discover it in my own tests and spend the next 2 days to figure it out. That's already 960x compared to the time it took me to introduce it.
01HNNWZ0MV43FF 2 days ago [-]
Compared to before production
mrheosuper 2 days ago [-]
Even before production there are multi developing stages. An authenticate bug in PoC stage, few hours of engineers. In production, could make your company go bankrupt
dlcarrier 2 days ago [-]
Waterfall development never existed, either.

Business management and self help publishing long predates research, and nothing has changed. For some reason, software development has been extra susceptible to their nonsense.

bdcravens 2 days ago [-]
Some production systems are unique or expensive enough that emulating/virtualizing a setup in development is a large enough effort that it's cheaper to punt the observation and correction of environment-specific bugs to production.
jamesfinlayson 1 days ago [-]
Unfortunately yes - I had an issue last week with posting some data to an endpoint - all looked fine in dev of course, so I went to the prod logs and found that there was an issue with the logging. Turns out the logging was generating JSON using a library that didn't properly generate JSON if the output was multiple MB.

Trying to replicate in dev then turned out to be a monumental pain because Java has not huge limits on static strings.

moomin 1 days ago [-]
Speaking of how you present information in a misleading manner, did anybody actually get interviewed for this article or is this just a reheating of some blog articles?
JoeAltmaier 2 days ago [-]
Every major vendor does testing now on the client machines. Release early and often; record bugs; fix some. Rely on community forums for support.

This sounds like it's cheaper to fix in production, by orders of magnitude?

mixmastamyk 1 days ago [-]
If you don't fix the bugs, they're free!
devrandoom 2 days ago [-]
Bug found by you, the developer vs. user in production? Easily 100x.
michaelmrose 2 days ago [-]
Fixing it in production means it could have effected your production users who in turn couldn't do whatever it is that they do to actually make or give you money with an unknown but potentially significant effect on your bottom line.

It also involves more and oft more senior people who are paid more as it must be triaged, assigned, and managed.

Whilst it is unlikely that this falls exactly neatly on different orders of magnitude eg exactly 10 and 100x more if its taken to mean that its substantially and very substantially more expensive this seems fine.

jerlam 2 days ago [-]
Get a reputation for buggy, unreliable software and soon you won't have a lot of paying customers. Doesn't really fall under the definition of "fixing bugs" but a lot more impactful.
pdimitar 2 days ago [-]
> Laurent Bossavit, an Agile methodology expert

Congratulations, you got me to stop reading just at the start of the article.

On topic, I don't think any good engineer ever claimed the title of the article. The "more expensive" part stems from having to rush and maybe do a sloppy job, introducing regressions, higher hosting costs or other maladies.

So the "higher cost" might just be a compounding value borne out of panicky measures. Sometimes you really do have to get your sleeves rolled up and timebox any fix you have in mind and just progress and/or actually kill the problem. Often though, you just deflect the problem to somewhere else temporarily where the "bleeding" will not be as significant. Which buys you the time to do a better job.

Titles like those of the articles are highly dramatized. I am surprised any serious working person ever took them seriously.

tobyjsullivan 2 days ago [-]
> I don't think any good engineer ever claimed the title

I don’t claim to be a good engineer but I have made the claim in the title many times. Though it’s usually in the form of a more nuanced statement.

It’s about time, rather than money. If you can change a line of code to fix a bug before making your commit, that’s a lot faster than all the rigmarole of shipping the same fix later (new PR, code review, wait for CI, merge, deploy, etc.). Not to mention troubleshooting and debugging effort.

The multiplier depends on your context but the bar isn’t high. 100x a 30-second fix is about an hour of effort. I’ve worked in several teams where the average effort to change a line on prod approached that (with honest measurement, including context switching costs)

TeMPOraL 2 days ago [-]
I believe in a soft form of that too (i.e. no specific numbers); the severity really depends on the type of project.

In few industrial and enterprise projects I worked on, once you cross past "testing", fixing a bug involved coordinating with another team, which was doing a test deployment or evaluation at customer site; at that point, extra process would kick in, and if it was severe enough (or you were unlucky enough) the customer side got wind of it, you could expect some extra e-mail rounds and possibly a meeting.

Now, if your bug didn't get noticed then, and failed in actual production... the time and cost multiplier was effectively unbounded. A fix for a simple and low-impact bug could take a week to get from commit to being ready to release, and then wait a month in limbo, because there are schedules to these kinds of projects (can't really do continuous delivery if each release triggers a validation and sign-off process on customer's end, that engages a team of people for a day or more). A fix for a more complex or impactful bug could become... the last thing you release, if the customer gives up on your product over it. Etc.

People like to focus on technical aspects (restoring databases, hard-to-reach hardware, etc.) when discussing this concept, but there are whole classes of projects where the driving aspect is bureaucracy - coordinating business side, altering long-term project plans, getting sign-off on testing, re-evaluating regulatory compliance, etc. That can quickly get arbitrarily expensive.

jdlshore 1 days ago [-]
> On topic, I don't think any good engineer ever claimed the title of the article.

It’s been a common claim in the past, usually preceded by the phrase “research studies have found…” Steve McConnell is probably the most well known these days, but Barry Boehm is probably the first person to popularize the claim.

Separately, just because you haven’t heard something doesn’t mean it doesn’t happen.

pdimitar 1 days ago [-]
Obviously it could be a thing. But I feel it's been blown out of proportion, as it often happens.
throwaway314155 2 days ago [-]
Claiming that a supposedly well known study doesn't exist when that study is in fact not at all well known or even cited (although the premise is) is like pure rage-bait/clickbait gold.

edit reply because fuck you hn mods:

What? How?? Im not claiming it's from a scientific study I'm claiming it's "conventional wisdom" firmly in the zeitgeist.

aspenmayer 2 days ago [-]
What do you call knowing all this as you do, and furthering the farce by failing to link it forthwith?

Please! If the original post was nerd sniping, this is spawn camping lol

bravesoul2 2 days ago [-]
Actual title:

Everyone cites that 'bugs are 100x more expensive to fix in production' research, but the study might not even exist

igouy 2 days ago [-]
2021