▲Furiosa: 3.5x efficiency over H100sfuriosa.ai

91 points by written-beyond 3 hours ago | 44 comments

roughly 3 hours ago [-]

I am of the opinion that Nvidia's hit the wall with their current architecture in the same way that Intel has historically with its various architectures - their current generation's power and cooling requirements are requiring the construction of entirely new datacenters with different architectures, which is going to blow out the economics on inference (GPU + datacenter + power plant + nuclear fusion research division + lobbying for datacenter land + water rights + ...).

The story with Intel around these times was usually that AMD or Cyrix or ARM or Apple or someone else would come around with a new architecture that was a clear generation jump past Intel's, and most importantly seemed to break the thermal and power ceilings of the Intel generation (at which point Intel typically fired their chip design group, hired everyone from AMD or whoever, and came out with Core or whatever). Nvidia effectively has no competition, or hasn't had any - nobody's actually broken the CUDA moat, so neither Intel nor AMD nor anyone else is really competing for the datacenter space, so they haven't faced any actual competitive pressure against things like power draws in the multi-kilowatt range for the Blackwells.

The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall, and the only way to make the economics of, eg, a Blackwell-powered datacenter make sense is to assume that the entire economy is going to be running on it, as opposed to some useful tools and some improved interfaces. Otherwise, the investment numbers just don't make sense - the gap between what we see on the ground of how LLMs are used and the real but limited value add they can provide and the actual full cost of providing that service with a brand new single-purpose "AI datacenter" is just too great.

So this is a press release, but any time I see something that looks like an actual new hardware architecture for inference, and especially one that doesn't require building a new building or solving nuclear fusion, I'll take it as a good sign. I like LLMs, I've gotten a lot of value out of them, but nothing about the industry's finances add up right now.

nl 2 hours ago [-]

> I am of the opinion that Nvidia's hit the wall with their current architecture

Based on what?

Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

Inference tests: https://inferencemax.semianalysis.com/

Training tests: https://www.lightly.ai/blog/nvidia-b200-vs-h100

https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200... (only H100, but vs AMD)

> but nothing about the industry's finances add up right now

Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom? Because the released numbers seem to indicate that inference providers and Anthropic are doing pretty well, and that OpenAI is really only losing money on inference because of the free ChatGPT usage.

Further, I'm sure most people heard the mention of an unnamed enterprise paying Anthropic $5000/month per developer on inference(!!) If a company if that cost insensitive is there any reason why Anthropic would bother to subsidize them?

roughly 2 hours ago [-]

> Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

I'm more concerned about fully-loaded dollars per token - including datacenter and power costs - rather than "does the chip go faster." If Nvidia couldn't make the chip go faster, there wouldn't be any debate, the question right now is "what is the cost of those improvements." I don't have the answer to that number, but the numbers going around for the costs of new datacenters doesn't give me a lot of optimism.

> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

OpenAI has $1.15T in spend commitments over the next 10 years: https://tomtunguz.com/openai-hardware-spending-2025-2035/

As far as revenue, the released numbers from nearly anyone in this space are questionable - they're not public companies, we don't actually get to see inside the box. Torture the numbers right and they'll tell you anything you want to hear. What we _do_ get to see is, eg, Anthropic raising billions of dollars every ~3 months or so over the course of 2025. Maybe they're just that ambitious, but that's the kind of thing that makes me nervous.

nl 1 hours ago [-]

> OpenAI has $1.15T in spend commitments over the next 10 years

Yes, but those aren't contracted commitments, and we know some of them are equity swaps. For example "Microsoft ($250B Azure commitment)" from the footnote is an unknown amount of actual cash.

And I think it's fair to point out the other information in your link "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."

roughly 28 minutes ago [-]

> "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."

OpenAI can project whatever they want, they're not public.

Forgeties79 2 hours ago [-]

> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

I mean the amount of money invested across just a handful of AI companies is currently staggering and their respective revenues are no where near where they need to be. That’s a valid reason to be skeptical. How many times have we seen speculative investment of this magnitude? It’s shifting entire municipal and state economies in the US.

OpenAI alone is currently projected to burn over $100 billion by what? 2028 or 2029? Forgot what I read the other day. Tens of billions a year. That is a hell of a gamble by investors.

sothatsit 18 minutes ago [-]

The flip side is that these companies seem to be capacity constrained (although that is hard to confirm). If you assume the labs are capacity constrained, which seems plausible, then building more capacity could pay off by allowing labs to serve more customers and increase revenue per customer.

This means the bigger questions are whether you believe the labs are compute constrained, and whether you believe more capacity would allow them to drive actual revenue. I think there is a decent chance of this being true, and under this reality the investments make more sense. I can especially believe this as we see higher-cost products like Claude Code grow rapidly with much higher token usage per user.

This all hinges on demand materialising when capacity increases, and margins being good enough on that demand to get a good ROI. But that seems like an easier bet for investors to grapple with than trying to compare future investment in capacity with today's revenue, which doesn't capture the whole picture. Anthropic has 10x'ed their revenue for three straight years after all.

Forgeties79 7 minutes ago [-]

Typically a factory is outputting at a profit or has a clear path to profitability in order to pay off a loan/investment. They have debt, but they’re moving towards the black in a concrete, relatively predictable way - no one speculates on a factory like they do with AI companies currently. If said factory’s output is maxed and they’re still not making money, then it’s a losing investment and they wouldn’t expand. It’s not really apples to apples.

segmondy 2 hours ago [-]

> The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall

I don't know who needs to hear this, but the real break through in AI that we have had is not LLMs, but generative AI. LLM is but one specific case. Furthermore, we have hit absolutely no walls. Go download a model from Jan 2024, another from Jan 2025 and one from this year and compare. The difference is exponential in how well they have gotten.

binary132 2 hours ago [-]

>go download a model

GP was talking about commercially hosted LLMs running in datacenters, not free Chinese models.

Local is definitely still improving. That’s another reason the megacenter model (NVDA’s big line up forever plan) is either a financial catastrophe about to happen, or the biggest bailout ever.

wahnfrieden 1 hours ago [-]

GPT 5.2 is an incredible leap over 5.1 / 5

hadlock 4 minutes ago [-]

5.2 is great if you ask it engineering questions, or questions an engineer might ask. It is extremely mid, and actually worse than the o3/o4 era models if you start asking it trivia like if the I-80 tunnel on the bay bridge (yerba buena island) is the largest bore in the world. Don't even get me started on whatever model is wired up to the voice chat button.

But yes it will write you a flawless, physics accurate flight simulator in rust on the first try. I've proven that. I guess what I'm trying to say is Anthropic was eating their lunch at coding, and OpenAI rose to the challenge, but if you're not doing engineering tasks their current models are arguably worse than older ones.

kuil009 2 hours ago [-]

Thanks for this. It put into words a lot of the discomfort I’ve had with the current AI economics.

bsder 37 minutes ago [-]

We've seen this before.

In 2001, there were something like 50+ OC-768 hardware startups.

At the time, something like 5 OC-768 links could carry all the traffic in the world. Even exponential doubling every 12 months wasn't going to get enough customers to warrant all the funding that had poured into those startups.

When your business model bumps into "All the <X> in the world," you're in trouble.

xnx 52 minutes ago [-]

Remember that without real competition, Nvidia has little incentive to release something 16x faster when they could release something 2x faster 4 times.

re-thc 2 hours ago [-]

> I am of the opinion that Nvidia's hit the wall with their current architecture

Not likely since TSMC has a new process with big gains.

> The story with Intel

Was that their fab couldn’t keep up not designs.

frankchn 1 hours ago [-]

If Intel's original 10nm process and Cannon Lake had launched within Intel's original timeframe of 2016/17, it would have been class leading.

Instead, they couldn't get 10nm to work and launched one low-power SKU in 2018 that had almost half the die disabled, and stuck to 14nm from 2014-2021.

linuxftw 1 hours ago [-]

Based on conversations I've had with some people managing GPU's at scale in the datacenters, inference is an after thought. There is a gold rush for training right now, and that's where these massive clusters are being used.

LLM's are probably a small fraction of the overall GPU compute in use right now. I suspect in the next 5 years we'll have full Hollywood movies being completely generated (at least the specialfx) entirely by AI.

flyinglizard 2 hours ago [-]

You’re right but Nvidia enjoys an important advantage Intel had always used to mask their sloppy design work: the supply chain. You simply can’t source HBMs at scale because Nvidia bought everything, TSMC N3 is likewise fully booked and between Apple and Nvidia their 18A is probably already far gone and if you want to connect your artisanal inference hardware together then congratulations, Nvidia is the leader here too and you WILL buy their switches.

As for the business side, I’ve yet to hear of a transformative business outcome due to LLMs (it will come, but not there yet). It’s only the guys selling the shovels that are making money.

This entire market runs on sovereign funds and cyclical investing. It’s crazy.

bigyabai 2 hours ago [-]

> but nothing about the industry's finances add up right now.

The acquisitions do. Remember Groq?

wmf 2 hours ago [-]

That may not be a good example because everyone is saying Groq isn't worth $20B.

jsheard 1 hours ago [-]

They were valued at $6.9B just three months before Nvidia bought them for $20B, triple the valuation. That figure seems to have been pulled out of thin air.

minimaltom 1 hours ago [-]

Speaking generally: It makes sense for a acquisition price to be at a premium to valuation, between the dynamics where you have to convince leadership its better to be bought than to keep growing, and the expected risk posed by them as competition.

Most M&As arent done by value investors.

petesergeant 2 hours ago [-]

> nothing about the industry's finances add up right now

Nothing about the industry’s finances, or about Anthropic and OpenAI’s finances?

I look at the list of providers on OpenRouter for open models, and I don’t believe all of them are losing money. FWIW Anthropic claims (iirc) that they don’t lose money on inference. So I don’t think the industry or the model of selling inference is what’s in trouble there.

I am much more skeptical of Anthropic and OpenAI’s business model of spending gigantic sums on generating proprietary models. Latest Claude and GPT are very very good, but not better enough than the competition to justify the cash spend. It feels unlikely that anyone is gonna “winner takes all” the market at this point. I don’t see how Anthropic or OpenAI’s business model survive as independent entities, or how current owners don’t take a gigantic haircut, other than by Sam Altman managing to do something insane like reverse acquiring Oracle.

EDIT: also feels like Musk has shown how shallow the moat is. With enough cash and access to exceptional engineers, you can magic a frontier model out of the ether, however much of a douche you are.

zmmmmm 3 hours ago [-]

What can it actually run? The fact their benchmark plot refers to Llama 3.1 8b signals to me that it's hand implemented for that model and likely can't run newer / larger models. Why else would you benchmark such an outdated model? Show me a benchmark for gpt-oss-120b or something similar to that.

sanxiyn 2 hours ago [-]

Looking at their blog, they in fact ran gpt-oss-120b: https://furiosa.ai/blog/serving-gpt-oss-120b-at-5-8-ms-tpot-...

I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.

reactordev 2 hours ago [-]

Many are aware, just can’t offload it onto their hardware.

The 8B models are easier to run on an RTX to compare it to local inference. What llama does on an RTX 5080 at 40t/s, Furiosa should do at 40,000t/s or whatever… it’s an easy way to have a flat comparison across all the different hardware llama.cpp runs on.

nl 2 hours ago [-]

> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token

That's 86 token/second/chip

By comparison, a H100 will do 2390 token/second/GPU

Am I comparing the wrong things somehow?

[1] https://inferencemax.semianalysis.com/

sanxiyn 1 hours ago [-]

I think you are comparing latency with throughput. You can't take the inverse of latency to get throughput because concurrency is unknown. But then, RNGD result is probably with concurrency=1.

binary132 2 hours ago [-]

I thought they were saying it was more efficient, as in tokens per watt. I didn’t see a direct comparison on that metric but maybe I didn’t look well enough.

nl 1 hours ago [-]

Probably. Companies sell on efficiency when they know they lose on performance.

zmmmmm 2 hours ago [-]

Now I'm interested ...

It still kind of makes the point that you are stuck with a very limited range of models that they are hand implementing. But at least it's a model I would actually use. Give me that in a box I can put in a standard data center with normal power supply and I'm definitely interested.

But I want to know the cost :-)

rjzzleep 2 hours ago [-]

The fact that so many people are focusing solely on massive LLM models is an oversight by people that narrowly focusing on a tiny (but very lucrative) subdomain of AI applications.

whimsicalism 3 hours ago [-]

Got excited, then I saw it was for inference. yawns

Seems like it would obviously be in TSMCs interest to give preferential taping to nvidia competitors, they benefit from having a less consolidated customer base bidding up their prices.

darknoon 3 hours ago [-]

really weird graph where they're comparing to 3x H100 PCI-E which is a config I don't think anyone is using.

they're trying to compare at iso-power? I just want to see their box vs a box of 8 h100s b/c that's what people would buy instead, and they can divide tokens and watts if that's the pitch.

minimaltom 1 hours ago [-]

Whats a more realistic config?

nycdatasci 1 hours ago [-]

Is this from 2024? It mentions "With global data center demand at 60 GW in 2024"

Also, there is no mention of the latest-gen NVDA chips: 5 RNGD servers generate tokens at 3.5x the rate of a single H100 SXM at 15 kW. This is reduced to 1.5x if you instead use 3 H100 PCIe servers as the benchmark.

jszymborski 1 hours ago [-]

Is it reasonable for me not to be able to read a single word of a text-based blog post because I don't have WebGL enabled?

kuil009 3 hours ago [-]

The positioning makes sense, but I’m still somewhat skeptical.

Targeting power, cooling, and TCO limits for inference is real, especially in air-cooled data centers.

But the benchmarks shown are narrow, and it’s unclear how well this generalizes across models and mixed production workloads. GPUs are inefficient here, but their flexibility still matters.

grosswait 3 hours ago [-]

How usable is this in practice for the average non AI organization? Are you locked into a niche ecosystem that limits the options of what models you can serve?

sanxiyn 2 hours ago [-]

Yes, but in principle it isn't that different from running on Trainium or Inferentia (it's a matter of degree), and plenty of non-AI organizations adopted Trainium/Inferentia.

nl 2 hours ago [-]

So inference only and slower than B200s?

Maybe they are cheap.

richwater 2 hours ago [-]

This is from September 2025, what's new?

sanxiyn 2 hours ago [-]

What's new is HN discovered it. It wasn't posted in September 2025.

Loading comments...

roughly 3 hours ago [-]

nl 2 hours ago [-]

> I am of the opinion that Nvidia's hit the wall with their current architecture

Based on what?

Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

Inference tests: https://inferencemax.semianalysis.com/

Training tests: https://www.lightly.ai/blog/nvidia-b200-vs-h100

https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200... (only H100, but vs AMD)

> but nothing about the industry's finances add up right now

roughly 2 hours ago [-]

> Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

OpenAI has $1.15T in spend commitments over the next 10 years: https://tomtunguz.com/openai-hardware-spending-2025-2035/

nl 1 hours ago [-]

> OpenAI has $1.15T in spend commitments over the next 10 years

Yes, but those aren't contracted commitments, and we know some of them are equity swaps. For example "Microsoft ($250B Azure commitment)" from the footnote is an unknown amount of actual cash.

And I think it's fair to point out the other information in your link "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."

roughly 28 minutes ago [-]

> "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."

OpenAI can project whatever they want, they're not public.

Forgeties79 2 hours ago [-]

> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

OpenAI alone is currently projected to burn over $100 billion by what? 2028 or 2029? Forgot what I read the other day. Tens of billions a year. That is a hell of a gamble by investors.

sothatsit 18 minutes ago [-]

Forgeties79 7 minutes ago [-]

segmondy 2 hours ago [-]

> The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall

binary132 2 hours ago [-]

>go download a model

GP was talking about commercially hosted LLMs running in datacenters, not free Chinese models.

Local is definitely still improving. That’s another reason the megacenter model (NVDA’s big line up forever plan) is either a financial catastrophe about to happen, or the biggest bailout ever.

wahnfrieden 1 hours ago [-]

GPT 5.2 is an incredible leap over 5.1 / 5

hadlock 4 minutes ago [-]

kuil009 2 hours ago [-]

Thanks for this. It put into words a lot of the discomfort I’ve had with the current AI economics.

bsder 37 minutes ago [-]

We've seen this before.

In 2001, there were something like 50+ OC-768 hardware startups.

When your business model bumps into "All the <X> in the world," you're in trouble.

xnx 52 minutes ago [-]

Remember that without real competition, Nvidia has little incentive to release something 16x faster when they could release something 2x faster 4 times.

re-thc 2 hours ago [-]

> I am of the opinion that Nvidia's hit the wall with their current architecture

Not likely since TSMC has a new process with big gains.

> The story with Intel

Was that their fab couldn’t keep up not designs.

frankchn 1 hours ago [-]

If Intel's original 10nm process and Cannon Lake had launched within Intel's original timeframe of 2016/17, it would have been class leading.

Instead, they couldn't get 10nm to work and launched one low-power SKU in 2018 that had almost half the die disabled, and stuck to 14nm from 2014-2021.

linuxftw 1 hours ago [-]

flyinglizard 2 hours ago [-]

As for the business side, I’ve yet to hear of a transformative business outcome due to LLMs (it will come, but not there yet). It’s only the guys selling the shovels that are making money.

This entire market runs on sovereign funds and cyclical investing. It’s crazy.

bigyabai 2 hours ago [-]

> but nothing about the industry's finances add up right now.

The acquisitions do. Remember Groq?

wmf 2 hours ago [-]

That may not be a good example because everyone is saying Groq isn't worth $20B.

jsheard 1 hours ago [-]

They were valued at $6.9B just three months before Nvidia bought them for $20B, triple the valuation. That figure seems to have been pulled out of thin air.

minimaltom 1 hours ago [-]

Most M&As arent done by value investors.

petesergeant 2 hours ago [-]

> nothing about the industry's finances add up right now

Nothing about the industry’s finances, or about Anthropic and OpenAI’s finances?

EDIT: also feels like Musk has shown how shallow the moat is. With enough cash and access to exceptional engineers, you can magic a frontier model out of the ether, however much of a douche you are.

zmmmmm 3 hours ago [-]

sanxiyn 2 hours ago [-]

Looking at their blog, they in fact ran gpt-oss-120b: https://furiosa.ai/blog/serving-gpt-oss-120b-at-5-8-ms-tpot-...

I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.

reactordev 2 hours ago [-]

Many are aware, just can’t offload it onto their hardware.

nl 2 hours ago [-]

> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token

That's 86 token/second/chip

By comparison, a H100 will do 2390 token/second/GPU

Am I comparing the wrong things somehow?

[1] https://inferencemax.semianalysis.com/

sanxiyn 1 hours ago [-]

I think you are comparing latency with throughput. You can't take the inverse of latency to get throughput because concurrency is unknown. But then, RNGD result is probably with concurrency=1.

binary132 2 hours ago [-]

I thought they were saying it was more efficient, as in tokens per watt. I didn’t see a direct comparison on that metric but maybe I didn’t look well enough.

nl 1 hours ago [-]

Probably. Companies sell on efficiency when they know they lose on performance.

zmmmmm 2 hours ago [-]

Now I'm interested ...

But I want to know the cost :-)

rjzzleep 2 hours ago [-]

The fact that so many people are focusing solely on massive LLM models is an oversight by people that narrowly focusing on a tiny (but very lucrative) subdomain of AI applications.

whimsicalism 3 hours ago [-]

Got excited, then I saw it was for inference. yawns

Seems like it would obviously be in TSMCs interest to give preferential taping to nvidia competitors, they benefit from having a less consolidated customer base bidding up their prices.

darknoon 3 hours ago [-]

really weird graph where they're comparing to 3x H100 PCI-E which is a config I don't think anyone is using.

they're trying to compare at iso-power? I just want to see their box vs a box of 8 h100s b/c that's what people would buy instead, and they can divide tokens and watts if that's the pitch.

minimaltom 1 hours ago [-]

Whats a more realistic config?

nycdatasci 1 hours ago [-]

Is this from 2024? It mentions "With global data center demand at 60 GW in 2024"

jszymborski 1 hours ago [-]

Is it reasonable for me not to be able to read a single word of a text-based blog post because I don't have WebGL enabled?

kuil009 3 hours ago [-]

The positioning makes sense, but I’m still somewhat skeptical.

Targeting power, cooling, and TCO limits for inference is real, especially in air-cooled data centers.

But the benchmarks shown are narrow, and it’s unclear how well this generalizes across models and mixed production workloads. GPUs are inefficient here, but their flexibility still matters.

grosswait 3 hours ago [-]

How usable is this in practice for the average non AI organization? Are you locked into a niche ecosystem that limits the options of what models you can serve?

sanxiyn 2 hours ago [-]

Yes, but in principle it isn't that different from running on Trainium or Inferentia (it's a matter of degree), and plenty of non-AI organizations adopted Trainium/Inferentia.

nl 2 hours ago [-]

So inference only and slower than B200s?

Maybe they are cheap.

richwater 2 hours ago [-]

This is from September 2025, what's new?

sanxiyn 2 hours ago [-]

What's new is HN discovered it. It wasn't posted in September 2025.