▲R-Zero: Self-Evolving Reasoning LLM from Zero Dataarxiv.org

102 points by lawrenceyan 17 hours ago | 53 comments

Iv 4 hours ago [-]

"Starting from a single base LLM"

Ok, zero data, except the data used in the teacher model.

nickpsecurity 3 hours ago [-]

Only 1-15TB of data processed at $10k-$100m depending on model size. Then, this shaves off a few hundred to a few grand on fine-tuning. I mean, we're still saving money at least.

markmoscov 3 hours ago [-]

[dead]

nakamoto_damacy 8 hours ago [-]

Perpetual Motion Machines were a thing at some point, too.

YeGoblynQueenne 8 hours ago [-]

Don't laugh. PMMs work! I built mine ten years ago when I realised I could improve the SOTA by a huge 20%. I've been improving it for the last 10 years and I get an average performance boost of ~0.25 every year. We will have Free Energy in the next 10 years.

ojo-rojo 6 hours ago [-]

I find your comment interesting, even though I'm not sure if I really get what you're saying. You built a perpetual motion machine? You then made improvements? Can you share details?

YeGoblynQueenne 2 hours ago [-]

This is HN so I think it's fine to break standard protocol and clarify: I was joking. Specifically I was riffing off nakamoto_damacy's comment and carrying the comparison (of LLMs) with Perpetual Motion Machines (PMMs) to its logical conclusion.

pas 5 hours ago [-]

they are claiming that they built a PMM prototype, which is not fully satisfying the business requirements yet, but they are on track to do so based on all the amazing documented validated peer-reviewed published progress they made already over the years!

YeGoblynQueenne 2 hours ago [-]

That!

suprfsat 6 hours ago [-]

Good news everyone, you've passed the Turing test.

amelius 4 hours ago [-]

Hmm, I guess I didn't pass it then.

nickpsecurity 3 hours ago [-]

The trick is you use magnets, momentum, and WD-40. That can get you most of the way.

It probably will eventually stop, though. Something about the Sun becoming a red giant...

YeGoblynQueenne 2 hours ago [-]

Pf, magnets. That's so 1920's! Room-temperature superconductors are the thing nowadays. I'm sure we'll have those in just a few years.

taneq 5 hours ago [-]

20%? 0.25%? Those are rookie numbers! /s

(I feel like this post is underappreciated by at least 20%. :D )

api 8 hours ago [-]

I refer to the endless self improving runaway AI as an “information theoretic perpetual motion machine.”

This will work in a sense. It will do… something… and learn… something. It will be unrelated to the physical universe in any way. See also: procedural landscape generators, etc.

hodgehog11 4 hours ago [-]

This makes sense on its face, but the flaw in the logic here is the implicit assumption that current procedures extract all information available in the datasets. We know this is not even remotely close to being true.

Many decades ago, statisticians made a similar erroneous assumption that maximum likelihood estimators, which also minimize entropy, are "optimal" in terms of saturating error. The fact that you can do better by smarter regularisation is the key to why DL works in the first place.

I'm no shill for AI, but you're going to need a better argument for why runaway AI up to obscene levels of performance is not theoretically possible. There are quite a few people, including some of my colleagues, that are looking in earnest but so far no one has found one.

K0balt 7 hours ago [-]

Might kinda work if you gave it tools to do its research on the open internet, fiverr, mechanical Turk, etc.

nakamoto_damacy 6 hours ago [-]

Sure, it could up until some point where in order for it to figure out that it has to use a tool or access the Internet it will need more intelligence (to know that its answer or understanding is not sufficient or incorrect) How do we as humans know that? Someone tells us. Who's going to tell it? Then you end up at Minsky's Society of Mind, but also a distributed perpetual motion machine. Evolution seems to have figured out the intuition mechanism as some sort of probabilistic mechanism that's been honed for potentially millions of years, if not billions (white blood cells track pathogens, without having any neural network, so it's possible.) -- I think I opened a can of worms with these thoughts.

agentultra 7 hours ago [-]

On its own without any alignment or labelling. Super-intelligence or super-Grok?

api 7 hours ago [-]

That’s at least some contact with reality, at least by proxy. I’m referring to a brain in a vat somehow learning.

RLAIF 7 hours ago [-]

[dead]

thom 11 hours ago [-]

For values of zero quite far above zero.

falcor84 10 hours ago [-]

What am I missing? From my skimming, there's zero external data beyond what is needed for the Challenger to generate questions.

thom 9 hours ago [-]

An existing trained LLM is an enormous amount of 'data' however it might be encoded. AlphaZero didn't start with Stockfish or a database of games.

magicalhippo 9 hours ago [-]

As I understand it the point of the article isn't to train a LLM from scratch, it's to teach a non-reasoning model to reason without additional explicit training data.

YeGoblynQueenne 8 hours ago [-]

The abstract does use the term "from scratch":

>> To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

Giving the benefit of the doubt, they're just using it wrong, but the way they use it sure reads like they claim they found a way to initialise LLMs with 0 data. Only the absurdity of the claim protects the reader from such misunderstanding, and that's never a good thing in a research paper.

magicalhippo 8 hours ago [-]

If you included the previous and following sentences, it's at least to me clear what they mean:

However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence

To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver.

Training a LLM is a multi-stage process[1], and they're tackling the stage at the end. That's where you do fine-tuning or reinforcement learning. They're not training a LLM from scratch. They're explicitly stating they start from a base LLM, ie a pretrained non-tuned model.

As I understand it, and as they mention, training data for the latter stages has typically required high-quality human-curated samples in large numbers, even if they're augmented using LLMs, say by generating multiple variations of each human-curated training sample.

Their proposal is to have a generative adversarial network generate that data without any initial human input, ie from scratch.

[1]: https://snorkel.ai/blog/large-language-model-training-three-...

YeGoblynQueenne 2 hours ago [-]

That's a fair reading but when you write a technical paper you must try to minimise the number of different possible readings of each sentence, otherwise different people will understand different things, and that's the most important thing you need to avoid.

tucnak 8 hours ago [-]

AlphaZero is oftentimes dragged out to ridicule the so-called "self-play LLM training" techniques, although I don't think these arguments are terribly convincing. You can think of AlphaZero games as effectively synthetic data in adversarial setting; yes, it's easy to produce and verify as the rules of chess are verifiable, so it doesn't require much data on paper. This is not the case for most texts, with some notable exceptions in verifiable domains, where self-play is coincidentally applied most successfully. Thus, you could make an argument that the pre-existing "trained LLM" is merely functioning as a verifier proxy, analogous to the well-defined chess verifier in AlphaZero.

nerpderp82 6 hours ago [-]

Thank you for your mature intelligent answer.

jasonjmcghee 14 hours ago [-]

Conceptually, it's effectively a GAN

frumiousirc 8 hours ago [-]

My initial thought as well. But, what is the "Discriminator" here? What grounds the training toward reality? The "Challenger" and "Solver" adversity alone can only serve to amplify hallucination.

Ahh, GPT-4o is the arbiter.

So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful.

However the reliance on an arbiter LLM makes the claim that it will overcome the problem of a lack of training data unreasonable. Once the target LLM is scaled up to reach the in-distribution domain size of the arbiter, it seems to me it will turn back into a hallucination amplifier.

djoldman 7 hours ago [-]

See Figure 2.

The solver/challenger is the GAN discriminator/generator.

The challenger is trained to create difficult questions. The solver is trained to strengthen pathways that correctly solve the questions like so:

> To guide the Challenger toward producing challenging yet solvable questions, we first define an uncertainty score. For a generated question x, we query the current Solver... The most frequent response is treated as the pseudo-label y˜(x), and we compute the Solver’s empirical accuracy....The uncertainty reward is then defined.... This function incentivizes questions where the Solver is maximally uncertain (accuracy approaches 50%)

Identifying the best pseudo-label seems like it would be the limitation of the approach.

magicalhippo 9 hours ago [-]

For those not in the know, that's Generative Adversarial Networks[1], where two neural networks are trained in a competitive way.

One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.

Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.

[1]: https://en.wikipedia.org/wiki/Generative_adversarial_network

torginus 8 hours ago [-]

GAN's are a supervised training method, not really self-improving (after converging to being able to reproduce the training set).

Davidzheng 3 hours ago [-]

I think in formal domain like lean it should actually be possible to do it from zero--but seems like no major successes no far

clbrmbr 7 hours ago [-]

Terrible choice of name. DeepSeek developed a historically important model called “R-Zero” (this was the predecessor to R1 that was training without any coldstart SFT, and was very strong but difficult to read chain of thought because it code switches into Chinese and has no line breaks).

lawlessone 1 hours ago [-]

OK but how do you ensure it's improving in a direction that aligns with reality?

freejazz 3 hours ago [-]

I still don't understand what a "reasoning" LLM is

cluckindan 3 hours ago [-]

It’s an LLM that has been trained and prompted to make users believe that the model is using logical reasoning to arrive at its output, when it is in fact still predicting the possible next output tokens, just like any other LLM.

There may be additional feedback loops, but fundamentally, that is what it is doing. Sure, it will show you what steps it takes to arrive at a conclusion, but it is just predicting the steps, the conclusion and the potential validity of the aforementioned based on its training data, not actually evaluating the logic or the truthiness of the output.

If you don’t believe me, ask your ”reasoning” LLM this question: What’s the name of the paternal great-great-grandfather of the son of Jacob’s son’s son’s son?

BrawnyBadger53 55 minutes ago [-]

Or to write it less pessimistically, the models are trained to prime their own context window such that by the end of the chain they arrive at more valuable responses. By creating intermediary steps in the chain, the next step is easier to generate rather than moving directly to the desired response. We call it reasoning because it is intuitively analogous to human reasoning methods though it is understood that LLMs don't succeed as generally as humans are able to.

mvdwoord 2 hours ago [-]

Progress is hard to keep track of in this fast paced environment, but aren't there already models that can add external tools and simply offload parts of he reasoning there? Maybe over MCP or some other mechanism, so it can offload e.g. calculations, or test code in a sandbox, or even write code to answer part of a question, execute the code somewhere, and take the results into the rest of the inference process as context?

Or is there a more subtle issue which prevents or makes this hard?

Is there something fundamentally impossible about having a model detecting the amount of Rs in 'strawberry' to be a string search operation and in some sandbox execute something like:

% echo "strawberry" | tr -dc "r" | wc -c

It seems agents do this already, but regular GPT style environments seem to lack it?

yunohn 2 hours ago [-]

My observation of AI progress over the past 2yrs has shown that LLM companies are focusing purely on raw model knowledge instead of optimised usable tooling. Unsure when this will ever change, but that’s why your example is not the industry’s standard yet.

mvdwoord 2 hours ago [-]

My intuition, which is of course woefully inadequate in this area, says there is a ton of accuracy to be gained, and I feel also a lot of offloading and therefore pruning or better use for the rest of the parameters...

Anyway,. let me refresh my page, as I am sure while typing this some new model architecture is dropping. ;)

sindriava 2 hours ago [-]

I won't read this because you're not really thinking, just pressing keyboard keys.

cluckindan 2 hours ago [-]

Joke’s on you, I dictated it.

sindriava 2 hours ago [-]

Rich coming from the guy who moved his muscles until sounds came out.

Also next time you should bother to at least copy paste your questions into any recent LLM, since they can all solve it without issue. But hallucations like this are common with non-reasoning HN users.

cluckindan 2 hours ago [-]

But can they solve it without referring to the Bible, or without mentioning anyone in the biblical Jacob’s family tree?

Don’t think so. Humans solve that puzzle in a very different way than LLMs ”reason” about it.

nerpderp82 1 hours ago [-]

There can be more than one intelligence. Nature has shown us that there are many. And many which can "outsmart" a human.

Varelion 2 hours ago [-]

Let's break this down carefully, step by step.

Start with Jacob.

Jacob’s son → call him A.

A’s son → call him B.

B’s son → call him C.

C’s son → call him D (this is “the son of Jacob’s son’s son’s son”).

Now the question asks for the paternal great-great-grandfather of D:

D’s father → C

D’s grandfather → B

D’s great-grandfather → A

D’s great-great-grandfather → Jacob

Answer: Jacob

freejazz 2 hours ago [-]

Thank you, I do not have a "reasoning" LLM and I have not found LLMs very useful in my life so I do not really engage with them outside of reading about them here and in other places.

neuroelectron 4 hours ago [-]

Now gamify it.

cyberge99 14 hours ago [-]

What could go wrong?

magicalhippo 9 hours ago [-]

Just don't hook it into the nuclear missile controls. We've seen[1] how that goes[2].

[1]: https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project

[2]: https://en.wikipedia.org/wiki/The_Terminator

koakuma-chan 8 hours ago [-]

[3] https://en.wikipedia.org/wiki/Re:Zero

Loading comments...

Iv 4 hours ago [-]

"Starting from a single base LLM"

Ok, zero data, except the data used in the teacher model.

nickpsecurity 3 hours ago [-]

Only 1-15TB of data processed at $10k-$100m depending on model size. Then, this shaves off a few hundred to a few grand on fine-tuning. I mean, we're still saving money at least.

markmoscov 3 hours ago [-]

[dead]

nakamoto_damacy 8 hours ago [-]

Perpetual Motion Machines were a thing at some point, too.

YeGoblynQueenne 8 hours ago [-]

ojo-rojo 6 hours ago [-]

I find your comment interesting, even though I'm not sure if I really get what you're saying. You built a perpetual motion machine? You then made improvements? Can you share details?

YeGoblynQueenne 2 hours ago [-]

pas 5 hours ago [-]

YeGoblynQueenne 2 hours ago [-]

That!

suprfsat 6 hours ago [-]

Good news everyone, you've passed the Turing test.

amelius 4 hours ago [-]

Hmm, I guess I didn't pass it then.

nickpsecurity 3 hours ago [-]

The trick is you use magnets, momentum, and WD-40. That can get you most of the way.

It probably will eventually stop, though. Something about the Sun becoming a red giant...

YeGoblynQueenne 2 hours ago [-]

Pf, magnets. That's so 1920's! Room-temperature superconductors are the thing nowadays. I'm sure we'll have those in just a few years.

taneq 5 hours ago [-]

20%? 0.25%? Those are rookie numbers! /s

(I feel like this post is underappreciated by at least 20%. :D )

api 8 hours ago [-]

I refer to the endless self improving runaway AI as an “information theoretic perpetual motion machine.”

This will work in a sense. It will do… something… and learn… something. It will be unrelated to the physical universe in any way. See also: procedural landscape generators, etc.

hodgehog11 4 hours ago [-]

K0balt 7 hours ago [-]

Might kinda work if you gave it tools to do its research on the open internet, fiverr, mechanical Turk, etc.

nakamoto_damacy 6 hours ago [-]

agentultra 7 hours ago [-]

On its own without any alignment or labelling. Super-intelligence or super-Grok?

api 7 hours ago [-]

That’s at least some contact with reality, at least by proxy. I’m referring to a brain in a vat somehow learning.

RLAIF 7 hours ago [-]

[dead]

thom 11 hours ago [-]

For values of zero quite far above zero.

falcor84 10 hours ago [-]

What am I missing? From my skimming, there's zero external data beyond what is needed for the Challenger to generate questions.

thom 9 hours ago [-]

An existing trained LLM is an enormous amount of 'data' however it might be encoded. AlphaZero didn't start with Stockfish or a database of games.

magicalhippo 9 hours ago [-]

As I understand it the point of the article isn't to train a LLM from scratch, it's to teach a non-reasoning model to reason without additional explicit training data.

YeGoblynQueenne 8 hours ago [-]

The abstract does use the term "from scratch":

>> To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

magicalhippo 8 hours ago [-]

If you included the previous and following sentences, it's at least to me clear what they mean:

To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.

Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver.

Their proposal is to have a generative adversarial network generate that data without any initial human input, ie from scratch.

[1]: https://snorkel.ai/blog/large-language-model-training-three-...

YeGoblynQueenne 2 hours ago [-]

tucnak 8 hours ago [-]

nerpderp82 6 hours ago [-]

Thank you for your mature intelligent answer.

jasonjmcghee 14 hours ago [-]

Conceptually, it's effectively a GAN

frumiousirc 8 hours ago [-]

My initial thought as well. But, what is the "Discriminator" here? What grounds the training toward reality? The "Challenger" and "Solver" adversity alone can only serve to amplify hallucination.

Ahh, GPT-4o is the arbiter.

So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful.

djoldman 7 hours ago [-]

See Figure 2.

The solver/challenger is the GAN discriminator/generator.

The challenger is trained to create difficult questions. The solver is trained to strengthen pathways that correctly solve the questions like so:

Identifying the best pseudo-label seems like it would be the limitation of the approach.

magicalhippo 9 hours ago [-]

For those not in the know, that's Generative Adversarial Networks[1], where two neural networks are trained in a competitive way.

One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.

Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.

[1]: https://en.wikipedia.org/wiki/Generative_adversarial_network

torginus 8 hours ago [-]

GAN's are a supervised training method, not really self-improving (after converging to being able to reproduce the training set).

Davidzheng 3 hours ago [-]

I think in formal domain like lean it should actually be possible to do it from zero--but seems like no major successes no far

clbrmbr 7 hours ago [-]

lawlessone 1 hours ago [-]

OK but how do you ensure it's improving in a direction that aligns with reality?

freejazz 3 hours ago [-]

I still don't understand what a "reasoning" LLM is

cluckindan 3 hours ago [-]

If you don’t believe me, ask your ”reasoning” LLM this question: What’s the name of the paternal great-great-grandfather of the son of Jacob’s son’s son’s son?

BrawnyBadger53 55 minutes ago [-]

mvdwoord 2 hours ago [-]

Or is there a more subtle issue which prevents or makes this hard?

Is there something fundamentally impossible about having a model detecting the amount of Rs in 'strawberry' to be a string search operation and in some sandbox execute something like:

% echo "strawberry" | tr -dc "r" | wc -c

It seems agents do this already, but regular GPT style environments seem to lack it?

yunohn 2 hours ago [-]

mvdwoord 2 hours ago [-]

Anyway,. let me refresh my page, as I am sure while typing this some new model architecture is dropping. ;)

sindriava 2 hours ago [-]

I won't read this because you're not really thinking, just pressing keyboard keys.

cluckindan 2 hours ago [-]

Joke’s on you, I dictated it.

sindriava 2 hours ago [-]

Rich coming from the guy who moved his muscles until sounds came out.

cluckindan 2 hours ago [-]

But can they solve it without referring to the Bible, or without mentioning anyone in the biblical Jacob’s family tree?

Don’t think so. Humans solve that puzzle in a very different way than LLMs ”reason” about it.

nerpderp82 1 hours ago [-]

There can be more than one intelligence. Nature has shown us that there are many. And many which can "outsmart" a human.

Varelion 2 hours ago [-]

Let's break this down carefully, step by step.

Start with Jacob.

Jacob’s son → call him A.

A’s son → call him B.

B’s son → call him C.

C’s son → call him D (this is “the son of Jacob’s son’s son’s son”).

Now the question asks for the paternal great-great-grandfather of D:

D’s father → C

D’s grandfather → B

D’s great-grandfather → A

D’s great-great-grandfather → Jacob

Answer: Jacob

freejazz 2 hours ago [-]

Thank you, I do not have a "reasoning" LLM and I have not found LLMs very useful in my life so I do not really engage with them outside of reading about them here and in other places.

neuroelectron 4 hours ago [-]

Now gamify it.

cyberge99 14 hours ago [-]

What could go wrong?

magicalhippo 9 hours ago [-]

Just don't hook it into the nuclear missile controls. We've seen[1] how that goes[2].

[1]: https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project

[2]: https://en.wikipedia.org/wiki/The_Terminator

koakuma-chan 8 hours ago [-]

[3] https://en.wikipedia.org/wiki/Re:Zero