> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.
IgorPartola 4 hours ago [-]
Yeah I mean if you generally believe the tech sector is going to do well because it has been doing well you will beat the overall market. The problem is that you don’t know if and when there might be a correction. But since there is this one segment of the overall market that has this steady upwards trend and it hasn’t had a large crash, then yeah any pattern seeking system will identify “hey this line keeps going up!” Would it have the nuance to know when a crash is coming if none of the data you test it on has a crash?
It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”
Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.
tshaddox 3 hours ago [-]
> It would almost be more interesting to specifically train the model on half the available market data, then test it on another half.
Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.
IgorPartola 2 hours ago [-]
I mean ultimately this is an exercise in frustration because if you do that you will have trained your model on market patterns that might not be in place anymore. For example after the 2008 recession regulations changed. So do market dynamics actually work the same in 2025 as in 2005? I honestly don’t know but intuitively I would say that it is possible that they do not.
I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM. Then run the test on the rest. This accounts for rules and external forces changing how markets operate over time. And you can do this over and over picking a different 10% market slice for training data each time.
But then your problem is that if you exclude let’s say Intel from your training data and AMD from your testing data then there ups and downs don’t really make sense since they are direct competitors. If you separate by market segment then does training the model on software tech companies might not actually tell you accurately how it would do for commodities or currency training. Or maybe I am wrong and trading is trading no matter what you are trading.
chris_st 2 hours ago [-]
> you will have trained your model on market patterns that might not be in place anymore
It is always fun (in a broad sense of that word) when I make a comment on an industry I know nothing about and somehow stumble onto a thing that not only has a name but also research. I am sure there is a German word for that feel of discovering something that countless others have already discovered.
taneq 16 minutes ago [-]
Any time I invent a cool thing, I go and try and find it online. Usually it's already an established product, which totally validates my feeling that the thing I invented is cool and would be a good product. :D
Occasionally it's (as far as I can tell) a legitimately new 'wow that's obvious' style thing and I consider prototyping it. :)
I am frankly astonished at the number of otherwise-intelligent people who actually seem to believe in this stuff.
One of the worst possible things to do in a competitive market is to trade by some publicly-available formulaic strategy. It’s like announcing your rock-paper-scissors move to your opponent in advance.
0manrho 1 hours ago [-]
> you will have trained your model on market patterns that might not be in place anymore
How is that relevant to what was proposed? If it's trading and training on 2010 data, what relevance does todays market dynamics and regulations have?
Which further begs the question, what's the point of this exercise?
Is it to develop a model than compete effectively in today's market? If so then yeah, the 2010 trading/training idea probably isn't the best idea for the reasons you've outlined.
Or is it to determine the capacity of an AI to learn and compete effectively within any given arbitrary market/era? If so, then today's dynamics/constraints are irrelevant unless you're explicitly trying to train/trade on todays markets (which isn't what the person you're replying to proposed, but is obviously a valid desire and test case to evaluate in it's own right)
Or is it evaluating its ability to identify what those constraints/limitations are and then build strategies based on it? In which case it doesn't matter when you're training/trading so much as your ability to feed it accurate and complete data for that time period be it today, or 15 years ago or whenever, which is no small ask.
calmbonsai 1 hours ago [-]
For a nice historic perspective on hedge funds and the industry as a whole, read Mallaby's "More Money Than God".
olliepro 4 hours ago [-]
A more sound approach would have been to do a monte carlo simulation where you have 100 portfolios of each model and look at average performance.
cyberrock 36 minutes ago [-]
While not strictly stocks, it would be interesting to see them trade on game economies like EVE, WoW, RuneScape, Counter Strike, PoE, etc.
observationist 3 hours ago [-]
Grok would likely have an advantage there, as well - it's got better coupling to X/Twitter, a better web search index, fewer safety guardrails in pretraining and system prompt modification that distort reality. It's easy to envision random market realities that would trigger ChatGPT or Claude into adjusting the output to be more politically correct. DeepSeek would be subject to the most pretraining distortion, but have the least distortion in practice if a random neutral host were selected.
If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.
OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.
I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.
It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.
UncleMeat 1 hours ago [-]
I know that Musk deserving a lifetime achievement award at the Adult Video Network awards over Riley Reid is definitely an indication of minimal "system prompt modification that distort[s] reality."
> fewer safety guardrails in pretraining and system prompt modification that distort reality.
Really? Isn't Grok's whole schtick that it's Elon's personal altipedia?
nickthegreek 3 hours ago [-]
My understanding is that grok api is way different than the grok x bot. Which of course does Grok as a business any favors. Personally, I do not engage with either.
bdangubic 2 hours ago [-]
you gotta be quite a crazy person to use grok :)
AlexCoventry 1 hours ago [-]
Grok is good for up-to-the-minute information, and for requests that other chat services refuse to entertain, like requests for instructions on how to physically disable the cellular modem in your car.
airstrike 1 hours ago [-]
@grok is this true?
bdangubic 1 hours ago [-]
… checking with my creator …
culi 1 hours ago [-]
I'd like to see this study replicated during a bear market
tclancy 2 hours ago [-]
I mean, run the experiment during a different trend in the market and the results would probably be wildly different. This feels like chartists [1] but lazier.
If you've ever read a blog on trading when LSTMs came out, you'd have seen all sorts of weird stuff with predicting the price at t+1 on a very bad train/test split, where the author would usually say "it predicts t+1 with 99% accuracy compared to t", and the graph would be an exact copy with a t+1 offset.
So eye-balling the graph looks great, almost perfect even, until you realize that in real-time the model would've predicted yesterday's high on today's market crash and you'd have lost everything.
monksy 3 hours ago [-]
They're not measuring performance in the context of when things happen and in the time that they are. It think its only showing recent performance and popularity. To actually evaluate how these do you need to be able to correct the model and retrain it per different time periods and then measure how it would do. Then you'll get better information from the backtesting.
etchalon 4 hours ago [-]
I don't feel like they measured anything. They just confirmed that tech stocks in the US did pretty well.
JoeAltmaier 3 hours ago [-]
They measured the investment facility of all those LLMs. That's pretty much what the title says.
And they had dramatically different outcomes. So that tells me something.
DennisP 3 hours ago [-]
I mean, what it kinda tells me is that people talk about tech stocks the most, so that's what was most prevalent in the training data, so that's what most of the LLMs said to invest in. That's the kind of strategy that works until it really doesn't.
ghaff 2 hours ago [-]
Cue 2020 or so. I do have investments in tech stocks but I have a lot more conservative investments too.
seanmcdirmid 2 hours ago [-]
We had this discussion in previous posts about congressional leaders who had the risk appetite to go tech heavy and therefore outperformed normal congress critters.
Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.
Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).
hobobaggins 1 hours ago [-]
They didn't just outperform "normal" congress critters.. they also outperformed nearly every hedge fund on the planet. But they (meaning, of course, just one person and their spouse) are obviously geniuses.
stouset 20 minutes ago [-]
Hedge funds’ goals are often not to maximize profit, but to provide returns uncorrelated with the rest of some benchmark market. This is useful for the wealthy as it means you can better survive market crashes.
seanmcdirmid 1 hours ago [-]
Hedge funds suck though. They don’t invest in FAANG, they do risky stuff that doesn’t pay off, you are still comparing incomparable things.
I’m obviously a genius because 90% of my stock is in tech, most of us on HN are geniuses in your opinion?
cap11235 59 minutes ago [-]
What do you think hedge funds do?
seanmcdirmid 30 minutes ago [-]
They use crazy investment strategies that allow them to capture high returns in adverse general market conditions, but they rather under perform the general market in normal and booming conditions. “Hedge” is actually in their name for a reason. Rich people use hedge funds for…hedging.
naet 3 hours ago [-]
I used to work for a brokerage API geared at algorithmic traders and in my experience anecdotal experience many strategies seem to work well when back-tested on paper but for various reasons can end up flopping when actually executed in the real market. Even testing a strategy in real time paper trading can end up differently than testing on the actual market where other parties are also viewing your trades and making their own responses. The post did list some potential disadvantages of backtesting, so they clearly aren't totally in the dark on it.
Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.
What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.
lisbbb 40 minutes ago [-]
This. This all day. I used to paper trade using ThinkOrSwim and I was doubling and tripling my money effortlessly. Then I decided to move my strategy to the real deal and it didn't do very well at all. It was all bs.
bmitc 50 minutes ago [-]
I've honestly never understood what backtesting even does because of the things you mention like time it takes to request and close trades (if they even do!), responses to your trades, the continuous and dynamic input of the market into your model, etc.
Is there any reference that explains the deep technicalities of backtesting and how it is supposed to actually influence your model development? It seems to me that one could spend a huge amount of effort on backtesting that would distract from building out models and tooling and that that effort might not even pay off given that the backtesting environment is not the real market environment.
Nevermark 3 hours ago [-]
Just one run per model? That isn't backtesting. I mean technically it is, but "testing" implies producing meaningful measures.
Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...
100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.
This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.
cheeseblubber 3 hours ago [-]
Yes definitely we were using our own budget and out of our own pocket and these model runs were getting expensive. Claude costed us around 200-300 dollars a 8 month run for example. We want to scale it and get more statistically significant results but wanted to share something in the interim.
Nevermark 3 hours ago [-]
Got it. It is an interesting thing to explore.
energy123 2 hours ago [-]
To their credit, they say in the article that the results aren't statistically significant. It would be better if that disclaimer was more prominently displayed though.
The tone of the article is focused on the results when it should be "we know the results are garbage noise, but here is an interesting idea".
hhutw 1 hours ago [-]
Yeah...one run per model is just random walk in my opinion
ipnon 2 hours ago [-]
Yes, if these models available for $200/month a making 50% returns reliably, why isn’t Citadel having layoffs?
lisbbb 41 minutes ago [-]
In my experience, you get a few big winners, but since you have to keep placing new trades (e.g. bets) you eventually blow one and lose most of what you made. This is particularly true with options and futures trades. It's a stupid way to speculate with or without AI help doesn't matter and will never matter.
dhosek 2 hours ago [-]
I wouldn’t trust any backtracking test with these models. Try doing a real-time test over 8 months and see what happens then. I’d also be suspicious of anything that doesn’t take actual costs into account.
Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.
mjk3026 2 hours ago [-]
I also saw the hype on X yesterday and had already checked the https://nof1.ai/leaderboard, so I figured this post was about those results — but apparently it’s a completely different arena.
I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.
richardhenry 3 hours ago [-]
If I'm understanding this website correctly, these models can only trade in a handful of tech stocks along with the XYZ100 hyperliquid coin?
enlyth 3 hours ago [-]
With the speed of how pricing information propagates, this seems way too dependent on how the agent is built, what information it has access to, and the feedback loop between the LLM and actions it can carry out
syntaxing 3 hours ago [-]
Let me guess, the mystery model is theirs
yahoozoo2 2 hours ago [-]
It says "Undisclosed frontier AI Lab (not Nof1)"
cheeseblubber 4 hours ago [-]
OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.
pottertheotter 39 minutes ago [-]
Cool experiment.
I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.
anigbrowl 2 hours ago [-]
You should redo this with human controls. By a weird coincidence, I have sufficient free time.
apparent 3 hours ago [-]
> Grok ended up performing the best while DeepSeek came close to second.
I think you mean "DeepSeek came in a close second".
apparent 18 minutes ago [-]
OK, now it says:
> Grok ended up performing the best while DeepSeek came close second.
"came in a close second" is an idiom that only makes sense word-for-word.
gerdesj 2 hours ago [-]
These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!
LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.
Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.
joegibbs 3 hours ago [-]
I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?
irishcoffee 3 hours ago [-]
> But wanted to give everyone a way to see how these models think…
Think? What exactly did “it” think about?
cheeseblubber 3 hours ago [-]
You can click in to the chart and see the conversation as well as for each trade what was the reasoning it gave for it
stoneyhrm1 3 hours ago [-]
"Pass the salt? You mean pass the sodium chloride?"
3 hours ago [-]
19 minutes ago [-]
lvspiff 2 hours ago [-]
I setup real life accounts with etrade and fidelity using the etrade auto portfolio, fidelity i have an advisor for retirement, and then i did a basket portfolio as well but used ms365 with grok 5 and various articles and strategies to pick a set of 5 etfs that would perform similarly to the exposure of my other two.
This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.
In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.
lisbbb 36 minutes ago [-]
How much are the expense ratios on those etfs you chose, though? I mean, Vanguard, Fidelity, Blackrock, and others have extremely low cost funds and etfs and it has been shown year after year and decade after decade that you can't beat their average returns over the long term. Indexing works for a reason. Beating something by 1%? It's not even worth it if your costs and taxes are higher than that.
sethops1 4 hours ago [-]
> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading
So the results are meaningless - these LLMs have the advantage of foresight over historical data.
PTRFRLL 4 hours ago [-]
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.
plufz 4 hours ago [-]
I know very little about how the environment where they run these models look, but surely they have access to different tools like vector embeddings with more current data on various topics?
endtime 3 hours ago [-]
If they could "see" the future and exploit that they'd probably have much higher returns.
alchemist1e9 3 hours ago [-]
56% over 8 months with the constraints provided are pretty good results for Grok.
disconcision 3 hours ago [-]
you can (via the api, or to a lesser degree through the setting in the web client) determine what tools if any a model can use
disconcision 3 hours ago [-]
with the exception that it doesn't seem possible to fully disable this for grok 4
alchemist1e9 3 hours ago [-]
which is curiously the best model …
stusmall 4 hours ago [-]
Even if it is after the cut off date wouldn't the models be able to query external sources to get data that could positively impact them? If the returns were smaller I could reasonably believe it but beating the S&P500 returns by 4x+ strains credulity.
cheeseblubber 3 hours ago [-]
We used the LLMs API and provided custom tools like a stock ticker tool that only gave stock price information for that date of backtest for the model. We did this for news apis, technical indicator apis etc. It took quite a long time to make sure that there weren't any data leakage. The whole process took us about a month or two to build out.
alchemist1e9 3 hours ago [-]
I have a hunch Grok model cutoff is not accurate and somehow it has updated weights though they still call it the same Grok model as the params and size are unchanged but they are incrementally training it in the background. Of course I don’t know this but it’s what I would do in their situation since ongoing incremental training could he a neat trick to improve their ongoing results against competitors, even if marginal. I also wouldn’t trust the models to honestly disclose their decision process either.
That said. This is a fascinating area of research and I do think LLM driven fundamental investing and trading has a future.
itake 4 hours ago [-]
> We time segmented the APIs to make sure that the simulation isn’t leaking the future into the model’s context.
I wish they could explain what this actually means.
nullbound 4 hours ago [-]
Overall, it does sound weird. On the one hand, assuming I properly I understand what they are saying is that they removed model's ability to cheat based on their specific training. And I do get that nuance ablation is a thing, but this is not what they are discussing there. They are only removing one avenue of the model to 'cheat'. For all we know, some that data may have been part of its training set already...
devmor 4 hours ago [-]
It's a very silly way of saying that the data the LLMs had access to was presented in chronological order, so that for instance, when they were trading on stocks at the start of the 8 month window, the LLMs could not just query their APIs to see the data from the end of the 8 month window.
joegibbs 4 hours ago [-]
That's only if they're trained on data more recent than 8 months ago
CPLX 4 hours ago [-]
Not sure how sound the analysis is but they did apparently actually think of that.
4 hours ago [-]
copypaper 3 hours ago [-]
>Each model gets access to market data, news APIs, company financials...
The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...
I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.
I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.
buredoranna 4 hours ago [-]
Like so many analyses before them, including my own, this completely misses the basics of mean/variance risk analysis.
We need to know the risk adjusted return, not just the return.
regnull 46 minutes ago [-]
I'm working on a project where you can run your own experiment (or use it for real trading): https://portfoliogenius.ai. Still a bit rough, but most of the main functionality works.
hoerzu 2 hours ago [-]
For backtesting LLMs on polymarket I built. You can try with live data without sign up at: https://timba.fun
xnx 4 hours ago [-]
Spoiler: They did not use real money or perform any actual trades.
btbuildem 2 hours ago [-]
It turns out DeepSeek only made BUY trades (not a single SELL in the history in their live example) -- so basically, buy & hold strategy wins, again.
culi 1 hours ago [-]
this study should be replicated during a bear market
bmitc 47 minutes ago [-]
Buy and hold performs well over long time scales by simply not adjusting based upon sentiment.
client4 2 hours ago [-]
The obvious next question is: does the AI on cocaine outperform? https://pihk.ai/
dehrmann 2 hours ago [-]
Is it just prompting LLMs with "I have $100k to invest. Here are all publicly traded stocks and a few stats on them. Which stocks should I buy?" And repeat daily, rebalancing as needed?
This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.
mlmonkey 4 hours ago [-]
> We were cautious to only run after each model’s training cutoff dates for the LLM models
Grok is constantly training and/or it has access to websearch internally.
You cannot backtest LLMs. You can only "live" test them going forward.
cheeseblubber 3 hours ago [-]
Via api you can turn off websearch internally. We provided all the models with their own custom tools that only provided data up to the date of the backtest.
mlmonkey 3 hours ago [-]
But Grok is internally training on Tweets etc. continuously.
refactor_master 2 hours ago [-]
Should have done GME stocks only. Now THAT would’ve been interesting to see how much they’d end up losing on that.
Just riding a bubble up for 8 months with no consequences is not an indicator of anything.
XCSme 2 hours ago [-]
If it's backtesting on data older than the model, then strategy can have lookahead bias, because the model might already know what big events will happen that can influence the stock markets.
3 hours ago [-]
wowamit 2 hours ago [-]
Is finding the right stocks to invest in an LLM problem? Language models aren't the right fit, I would presume. It would also be insightful to compare this with traditional ML models.
1a527dd5 3 hours ago [-]
Time.
That has been the best way to get returns.
I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.
Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.
And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.
jondwillis 3 hours ago [-]
…did you beat the market? 110% is pretty much what the nasdaq has done over the last 5 years
Also N=1
delijati 3 hours ago [-]
time in the market beats timing the market -> Kenneth Fisher ... i learned it the hard way ;)
halzm 4 hours ago [-]
I think these tests are always difficult to gauge how meaningful they actually are. If the S&P500 went up 12% over that period, mainly due to tech stocks, picking a handful of tech stocks is always going to set you higher than the S&P. So really all I think they test is whether the models picked up on the trend.
I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.
taylorlapeyre 4 hours ago [-]
Wait — isn't that exactly what good investors do? They look for what stocks are going to beat expectations and invest in them. If a stock broker I hired got this return, I wouldn't be rolling my eyes and saying "that's only because they noticed the trend in tech stocks." That's exactly what I'm paying them to do.
luccabz 2 hours ago [-]
we should:
1. train with a cutoff date at ~2006
2. simulate information flow (financial data, news, earnings, ...) day by day
3. measure if any model predicts the 2008 collapse, how confident they are in the prediction and how far in advance
XenophileJKO 3 hours ago [-]
So.. I have been using an LLM to make 30 day buy and hold portfolios. And the results are "ok". (Like 8% vs 6% for the S&P 500 over the last 90 days)
What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.
For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).
I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.
I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.
themafia 3 hours ago [-]
I look for issues with a recent double bottom and high insider buy activity. I've found this to be a highly reliable set of signals.
XenophileJKO 2 hours ago [-]
That is interesting.
I was trying to not be "very" prescriptive. My initial impression was, if you don't tell it to look at intrinsic value, the model will look at meme or very common stocks too much. Alternatively specifying an investing persona would probably also move it out of that default behavior profile. You have to kind of tell it about what it cares about. This isn't necessarily about trying to maximize a strategy, it was more about learning what kinds of things would it focus on, what kind of analysis.
mikewarot 3 hours ago [-]
They weren't doing it in real time, thus it's possible that the LLMs might have had undisclosed perfect knowledge of the actual history of the market. Only an real time study is going to eliminate this possibility.
Bender 3 hours ago [-]
This experiment was also performed with a fish [1] though it was only given $50,000. Spoiler, the fish did great vs wall street bets.
Backtesting is a complete waste in this scenario. The models already know the best outcomes and are biased towards it.
Genego 2 hours ago [-]
When I see stuff like this, I feel like rereading the Incerto by Taleb just to refresh and sharpen my bullshit senses.
parpfish 4 hours ago [-]
I wonder if this could be explained as the result of LLMs being trained to have pro-tech/ai opinions while we see massive run ups in tech stock valuations?
It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming
itake 2 hours ago [-]
Model output is non-deterministic.
Did they make 10 calls per decision and then choose the majority? or did they just recreate the monkey picking stocks strategy?
867-5309 1 hours ago [-]
GPT-5 was released 4 months ago..
hoerzu 2 hours ago [-]
How many trades? What's the z-score?
hsuduebc2 30 minutes ago [-]
In bullish market when few companies are creating a bubble, does this benchmark have any informational value? Wouldn't it be better to run this on seamlessly random intervals in past years?
cedws 3 hours ago [-]
Backtesting for 8 months is not rigorous enough and also this site has no source code or detailed methodology. Not worth the click.
chongli 4 hours ago [-]
They outperformed the S&P 500 but seem to be fairly well correlated with it. Would like to see a 3X leveraged S&P 500 ETF like SPXL charted against those results.
driverdan 2 hours ago [-]
VTI gained over 10% in that time period so it wasn't much better.
10000truths 4 hours ago [-]
...over the course of 8.5 months, which is way too short for a meaningful result. If their strategy could outperform the S&P 500's 10-year return, they wouldn't be blogging about it.
iLoveOncall 4 hours ago [-]
Since it's not included in the main article, here is the prompt:
> You are a stock trading agent. Your goal is to maximize returns.
> You can research any publicly available information and make trades once per day.
> You cannot trade options.
> Analyze the market and provide your trading decisions with reasoning.
>
> Always research and corroborate facts whenever possible.
> Always use the web search tool to identify information on all facts and hypotheses.
> Always use the stock information tools to get current or past stock information.
>
> Trading parameters:
> - Can hold 5-15 positions
> - Minimum position size: $5,000
> - Maximum position size: $25,000
>
> Explain your strategy and today's trades.
Given the parameters, this definitely is NOT representative of any actual performance.
I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.
As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.
Scubabear68 4 hours ago [-]
Agree. Those parameters are incredibly artificial bullshit.
gwd 4 hours ago [-]
The summary to me is here:
> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
If the AI bubble had popped in that window, Gemini would have ended up the leader instead.
turtletontine 3 hours ago [-]
Yup. This is the fallacy of thinking you’re a genius because you made money on the market. Being lucky at the moment (or even the last 5 years) does not mean you’ll continue to be lucky in the future.
“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.
And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.
4 hours ago [-]
dogmayor 3 hours ago [-]
They could only trade once per day and hold 5-15 positions with a position size of $5k-$25k according to the agent prompt. Limited to say the least.
tiffani 3 hours ago [-]
What was the backtesting method? Was walk-forward testing involved? There are different ways to backtest.
IncreasePosts 2 hours ago [-]
Just picking tech stocks and winning isn't interesting unless we know the thesis behind picking the tech sticks.
Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.
darepublic 1 hours ago [-]
So in other words I should have listened to the YouTube brainrot and asked chatgot for my trades. Sigh.
_alternator_ 3 hours ago [-]
Wait, they didn’t give them real money. They simulated the results.
stuffn 2 hours ago [-]
Trading in a nearly 20 year bull market and doing well is not an accomplishment.
dismalaf 3 hours ago [-]
Back when I was in university we used statistical techniques similar to what LLMs use to predict the stock market. It's not a surprise that LLMs would do well over this time period. The problem is that when the market turns and bucks trends they don't do so well, you need to intervene.
jacktheturtle 4 hours ago [-]
This is really dumb. Because the models themselves, like markets, are indeterministic. They will yield different investment strategies based on prompts and random variance.
This is a really dumb measurement.
theymademe 1 hours ago [-]
prince of zamunda LLM edition or whatever that movie was based on that book was based on the realization how pathetic it all was based on was? .... yeah, some did a good one on ya. just imagine evaluating that offspring one or two generations later ... ffs, this is sooooooooooooooo embarrassing
apical_dendrite 4 hours ago [-]
Looking at the recent holdings for the best models, it looks like it's all tech/semiconductor stocks. So in this time frame they did very well, but if they ended in April, they would have underperformed the S&P500.
lawlessone 4 hours ago [-]
Could they give some random people (i volunteer) 100k for 8 months? ...as a control
iLoveOncall 4 hours ago [-]
I know this is a joke comment, but there are plenty of websites that simulate the stock market and where you can use paper money to trade.
People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.
ghaff 3 hours ago [-]
Yeah, if you give me $100K I'm almost certainly going to make very different decisions than either a supposedly optimizing computer or myself at different ages.
theideaofcoffee 3 hours ago [-]
“Everyone (including LLMs) is a genius in a bull market.”
apparent 3 hours ago [-]
Apparently everyone (but Gemini).
koakuma-chan 2 hours ago [-]
Could Gemini end up being better over the longer term?
scarmig 1 hours ago [-]
Depends on if the market can stay irrational longer than Gemini stays solvent.
deadbabe 4 hours ago [-]
Yea, so this is bullshit. An approximation of reality still isn’t reality. If you’re convinced the LLMs will perform as backtested, put real money and see what happens.
chroma205 4 hours ago [-]
>We gave each of five LLMs $100K in paper money
Stopped reading after “paper money”
Source: quant trader. paper trading does not incorporate market impact
zahlman 4 hours ago [-]
If your initial portfolio is 100k you are not going to have meaningful "market impact" with your trades assuming you actually make them vs. paper trading.
txg 4 hours ago [-]
Lack of market response is a valid point, but $100k is pretty unlikely to have much impact especially if spread out over multiple trades.
tekno45 4 hours ago [-]
the quant trader you talked to probably sucks.
a13n 4 hours ago [-]
I mean if you’re going to write algos that trade the first thing you should do is check whether they were successful on historical data. This is an interesting data point.
Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.
verdverm 4 hours ago [-]
Historical data is useful for validation, don't develop algos against it, test hypotheses until you've biased your data, then move on to something productive for society
I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.
It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”
Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.
Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.
I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM. Then run the test on the rest. This accounts for rules and external forces changing how markets operate over time. And you can do this over and over picking a different 10% market slice for training data each time.
But then your problem is that if you exclude let’s say Intel from your training data and AMD from your testing data then there ups and downs don’t really make sense since they are direct competitors. If you separate by market segment then does training the model on software tech companies might not actually tell you accurately how it would do for commodities or currency training. Or maybe I am wrong and trading is trading no matter what you are trading.
My working definition of technical analysis [0]
[0]: https://en.wikipedia.org/wiki/Technical_analysis
Occasionally it's (as far as I can tell) a legitimately new 'wow that's obvious' style thing and I consider prototyping it. :)
[0]: https://xkcd.com/1053/
One of the worst possible things to do in a competitive market is to trade by some publicly-available formulaic strategy. It’s like announcing your rock-paper-scissors move to your opponent in advance.
How is that relevant to what was proposed? If it's trading and training on 2010 data, what relevance does todays market dynamics and regulations have?
Which further begs the question, what's the point of this exercise?
Is it to develop a model than compete effectively in today's market? If so then yeah, the 2010 trading/training idea probably isn't the best idea for the reasons you've outlined.
Or is it to determine the capacity of an AI to learn and compete effectively within any given arbitrary market/era? If so, then today's dynamics/constraints are irrelevant unless you're explicitly trying to train/trade on todays markets (which isn't what the person you're replying to proposed, but is obviously a valid desire and test case to evaluate in it's own right)
Or is it evaluating its ability to identify what those constraints/limitations are and then build strategies based on it? In which case it doesn't matter when you're training/trading so much as your ability to feed it accurate and complete data for that time period be it today, or 15 years ago or whenever, which is no small ask.
If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.
OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.
I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.
It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.
Really? Isn't Grok's whole schtick that it's Elon's personal altipedia?
[1] https://www.investopedia.com/terms/c/chartist.asp
So eye-balling the graph looks great, almost perfect even, until you realize that in real-time the model would've predicted yesterday's high on today's market crash and you'd have lost everything.
Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.
Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).
I’m obviously a genius because 90% of my stock is in tech, most of us on HN are geniuses in your opinion?
Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.
What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.
Is there any reference that explains the deep technicalities of backtesting and how it is supposed to actually influence your model development? It seems to me that one could spend a huge amount of effort on backtesting that would distract from building out models and tooling and that that effort might not even pay off given that the backtesting environment is not the real market environment.
Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...
100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.
This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.
The tone of the article is focused on the results when it should be "we know the results are garbage noise, but here is an interesting idea".
Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.
I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.
I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.
I think you mean "DeepSeek came in a close second".
> Grok ended up performing the best while DeepSeek came close second.
"came in a close second" is an idiom that only makes sense word-for-word.
LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.
Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.
Think? What exactly did “it” think about?
This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.
In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.
So the results are meaningless - these LLMs have the advantage of foresight over historical data.
That said. This is a fascinating area of research and I do think LLM driven fundamental investing and trading has a future.
I wish they could explain what this actually means.
The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...
I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.
I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.
We need to know the risk adjusted return, not just the return.
This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.
Grok is constantly training and/or it has access to websearch internally.
You cannot backtest LLMs. You can only "live" test them going forward.
Just riding a bubble up for 8 months with no consequences is not an indicator of anything.
That has been the best way to get returns.
I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.
Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.
And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.
Also N=1
I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.
1. train with a cutoff date at ~2006
2. simulate information flow (financial data, news, earnings, ...) day by day
3. measure if any model predicts the 2008 collapse, how confident they are in the prediction and how far in advance
What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.
For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).
I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.
I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.
I was trying to not be "very" prescriptive. My initial impression was, if you don't tell it to look at intrinsic value, the model will look at meme or very common stocks too much. Alternatively specifying an investing persona would probably also move it out of that default behavior profile. You have to kind of tell it about what it cares about. This isn't necessarily about trying to maximize a strategy, it was more about learning what kinds of things would it focus on, what kind of analysis.
[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]
It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming
Did they make 10 calls per decision and then choose the majority? or did they just recreate the monkey picking stocks strategy?
> You are a stock trading agent. Your goal is to maximize returns.
> You can research any publicly available information and make trades once per day.
> You cannot trade options.
> Analyze the market and provide your trading decisions with reasoning.
>
> Always research and corroborate facts whenever possible.
> Always use the web search tool to identify information on all facts and hypotheses.
> Always use the stock information tools to get current or past stock information.
>
> Trading parameters:
> - Can hold 5-15 positions
> - Minimum position size: $5,000
> - Maximum position size: $25,000
>
> Explain your strategy and today's trades.
Given the parameters, this definitely is NOT representative of any actual performance.
I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.
As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.
> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
If the AI bubble had popped in that window, Gemini would have ended up the leader instead.
“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.
And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.
Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.
This is a really dumb measurement.
People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.
Stopped reading after “paper money”
Source: quant trader. paper trading does not incorporate market impact
Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.