If I have an application that uses OpenAI models then this service can act as a proxy between the my application and the actual OpenAI service. It logs all of the requests that get sent to the OpenAI api. At some later time, I can go through and choose a subset of the API calls and mark them (I'm guessing as good or bad) and these get converted into a training set. I then have to create a value function as its own API that I run on my own servers somewhere (like fly.io). Then I start a training run, which I assume will use some open source AI model to regenerate responses to the training set derived from my initial OpenAI api calls. It then takes the generated responses from that open source model, sends them to my value function api which scores them, and then uses that score to apply some RL magic to the base open source model. At the end of this process I have an open source model that has been RL trained based on the captured api calls as well as the scoring from the value function.
I suppose the argument here is, a RL trained open source model will perform your task better than the base OpenAI model. So your target market is, people already using OpenAI api, they have the desire and funds to experiment with RL, they have the capability of defining a value function, they are able to sift through their api calls to identify the ones that aren't performing well and isolate them, and they are willing to swap out their OpenAI model with an open source model that is RL trained if it can be shown it is more accurate.
I would guess this market exists and the need is real. Defining a value function is much easier than building the infrastructure to RL a variety of open source models. So someone who wants to do this may appreciate paying for someone else who has already set up the infrastructure. And they don't want to host their own model (their already paying for OpenAI model hosting) so maybe they have no problem paying you for inference as well.
Whether or not this succeeds as a business really depends on how effective RL is for the clients you find. There are two paths here, RL is wildly successful and therefore so are you. Or RL fine-tuning is unable to keep up with foundation model advancements and clients will learn it is better to wait it out on the big fellas rather than go through the time-consuming and costly process.
Zollerboy1 2 days ago [-]
Wow! Thanks for taking the time to think through it. Yes, you are exactly right! I couldn’t have described Augento better than this myself. We actually want to make writing a reward function completely optional and build some RLHF (Reinforcement Learning from Human Feedback) loop soon. One of our long-term goals is to bring the cost of RL down so the barrier of entry to fine-tuning big models is not as high as it currently is.
spmurrayzzz 2 days ago [-]
I agree with you that the market exists and, as a result, solutions to this problem also exist in abundance. The most difficult part about a building a product like the one presented here is making something super generic that works for a wide swath of use cases. If you simplify the stack to more bespoke/custom approach, the build burden decreases exponentially.
For the folks who are already technical in this vertical, especially ones that leverage a low cardinality architecture (one or two models, small subset of tasks, etc), this type of thing is quite easy to build yourself first as a working prototype and then only slightly more difficult to productionize & automate.
I have some in-house infra that does similar work: monitors inputs and outputs from models, puts them in a UI for a human to score/rank, preps a DPO dataset for training, kicks off training run. The total amount of calendar time I spent from prototype to production was roughly two person weeks. Changing the human intervention mechanism to an automated reward function would be an hour or two worth of work. If I had to make this work for all types of users, tasks, and models — no shot I'd have the time personally to pull that off with any reasonable velocity.
With that said, having a nice UI with great observability into the whole process is a pretty big value-add to get out of the box as well.
(EDIT: for clarity, not affiliated all with the OP project/org)
_ink_ 2 days ago [-]
Does it mean after I successfully trained the Open Source Model, I don't need OpenAI anymore?
lukasego 2 days ago [-]
Yes, indeed
resiros 1 days ago [-]
Congrats on the launch! The idea sounds very interesting on paper. The tricky part though is the reward function.
Providing finetuning as a service works because the friction with finetuning is operational (getting the GPUs, preparing the training...), so the vendor can take care of that and give you an API. The work becomes straightforward and doesn't require much preparation - give us some examples and we'll provide you a model that works well with these and hopefully generalizes.
RL as a service is much trickier in my opinion. The friction is not only operational. Getting RL to work (at least from my probably deprecated 10-year-old knowledge) is much harder because the real friction is in building the right reward function. I've skimmed your docs, and you don't say much about reward functions other than the obvious.
I think to get this to work, you need to improve your docs and examples a lot, and maybe focus on some recurrent use cases (e.g., customer support agent) with clear reward functions. Perhaps provide some building block reward functions and some UI/tools to help create them. Basically, find a way to remove the real friction on how to use RL in my agent - the reward function part.
In any case, congrats again on the launch. We're building an LLMOps platform (see my profile), there might be collaboration/integration potential, write me if you think that's interesting.
lukasego 1 days ago [-]
Thanks for this very lucid post! For many use cases such as coding, formatting, it's very clear for the users how to define the reward function. Fore more intricate ones, you're right in that it can be tricky. I like your ideas of trying to provide tools to help here, and offering recurring reward functions as templates that will only need slight adaptations. It will be the user defining it, but there's a path to simplification. - The operational friction with getting the GPUs, optimizing compute and preparing the training are hard for RL, hence we got these things out of the way. -
Thanks for the very thoughtful suggestions and contacting, great input!
lukasego 8 hours ago [-]
Hi everyone, we stripped the need to connect to a subscription when you Import a Provider. You wouldn't have had to pay anyways - but now you can just go ahead and start data ingestion onto Augento without any friction.
And we continue to be happy to answer your feedback!
jacobross 20 hours ago [-]
Man, this is awesome. I've been obsessed with this idea since reading up on end-to-end RL used in reasoning models and OpenAI using it with Deep Research.
Seems like the most powerful agents will make use of some form of RL or advanced learning.
I'm not from an ML/DL background but these ideas are fascinating and I've begun self-teaching myself some RL.
I'm curious as to how long this took to build and any advice for someone wanting to learn more about RL in this context?
Thanks!
lukasego 8 hours ago [-]
The entirety of the production-ready platform took us 3-4 weeks to build, including figuring out RL and GPU infrastructure. If you want to know more about RL, you can check out Huggingface. You can also hop on Augento https://augento.ai and join our Slack community. We'll answer and discuss any question together and with others. You'd get $250 worth of free credits you can use to tinker with RL already - it'll teach you some stuff.
jacobross 7 hours ago [-]
That’s impressive. I have an extremely long chat with Claude that I did about a month ago discussing an idea very similar to this. Obviously an idea is worth next to nothing compared to what you and the team have created here but it’s becoming a genuine obsession of mine. Will Brown’s talk recently on RL ignited this even further given what he explained.
I’ll jump in this weekend.
Part of me wishes I did CS instead of learning SWE. There’s so much to uncover in RL and jumping straight in at the top feels like the wrong strategy to learn effectively.
I love the idea, love the platform. I’ll be keeping a close eye on how you guys go.
If you need a Technical Product Manager, let me know! I’m currently an Artificial Intelligence Lead at a hardware-enabled SaaS company but genuinely believe RL and agents will be the next step towards AGI.
HyprMusic 2 days ago [-]
This looks great.
I have a few questions. 1. I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like? 2. Any idea on inference costs?
Also just to reiterate what others say but the option of exporting weights would definitely make it more appealing (although it sounds like that's in the roadmap).
Zollerboy1 1 days ago [-]
Thanks!
> I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like?
Yeah, you could probably call it serverless inference. However, due to the fact that all fine-tuned models are trained on the same base model(s), we have some interesting optimizations we can apply over standard "serverless" model deployment. The biggest is that we can keep the base model loaded in VRAM and only swap the trained weight deltas per request. This gives us sub-second cold-start times for inference in the average case.
> Any idea on inference costs?
Right now, we’re pricing inference at $0.5/M input tokens, $2.5/M output tokens. That’s in a similar price range but a bit lower than gpt-4o/Claude 3.5, which we consider the main models we’re "competing" with. As it’s our goal to democratize access to models/agents in the long run, we hope that we can drop the prices for inference further, which should be enabled by some other optimizations we’re currently planning.
thebeardisred 2 days ago [-]
I tell you what I don't like, the game y'all are playing with billing of Slack users:
Where do you want to access #ext-customers?
The organization you select is where you’ll find this channel in Slack. Admins will get a chance to review everything before you start collaborating.
Tip: Add this Slack Connect channel to the organization that’s already connected with P2P Industries, or where you have similar channels.
JackYoustra 2 days ago [-]
I'm confused - what's this?
lmeierhoefer 1 days ago [-]
Yes, we wanted to incentivize, that people who want to use the platform (redeeming the $20 training credits) are also joining a slack channel, so we can give direct support. We should have pointed this out in the post.
1 days ago [-]
mogili 2 days ago [-]
This is a good problem to solve for. But making this closed source makes it a bad choice for us to use.
And the other aspect as someone already specified is it seems to only work with single agent workflows.
Zollerboy1 2 days ago [-]
We could open-source (parts of) our platform. What specifically would you like to see open-sourced?
We thought about developing this into a piece of software you can run in your own cloud (for compliance and security) but at the moment this makes the GPU economics really difficult and would probably be only interesting/relevant to big enterprises.
Anyway, we're definitely curious to hear if anyone has interesting applications for an open-source version of Augento!
tptacek 2 days ago [-]
Why does it being open-source matter for this particular use case?
filipeisho 2 days ago [-]
Also I think if I'd use your product, I'd like to be able to host the model elsewhere in case I don't like the platform anymore :)
hannesfur 2 days ago [-]
That’s fair! It has been mentioned before so we‘ll likely build that into the platform. Would you like us to upload your model to your huggingface account, download the weights or choose an inference provider we then upload it to?
oofbaroomf 2 days ago [-]
I (not GP) would like to be able to choose between the options. Inference provider isn't super necessary though (can do that through huggingface).
lukasego 2 days ago [-]
Thanks for stating your preference! This is something we can incorporate into the platform.
filipeisho 2 days ago [-]
Seems like download the weights would be the most flexible option. HF and inference providers would be nice to have.
georgeck 2 days ago [-]
Is this solution similar to the Direct Preference Optimization (DPO) [1] provided by another 'fine-tuning as a service' - OpenPipe?
No, DPO avoids a Reinforcement Learning training loop. For the current iteration on verifiable domains, our method is GRPO.
Let me elaborate: DPO is for preference learning - each data sample in the dataset contains 2 pieces: preferred and non-preferred responses (what the model should avoid generating). DPO optimizes for the preferred response between the 2. That means, DPO is one effective method for making a model learn sentiment or preference. We call a generalization of this alignment mode - it's on our roadmap.
On the current GRPO implementation side, dataset needs on Augento are simpler: Just the prompt, and some captured context if you like - it's then the reward function that scores the model generations.
Currently, with GRPO, training is done on verifiable domains. Not preference, but one piece of output will be judged by a deterministic reward function, or by a reward model (which the user decides - you can decide it through defining the reward function).
(EDIT: Would you use DPO? Do you have experience with it or needs?)
lukasego 1 days ago [-]
To add, there is the important distinction to be made between RLHF (Reinforcement Learning with Human Feedback) and RL. DPO is a simpler and more efficient way to do RLHF. In its current iteration, Augento does RL (using the term coined by OpenAI: Reinforcement Fine-tuning) which improves model performance on domains where there exists a verification function for the answer that you can use for scoring, rather than a preferred answer such as DPO needs.
But as said, such preference mode is on the roadmap.
foundzen 2 days ago [-]
Only 20 training samples improved llm performance, that sounds unrealistic! My experience with RLHF for LLM perf differs. Can you be more specific about the case where you achieved this and share technical details about how do you do that?
lmeierhoefer 1 days ago [-]
We are not doing RLHF but fine-tuning directly on a reward function. Our task was around improving a coding agent, coding in JSONata(https://jsonata.org).
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.
EMIRELADERO 2 days ago [-]
RLHF != RL
filipeisho 2 days ago [-]
I love the idea of the product! I would trust your solution to be the best for very simple use cases but not for multistep or ReAct agents. Any thoughts / insights on that?
I think the demo could be more exciting, the voice of the person talking sounds like he's bored haha
dang 2 days ago [-]
Ha - here's the advice I give to YC startups about making demo videos for HN:
"What works well for HN is raw and direct, with zero production values. Skip any introductions and jump straight into showing your product doing what it does best. Voiceover is good, but no marketing slickness—no fancy logos or background music!"
I guess there's zero production values and zero production values...
filipeisho 2 days ago [-]
Totally agree. Raw is great, but energy matters too. If the person sounds bored, it's hard to get excited about the product—even if it's amazing. Passion is contagious.
lukasego 1 days ago [-]
That's true, thanks for the feedback! In the end, it wasn't boredom, but the long work - put too much energy into the platform ;) Taking it to heart for the next one!
lukasego 2 days ago [-]
Well... we took the rawness to heart, that's clear!
dang 2 days ago [-]
Which was exactly correct!
lmeierhoefer 2 days ago [-]
Yes, great point. We are currently working on multistep RL.
The big problem with the trivial approach (give a single reward to the entire (ReAct) trajectory) is that the model receives a weak learning signal per decision (called credit assignment problem in literature), i.e. the individual decisions are not properly taken into account, which will then make the training unstable. I guess this has been an unsolved problem for a long time; however was not really looked at since generalist “planning” agents were not a big thing in RL until o1/DeepSeek.
IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.
serjester 2 days ago [-]
Are you worried about OpenAI and every other big lab eventually doing this? It’s going to be hard to get anyone to hand over this kind of data / control without a giant name attached.
lmeierhoefer 2 days ago [-]
No, not really. As I posted in the other thread, there are quite a few historical examples of why the big labs won’t take the entire market. They will push to publish something like this soon. Also, I think reinforcement fine-tuning is more convenient on the data-control side. Our platform allows you to self-host the reward function, so we only need the prompts; everything else can theoretically stay on the user side.
brap 2 days ago [-]
Congrats on the launch!
Noob question - from my understanding, SoTA proprietary models already provide APIs for fine tuning, I'd say it's only a matter of time before they provide RL based APIs, no?
lmeierhoefer 2 days ago [-]
Thanks! Yes, absolutely. OpenAI already has a reinforcement learning fine-tuning API in closed beta. However, historically, they’ve always left significant room for integrations into users systems. E.g. in the current demo of their RL fine-tuning platform, you can only select predefined reward functions and must manually upload the query datasets. I think that's the reason why so many open-source supervised fine-tuning companies exist.
My long-term take is that the agent economy will be around a few labs providing (partially open-source) foundational models where you don’t want to be part of the competition, as this will be the AI equivalent of the high-frequency tradings arms race).
And above that will sit an infrastructure layer, specializing these very models to the users domains. OpenAI/Anthropic/… RL finetuning will be a part of that infrastructure layer, but so will open-source-model alternatives like ours.
codingwagie 2 days ago [-]
This is just dev ops wrapped around an open source fine tuning repo.
qeternity 2 days ago [-]
It's convenience. And people pay for convenience all the time.
lukasego 2 days ago [-]
People pay for convenience, that's true - and part of the equation here. Agreed! The approach is to make data capturing as convenient as possible, where you just paste in api key + base url into your existing code, and you gather all your runs. And then, Reinforcement Learning is hard to figure out - so one of the goals is to commoditize Reinforcement Learning, what you're alluding to. In its iteration, the platform is released with verifiable mode where Augento takes all the headache of GPU infrastructure, GRPO implementation, training configurations and dataset curation away - you just select your gathered runs, and start the training. But we'll go past that, and expand Augento into a platform for alignment and self-learning.
Tl;DR Yes, indeed! We designed Augento with convenience in mind.
hannesfur 2 days ago [-]
In a sense: You are not wrong! But when we got started we thought it is way easier than it actually was. Procuring powerful GPUs alone is difficult, collecting proper data too. But of course you can still do everything yourself. If you want to give this a try yourself, I would recommend taking a look at torchtune (https://github.com/pytorch/torchtune).
noosphr 2 days ago [-]
People not in the field have no idea just how distorted the market is right now.
I was working at a startup doing end to end training for modified BERT architectures and everything from buying a GPU - basically impossible right now, we ended up looking at sourcing franken cards _from_ China.
To the power and heat removal - you need a large factories worth of power in the space of a small flat.
To pre-training something that's not been pre-trained before - say hello to throwing out more than 80% of pretraining runs because of a novel architecture.
Was designed to burn money as fast as possible.
Without hugely deep pockets, with a contract from NVidia, and with a datacenter right next to a nuclear power plant you can't compete at the model level.
hannesfur 2 days ago [-]
You are right. If you want/can pay out of your own pocket, RunPod (https://www.runpod.io) deserves a shoutout here. We rented GPUs from them (they have them and they are cheaper and more available than Lambda Labs) until we convinced AWS to give us capacity blocks.
But in general the prices for GPUs as well as their scarcity is really crass and unlike mining you can't really use gaming or franken cards as a fallback. I can count the GPUs we can do this on (even for relatively small models) on one hand.
Yes, you could do that.
However, you would have created a different platform than Augento. Maybe we should make the distinction clearer though.
The blog article you are referring to uses another method to fine-tune models that many other big platforms like Together AI (and even OpenAI themselves) are already supporting: Supervised Fine Tuning (SFT). We are doing Reinforcement Learning using GRPO instead.
SFT has the big caveat that it requires good prompt-completion datasets to work, which are rare/hard to curate for many use cases. For GRPO, you (the programmer) don’t even need to know what the correct answer is as long as you can decide if it’s a good answer (P?NP) at its heart, essentially.
esafak 2 days ago [-]
Can users download the fine-tuned model?
lukasego 2 days ago [-]
For those that have the need, we'll make it possible for sure! Otherwise, the models are ready for inference directly through Augento - say you’ve been working with the OpenAI Chat completion API, you'll just have to change the model string, e.g. to "augento:v2"
hannesfur 2 days ago [-]
If you send us an email, we can send you your weights right now :)
az226 2 days ago [-]
Can you export/download the models after training?
hannesfur 2 days ago [-]
Not yet, but since people seem to want this it’s at the top of our roadmap. If you train a model now and want to get the weights, just message us and we‘ll give them to you ;)
2 days ago [-]
abc-1 2 days ago [-]
Neat. Maybe you guys should make a fine tuning platform for deepseek specifically, with a fine tune API similar to openAIs. You could expand out into hosting those models too.
hannesfur 2 days ago [-]
you mean fine tuning that feels like SFT but is different (since you can't use that with reasoning models) built around the DeepSeek class of models?
abc-1 2 days ago [-]
I just want to fine tune deepseek v3 chat but it’s not possible or easy for regular consumers
Lol, can't use in non-subscription mode. It requires subscription to import provider.
lukasego 2 days ago [-]
Hi! You won't get billed for importing a provider. You just need a user account because your providers need to be associated to your Augento user. You can then start to use the data ingestion onto the platform - free of charge, of course ;) Actual billing then applies only to training and inference.
If I have an application that uses OpenAI models then this service can act as a proxy between the my application and the actual OpenAI service. It logs all of the requests that get sent to the OpenAI api. At some later time, I can go through and choose a subset of the API calls and mark them (I'm guessing as good or bad) and these get converted into a training set. I then have to create a value function as its own API that I run on my own servers somewhere (like fly.io). Then I start a training run, which I assume will use some open source AI model to regenerate responses to the training set derived from my initial OpenAI api calls. It then takes the generated responses from that open source model, sends them to my value function api which scores them, and then uses that score to apply some RL magic to the base open source model. At the end of this process I have an open source model that has been RL trained based on the captured api calls as well as the scoring from the value function.
I suppose the argument here is, a RL trained open source model will perform your task better than the base OpenAI model. So your target market is, people already using OpenAI api, they have the desire and funds to experiment with RL, they have the capability of defining a value function, they are able to sift through their api calls to identify the ones that aren't performing well and isolate them, and they are willing to swap out their OpenAI model with an open source model that is RL trained if it can be shown it is more accurate.
I would guess this market exists and the need is real. Defining a value function is much easier than building the infrastructure to RL a variety of open source models. So someone who wants to do this may appreciate paying for someone else who has already set up the infrastructure. And they don't want to host their own model (their already paying for OpenAI model hosting) so maybe they have no problem paying you for inference as well.
Whether or not this succeeds as a business really depends on how effective RL is for the clients you find. There are two paths here, RL is wildly successful and therefore so are you. Or RL fine-tuning is unable to keep up with foundation model advancements and clients will learn it is better to wait it out on the big fellas rather than go through the time-consuming and costly process.
For the folks who are already technical in this vertical, especially ones that leverage a low cardinality architecture (one or two models, small subset of tasks, etc), this type of thing is quite easy to build yourself first as a working prototype and then only slightly more difficult to productionize & automate.
I have some in-house infra that does similar work: monitors inputs and outputs from models, puts them in a UI for a human to score/rank, preps a DPO dataset for training, kicks off training run. The total amount of calendar time I spent from prototype to production was roughly two person weeks. Changing the human intervention mechanism to an automated reward function would be an hour or two worth of work. If I had to make this work for all types of users, tasks, and models — no shot I'd have the time personally to pull that off with any reasonable velocity.
With that said, having a nice UI with great observability into the whole process is a pretty big value-add to get out of the box as well.
(EDIT: for clarity, not affiliated all with the OP project/org)
Providing finetuning as a service works because the friction with finetuning is operational (getting the GPUs, preparing the training...), so the vendor can take care of that and give you an API. The work becomes straightforward and doesn't require much preparation - give us some examples and we'll provide you a model that works well with these and hopefully generalizes.
RL as a service is much trickier in my opinion. The friction is not only operational. Getting RL to work (at least from my probably deprecated 10-year-old knowledge) is much harder because the real friction is in building the right reward function. I've skimmed your docs, and you don't say much about reward functions other than the obvious.
I think to get this to work, you need to improve your docs and examples a lot, and maybe focus on some recurrent use cases (e.g., customer support agent) with clear reward functions. Perhaps provide some building block reward functions and some UI/tools to help create them. Basically, find a way to remove the real friction on how to use RL in my agent - the reward function part.
In any case, congrats again on the launch. We're building an LLMOps platform (see my profile), there might be collaboration/integration potential, write me if you think that's interesting.
Seems like the most powerful agents will make use of some form of RL or advanced learning.
I'm not from an ML/DL background but these ideas are fascinating and I've begun self-teaching myself some RL.
I'm curious as to how long this took to build and any advice for someone wanting to learn more about RL in this context?
Thanks!
I’ll jump in this weekend.
Part of me wishes I did CS instead of learning SWE. There’s so much to uncover in RL and jumping straight in at the top feels like the wrong strategy to learn effectively.
I love the idea, love the platform. I’ll be keeping a close eye on how you guys go.
If you need a Technical Product Manager, let me know! I’m currently an Artificial Intelligence Lead at a hardware-enabled SaaS company but genuinely believe RL and agents will be the next step towards AGI.
I have a few questions. 1. I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like? 2. Any idea on inference costs?
Also just to reiterate what others say but the option of exporting weights would definitely make it more appealing (although it sounds like that's in the roadmap).
> I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like?
Yeah, you could probably call it serverless inference. However, due to the fact that all fine-tuned models are trained on the same base model(s), we have some interesting optimizations we can apply over standard "serverless" model deployment. The biggest is that we can keep the base model loaded in VRAM and only swap the trained weight deltas per request. This gives us sub-second cold-start times for inference in the average case.
> Any idea on inference costs?
Right now, we’re pricing inference at $0.5/M input tokens, $2.5/M output tokens. That’s in a similar price range but a bit lower than gpt-4o/Claude 3.5, which we consider the main models we’re "competing" with. As it’s our goal to democratize access to models/agents in the long run, we hope that we can drop the prices for inference further, which should be enabled by some other optimizations we’re currently planning.
And the other aspect as someone already specified is it seems to only work with single agent workflows.
[1] https://docs.openpipe.ai/features/dpo/overview
(EDIT: Would you use DPO? Do you have experience with it or needs?)
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.
I think the demo could be more exciting, the voice of the person talking sounds like he's bored haha
"What works well for HN is raw and direct, with zero production values. Skip any introductions and jump straight into showing your product doing what it does best. Voiceover is good, but no marketing slickness—no fancy logos or background music!"
I guess there's zero production values and zero production values...
IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.
Noob question - from my understanding, SoTA proprietary models already provide APIs for fine tuning, I'd say it's only a matter of time before they provide RL based APIs, no?
My long-term take is that the agent economy will be around a few labs providing (partially open-source) foundational models where you don’t want to be part of the competition, as this will be the AI equivalent of the high-frequency tradings arms race). And above that will sit an infrastructure layer, specializing these very models to the users domains. OpenAI/Anthropic/… RL finetuning will be a part of that infrastructure layer, but so will open-source-model alternatives like ours.
I was working at a startup doing end to end training for modified BERT architectures and everything from buying a GPU - basically impossible right now, we ended up looking at sourcing franken cards _from_ China.
To the power and heat removal - you need a large factories worth of power in the space of a small flat.
To pre-training something that's not been pre-trained before - say hello to throwing out more than 80% of pretraining runs because of a novel architecture.
Was designed to burn money as fast as possible.
Without hugely deep pockets, with a contract from NVidia, and with a datacenter right next to a nuclear power plant you can't compete at the model level.
https://aws.amazon.com/blogs/machine-learning/customize-deep...
And charge for it?
The blog article you are referring to uses another method to fine-tune models that many other big platforms like Together AI (and even OpenAI themselves) are already supporting: Supervised Fine Tuning (SFT). We are doing Reinforcement Learning using GRPO instead. SFT has the big caveat that it requires good prompt-completion datasets to work, which are rare/hard to curate for many use cases. For GRPO, you (the programmer) don’t even need to know what the correct answer is as long as you can decide if it’s a good answer (P?NP) at its heart, essentially.