▲Helix: A vision-language-action model for generalist humanoid controlfigure.ai

303 points by Philpax 259 days ago | 170 comments

porphyra 259 days ago [-]

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

vessenes 259 days ago [-]

I was reading their site, and I too have some questions about this architecture.

I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model.

Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here.

By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.

So the pipeline is:

* Video of robot doing something

* (o1 or some other high end model) "describe very precisely the task the robot was given"

* o1 output -> 7B model -> small model -> loss

yurimo 259 days ago [-]

I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.

falcor84 259 days ago [-]

I suppose the next big milestone is Wozniak's Coffee Test: A robot is to enter a random home and figure out how to make coffee with whatever they have.

UltraSane 259 days ago [-]

That could still be decades away.

kilroy123 259 days ago [-]

I don't know... I'm starting to seriously think that is only 5-10 years away.

rtkwe 258 days ago [-]

Is that a real 5-10 years, a research 5-10 years[0] or 5-10 years of "FSD in the next 6 months"?

The demo space is so sterile and empty I think we're still a loong ways off from the Coffee test happening. One big thing I see is they don't have to rearrange other items they have nice open bins/drawers/shelves/etc to drop the items into. That kind of multistep planning has been a thorn in independent robotics for decades.

[0] https://xkcd.com/678/

ge96 259 days ago [-]

Imagine they bring one out to a construction site and they treat the robot as a new rookie guy, go pick up those pipes. That would be an ultimate on the fly test to me.

ortsa 259 days ago [-]

Picking up a bundle of loose pipes actually seems like a great benchmark for humanoid robots. Especially if they're not in a perfect pile. A full test could be something like grabbing all the pipes, from the floor, and putting them into a truck bed, in some (hopefully) sane fashion

sayamqazi 259 days ago [-]

I have my personal multimodal benchmark for physical robots.

You put a keyring with bunch of different keys in front of a robot and then instruct it pick it up and open a lock while you are describing which key is the correct one. Something like "Use the key with black plastic head and you need to put it in teeths facing down"

I have low hopes of this being possibe in the next 20 years. I hope I am still alive to witness if it ever happens.

m0llusk 259 days ago [-]

pick up that can, heh heh heh

causal 259 days ago [-]

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

silentwanderer 259 days ago [-]

In terms of low-level safety, they can probably back out forces on the robot from current or torque measurement and detect collisions. The challenge comes with faster motions carrying lots of inertia and behavioral safety (e.g. don't pour oil on the stove)

rtkwe 258 days ago [-]

That's actually more of a solved problem. Robot arms that can track the force they're applying and where to avoid injuring humans have been kicking around for 10-15 years. It let them go out of the mega safety cells into the same space as people and even do things like letting the operator pose the robot to teach it positions instead of having to do it in a computer program or with a remote control.

The term I see a lot is co-robotics or corobots. At least that's what Kuka calls them.

Symmetry 258 days ago [-]

That's fine for wheeled robots or robots bolted to the floor but for legged robots, especially bipeds, the hard question is how to prevent them from falling over on things. These don't look heavy enough to be too dangerous for a standing adult but you've still got pets/children to worry about.

UltraSane 259 days ago [-]

You can have dedicated controllers for the motors that limit their max torque.

imtringued 259 days ago [-]

That's not enough. When a robot link is in motion and hits an object, the change in momentum creates an impulse over the duration of deceleration. The faster the robot moves, the faster it has to decelerate, the higher the instantaneous braking force at the impact point.

UltraSane 258 days ago [-]

Then limit max velocity.

cess11 259 days ago [-]

Not a big deal on the battlefield.

causal 259 days ago [-]

I'd say a very big deal when munitions and targeting are involved

cess11 257 days ago [-]

Why do you think that?

Battle fields are violent and exhausting, people get the shakes, make mistakes, hurt themselves and each other all the time. Munitions are generally designed to not explode from a bump in the road or getting dropped or squeezed, and targeting systems commonly support automatic tracking or similar, for this specific reason.

rizky05 259 days ago [-]

[dead]

mmh0000 259 days ago [-]

The thing in the video moves slower than the sloth in Zootopia. If you die by that robot, you probably deserve it.

throwaway0123_5 259 days ago [-]

As a sibling comment implies though, there's also danger from it being stupid while unsupervised. For example, I'd be very nervous having it do something autonomously in my kitchen for fear of it burning down my house by accident.

mikehollinger 259 days ago [-]

From a different robot (Boston Dynamics' new Atlas) - the system moves at a "reasonable" speed. But watch at 1m20s in this video[1]. You can see it bump and then move VERY quickly -- with speed that would certainly damage something, or hurt someone.

[1] https://www.youtube.com/watch?v=F_7IPm7f1vI

charlie0 259 days ago [-]

Especially if holding a knife or something sharp.

exe34 259 days ago [-]

or if you're old, injured, groggy from medication, distracted by something/someone else, blind, deaf or any number of things.

it's easy to take your able body for granted, but reality comes to meet all of us eventually.

dr_kiszonka 259 days ago [-]

They are designed to penetrate Holtzman shields, surely.

causal 259 days ago [-]

Are you saying it cannot move faster than they because of some kind of governor?

UltraSane 259 days ago [-]

That is how I would design it. It is common in safety critical PLC systems to have 1 or more separate safety PLCs that try to prevent bad things from happening.

idiotsecant 259 days ago [-]

Although in a SIL safety system the dangerous events are identified and extremely thoroughly characterized as part of system design.

There cannot be a safety system of this type for a generalist platform like a humanoid robot. It's possibility space is just too high.

I think the safety governor in this case would have to be a neural network that is at least as complex as the robots network, if not more so.

Which begs the question: what system checks that one for safety?

UltraSane 258 days ago [-]

Limiting max force applied CAN be can be characterized for this robot.

Symmetry 259 days ago [-]

A governor, the firmware in the motor controllers, something like that. Certainly not the neural network though.

Symmetry 259 days ago [-]

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

wwwtyro 259 days ago [-]

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

camjw 259 days ago [-]

Maybe I don't understand exactly what you're describing but why would anyone pay for this? When I bring home the shopping I just... chuck stuff in the cupboards. I already know where it all goes. Maybe you can explain more?

loudmax 259 days ago [-]

One use case I imagine is skilled workmanship. For example, putting on a pair of AR glasses and having the equivalent of an experienced plumber telling me exactly where to look for that leak and how to fix it. Or how to replace my brake pads or install a new kitchen sink.

When I hire a plumber or a mechanic or an electrician, I'm not just paying for muscle. Most of the value these professionals bring is experience and understanding. If a video-capable AI model is able to assume that experience, then either I can do the job myself or hire some 20 year old kid at roughly minimum wage. If capabilities like this come about, it will be very disruptive, for better and for worse.

hulahoof 259 days ago [-]

Sounds like what Hololens was designed to solve, more in the AR space than AI though

semi-extrinsic 259 days ago [-]

This is called "watching YouTube tutorials". We've had it for decades.

rolisz 259 days ago [-]

But what if there's no YouTube tutorial for the exact AC unit you have and it doesn't look like any of the videos you checked out?

cess11 259 days ago [-]

Have you met people that seem to be able to fix almost anything?

If you can't get a tutorial on your exact case you learn about the problem domain and intuit from there. Usually it works out if you're careful, unlike software.

semi-extrinsic 259 days ago [-]

Then you are equally fucked as the AI will be, so no difference.

Case in point, I remember about ten years ago our washing machine started making noise from the drum bearing. Found a Youtube tutorial for bearing replacement on the exact same model, but 3 years older. Followed it just fine until it was time to split the drum. Then it turned out that in the newer units like mine, some rent-seeking MBA fuckers had decided more profits could be had if they plastic welded shut the entire drum assembly. Which was then a $300 replacement part for a $400 machine.

An AI doesn't help with this type of shit. It can't know the unknown.

deepGem 259 days ago [-]

But once it knows it’s pretty certain to become common knowledge almost instantaneously. That’s not possible now. What you learn stays localised to you and may be people 1 degree away from you that’s it.

semi-extrinsic 259 days ago [-]

How does that work? None of the current AI models can re-train on the fly. How would the inference engine even know if it's a case of new information that needs to be fed back, or just a user that's not following instructions correctly?

deepGem 259 days ago [-]

This is correct. What I meant to say was that in due course, re-training on the fly will become a norm. Even without on the fly re-training we are looking at a small delta.

__MatrixMan__ 259 days ago [-]

It would be nice to be able to select a recipe and have it populate your shopping list based on what is currently in your cupboards. If you just chuck stuff in the cupboards then you have to be home to know what they contain.

Or you could wear it while you cook and it could give you nutrition information for whatever it is you cooked. Armed with that it could make recommendations about what nutrients you're likely deficient in based on your recent meals and suggest recipes to remedy the gap--recipes based on what it knows is already in the cupboard.

gopher_space 259 days ago [-]

Maybe I’m showing my age, but isn’t this a home ec class?

__MatrixMan__ 259 days ago [-]

I took home ec in 2001. I learned to use a sewing machine, it was great.

But none of the kitchen stuff we learned had anything to do with ensuring that this week's shopping list ensures that you'll get enough zinc next week, or the kind of prep that uses the other half of yesterday's cauliflower in tomorrow's dinner so that it doesn't go bad.

These aren't hard problems to solve if you've got time to plan, but they are hard to solve if you are currently at the grocery store and can't remember that you've got a half a cauliflower that needs an associated recipe.

mistercheph 259 days ago [-]

[flagged]

luma 259 days ago [-]

> why would anyone pay for this?

Presumably, they won't as this is still a tech demo. One can take this simple demonstration and think about some future use cases that aren't too different. How far away is something that'll do the dishes, cook a meal, or fold the laundry, etc? That's a very different value prop, and one that might attract a few buyers.

Philip-J-Fry 259 days ago [-]

The person you're replying to is referring to the GP. The GP asks for an AI that tells them where to put their shopping. Why would anyone pay for THAT? Since we already know where everything goes without needing an AI to tell us. An AI isn't going to speed that up.

SoftTalker 259 days ago [-]

Yes it's pretty amazing how so many people seem to overcomplicate simple household tasks by introducing unnecessary technology.

bear141 259 days ago [-]

Maybe some people just assume there is a “best” or “optimal” way to do everything and AI will tell us what that is. Some things are just preference and I don’t mind the tiny amount of energy that goes into doing small things the way I like.

jayd16 259 days ago [-]

Maybe they're imagining more complex tasks like working on an engine.

Philpax 259 days ago [-]

Sounds like what's described in Manna: https://marshallbrain.com/manna1

sho_hn 259 days ago [-]

Dunno, I would not want to bleed my mental faculties for doing even simple planning work like this by outsourcing it to AI. Reliance on crutches like this would seem like a pathway to early-onset dementia.

meowkit 259 days ago [-]

Already playing out, anecdotally to my experience.

Its similar to losing callouses on our hands if you don’t labor/go to the gym.

mistercheph 259 days ago [-]

[flagged]

RedNifre 259 days ago [-]

I fully agree, building something like this is somewhere in my back log.

I think the key point why this "reverse cyborg" idea is not as dystopian as, say, being a worker drone in a large warehouse where the AI does not let you go to the toilet is that the AI is under your own control, so you decide on the high level goal "sort the stuff away", the AI does the intermediate planning and you do the execution.

We already have systems like that, every time you use you tell your navi where you want to go, it plans the route and gives you primitive commands like "on the next intersection, turn right", so why not have those for cooking, doing the laundry, etc.?

Heck, even a paper calendar is already kinda this, as in separating the planning phase from the execution phase.

Jarwain 259 days ago [-]

I'm quite slowly working on something like this, but for time.

For "stuff" I think a bigger draw is having it so it can let me know "hey you already have 3 of those spices at locations x, y, and z, so don't get another" or "hey you won't be able to fit that in your freezer"

falcor84 259 days ago [-]

This is almost literally the first chapter in Marshall Brain's "Manna" [0], being the first step towards world-controlling AGI:

> Manna told employees what to do simply by talking to them. Employees each put on a headset when they punched in. Manna had a voice synthesizer, and with its synthesized voice Manna told everyone exactly what to do through their headsets. Constantly. Manna micro-managed minimum wage employees to create perfect performance.

[0] https://marshallbrain.com/manna1

__MatrixMan__ 259 days ago [-]

I imagine a something like a headlamp except it's a projector and a camera so it can just light up where it wants you to pick something up in one color or where it wants you to put it down in another color. It can learn from what it sees of my hands how the eventual robot should handle the space (e.g. not putting heavy things on top of fragile things and such).

I'd totally use that to clean my garage so that later I can ask it where the heck I put the thing or ask it if I already have something before I buy one...

lynx97 259 days ago [-]

A good AI fridge would be already a great starting point. With a checkin procedure that makes sure to actually know whats in the fridge. Complete with expiry tracking and recipe suggestions based on personal preferences combined with product expiry. I am totally unimpressed with almost everything I see in home automation these days, but I'd immediately buy the AI fridge if it really worked smoothly.

cactusplant7374 259 days ago [-]

Your solution sounds like the worst cognitive load for getting home from the grocery store and wanting it all to be over.

hooverd 259 days ago [-]

You already have one: a brain.

lucianbr 259 days ago [-]

You want to outsource thinking to a computer system and keep manual labor? You do you, but I want the opposite. I want to decide what goes where but have a robot actually put the stuff there.

TeMPOraL 259 days ago [-]

That's the problem, though - the computer is already better at thinking than you, but we still don't know how to make it good at arbitrary labor requiring a mix of precision and power, something humans find natural.

In other words: I'm sorry, but that's how reality turned out. Robots are better at thinking, humans better at laboring. Why fight against nature?

(Just joking... I think.)

RedNifre 259 days ago [-]

I think he means outsourcing everything eventually, but right now, outsourcing the thought process is possible, while outsourcing the manual labor is not.

malux85 259 days ago [-]

Yeah there’s more to it than that. Do you want a can of beans to be put in the utensil draw just because it would fit? If it was done as you describe the placement of all of your items would be almost random each time, the bot need to have contextual memory and familiarity with your storage habits and preferences.

This can be done of course, in your statement the phrase “just figure out” is doing a lot more heavy lifting than you allude to

htrp 259 days ago [-]

so the kiva-amazon model?

ziofill 259 days ago [-]

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

01100011 259 days ago [-]

I'd pay $2k for something that folds my laundry reliably. It doesn't need arms or legs, just like my dishwasher doesn't need arms or legs. It just needs to let me dump in a load of clean laundry and output stacks of neatly folded or hung clothing.

mistercheph 259 days ago [-]

There are many services that for ~$3/lb will pick up, wash, dry, fold/hang, and deliver 10lb's of laundry every week for $1,500/yr.

259 days ago [-]

imtringued 259 days ago [-]

Laundry folding machines already exist. You can find cheap ones on AliExpress.

https://www.aliexpress.com/w/wholesale-clothes-folding-machi...

mmh0000 259 days ago [-]

It's called a "house cleaner" and they only cost ~$150 (area and all varies) bi-weekly. I'll shit a brick (and then have the robot clean it up) if a robot is ever cheaper than ~$4000/yr.

vessenes 259 days ago [-]

A robot will definitely cost less than $30/hr eventually. But you'll be running it a lot more than a few hours every other week.

ziofill 259 days ago [-]

Yeah, but a robot will work 24/7, not 2h byweekly -_-

mclau156 259 days ago [-]

do you need a robot to work in your house 24/7?

ziofill 259 days ago [-]

Well perhaps not at night, but otherwise there’s always something to clean, something to fix, something to cook, take care of the yard.. heck I might need two robots ^^’

gigel82 259 days ago [-]

Why not? If it's done with all the chores, I can have it make some silly woodworking / art project for Etsy to earn its keep, or just loan it out to neighbors.

ben_w 259 days ago [-]

In the same way it's hard to earn money from using AI to make art, I don't see Etsy projects made by affordable domestic robots selling above cost.

dartos 259 days ago [-]

Hopefully in the next decade we’ll get there.

Vision+language multimodal models seem to solve some of the hard problems.

cess11 259 days ago [-]

To me this is such a weird wish. Why would you not want to care for your home and the people living there? Why would you want to have a slave taking these activities from you?

I'd rather have less waged labour and more time for chores with the family.

int_19h 258 days ago [-]

Because it's boring and tedious. And no, a robot is not a slave.

abraxas 259 days ago [-]

Yeah, except that future doesn't need us. By us I mean those of us who don't have $1B to their name.

Do you really expect the oligarchs to put up with the environmental degradation of 8 billion humans when they can have a pristine planet to themselves with their whims served by the AI and these robots?

I fully anticipate that when these things mature enough we'll see an "accidental" pandemic sweep and kill off 90% of us. At least 90%.

ben_w 259 days ago [-]

I'd expect Musk and Bezos to know about von Neumann replicators; factories that make these robots staffed entirely by these robots all the way to the mines digging minerals out of the ground… rapid and literally exponential growth until they hit whatever the limiting factor is, but they've both got big orbial rockets now, so the limit isn't necessarily 6e24 kg.

ewjt 259 days ago [-]

Oligarchs would use the robots to kill people instead of a pandemic. A virus carries too much risk of affecting the original creators.

Fortunately, robotic capability like that basically becomes the equivalent of Nuclear MAD.

Unfortunately, the virus approach probably looks fantastic to extremist bad actors with visions of an afterlife.

siavosh 259 days ago [-]

What do humans do best?

jayd16 259 days ago [-]

Everything everything else is worse at.

ein0p 259 days ago [-]

Browse Instagram, apparently.

ziofill 259 days ago [-]

I mean to use their time to pursue their passions and interests, not cleaning up the kitchen or making the bed or doing laundry...

hooverd 259 days ago [-]

Given time to "pursue their passions and interests", most people chose to turn their brain to soup on social media.

KolmogorovComp 259 days ago [-]

but most people think they are better than most people.

259 days ago [-]

plipt 259 days ago [-]

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

liuliu 259 days ago [-]

It looks pretty obvious (I think):

1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;

2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;

Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.

It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.

tim_ai_robotics 253 days ago [-]

I'm very skeptical. I'm quite familiar with VLAs and this seems like an unbelievable leap forward based on their claims.

bbor 259 days ago [-]

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.

Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated.

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...

plipt 259 days ago [-]

    their "AI" marketing page[1] pretty clearly implies that compute is offloaded

I think that answers most of my questions.

I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.

Thanks for your reply

verytrivial 259 days ago [-]

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

bilsbie 259 days ago [-]

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?

I now they claim there’s no special coding but did they practice this task? Special training?

Even if this video is totally legit I’m but burned out by all the hype videos in general.

turnsout 259 days ago [-]

They appear to be realtime, based on the robot's movements with the human in the scene. If you believe the article, it's zero shot (no preprogramming, practice or special training).

ge96 259 days ago [-]

they seem slow to me, I was thinking they're slow for safety

259 days ago [-]

aerodog 259 days ago [-]

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

_1 259 days ago [-]

Current discussion: https://news.ycombinator.com/item?id=43110265

pr337h4m 259 days ago [-]

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

Symmetry 259 days ago [-]

They can put away clutter but if they could chop a carrot or dust a vase they'd have shown videos demonstrating that sort of capability.

EDIT: Let alone chop an onion. Let me tell you having a robot manipulate onions is the worst. Dealing with loose onion skins is very hard.

j-krieger 259 days ago [-]

Sure. But if you showed this video to someone 5 or 10 years ago, they'd say it's fiction.

Symmetry 258 days ago [-]

Telling a robot verbally "Put the cup on the counter" and having it figure out what the cup is, what the counter is in its field of view would have seemed like science fiction. The object manipulation itself is still well behind what we saw in the 2015 DARPA Grand Challenge though.

squigz 259 days ago [-]

There's something hilarious to me about the idea of chopping onions being a sort of benchmark for robots.

sandis 259 days ago [-]

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws

ge96 259 days ago [-]

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

sottol 259 days ago [-]

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

megous 259 days ago [-]

Unfortunately, there'll be no time travel to save us. That was the lying part of the movie. Other stuff was true.

kla-s 259 days ago [-]

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

pr337h4m 259 days ago [-]

"Over a year" according to the founder: https://x.com/adcock_brett/status/1892578309344502191

bhouston 259 days ago [-]

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

imtringued 259 days ago [-]

You don't use function calling. You specifically train the neural network to directly encode the robot action as a token. There are many ways. You can output absolute positions, delta positions, relative trajectory. You can do this in joint space or end effector space.

200Hz is barely enough to control a motor, but it is good enough to send a reference signal to a motor controller. Usually what is done is that you have a neural network to learn complex high level behaviour and use that to produce a high level trajectory, then you have a whole body robot controller based on quadratic programming that does things like balancing, maintaining contacts when holding objects or pressing against things. This requires a model of the robot dynamics so that you know the relationship between torques and acceleration. Then after that you will need a motor controller that accepts reference acceleration/torque, velocity and position commands which then is turned into 10kHz to 100kHz pulse width modulated signals by the motor controller. The motor controller itself is driving MOSFETs so it can only turn them on or off, unless you are using expensive sinusoidal drivers.

Philpax 259 days ago [-]

Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model. The former produces a latent vector, which is then interpreted by the latter to drive the motors.

NitpickLawyer 259 days ago [-]

> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.

whatever1 259 days ago [-]

This is typical in real time applications. A supervisor tries to guess in which region the system is currently and then invokes the correct set of lower level algorithms.

andiareso 259 days ago [-]

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.

Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.

/rant

matteocontrini 259 days ago [-]

What do you mean? Videos on that page are served by CloudFront. If you're seeing issues it may be that videos are not encoded for web playback (faststart, etc.) but I haven't checked.

unraveller 259 days ago [-]

https://www.youtube.com/watch?v=Z3yQHYNXPws

the official figure yt release vid

traverseda 259 days ago [-]

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

jayd16 259 days ago [-]

It probably gives them confidence that they can accurately see a thing even though they don't know what that thing is.

I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.

thomastjeffery 259 days ago [-]

There is no such thing as "thing" here.

These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).

The only dimension accuracy can apply to is desirability.

jayd16 259 days ago [-]

You don't think there's any segmentation going on?

thomastjeffery 259 days ago [-]

Implicitly, maybe. Does that matter if you don't know where?

ygouzerh 259 days ago [-]

So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.

However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.

Similar than a kid that never saw a banana, but his parent described it to him.

Symmetry 259 days ago [-]

It's normal to have a training set and a validation set and I interpreted that to mean that these items weren't in the training set.

swalsh 259 days ago [-]

At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.

I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.

lyu07282 259 days ago [-]

I don't understand we already saw exactly what happens with the emergence of drones and Israel is already using AI to select bombing targets and semi-autonomous turrets. What conversation? What kind of society do you think we are living in?

01100011 259 days ago [-]

We had sufficient AI to make death machines for decades. You don't need fancy LLMs to get a pretty good success rate for targeting.

I have said for years that the only thing keeping us from "stabby the robot" is solving the power problem. If you can keep a drone going for a week, you have a killing machine. Use blades to avoid running out of ammo. Use IR detection to find the jugular. Stab, stab and move on. I'm guessing "traditional" vision algorithms are also sufficient to, say, identify an ethnicity and conduct ethnic cleansing. We are "solving the power problem" away from a new class of WMDs that are accessible to smaller states/groups/individuals.

j-krieger 259 days ago [-]

> We had sufficient AI to make death machines for decades

And we already reached the peek here. Small drones that are cheaply mass produced, fly on SIM cards alone and explode when they reached a target. That's all there is to it. You don't need a gun mounted on a spot or a humanoid robot carrying a gun. Exploding swarms are enough.

charlie0 259 days ago [-]

Black Mirror has the perfect episode for this scenario already.

UltraSane 259 days ago [-]

Except those things would be very easily defeated by a 12 gauge shotgun or a AR-15

j-krieger 248 days ago [-]

You are severely underestimating the speed of drones and their formation capabilities, as well as overestimating your aim under immense danger.

meindnoch 259 days ago [-]

So you're concerned about remote operated howitzers? Autoloaders and remote control land vehicles have existed for 40 or so years by now. If we wanted remote controlled howitzers we could have fielded them already.

int_19h 258 days ago [-]

We already have an ongoing major war (in Ukraine) where both sides are using autonomous AI-driven drones that kill people, at scale, with escalating tit-for-tat advances in lethality. This conversation is being had alright.

imtringued 259 days ago [-]

Panzerhaubitze 2000 already has an autoloader and the entire point of self propelled artillery is that it moves after shooting to avoid counter artillery fire.

Symmetry 259 days ago [-]

They don't look strong enough to pick up a 155mm shell even with both arms - and we haven't seen them pick up something with two arms.

ripped_britches 259 days ago [-]

This would have made a more interesting demo at least

ramenlover 259 days ago [-]

Why do they make “eye contact” after every hand off? Feels oddly forced.

GoatInGrey 259 days ago [-]

Perhaps they're exchanging knowing looks on how stupid they think the demo is. Solidarity between artificial brothers.

pixl97 259 days ago [-]

If you're training robots to interact with humans this is the kind of behavior you'd want. Humans use a ton of nonverbal hints like this to register context.

bear141 259 days ago [-]

This along with the writing style in the description is totally forced anthropomorphizing. It’s creepy.

jimbohn 259 days ago [-]

Gotta hype up the investors somehow

bbor 259 days ago [-]

To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:

  A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz

Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.

Animats 259 days ago [-]

"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."

If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?

moshun 259 days ago [-]

Most AI startups, like most startups in general, are in the business of selling futures. Much easier to get new seed $$$ once there’s a real hype around your demo. Not saying they aren’t being honest, just pointing out the logic of starting here and working your way up to a huge valuation.

Animats 259 days ago [-]

Right. As I point out occasionally, Tesla, as a car company, is overvalued by an order of magnitude.

If they can find suckers who accept that valuation, it's much easier to exit as a billionaire than actually make it work.

Visualize making it work. You build or buy a robot that has enough operating envelope for an Amazon picking station, provide it with an end-effector, and use this claimed general purpose software to control it. Probably just arms; it doesn't need to move around. Movement is handled by Amazon's Kiva-type AGV units.

You set up a test station with a supply of Amazon products and put it to work. It's measured on the basis of picks per minute, failed picks, and mean time before failure. You spend months to years debugging standard robotics problems such as tendon wear, gripper wear, and products being damaged during picking and placing. Once it's working, Amazon buys some units and puts them to work in real distribution centers. More problems are found and solved.

Now you have a unit that replaces one human, and costs maybe $20,000 to make in quantity. Amazon beats you down in price so you get to sell it for maybe $25,000 in quantity. You have to build manufacturing facilities and service depots. Success is Amazon buying 50,000 of them, for total income of $0.25 billion. This probably becomes profitable about five years from now, if it all works.

By which time someone in China, Japan, or Taiwan is doing it cheaper and better.

ripped_britches 259 days ago [-]

I don’t even think Elizabeth Holmes actually had this mindset. Most entrepreneurs are actually trying to make a business.

Frankion 259 days ago [-]

Perhaps they are in the process of doing so?

Perhaps it's possible to grip them but not to pack them?

bilsbie 259 days ago [-]

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.

The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.

It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

turnsout 259 days ago [-]

If you read the article, they describe a two-system approach; one "think fast" 80M parameter model running at 200hz to control motion, and one "think slow" 7B parameter model running at ~7-9hz for everything else (scene understanding, language processing, etc).

If that sounds like a cheat, neuroscientists tell us this is how the human brain works.

ripped_britches 259 days ago [-]

This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.

I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.

ianamo 259 days ago [-]

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

thomastjeffery 259 days ago [-]

Nope.

The article clearly spells out that it's end to end LLM. Text and video in, motor function out.

Technically, the text model probably has a few copies, but they are nothing more than Asimov's narrative. Laws don't (and can't) exist in a model

exe34 259 days ago [-]

Is there a paper? I think I get how they did their training, but I'd like to understand it more.

Does anyone know if this trained model would work on a different robot at all, or would it need retraining?

the_other 259 days ago [-]

It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”

That’s how I feel about LLMs and code.

259 days ago [-]

kingkulk 259 days ago [-]

Anyone have a link to their paper?

IAmNotACellist 259 days ago [-]

I don't suppose this is open research and I can read about their model architecture?

ein0p 259 days ago [-]

There's no way this is 100% real though. No startup demo ever is.

bilsbie 259 days ago [-]

They should have made them talk. It’s a little dehumanizing otherwise.

butifnot0701 259 days ago [-]

It's kinda eerie how they look at each other after handover

anentropic 259 days ago [-]

Very impressive

Why make such sinister-looking robots though...?

jayd16 259 days ago [-]

With the way they move, they look like stoned teenagers interning at a Bond villain factory. Not to knock the tech but they're scary and silly at the same time.

esafak 259 days ago [-]

Black was not the best color choice.

anentropic 259 days ago [-]

Well, when you put it like that I feel a bit uncomfortable...

But it did seem like title of their mood board must have been "Black Mirror".

Very uncanny valley, the glossy facelessness. It somehow looks neither purely utilitarian/industrial nor 'friendly'. I could see it being based on the aesthetic of laptops and phones, i.e. consumer tech, but the effect is so different when transposed onto a very humanoid form.

kubb 259 days ago [-]

Wow! This is something new.

dr_dshiv 259 days ago [-]

Wake me when robots can make a peanut butter sandwich

abraxas 259 days ago [-]

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.

psb217 259 days ago [-]

Natural, cluttered environments are a lot tougher to deal with. This near future-y minimalist environment has the dual benefits of looking stylish and being much closer to whatever they were able to simulate at scale for training the models.

Loading comments...

porphyra 259 days ago [-]

vessenes 259 days ago [-]

I was reading their site, and I too have some questions about this architecture.

By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.

So the pipeline is:

* Video of robot doing something

* (o1 or some other high end model) "describe very precisely the task the robot was given"

* o1 output -> 7B model -> small model -> loss

yurimo 259 days ago [-]

falcor84 259 days ago [-]

I suppose the next big milestone is Wozniak's Coffee Test: A robot is to enter a random home and figure out how to make coffee with whatever they have.

UltraSane 259 days ago [-]

That could still be decades away.

kilroy123 259 days ago [-]

I don't know... I'm starting to seriously think that is only 5-10 years away.

rtkwe 258 days ago [-]

Is that a real 5-10 years, a research 5-10 years[0] or 5-10 years of "FSD in the next 6 months"?

[0] https://xkcd.com/678/

ge96 259 days ago [-]

Imagine they bring one out to a construction site and they treat the robot as a new rookie guy, go pick up those pipes. That would be an ultimate on the fly test to me.

ortsa 259 days ago [-]

sayamqazi 259 days ago [-]

I have my personal multimodal benchmark for physical robots.

I have low hopes of this being possibe in the next 20 years. I hope I am still alive to witness if it ever happens.

m0llusk 259 days ago [-]

pick up that can, heh heh heh

causal 259 days ago [-]

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

silentwanderer 259 days ago [-]

rtkwe 258 days ago [-]

The term I see a lot is co-robotics or corobots. At least that's what Kuka calls them.

Symmetry 258 days ago [-]

UltraSane 259 days ago [-]

You can have dedicated controllers for the motors that limit their max torque.

imtringued 259 days ago [-]

UltraSane 258 days ago [-]

Then limit max velocity.

cess11 259 days ago [-]

Not a big deal on the battlefield.

causal 259 days ago [-]

I'd say a very big deal when munitions and targeting are involved

cess11 257 days ago [-]

Why do you think that?

rizky05 259 days ago [-]

[dead]

mmh0000 259 days ago [-]

The thing in the video moves slower than the sloth in Zootopia. If you die by that robot, you probably deserve it.

throwaway0123_5 259 days ago [-]

mikehollinger 259 days ago [-]

[1] https://www.youtube.com/watch?v=F_7IPm7f1vI

charlie0 259 days ago [-]

Especially if holding a knife or something sharp.

exe34 259 days ago [-]

or if you're old, injured, groggy from medication, distracted by something/someone else, blind, deaf or any number of things.

it's easy to take your able body for granted, but reality comes to meet all of us eventually.

dr_kiszonka 259 days ago [-]

They are designed to penetrate Holtzman shields, surely.

causal 259 days ago [-]

Are you saying it cannot move faster than they because of some kind of governor?

UltraSane 259 days ago [-]

That is how I would design it. It is common in safety critical PLC systems to have 1 or more separate safety PLCs that try to prevent bad things from happening.

idiotsecant 259 days ago [-]

Although in a SIL safety system the dangerous events are identified and extremely thoroughly characterized as part of system design.

There cannot be a safety system of this type for a generalist platform like a humanoid robot. It's possibility space is just too high.

I think the safety governor in this case would have to be a neural network that is at least as complex as the robots network, if not more so.

Which begs the question: what system checks that one for safety?

UltraSane 258 days ago [-]

Limiting max force applied CAN be can be characterized for this robot.

Symmetry 259 days ago [-]

A governor, the firmware in the motor controllers, something like that. Certainly not the neural network though.

Symmetry 259 days ago [-]

wwwtyro 259 days ago [-]

camjw 259 days ago [-]

loudmax 259 days ago [-]

hulahoof 259 days ago [-]

Sounds like what Hololens was designed to solve, more in the AR space than AI though

semi-extrinsic 259 days ago [-]

This is called "watching YouTube tutorials". We've had it for decades.

rolisz 259 days ago [-]

But what if there's no YouTube tutorial for the exact AC unit you have and it doesn't look like any of the videos you checked out?

cess11 259 days ago [-]

Have you met people that seem to be able to fix almost anything?

If you can't get a tutorial on your exact case you learn about the problem domain and intuit from there. Usually it works out if you're careful, unlike software.

semi-extrinsic 259 days ago [-]

Then you are equally fucked as the AI will be, so no difference.

An AI doesn't help with this type of shit. It can't know the unknown.

deepGem 259 days ago [-]

semi-extrinsic 259 days ago [-]

deepGem 259 days ago [-]

This is correct. What I meant to say was that in due course, re-training on the fly will become a norm. Even without on the fly re-training we are looking at a small delta.

__MatrixMan__ 259 days ago [-]

gopher_space 259 days ago [-]

Maybe I’m showing my age, but isn’t this a home ec class?

__MatrixMan__ 259 days ago [-]

I took home ec in 2001. I learned to use a sewing machine, it was great.

mistercheph 259 days ago [-]

[flagged]

luma 259 days ago [-]

> why would anyone pay for this?

Philip-J-Fry 259 days ago [-]

SoftTalker 259 days ago [-]

Yes it's pretty amazing how so many people seem to overcomplicate simple household tasks by introducing unnecessary technology.

bear141 259 days ago [-]

jayd16 259 days ago [-]

Maybe they're imagining more complex tasks like working on an engine.

Philpax 259 days ago [-]

Sounds like what's described in Manna: https://marshallbrain.com/manna1

sho_hn 259 days ago [-]

meowkit 259 days ago [-]

Already playing out, anecdotally to my experience.

Its similar to losing callouses on our hands if you don’t labor/go to the gym.

mistercheph 259 days ago [-]

[flagged]

RedNifre 259 days ago [-]

I fully agree, building something like this is somewhere in my back log.

Heck, even a paper calendar is already kinda this, as in separating the planning phase from the execution phase.

Jarwain 259 days ago [-]

I'm quite slowly working on something like this, but for time.

falcor84 259 days ago [-]

This is almost literally the first chapter in Marshall Brain's "Manna" [0], being the first step towards world-controlling AGI:

[0] https://marshallbrain.com/manna1

__MatrixMan__ 259 days ago [-]

I'd totally use that to clean my garage so that later I can ask it where the heck I put the thing or ask it if I already have something before I buy one...

lynx97 259 days ago [-]

cactusplant7374 259 days ago [-]

Your solution sounds like the worst cognitive load for getting home from the grocery store and wanting it all to be over.

hooverd 259 days ago [-]

You already have one: a brain.

lucianbr 259 days ago [-]

You want to outsource thinking to a computer system and keep manual labor? You do you, but I want the opposite. I want to decide what goes where but have a robot actually put the stuff there.

TeMPOraL 259 days ago [-]

In other words: I'm sorry, but that's how reality turned out. Robots are better at thinking, humans better at laboring. Why fight against nature?

(Just joking... I think.)

RedNifre 259 days ago [-]

I think he means outsourcing everything eventually, but right now, outsourcing the thought process is possible, while outsourcing the manual labor is not.

malux85 259 days ago [-]

This can be done of course, in your statement the phrase “just figure out” is doing a lot more heavy lifting than you allude to

htrp 259 days ago [-]

so the kiva-amazon model?

ziofill 259 days ago [-]

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

01100011 259 days ago [-]

mistercheph 259 days ago [-]

There are many services that for ~$3/lb will pick up, wash, dry, fold/hang, and deliver 10lb's of laundry every week for $1,500/yr.

259 days ago [-]

imtringued 259 days ago [-]

Laundry folding machines already exist. You can find cheap ones on AliExpress.

https://www.aliexpress.com/w/wholesale-clothes-folding-machi...

mmh0000 259 days ago [-]

It's called a "house cleaner" and they only cost ~$150 (area and all varies) bi-weekly. I'll shit a brick (and then have the robot clean it up) if a robot is ever cheaper than ~$4000/yr.

vessenes 259 days ago [-]

A robot will definitely cost less than $30/hr eventually. But you'll be running it a lot more than a few hours every other week.

ziofill 259 days ago [-]

Yeah, but a robot will work 24/7, not 2h byweekly -_-

mclau156 259 days ago [-]

do you need a robot to work in your house 24/7?

ziofill 259 days ago [-]

Well perhaps not at night, but otherwise there’s always something to clean, something to fix, something to cook, take care of the yard.. heck I might need two robots ^^’

gigel82 259 days ago [-]

Why not? If it's done with all the chores, I can have it make some silly woodworking / art project for Etsy to earn its keep, or just loan it out to neighbors.

ben_w 259 days ago [-]

In the same way it's hard to earn money from using AI to make art, I don't see Etsy projects made by affordable domestic robots selling above cost.

dartos 259 days ago [-]

Hopefully in the next decade we’ll get there.

Vision+language multimodal models seem to solve some of the hard problems.

cess11 259 days ago [-]

To me this is such a weird wish. Why would you not want to care for your home and the people living there? Why would you want to have a slave taking these activities from you?

I'd rather have less waged labour and more time for chores with the family.

int_19h 258 days ago [-]

Because it's boring and tedious. And no, a robot is not a slave.

abraxas 259 days ago [-]

Yeah, except that future doesn't need us. By us I mean those of us who don't have $1B to their name.

I fully anticipate that when these things mature enough we'll see an "accidental" pandemic sweep and kill off 90% of us. At least 90%.

ben_w 259 days ago [-]

ewjt 259 days ago [-]

Oligarchs would use the robots to kill people instead of a pandemic. A virus carries too much risk of affecting the original creators.

Fortunately, robotic capability like that basically becomes the equivalent of Nuclear MAD.

Unfortunately, the virus approach probably looks fantastic to extremist bad actors with visions of an afterlife.

siavosh 259 days ago [-]

What do humans do best?

jayd16 259 days ago [-]

Everything everything else is worse at.

ein0p 259 days ago [-]

Browse Instagram, apparently.

ziofill 259 days ago [-]

I mean to use their time to pursue their passions and interests, not cleaning up the kitchen or making the bed or doing laundry...

hooverd 259 days ago [-]

Given time to "pursue their passions and interests", most people chose to turn their brain to soup on social media.

KolmogorovComp 259 days ago [-]

but most people think they are better than most people.

259 days ago [-]

plipt 259 days ago [-]

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

liuliu 259 days ago [-]

It looks pretty obvious (I think):

2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;

Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.

tim_ai_robotics 253 days ago [-]

I'm very skeptical. I'm quite familiar with VLAs and this seems like an unbelievable leap forward based on their claims.

bbor 259 days ago [-]

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...

plipt 259 days ago [-]

    their "AI" marketing page[1] pretty clearly implies that compute is offloaded

I think that answers most of my questions.

I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.

Thanks for your reply

verytrivial 259 days ago [-]

bilsbie 259 days ago [-]

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?

I now they claim there’s no special coding but did they practice this task? Special training?

Even if this video is totally legit I’m but burned out by all the hype videos in general.

turnsout 259 days ago [-]

They appear to be realtime, based on the robot's movements with the human in the scene. If you believe the article, it's zero shot (no preprogramming, practice or special training).

ge96 259 days ago [-]

they seem slow to me, I was thinking they're slow for safety

259 days ago [-]

aerodog 259 days ago [-]

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

_1 259 days ago [-]

Current discussion: https://news.ycombinator.com/item?id=43110265

pr337h4m 259 days ago [-]

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

Symmetry 259 days ago [-]

They can put away clutter but if they could chop a carrot or dust a vase they'd have shown videos demonstrating that sort of capability.

EDIT: Let alone chop an onion. Let me tell you having a robot manipulate onions is the worst. Dealing with loose onion skins is very hard.

j-krieger 259 days ago [-]

Sure. But if you showed this video to someone 5 or 10 years ago, they'd say it's fiction.

Symmetry 258 days ago [-]

squigz 259 days ago [-]

There's something hilarious to me about the idea of chopping onions being a sort of benchmark for robots.

sandis 259 days ago [-]

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws

ge96 259 days ago [-]

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

sottol 259 days ago [-]

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

megous 259 days ago [-]

Unfortunately, there'll be no time travel to save us. That was the lying part of the movie. Other stuff was true.

kla-s 259 days ago [-]

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

pr337h4m 259 days ago [-]

"Over a year" according to the founder: https://x.com/adcock_brett/status/1892578309344502191

bhouston 259 days ago [-]

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

imtringued 259 days ago [-]

Philpax 259 days ago [-]

NitpickLawyer 259 days ago [-]

> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

whatever1 259 days ago [-]

This is typical in real time applications. A supervisor tries to guess in which region the system is currently and then invokes the correct set of lower level algorithms.

andiareso 259 days ago [-]

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.

Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.

/rant

matteocontrini 259 days ago [-]

What do you mean? Videos on that page are served by CloudFront. If you're seeing issues it may be that videos are not encoded for web playback (faststart, etc.) but I haven't checked.

unraveller 259 days ago [-]

https://www.youtube.com/watch?v=Z3yQHYNXPws

the official figure yt release vid

traverseda 259 days ago [-]

jayd16 259 days ago [-]

It probably gives them confidence that they can accurately see a thing even though they don't know what that thing is.

I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.

thomastjeffery 259 days ago [-]

There is no such thing as "thing" here.

These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).

The only dimension accuracy can apply to is desirability.

jayd16 259 days ago [-]

You don't think there's any segmentation going on?

thomastjeffery 259 days ago [-]

Implicitly, maybe. Does that matter if you don't know where?

ygouzerh 259 days ago [-]

So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.

However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.

Similar than a kid that never saw a banana, but his parent described it to him.

Symmetry 259 days ago [-]

It's normal to have a training set and a validation set and I interpreted that to mean that these items weren't in the training set.

swalsh 259 days ago [-]

I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.

lyu07282 259 days ago [-]

01100011 259 days ago [-]

We had sufficient AI to make death machines for decades. You don't need fancy LLMs to get a pretty good success rate for targeting.

j-krieger 259 days ago [-]

> We had sufficient AI to make death machines for decades

charlie0 259 days ago [-]

Black Mirror has the perfect episode for this scenario already.

UltraSane 259 days ago [-]

Except those things would be very easily defeated by a 12 gauge shotgun or a AR-15

j-krieger 248 days ago [-]

You are severely underestimating the speed of drones and their formation capabilities, as well as overestimating your aim under immense danger.

meindnoch 259 days ago [-]

int_19h 258 days ago [-]

imtringued 259 days ago [-]

Panzerhaubitze 2000 already has an autoloader and the entire point of self propelled artillery is that it moves after shooting to avoid counter artillery fire.

Symmetry 259 days ago [-]

They don't look strong enough to pick up a 155mm shell even with both arms - and we haven't seen them pick up something with two arms.

ripped_britches 259 days ago [-]

This would have made a more interesting demo at least

ramenlover 259 days ago [-]

Why do they make “eye contact” after every hand off? Feels oddly forced.

GoatInGrey 259 days ago [-]

Perhaps they're exchanging knowing looks on how stupid they think the demo is. Solidarity between artificial brothers.

pixl97 259 days ago [-]

If you're training robots to interact with humans this is the kind of behavior you'd want. Humans use a ton of nonverbal hints like this to register context.

bear141 259 days ago [-]

This along with the writing style in the description is totally forced anthropomorphizing. It’s creepy.

jimbohn 259 days ago [-]

Gotta hype up the investors somehow

bbor 259 days ago [-]

To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:

  A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz

Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.

Animats 259 days ago [-]

If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?

moshun 259 days ago [-]

Animats 259 days ago [-]

Right. As I point out occasionally, Tesla, as a car company, is overvalued by an order of magnitude.

If they can find suckers who accept that valuation, it's much easier to exit as a billionaire than actually make it work.

By which time someone in China, Japan, or Taiwan is doing it cheaper and better.

ripped_britches 259 days ago [-]

I don’t even think Elizabeth Holmes actually had this mindset. Most entrepreneurs are actually trying to make a business.

Frankion 259 days ago [-]

Perhaps they are in the process of doing so?

Perhaps it's possible to grip them but not to pack them?

bilsbie 259 days ago [-]

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.

The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.

It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

turnsout 259 days ago [-]

If that sounds like a cheat, neuroscientists tell us this is how the human brain works.

ripped_britches 259 days ago [-]

This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.

ianamo 259 days ago [-]

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

thomastjeffery 259 days ago [-]

Nope.

The article clearly spells out that it's end to end LLM. Text and video in, motor function out.

Technically, the text model probably has a few copies, but they are nothing more than Asimov's narrative. Laws don't (and can't) exist in a model

exe34 259 days ago [-]

Is there a paper? I think I get how they did their training, but I'd like to understand it more.

Does anyone know if this trained model would work on a different robot at all, or would it need retraining?

the_other 259 days ago [-]

It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”

That’s how I feel about LLMs and code.

259 days ago [-]

kingkulk 259 days ago [-]

Anyone have a link to their paper?

IAmNotACellist 259 days ago [-]

I don't suppose this is open research and I can read about their model architecture?

ein0p 259 days ago [-]

There's no way this is 100% real though. No startup demo ever is.

bilsbie 259 days ago [-]

They should have made them talk. It’s a little dehumanizing otherwise.

butifnot0701 259 days ago [-]

It's kinda eerie how they look at each other after handover

anentropic 259 days ago [-]

Very impressive

Why make such sinister-looking robots though...?

jayd16 259 days ago [-]

With the way they move, they look like stoned teenagers interning at a Bond villain factory. Not to knock the tech but they're scary and silly at the same time.

esafak 259 days ago [-]

Black was not the best color choice.

anentropic 259 days ago [-]

Well, when you put it like that I feel a bit uncomfortable...

But it did seem like title of their mood board must have been "Black Mirror".

kubb 259 days ago [-]

Wow! This is something new.

dr_dshiv 259 days ago [-]

Wake me when robots can make a peanut butter sandwich

abraxas 259 days ago [-]

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.

psb217 259 days ago [-]