▲Voxtral – Frontier open source speech understanding modelsmistral.ai

123 points by meetpateltech 1 days ago | 23 comments

homarp 22 hours ago [-]

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

kamranjon 14 hours ago [-]

Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.

ipsum2 19 hours ago [-]

24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

azinman2 12 hours ago [-]

But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

sheerun 15 hours ago [-]

In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.

Raed667 12 hours ago [-]

It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

GaggiX 24 hours ago [-]

There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

lostmsu 22 hours ago [-]

Does it support realtime transcription? What is the ~latency?

rolisz 7 hours ago [-]

Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

homarp 1 days ago [-]

weights:https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 and https://huggingface.co/mistralai/Voxtral-Small-24B-2507

homarp 1 days ago [-]

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

danelski 1 days ago [-]

They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?

halJordan 19 hours ago [-]

They didn't release voxtral large so your question doesn't really make sense

danelski 11 hours ago [-]

It's about what their top offering is at the moment, not having Large in name. Mistral Medium 3 is notably not Mistral Large 3, but it was released as API-only.

wmf 19 hours ago [-]

They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.

Havoc 19 hours ago [-]

Probably not looking to directly compete in transcription space

lostmsu 22 hours ago [-]

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

ImageXav 22 hours ago [-]

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

lostmsu 22 hours ago [-]

Harvesting idle compute. https://borgcloud.org/speech-to-text

4b11b4 14 hours ago [-]

This is your service?

lostmsu 7 hours ago [-]

Yes

BetterWhisper 20 hours ago [-]

Do you support speaker recognition?

lostmsu 19 hours ago [-]

No. I found models doing that unreliable when there are many speakers.

Loading comments...

homarp 22 hours ago [-]

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

kamranjon 14 hours ago [-]

ipsum2 19 hours ago [-]

24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

azinman2 12 hours ago [-]

But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

sheerun 15 hours ago [-]

Raed667 12 hours ago [-]

It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

GaggiX 24 hours ago [-]

There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

lostmsu 22 hours ago [-]

Does it support realtime transcription? What is the ~latency?

rolisz 7 hours ago [-]

Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

homarp 1 days ago [-]

weights:https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 and https://huggingface.co/mistralai/Voxtral-Small-24B-2507

homarp 1 days ago [-]

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

danelski 1 days ago [-]

halJordan 19 hours ago [-]

They didn't release voxtral large so your question doesn't really make sense

danelski 11 hours ago [-]

It's about what their top offering is at the moment, not having Large in name. Mistral Medium 3 is notably not Mistral Large 3, but it was released as API-only.

wmf 19 hours ago [-]

They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.

Havoc 19 hours ago [-]

Probably not looking to directly compete in transcription space

lostmsu 22 hours ago [-]

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

ImageXav 22 hours ago [-]

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

lostmsu 22 hours ago [-]

Harvesting idle compute. https://borgcloud.org/speech-to-text

4b11b4 14 hours ago [-]

This is your service?

lostmsu 7 hours ago [-]

Yes

BetterWhisper 20 hours ago [-]

Do you support speaker recognition?

lostmsu 19 hours ago [-]

No. I found models doing that unreliable when there are many speakers.