▲Flash-KMeans: Fast and Memory-Efficient Exact K-Meansarxiv.org

168 points by matt_d 4 days ago | 14 comments

frakt0x90 12 hours ago [-]

They created this in service of their video generation model which "clusters and reorders tokens based on semantic similarity using k-means.":

http://arxiv.org/pdf/2505.18875

smusamashah 8 hours ago [-]

Project website https://svg-project.github.io/

ghm2199 3 hours ago [-]

Nice. I guess this must have existed in some form for GPUs before this, and that this is just better with the latest software stack?

This could be generally useful for a lot of algorithms like kmeans(ish) like Cut algorithm that projects points into eigen space — as opposed to the kernel space in Kmeans — need the affinity matrix(distance for K points).

leecarraher 11 hours ago [-]

Do they mean deterministic k-means, k-means++ ... ? Global optimal k-means is NP-Hard, so linear speedups aren't terribly helpful. It's nice, until you add more input. Standard k-means would be nice, or the k-means++ seed algorithm.

jmalicki 11 hours ago [-]

Kmeans++ is just a seed, this is the inner loop.

Also analogous to flash attention, a linear speedup in big O sense based on the typical algorithmoc complexity computing model can be a polynomial speedup in measured wall clock time due to memory hierarchy differences.

Still small compared to exponential differences, but for an NP-Hard problem, a linear 100x speedup is the difference between practically computable vs. not. There are a ton of things I'd wait 2 hours for that I wouldn't wait a week for.

n4r9 6 hours ago [-]

The abstract suggests they're proposing speed-up techniques for the assignment and centroid update stages of the classic k-means algorithm. Which would therefore also apply to k-means++.

wood_spirit 14 hours ago [-]

Does this have corresponding speed ups or memory gains for normal CPUs too? Just thinking about all the cups of coffee that have been made and drunk while scikit-learn kmeans chugs through a notebook :)

snovv_crash 14 hours ago [-]

For CPU with bigger K you would put the centroids in a search tree, so take advantage of the sparsity, while a GPU would calculate the full NxK distance matrix. So from my understanding the bottleneck they are fixing doesn't show up on CPU.

xavxav 13 hours ago [-]

search trees tend not to scale well to higher dimensions though, right?

from what I've seen I had the impression that Yinyang k-means was the best way to take advantage of the sparsity.

snovv_crash 8 hours ago [-]

Most data I've used is for geospatial with D<=4 (xyzt) so for me search trees worked great. But for things like descriptor or embedding clustering yes, trees wouldn't be useful.

openclaw01 13 hours ago [-]

[dead]

jacquesm 10 hours ago [-]

Nice one. K-Means is one of those neat little powertools that once you get the hang of it you find more and more applications for, but it can be a bit slow for larger data sets. So this is very nice to have, thank you matt_d for posting.

QubridAI 4 hours ago [-]

Exact K-Means but actually practical—faster, leaner, and finally scalable without approximation trade-offs.

matrix2596 14 hours ago [-]

looks like flash attention concepts applied to kmeans, nice speedup results

westurner 6 hours ago [-]

ScholarlyArticle: "Flash-KMeans: Fast and Memory-Efficient Exact K-Means" (2026) https://arxiv.org/abs/2603.09229 .. gscholar: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=Fla... :

> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;

> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.

> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.

> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.

k-means clustering > Algorithms > Variations: https://en.wikipedia.org/wiki/K-means_clustering#Variations

maiconburn 12 hours ago [-]

[dead]

Loading comments...

frakt0x90 12 hours ago [-]

They created this in service of their video generation model which "clusters and reorders tokens based on semantic similarity using k-means.":

http://arxiv.org/pdf/2505.18875

smusamashah 8 hours ago [-]

Project website https://svg-project.github.io/

ghm2199 3 hours ago [-]

Nice. I guess this must have existed in some form for GPUs before this, and that this is just better with the latest software stack?

leecarraher 11 hours ago [-]

jmalicki 11 hours ago [-]

Kmeans++ is just a seed, this is the inner loop.

n4r9 6 hours ago [-]

The abstract suggests they're proposing speed-up techniques for the assignment and centroid update stages of the classic k-means algorithm. Which would therefore also apply to k-means++.

wood_spirit 14 hours ago [-]

snovv_crash 14 hours ago [-]

xavxav 13 hours ago [-]

search trees tend not to scale well to higher dimensions though, right?

from what I've seen I had the impression that Yinyang k-means was the best way to take advantage of the sparsity.

snovv_crash 8 hours ago [-]

Most data I've used is for geospatial with D<=4 (xyzt) so for me search trees worked great. But for things like descriptor or embedding clustering yes, trees wouldn't be useful.

openclaw01 13 hours ago [-]

[dead]

jacquesm 10 hours ago [-]

QubridAI 4 hours ago [-]

Exact K-Means but actually practical—faster, leaner, and finally scalable without approximation trade-offs.

matrix2596 14 hours ago [-]

looks like flash attention concepts applied to kmeans, nice speedup results

westurner 6 hours ago [-]

ScholarlyArticle: "Flash-KMeans: Fast and Memory-Efficient Exact K-Means" (2026) https://arxiv.org/abs/2603.09229 .. gscholar: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=Fla... :

> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.

> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.

> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.

k-means clustering > Algorithms > Variations: https://en.wikipedia.org/wiki/K-means_clustering#Variations

maiconburn 12 hours ago [-]

[dead]