So the idea is that you'll eventually arrange n elements into ideally about √n sorted lists, organized so that the endpoints of these lists are also ordered (the "rainbow" shape). A new element goes on the end of one of those lists or a new list; which one is generally decided by binary search, but the index is saved to speed up the next insertion if it's the same. So a natural run might benefit by having adjacenct values inserted to the same list.
Which cases benefit and which don't seems potentially quite erratic. If you have a series of runs that shrink over the course of the array, with increasing start value and decreasing start value, each will become a new sorted list and you end up with O(n) insertion plus a perfect merge. If the runs get wider instead, they'll cross over existing endpoints and end up spattered around before merging.
It feels like the paper keeps trying to spin constant-factor improvements as somehow being asymptotically relevant despite acknowledging that mathematically they aren't, for example the fact that log(√n) is only log(n)/2 "would not reflect the sublinear nature of base array length k" and runs hitting the best case "approximately 50% of the time... reinforces JesseSort's average runtime complexity as somewhere between O(n) and O(n log_2 n)". Powersort provably takes O(n + n log r) worst-case time where r is the number of runs; I expect this algorithm is O(n log n) worst- and average-case.
However, on random arrays the given benchmarks do improve relative to Python default as the length increases, with faster performance starting a little under a million elements. There isn't a lot of investigation into why this is. I'd be interested to know if the number of comparisons is reduced in particular. Perhaps this algorithm is more cache-friendly with a random input, but this wouldn't be reliable as many inputs can cause the number of sorted lists to scale with O(n) so that the binary searches need to work on a large array.
mlochbaum 28 days ago [-]
The analysis would make a lot more sense if it dropped every big O and computed an expected number of comparisons. Of course, the real issue is that the whole thing relies on an empirically-determined length of √8√n for the number of lists in the rainbow, when the theoretical justification isn't there and clearly the worst case is n/2. I'm not seeing any awareness that the expected number of comparisons has to be at least log_2(n!), and am starting to see some worrying signs that the author thinks averages like n log log n are possible. I'd initially assumed that "somewhere between O(n) and O(n log_2 n)" statement meant in the presence of asymptotically large natural runs, but there's a claim that the runs "need not be of any minimum size" and another that the algorithm can "capitalize on natural runs" that seems to be referring to the benchmarks on "purely random values".
mlochbaum 28 days ago [-]
Ah, the README prior to commit b98f724 does claim O(n log(1.5 ln(n))) asymptotic runtime. This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed, which the paper presents before describing the problem that each inserted element pushes one of those endpoints away from the middle. Bit of a perpetual motion machine search: initial results showed the promise of this approach, some details got in the way but surely there's some trick to work around them...
Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.
tim-peters 27 days ago [-]
I want to thank you for your analysis! I'm the "timsort" guy, and I'm asked to look at all sorts of things. The devil is in the details, and I've given up bothering to look unless a paper spells out sufficient details up front. Else it's a pit I'd rather not get sucked into.
As general things, timsort was aimed at exploiting all kinds of pre-existing order, not random lists. The only goal for the latter was to be within reach of CPython's previous highly tuned samplesort implementation (like quicksort on steroids, with much less data movement than mergesorts, but more comparisons).
As you say, for randomly ordered input it gets remarkably close to the information-theoretic lower bound on # of compares. So that's not it.
"Cache effects", maybe, but that's a pit to dig into. I'll note that while the newer "powersort" merge strategy is elegant and provably near-optimal by some relevant measures, in practice it doesn't really appear to run any faster (although cases can be _contrived_ that make it - or the older method! - run faster).
Comparison specialization (to ints) could very well account for it. CPython's general comparison machinery is _very_ expensive, even for what turn out to be native machine ints (which CPython cannot know at compile-time: everything is deduced at runtime).
CPython has since grown new machinery to do a pre-pass over the list, and do a form of cheaper comparison specialization for what turn out to be homogeneous lists of suitable types (including "small enough" ints). That alone gives enough speedup to make some of the original choices sub-optimal. For example, binary insertion sort for short runs is no longer best then. That was aimed at minimizing worst-case # of compares, but as comparisons get cheaper that has less value.
There are also surprises in everything. For example, the worst case for binary insertion sort on most machines was _not_ reverse-ordered input, despite that it requires the most data movement. Instead the worst case was randomly ordered data. Why? Branch prediction. In randomly ordered data, each branch is unpredictable. In reverse-ordered data, "move to the left" is always the test outcome.
Another generality is that Python's sort cares more about speed for shorter lists than huge ones. "Big data" problems are more likely to use, e.g., extensions geared to "big data problems", like numpy for giant arrays of floats. In those contexts more suitable _algorithms_ exist, like radix sorts, or multi-key quicksorts for lists of long strings with long shared prefixes.
Python's niche is more in, e.g., web services, where a great many sorts of shorter lists are common.
Bottom line: there is no one "best" sorting algorithm. I keep an eye out for newer developments, but haven't seen anything better yet for what Python is aiming at. I was, e.g., very impressed by pdqsort - but it's not a stable sort, and that alone makes it a non-starter for general Python use.
mlochbaum 26 days ago [-]
Whoa, hey Tim! Very much agree on no "best" sorting algorithm. The generic-comparison case is one I keep idly considering, although more focused on long lists than short ones. There's been a lot of improvement lately in sorting small datatypes (arguably kicked off by fluxsort), but really, who has so many 4-byte ints laying around that sorting them is a bottleneck? A point Orson Peters made when presenting glidesort[0] was that lists with many repeated duplicates are common even in generic data, as in ordering customers by city, and quicksort seems to be the best way to deal with such a list. However glidesort/driftsort is still largely oriented towards smaller types, and its mergesort-quicksort hybridization can pretty easily make bad choices, as I described at [1] and especially [2] (slightly surprised I didn't mention binary insertion sort in that first section). So this approach feels like it's still immature to me.
You're much more up to date on the details of recent developments than I am. It's faded into a background interest (although a persistent one) for me.
One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).
As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)
mlochbaum 26 days ago [-]
I'd forgotten about patience sorting! Not that I was ever exactly familiar with it. JesseSort seems nearly identical, except that it allows insertion to either side of the lists, tries the previous index before doing a binary search, and uses pairwise instead of k-way merge at the end. The expected Θ(√n) list count is the same as well. An obvious advantage of going double-ended is that you get automatic O(n) sorting if the input's an ascending or descending run, instead of just one of those.
I think both algorithms can be made stable; if there's a barrier for patience sort it would be that the merge phase pops off the end of the sorted lists so it's going in the opposite direction as insertion? For JesseSort, an element that's equal to a previous one can only be inserted to the same list, or a more central one (created later). So you need to merge sublists giving priority to the outer ones, and also never append an element to the bottom of a sublist if it's equal to that list's current minimum. Doesn't look like JesseSort as implemented in jessesort_c.pyx does either of these: it merges non-adjacent lists and has a backwards merge by virtue of using < instead of <=, and always inserts to the innermost list possible instead of being strict about the bottom side. Which is understandable as being strict has a major downside. If you get a string of equal values in the bottom half of the range, you'd be stuck putting each one in a new sublist until you get to the midpoint and can create a new one where these values start going to the top.
tim-peters 26 days ago [-]
Ya, I'm just too old for bottomless pits anymore ;-) Patience sorting is still worth study, for its elegance, simple code, and real-world practicality in the context of solving the longest increasing subsequence problem. As a full-blown sorting algorithm, it only excels at reverse-ordered and "low diversity" inputs, but that's in return for very simple, uniform, and easily understood code.
Of course any of these could be changed to use the powersort merge strategy, if desired. It's elegant and principled. But it also exploits that timsort and powersort keep the _number_ of runx simultaneously active no more than log2(N). Needing to keep track of potentially O(N) sublists simultaneously is potentially burdensome.
In any case, I'm most interested in adaptive sorting, since partial order so very often exists in the real world (as you noted that Orson Peters noted, this is often due to "low diversity" as a consequence of sorting on one field of a complex record - but it's more than just that). Thanks to your good explanations, I'm persuaded that JesseSort isn't much more promising than patience sorting in that specific respect (yes, looks like JesseSort can handle ascending order in linear time too). So thanks again!
zahlman 28 days ago [-]
>This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed
I guess you know this, but we don't need to do any complicated analysis to disprove such a big-O for a comparison-based sort. It's a well known information-theory result: each comparison can remove at most O(1) entropy, while the initial state has O(n lg n) entropy because that's how log(N!) grows (https://en.wikipedia.org/wiki/Stirling%27s_approximation).
drpixie 28 days ago [-]
If you're going to make a big claim about sort speed, tell me how speed is better/worse for various data. How do the algorithms compare when the data is already ordered, when it's almost (but not quite) already ordered, when it's largely ordered, when it's completely random, and it's in the opposite order. This stuff, as well as the size of the dataset, is what we need to know in practice.
tialaramex 28 days ago [-]
Rather than, or at least in addition to, raw measured speed on a specific piece of hardware, which is often affected in hard to understand ways by niche optimisation choices and nuances of specific hardware, I actually really like the choice to report how many comparisons your algorithm needed.
For example Rust's current unstable sort takes ~24 comparisons on average for each of 10^7 truly random elements to sort them all, but if instead all those elements are chosen (still at random) from only 20 possibilities, it only needs a bit more than five comparisons regardless of whether there are 10^3 elements or 10^7.
Unlike "On this Intel i6-9402J Multi Wazoo (Bonk Nugget Edition) here are my numbers" which is not very useful unless you also have the i6-9402J in that specific edition, these comparison counts get to a more fundamental property of the algorithm that transcends micro architecture quirks which will not matter next year.
Vecr 28 days ago [-]
"My boss wants to buy systems with the Intel i10-101010F Medium Core Platinum (with rowhammer & Sonic & Knuckles), can you buy this $20,000 box and test your program so I can write him a report?"
What's your point? The paper you're linking does not include the analysis the post you're responding to is asking for.
gymbeaux 27 days ago [-]
It does give some insight into what you seek, at least. For example, “We find that for smallern≲262144, JesseSort is slower than Python’s default sort.”
I’d like to see a much larger n but the charts in the research paper aren’t really selling JesseSort. I think as more and more “sorts” come out, they all get more niche. JesseSort might be good for a particular dataset size and ordering/randomness but from what I see, we shouldn’t be replacing the default Python sorting algorithm.
tim-peters 26 days ago [-]
I want to elaborate on timing cautions: a sort that specializes to 4-byte machine ints is doing something that can't be expressed directly in CPython. Its lists are heterogeneous, types aren't known until runtime, and there is no native (to Python) concept of "4-byte int". All CPython ints are stored in "bigint" (unbounded) format, and even zero requires 32 bytes of storage in the format (overheads for a type pointer, refcount, 16-byte heap alignment padding). That format isn't 2's complement either (although it creates the semantic _illusion_ of being such): it's sign+magnitude.
The general object comparison machinery is very expensive at runtime, essentially repeated from scratch on every compare. To mitigate this, CPython does a prepass over the list (O(n)) to see whether the list is in fact homogenous, and, if so, of a type that can use a more specialized comparison function. Not free, but pays off hugely if you are in fact sorting a list of "small enough" machine ints. But it doesn't change the storage format: on every compare, cycles are still consumed to transform CPython's sign+magnitude bigint format to native 2's-complement for native machine code to compare directly.
I expect this accounts for most of the observed "speedup" over timsort. Despite semi-heroic efforts to cut the cost, comparisons of native ints in CPython will always be significantly more expensive than code specialized to compare native ints directly. At least until (if ever) CPython learns how to "unbox" ints. The PyPy implementation of Python already does, and sorting lists of native machine ints runs about twice as fast under PyPy. It's not the sorting algorithm that matters to this, it's the relative cost of machine-int compares, and almost certainly too the cache pressure of using many more bytes to store small ints than necessary in "boxed" format.
As mentioned elsewhere, JesseSort reminds me most of the old and simpler "patience sorting", with which it has a lot in common. Both keep a collection of sorted sublists (to be merged at the end), both use binary search to pick a collection member to which to add 'the next" input, both excel at "low diversity" inputs, both (on average) end up with O(sqrt(n)) sublists on randomly ordered inputs, and neither is "naturally" stable.
As I noted in listsort.txt at the time, while timsort gets a lot of value out of repeated elements ("low diversity"), it's just not the right approach to exploit that aa much as possible. JesseSort and pdqsort (and, by extension, glidesort) will always do better on that. OTOH, timsoet gets value out of many kinds of partial ordering, which "reveal themselves" dynamically via one run in a merge "winning" consistently.
vanderZwan 28 days ago [-]
Interesting approach, although I'm wondering why they didn't just use a double-ended queue instead of a linked list.
Also, I get the urge to name the algorithm after yourself, since Timsort is named after its creator as well, but if you call you data structure "rainbow" then my first association would be the rainbow flag, since flag-sorts also have a tradition in sorting algorithm names.
First thing I think about with "rainbow" and "data structures" is rainbow tables.
blibble 28 days ago [-]
a double-ended queue is often implemented using a doubly linked list (like the one in use here)
if you back it by two arrays then you can't do O(1) inserts in the middle (like is going on here)
tialaramex 28 days ago [-]
It's true that awful Deque implementations are often built from a linked list. If you're taught about big-O notation and haven't noticed how modern computers work you might even do that today.
I would expect that they're thinking of something more like Rust's VecDeque, so that's a conventional "growable array" (Rust's Vec, Java ArrayList, C++ std::vector) used internally with some head + tail tracking to form a circular buffer so as to deliver the Deque API (fast push to head or tail, fast pop from head or tail).
blibble 28 days ago [-]
you can make mechanically sympathetic linked lists with low constant factors as long as you avoid pointers
vanderZwan 28 days ago [-]
Do you mean using indices into a backing array instead of pointers if the maximum length is known to be small? That probabdy helps but still has the cache issue to some degree no?
(also I feel like this is arguing semantics a bit, but I'm sincerely wondering: is a linked list without pointers still a linked list?)
conradludgate 28 days ago [-]
And what about pdqsort, glidesort, driftsort, ipnsort? New research is always cool, but there's plenty of other research in this space to compare this to
agnishom 28 days ago [-]
I am a bit concerned that the primary channel of distribution of the paper is ResearchGate. I am somehow biased against taking it seriously because of this.
behnamoh 28 days ago [-]
I have mixed feelings about the naming. Isn't it kinda off-putting that he named it after himself? usually it's other people who name your algorithm after you. At least that's how it's like in science and engineering (Fourier didn't call it "Fourier transform", Laplace didn't call it "Laplace transform", Kalman didn't name it "Kalman filters", etc.)
bluepnume 28 days ago [-]
He should have at least called it eejss-sort
arduanika 28 days ago [-]
This guy sorts
behnamoh 28 days ago [-]
This guy names
hnisoss 28 days ago [-]
its only ok if your name is Tim?
tim-peters 27 days ago [-]
Perhaps ;-) I'm the "Tim" in "timsort". The name was an inside joke. I'm not a self-promoter, and never have been. As the so-called "Zen of Python" author, I thought it would be funny to pick a name I'd never pick ;-)
CPython had a unique (in my experience) combination of cheap data movement (only pointer swaps) and very expensive comparisons. That's what I was aiming at. I never publicized it, never wrote about it outside the CPython repo, and never thought I'd hear about it again.
Of course I'm pleased it found wider use, but that took my wholly by surprise. If I had it to do over again, I would probably have named it, say, gallopsort.
If this new sort catches on, Jesse should rename it before it's too late ;-)
hnisoss 16 days ago [-]
I think its an excellent name mr.Tim haha, I m glad it stuck. Always reminds me of effbot (rip) and others. It felt more personal and memorable than gallopsort would.
gyudin 28 days ago [-]
Like Dijkstra's algorithm? Knuth-Morris-Pratt algorithm? Huffman coding?
Did any of these guys put their name on the algorithm by themselves?
vanderZwan 28 days ago [-]
The author seems young, I think we should be a little forgiving if they saw all these algorithms and concluded "ok so if you come up with something new you get to name it after yourself"
behnamoh 28 days ago [-]
those names were given to those algorithms by others, not the creators.
gymbeaux 27 days ago [-]
What did they call their algorithms before they passed away?
atiedebee 26 days ago [-]
Looking at the paper linked on the wikipedia page for Huffman Coding, David Huffman referred to Huffman Coding as just "optimum coding" [0]
I know it is not possible in our Universe but on a lighter note, in some parallel universe out there, someone has made a sorting algorithm that runs in O(1). I wonder though if a quantum computer can simultaneously sort all the elements in O(1)
procaryote 28 days ago [-]
If you can pick parallel universes, just pick the one where the data happens to be sorted
warkdarrior 28 days ago [-]
"quantum computers are no better than classical ones [for sorting]"
A modern Borges could write a really good short story about that universe.
xiaodai 28 days ago [-]
Python's default is gallop sort however radixsort is much faster and performs in O(n).
selcuka 28 days ago [-]
> radixsort is much faster and performs in O(n).
Radix sort time complexity is not O(n); it's O(n*k) where k is the size of the largest element. Besides it has an additional O(n+k) space complexity.
ijustlovemath 28 days ago [-]
I thought it was Timsort?
pstoll 28 days ago [-]
It was and always will be timsort-ly yours. iykyk.
perlgeek 28 days ago [-]
radixsort isn't a comparison-based sort algorithm, so you're comparing apples to chickens.
prirun 28 days ago [-]
"We find that for smaller n≲ 262144, JesseSort is slower than Python’s default sort."
repsilat 28 days ago [-]
I wonder about the global statistics of sorted data... Is the most common number of elements to sort zero? Certainly less than ten?
What about the median? Two elements to sort? One? Zero again?
lblume 28 days ago [-]
The question is also for which list lengths the performance matters most. When sorting a few strings (<20), whether the algorithm uses 5 or 7 comparisons would usually not matter too much. So to find the optimal algorithm for a given situation, we would have to compute a weighted average by list length importance on the performances of the algorithm per list length.
3eb7988a1663 28 days ago [-]
Don't high performance systems have heuristics to decide what specific algorithms to use at runtime? It is not unimaginable to think that there could be a function dedicated to small vs large collections.
Assuming the results hold, someone has to decide if the additional complexity is worth the performance. For something like BLAS, go nuts. For Python standard library, maybe not.
tcper 28 days ago [-]
Amazing, there is still new sort algorithm today!
ggm 28 days ago [-]
"This journal no longer accepts papers on optimal solutions to the 9 queens problem"
quuxplusone 27 days ago [-]
Hmm. I understand the "rainbow" data structure depicted in the paper's Figure 2 (that particular diagram is super useful!), but I don't understand the algorithm to build a (useful) rainbow from an unsorted array. Obviously one way to build a rainbow is to sort the array and then say that the entire array is now the base array of a rainbow, but that's not algorithmically helpful. (Likewise, one can put an array into max-heap (Eytzinger) order simply by reverse-sorting it; but that's not algorithmically helpful.)
So I went to just transcribe the pseudocode algorithm given in the paper, and found instructions like "if index is at the middle of the base array then...", which is pretty confusing at first glance. I believe it just means that we're also tracking the length `L` of the base array, and `if index == (L+1)/2 then...`, but it could be written much more clearly. (In fact it could just say "if index equals the number of bands.") Likewise the use of the adjectives "middle-left" and "middle-right" instead of spelling out which actual indices we're talking about.
Likewise, when `index` is zero I don't think "base array[index-1]" is well-defined. Seems easy to patch up; but if it's so easy then why didn't the author do it?
The algorithm depends on the undefined subroutines MERGEPOLICY and MERGE. The implementation of MERGE might be considered obvious (if we are unconcerned with the memory consumption of our "band" representation, which also doesn't seem to be concretely described in the paper). MERGEPOLICY, though, is opaque to me, and seems kind of critical to the entire algorithm.
The algorithm runs the opaque BINARYSEARCH step O(n) times, giving it a runtime of O(n log n) right off the bat. This git commit — https://github.com/lewj85/jessesort/commit/b98f7245077681989... — indicates that the author was initially not even aware that sorting is generally O(n log n); he originally wrote that JesseSort was "O(n log(ln n))" with some nonsense about "harmonic numbers." (Thankfully removed in that git commit, but worryingly present before. You cannot in general make a good algorithm simply by starting with a bullshit algorithm and removing the bullshit.)
JesseSort seems to boil down to two steps: (1) Create a list of sorted sublists, using the paper's poorly-defined splitting algorithm. (2) Merge the bands, using the paper's undefined merging algorithm. Up to the details of those two steps it's exactly an N-way mergesort (the dynamic selection of `N` being part of the details). Without reliable concrete explanation of those details, I'm not sure how to proceed.
It almost feels like the author was really excited to share this "rainbow" intermediate data structure (Figures 1 and 2) — which, again, is legitimately interesting IMHO — but when it came to motivating why one might care about it, he just kinda went "I dunno, sorting?" and wasn't really excited about fleshing out that idea beyond a sketch.
P.S., I'm sure this is my C++ chauvinism showing, but I'm shocked that the Python reference implementation operates only on `int`. Surely if you know you're sorting nothing but plain old ints, faster algorithms exist! Python's default `sorted` function works on any kind of input. That probably explains a lot of the speed difference right there. Also, `jessesort` doesn't even fail loudly if you pass it non-ints; it just quietly type-casts the input.
>>> from jessesort_c import jessesort
>>> jessesort([1.414, 2.718])
[1, 2]
funnyAI 28 days ago [-]
[dead]
pinoy420 28 days ago [-]
Why is this implemented in python? Is it compiled some how?
00ajcr 28 days ago [-]
There's a Cython implementation (the .pyx file). The Cython code is compiled to C code (then that C code is compiled).
hinkley 28 days ago [-]
So for those of us who aren’t hanging on every update in the Python universe, what is the default sort in Python now? All I recall is that Timsort got dethroned and the author seems to be behaving like it’s some big secret to be kept from the plebs.
Edit: all the way at the bottom of page 6: timsort+powersort
vanderZwan 28 days ago [-]
Powersort is pretty neat, here's a blog post with links to talks about it by one of the authors:
so, I had ChatGPT give me a quick run through, after feeding it the JesseSort python file, here's how it understood the algorithm in simple terms:
1. It starts with two elements and places them in order.
2. Each new element is inserted into the correct position within the growing "rainbow" of sorted bands.
3. By the time all elements are inserted, the list consists of multiple sorted bands that are then merged.
Which cases benefit and which don't seems potentially quite erratic. If you have a series of runs that shrink over the course of the array, with increasing start value and decreasing start value, each will become a new sorted list and you end up with O(n) insertion plus a perfect merge. If the runs get wider instead, they'll cross over existing endpoints and end up spattered around before merging.
It feels like the paper keeps trying to spin constant-factor improvements as somehow being asymptotically relevant despite acknowledging that mathematically they aren't, for example the fact that log(√n) is only log(n)/2 "would not reflect the sublinear nature of base array length k" and runs hitting the best case "approximately 50% of the time... reinforces JesseSort's average runtime complexity as somewhere between O(n) and O(n log_2 n)". Powersort provably takes O(n + n log r) worst-case time where r is the number of runs; I expect this algorithm is O(n log n) worst- and average-case.
However, on random arrays the given benchmarks do improve relative to Python default as the length increases, with faster performance starting a little under a million elements. There isn't a lot of investigation into why this is. I'd be interested to know if the number of comparisons is reduced in particular. Perhaps this algorithm is more cache-friendly with a random input, but this wouldn't be reliable as many inputs can cause the number of sorted lists to scale with O(n) so that the binary searches need to work on a large array.
Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.
As general things, timsort was aimed at exploiting all kinds of pre-existing order, not random lists. The only goal for the latter was to be within reach of CPython's previous highly tuned samplesort implementation (like quicksort on steroids, with much less data movement than mergesorts, but more comparisons).
As you say, for randomly ordered input it gets remarkably close to the information-theoretic lower bound on # of compares. So that's not it.
"Cache effects", maybe, but that's a pit to dig into. I'll note that while the newer "powersort" merge strategy is elegant and provably near-optimal by some relevant measures, in practice it doesn't really appear to run any faster (although cases can be _contrived_ that make it - or the older method! - run faster).
Comparison specialization (to ints) could very well account for it. CPython's general comparison machinery is _very_ expensive, even for what turn out to be native machine ints (which CPython cannot know at compile-time: everything is deduced at runtime).
CPython has since grown new machinery to do a pre-pass over the list, and do a form of cheaper comparison specialization for what turn out to be homogeneous lists of suitable types (including "small enough" ints). That alone gives enough speedup to make some of the original choices sub-optimal. For example, binary insertion sort for short runs is no longer best then. That was aimed at minimizing worst-case # of compares, but as comparisons get cheaper that has less value.
There are also surprises in everything. For example, the worst case for binary insertion sort on most machines was _not_ reverse-ordered input, despite that it requires the most data movement. Instead the worst case was randomly ordered data. Why? Branch prediction. In randomly ordered data, each branch is unpredictable. In reverse-ordered data, "move to the left" is always the test outcome.
Another generality is that Python's sort cares more about speed for shorter lists than huge ones. "Big data" problems are more likely to use, e.g., extensions geared to "big data problems", like numpy for giant arrays of floats. In those contexts more suitable _algorithms_ exist, like radix sorts, or multi-key quicksorts for lists of long strings with long shared prefixes.
Python's niche is more in, e.g., web services, where a great many sorts of shorter lists are common.
Bottom line: there is no one "best" sorting algorithm. I keep an eye out for newer developments, but haven't seen anything better yet for what Python is aiming at. I was, e.g., very impressed by pdqsort - but it's not a stable sort, and that alone makes it a non-starter for general Python use.
[0] https://archive.fosdem.org/2023/schedule/event/rust_glidesor...
[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...
[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...
One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).
As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)
I think both algorithms can be made stable; if there's a barrier for patience sort it would be that the merge phase pops off the end of the sorted lists so it's going in the opposite direction as insertion? For JesseSort, an element that's equal to a previous one can only be inserted to the same list, or a more central one (created later). So you need to merge sublists giving priority to the outer ones, and also never append an element to the bottom of a sublist if it's equal to that list's current minimum. Doesn't look like JesseSort as implemented in jessesort_c.pyx does either of these: it merges non-adjacent lists and has a backwards merge by virtue of using < instead of <=, and always inserts to the innermost list possible instead of being strict about the bottom side. Which is understandable as being strict has a major downside. If you get a string of equal values in the bottom half of the range, you'd be stuck putting each one in a new sublist until you get to the midpoint and can create a new one where these values start going to the top.
Of course any of these could be changed to use the powersort merge strategy, if desired. It's elegant and principled. But it also exploits that timsort and powersort keep the _number_ of runx simultaneously active no more than log2(N). Needing to keep track of potentially O(N) sublists simultaneously is potentially burdensome.
In any case, I'm most interested in adaptive sorting, since partial order so very often exists in the real world (as you noted that Orson Peters noted, this is often due to "low diversity" as a consequence of sorting on one field of a complex record - but it's more than just that). Thanks to your good explanations, I'm persuaded that JesseSort isn't much more promising than patience sorting in that specific respect (yes, looks like JesseSort can handle ascending order in linear time too). So thanks again!
I guess you know this, but we don't need to do any complicated analysis to disprove such a big-O for a comparison-based sort. It's a well known information-theory result: each comparison can remove at most O(1) entropy, while the initial state has O(n lg n) entropy because that's how log(N!) grows (https://en.wikipedia.org/wiki/Stirling%27s_approximation).
For example Rust's current unstable sort takes ~24 comparisons on average for each of 10^7 truly random elements to sort them all, but if instead all those elements are chosen (still at random) from only 20 possibilities, it only needs a bit more than five comparisons regardless of whether there are 10^3 elements or 10^7.
Unlike "On this Intel i6-9402J Multi Wazoo (Bonk Nugget Edition) here are my numbers" which is not very useful unless you also have the i6-9402J in that specific edition, these comparison counts get to a more fundamental property of the algorithm that transcends micro architecture quirks which will not matter next year.
I’d like to see a much larger n but the charts in the research paper aren’t really selling JesseSort. I think as more and more “sorts” come out, they all get more niche. JesseSort might be good for a particular dataset size and ordering/randomness but from what I see, we shouldn’t be replacing the default Python sorting algorithm.
The general object comparison machinery is very expensive at runtime, essentially repeated from scratch on every compare. To mitigate this, CPython does a prepass over the list (O(n)) to see whether the list is in fact homogenous, and, if so, of a type that can use a more specialized comparison function. Not free, but pays off hugely if you are in fact sorting a list of "small enough" machine ints. But it doesn't change the storage format: on every compare, cycles are still consumed to transform CPython's sign+magnitude bigint format to native 2's-complement for native machine code to compare directly.
I expect this accounts for most of the observed "speedup" over timsort. Despite semi-heroic efforts to cut the cost, comparisons of native ints in CPython will always be significantly more expensive than code specialized to compare native ints directly. At least until (if ever) CPython learns how to "unbox" ints. The PyPy implementation of Python already does, and sorting lists of native machine ints runs about twice as fast under PyPy. It's not the sorting algorithm that matters to this, it's the relative cost of machine-int compares, and almost certainly too the cache pressure of using many more bytes to store small ints than necessary in "boxed" format.
As mentioned elsewhere, JesseSort reminds me most of the old and simpler "patience sorting", with which it has a lot in common. Both keep a collection of sorted sublists (to be merged at the end), both use binary search to pick a collection member to which to add 'the next" input, both excel at "low diversity" inputs, both (on average) end up with O(sqrt(n)) sublists on randomly ordered inputs, and neither is "naturally" stable.
As I noted in listsort.txt at the time, while timsort gets a lot of value out of repeated elements ("low diversity"), it's just not the right approach to exploit that aa much as possible. JesseSort and pdqsort (and, by extension, glidesort) will always do better on that. OTOH, timsoet gets value out of many kinds of partial ordering, which "reveal themselves" dynamically via one run in a merge "winning" consistently.
Also, I get the urge to name the algorithm after yourself, since Timsort is named after its creator as well, but if you call you data structure "rainbow" then my first association would be the rainbow flag, since flag-sorts also have a tradition in sorting algorithm names.
[0] https://en.m.wikipedia.org/wiki/Dutch_national_flag_problem
[1] https://en.m.wikipedia.org/wiki/American_flag_sort
if you back it by two arrays then you can't do O(1) inserts in the middle (like is going on here)
I would expect that they're thinking of something more like Rust's VecDeque, so that's a conventional "growable array" (Rust's Vec, Java ArrayList, C++ std::vector) used internally with some head + tail tracking to form a circular buffer so as to deliver the Deque API (fast push to head or tail, fast pop from head or tail).
(also I feel like this is arguing semantics a bit, but I'm sincerely wondering: is a linked list without pointers still a linked list?)
CPython had a unique (in my experience) combination of cheap data movement (only pointer swaps) and very expensive comparisons. That's what I was aiming at. I never publicized it, never wrote about it outside the CPython repo, and never thought I'd hear about it again.
Of course I'm pleased it found wider use, but that took my wholly by surprise. If I had it to do over again, I would probably have named it, say, gallopsort.
If this new sort catches on, Jesse should rename it before it's too late ;-)
https://en.wikipedia.org/wiki/Donald_Shell
[0] http://compression.ru/download/articles/huff/huffman_1952_mi...
https://en.m.wikipedia.org/wiki/Quantum_sort
Radix sort time complexity is not O(n); it's O(n*k) where k is the size of the largest element. Besides it has an additional O(n+k) space complexity.
What about the median? Two elements to sort? One? Zero again?
Assuming the results hold, someone has to decide if the additional complexity is worth the performance. For something like BLAS, go nuts. For Python standard library, maybe not.
So I went to just transcribe the pseudocode algorithm given in the paper, and found instructions like "if index is at the middle of the base array then...", which is pretty confusing at first glance. I believe it just means that we're also tracking the length `L` of the base array, and `if index == (L+1)/2 then...`, but it could be written much more clearly. (In fact it could just say "if index equals the number of bands.") Likewise the use of the adjectives "middle-left" and "middle-right" instead of spelling out which actual indices we're talking about.
Likewise, when `index` is zero I don't think "base array[index-1]" is well-defined. Seems easy to patch up; but if it's so easy then why didn't the author do it?
The algorithm depends on the undefined subroutines MERGEPOLICY and MERGE. The implementation of MERGE might be considered obvious (if we are unconcerned with the memory consumption of our "band" representation, which also doesn't seem to be concretely described in the paper). MERGEPOLICY, though, is opaque to me, and seems kind of critical to the entire algorithm.
The algorithm runs the opaque BINARYSEARCH step O(n) times, giving it a runtime of O(n log n) right off the bat. This git commit — https://github.com/lewj85/jessesort/commit/b98f7245077681989... — indicates that the author was initially not even aware that sorting is generally O(n log n); he originally wrote that JesseSort was "O(n log(ln n))" with some nonsense about "harmonic numbers." (Thankfully removed in that git commit, but worryingly present before. You cannot in general make a good algorithm simply by starting with a bullshit algorithm and removing the bullshit.)
JesseSort seems to boil down to two steps: (1) Create a list of sorted sublists, using the paper's poorly-defined splitting algorithm. (2) Merge the bands, using the paper's undefined merging algorithm. Up to the details of those two steps it's exactly an N-way mergesort (the dynamic selection of `N` being part of the details). Without reliable concrete explanation of those details, I'm not sure how to proceed.
It almost feels like the author was really excited to share this "rainbow" intermediate data structure (Figures 1 and 2) — which, again, is legitimately interesting IMHO — but when it came to motivating why one might care about it, he just kinda went "I dunno, sorting?" and wasn't really excited about fleshing out that idea beyond a sketch.
P.S., I'm sure this is my C++ chauvinism showing, but I'm shocked that the Python reference implementation operates only on `int`. Surely if you know you're sorting nothing but plain old ints, faster algorithms exist! Python's default `sorted` function works on any kind of input. That probably explains a lot of the speed difference right there. Also, `jessesort` doesn't even fail loudly if you pass it non-ints; it just quietly type-casts the input.
Edit: all the way at the bottom of page 6: timsort+powersort
https://www.wild-inter.net/posts/powersort-in-python-3.11
1. It starts with two elements and places them in order. 2. Each new element is inserted into the correct position within the growing "rainbow" of sorted bands. 3. By the time all elements are inserted, the list consists of multiple sorted bands that are then merged.
Was it wrong?