# Sorting for "nearly sorted data" ## Algorithm: * Go over the data and like in Stalin-sort keep only those who are in order * BUT: Unlike stalin-sort we partition! * in[], outs[], outns[] * The in is the input array * The outs is the "sorted part" of the separation (Stalin would keep them) * The outns is the "outliers" part of the separation (Stalin would kill them) * Use the same algorithm to recursively sort the outns part * Use the merge sort's merge algoritm to merge outs[] and outns[] back into in[] ## This works because we know for sure that outs has at least a single element! ## When it only has one element we get worst case O(n^2) runtime! ## When the data is nearly sorted, we get nearly O(n) runtime! ## Can be used to "keep an array/list sorted" with an "update" method on it that iterates over and update pos/key similar to kismap. ## Idea: decide if we go from top or bottom based on which is smaller - hopefully mitigates worst case being descending case! ## Example ------------------------- Split 0 in0: 3 7 5 8 9 5 8 9 5 9 9 3 1 outs0: 3 7 8 9 9 9 9 outns0: 5 5 8 5 3 1 ------------------------- Split 1 in1: 5 5 8 5 3 1 outs1: 5 5 8 outns1: 5 3 1 ------------------------- Split 2 in2: 5 3 1 outs2: 5 outns2: 3 1 ------------------------- Split 3 in3: 3 1 outs2: 3 outns2: 1 ------------------------- Merge 3 outs2: 3 outns2: 1 in3 [merge-out]: 1 3 ------------------------- Merge 2 outs2: 5 outns2: 1 3 == in3 in2 [merge-out]: 1 3 5 ------------------------- Merge 1 outs1: 5 5 8 outns1: 1 3 5 == in2 in1 [merge-out]: 1 3 5 5 5 8 ------------------------- Merge 0 outs0: 3 7 8 9 9 9 9 outns0: 1 3 5 5 5 8 == in1 in0: 1 3 3 5 5 5 7 8 8 9 9 9 9 Which is - as you can see the sort result of the input array! 3 7 5 8 9 5 8 9 5 9 9 3 1 ## Time and space analysis On random data this sounds to be close to the O(n*logn) amortized runtime statistically I think but did not go after it. On the worst case its clearly O(n^2) because we always just get a single element to outsi means that... Space analysis is roughly same as the non-optimized merge sort - see below for space optimized merge steps - maybe useful for this to! ## Remark: Yes there is a variant in which we only need outnsi and no outsi vectors! For split: * Have i,j indices on the input * i reads and j writes - except when i == j no self-overwrite happens. * read from [i] * put (pushback) either to outnsi or [j] * this means j <= i * j only grows when was put there * if i is after length of input, j tells bounds of outsi in input arr Recursion just done as-is above... For merge: * See that input has junk data at its end right as many as there is in the corresponding outnsi count * so we merge from right-to-left. * The (j - 1) tells where to read from one merge source * The end of the vector (or popback / peek) tells where to read from other merge source * The (i - 1) is destination where to write the smaller among the two * The corresponding merge-source-indices step only when they are moved to destination and i always moves. This ends with the input array being sorted. Rem.: Maybe it is worth it to set the outnsi vectors to be of length |input| / 2 or maybe even |input| to avoid vector growth at cost of more memory usage... Rem.: This is based on the merge-sort last idea below - which as I said in git commit I think is well known, but just came up with. # A random bad inplace-merge idea ## Example Lets say we have this two lists 1 3 3 5 7 9 2 3 4 5 6 7 But represented in the same array, partitioned into two parts: 1 3 3 5 7 9|2 3 4 5 6 7 We can go with two pointers and try to make this work with SWAPs: 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ ~ (noswap) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (swap*) 1 2 3 5 7 9|3 3 4 5 6 7 ^ ^ (noswap) 1 2 3 5 7 9|3 3 4 5 6 7 ^ ^ (swap*) 1 2 3 3 7 9|3 4 5 5 6 7 ^ ^ (swap*) 1 2 3 3 3 9|4 5 5 6 7 7 ^ ^ (swap*) 1 2 3 3 3 4|5 5 6 7 7 9 ^ ^ ## Where: swap* means swap element on left with right, but on the right list put it in its right place (binary search + memcpy) ## Maybe: The second part should be heapified! Then we can get log(n) pop&insert, but issue is then it does not stay sorted :-( ## Runtime: O(n^2) worst case which is extreme slow... ## Rem.: Likely swap + bubble is better here for the second side... # Better, but still slow random inplace merge idea 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (<=) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (<=) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (<=) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (<=) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (>) 1 3 3 5 6 9|2 3 4 5 7 7 ^ ^ (>) 1 3 3 5 6 7|2 3 4 5 7 9 ^ ^ (!!) 1 3 3 5 6 7|2 3 4 5 7 9 ^ ! ^ (logsearch: ~) 1 3 3 5 6 7|2 3 4 5 7 9 ^ ^ ^ ~ (tmpvec) 1 3 3 5 6 7|. . 4 5 7 9 ^ ^ ^ ~ tmp: 2 3 (memcpy) 1 . . 3 3 5|6 7 4 5 7 9 ^ ^ ^ ~ tmp: 2 3 (backwrite) 1 2 3 3 3 5 6 7|4 5 7 9 ^ ^ ^ tmp: nil (not(3 <= 4 < 3)) 1 2 3 3 3 5 6 7|4 5 7 9 ^ ^ ^ tmp: nil (logsearch: ~) (not(3 <= 4 < 3)) 1 2 3 3 3 5 6 7|4 5 7 9 ^ ^ ^ ~ (tmpvec) 1 2 3 3 3 5 6 7|. . 7 9 ^ ^ ^ ~ tmp: 4 5 (memcpy) 1 2 3 3 3 . . 5|6 7 7 9 ^ ^ ^ ~ tmp: 4 5 (backwrite) 1 2 3 3 3 4 5 5|6 7|7 9 ^ ^ ^ ~ tmp: nil (not(3 <= 4 < 3)) (not(3 <= 4 < 3)) (not(3 <= 4 < 3)) [END] ## This sounds like O(n*logn) for the merge operation - which would make a merge sort slower than n*log*n still, but not so bad as above ## This is not totally in-place because can use worst case a lot of mem, but averagely less than regular merge ## But just using n/2 element tmp array for "regular" alg works if you think about it so not sure if beating that one... # Doing n/2 element tmp array From: arr: 1 3 3 5 7 9|2 3 4 5 6 7 To: arr: . . . . . .|2 3 4 5 6 7 tmp: 1 3 3 5 7 9 And then we just always pick the smaller between the two piecewise: _ arr: . . . . . .|2 3 4 5 6 7 tmp: 1 3 3 5 7 9 ^ ^ _ arr: 1 . . . . .|2 3 4 5 6 7 tmp: . 3 3 5 7 9 ^ ^ _ arr: 1 2 . . . .|. 3 4 5 6 7 tmp: . 3 3 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 . . .|. 3 4 5 6 7 tmp: . . 3 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 . .|. 3 4 5 6 7 tmp: . . . 5 7 9 ^ ^ _ arr: 1 2 3 3 3 .|. . 4 5 6 7 tmp: . . . 5 7 9 ^ ^ _ arr: 1 2 3 3 3 4|. . . 5 6 7 tmp: . . . 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 3 4|5 . . 5 6 7 tmp: . . . . 7 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 . . 6 7 tmp: . . . . 7 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 . . 7 tmp: . . . . 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 3 4|5 5 6 7 . 7 tmp: . . . . . 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 7 7 . tmp: . . . . . 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 7 7 9 tmp: . . . . . . ^ ^ And this ends the merge algorithm!