magyarsort/neargoodsort_and_merge_ideas.md

7.1 KiB

Sorting for "nearly sorted data"

Algorithm:

  • Go over the data and like in Stalin-sort keep only those who are in order
  • BUT: Unlike stalin-sort we partition!
  • in[], outs[], outns[]
  • The in is the input array
  • The outs is the "sorted part" of the separation (Stalin would keep them)
  • The outns is the "outliers" part of the separation (Stalin would kill them)
  • Use the same algorithm to recursively sort the outns part
  • Use the merge sort's merge algoritm to merge outs[] and outns[] back into in[]

This works because we know for sure that outs has at least a single element!

When it only has one element we get worst case O(n^2) runtime!

When the data is nearly sorted, we get nearly O(n) runtime!

Can be used to "keep an array/list sorted" with an "update" method on it that iterates over and update pos/key similar to kismap.

Idea: decide if we go from top or bottom based on which is smaller - hopefully mitigates worst case being descending case!

Example

------------------------- Split 0
in0:
3 7 5 8 9 5 8 9 5 9 9 3 1

outs0:
3 7 8 9 9 9 9
outns0:
5 5 8 5 3 1
------------------------- Split 1
in1:
5 5 8 5 3 1

outs1:
5 5 8
outns1:
5 3 1
------------------------- Split 2
in2:
5 3 1

outs2:
5
outns2:
3 1
------------------------- Split 3
in3:
3 1

outs2:
3
outns2:
1
------------------------- Merge 3
outs2:
3
outns2:
1

in3 [merge-out]:
1 3
------------------------- Merge 2
outs2:
5
outns2:
1 3      == in3

in2 [merge-out]:
1 3 5
------------------------- Merge 1
outs1:
5 5 8
outns1:
1 3 5    == in2

in1 [merge-out]:
1 3 5 5 5 8
------------------------- Merge 0
outs0:
3 7 8 9 9 9 9
outns0:
1 3 5 5 5 8  == in1

in0:
1 3 3 5 5 5 7 8 8 9 9 9 9

Which is - as you can see the sort result of the input array!

3 7 5 8 9 5 8 9 5 9 9 3 1

Time and space analysis

On random data this sounds to be close to the O(n*logn) amortized runtime statistically I think but did not go after it.

On the worst case its clearly O(n^2) because we always just get a single element to outsi means that...

Space analysis is roughly same as the non-optimized merge sort - see below for space optimized merge steps - maybe useful for this to!

A random bad inplace-merge idea

Example

Lets say we have this two lists

1 3 3 5 7 9
2 3 4 5 6 7

But represented in the same array, partitioned into two parts:

1 3 3 5 7 9|2 3 4 5 6 7

We can go with two pointers and try to make this work with SWAPs:

1 3 3 5 7 9|2 3 4 5 6 7
^           ^         ~

(noswap) 1 3 3 5 7 9|2 3 4 5 6 7 ^ ^ (swap*) 1 2 3 5 7 9|3 3 4 5 6 7 ^ ^ (noswap) 1 2 3 5 7 9|3 3 4 5 6 7 ^ ^ (swap*) 1 2 3 3 7 9|3 4 5 5 6 7 ^ ^ (swap*) 1 2 3 3 3 9|4 5 5 6 7 7 ^ ^ (swap*) 1 2 3 3 3 4|5 5 6 7 7 9 ^ ^

Where: swap* means swap element on left with right, but on the right list put it in its right place (binary search + memcpy)

Maybe: The second part should be heapified! Then we can get log(n) pop&insert, but issue is then it does not stay sorted :-(

Runtime: O(n^2) worst case which is extreme slow...

Rem.: Likely swap + bubble is better here for the second side...

Better, but still slow random inplace merge idea

1 3 3 5 7 9|2 3 4 5 6 7
^           ^
    (<=)
1 3 3 5 7 9|2 3 4 5 6 7
  ^           ^
    (<=)
1 3 3 5 7 9|2 3 4 5 6 7
    ^           ^
    (<=)
1 3 3 5 7 9|2 3 4 5 6 7
      ^           ^
    (<=)
1 3 3 5 7 9|2 3 4 5 6 7
        ^           ^
    (>)
1 3 3 5 6 9|2 3 4 5 7 7
          ^           ^
    (>)
1 3 3 5 6 7|2 3 4 5 7 9
            ^           ^
    (!!)
1 3 3 5 6 7|2 3 4 5 7 9
^ !         ^
    (logsearch: ~)
1 3 3 5 6 7|2 3 4 5 7 9
^ ^         ^ ~
    (tmpvec)
1 3 3 5 6 7|. . 4 5 7 9
^ ^         ^ ~
    tmp: 2 3
    (memcpy)
1 . . 3 3 5|6 7 4 5 7 9
^ ^         ^ ~
    tmp: 2 3
    (backwrite)
1 2 3 3 3 5 6 7|4 5 7 9
      ^ ^       ^
    tmp: nil
    (not(3 <= 4 < 3))
1 2 3 3 3 5 6 7|4 5 7 9
        ^ ^     ^
    tmp: nil
    (logsearch: ~)
    (not(3 <= 4 < 3))
1 2 3 3 3 5 6 7|4 5 7 9
        ^ ^     ^ ~
    (tmpvec)
1 2 3 3 3 5 6 7|. . 7 9
        ^ ^     ^ ~
    tmp: 4 5
    (memcpy)
1 2 3 3 3 . . 5|6 7 7 9
        ^ ^     ^ ~
    tmp: 4 5
    (backwrite)
1 2 3 3 3 4 5 5|6 7|7 9
          ^ ^     ^ ~
    tmp: nil

    (not(3 <= 4 < 3))
    (not(3 <= 4 < 3))
    (not(3 <= 4 < 3))
[END]

This sounds like O(nlogn) for the merge operation - which would make a merge sort slower than nlog*n still, but not so bad as above

This is not totally in-place because can use worst case a lot of mem, but averagely less than regular merge

But just using n/2 element tmp array for "regular" alg works if you think about it so not sure if beating that one...

Doing n/2 element tmp array

From: arr: 1 3 3 5 7 9|2 3 4 5 6 7

To: arr: . . . . . .|2 3 4 5 6 7 tmp: 1 3 3 5 7 9

And then we just always pick the smaller between the two piecewise: _ arr: . . . . . .|2 3 4 5 6 7 tmp: 1 3 3 5 7 9 ^ ^ _ arr: 1 . . . . .|2 3 4 5 6 7 tmp: . 3 3 5 7 9 ^ ^ _ arr: 1 2 . . . .|. 3 4 5 6 7 tmp: . 3 3 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 . . .|. 3 4 5 6 7 tmp: . . 3 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 . .|. 3 4 5 6 7 tmp: . . . 5 7 9 ^ ^ _ arr: 1 2 3 3 3 .|. . 4 5 6 7 tmp: . . . 5 7 9 ^ ^ _ arr: 1 2 3 3 3 4|. . . 5 6 7 tmp: . . . 5 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 3 4|5 . . 5 6 7 tmp: . . . . 7 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 . . 6 7 tmp: . . . . 7 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 . . 7 tmp: . . . . 7 9 ^ ^ (rem.: tmp is preferred to keep order of elements unchanged for same keys!) _ arr: 1 2 3 3 3 4|5 5 6 7 . 7 tmp: . . . . . 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 7 7 . tmp: . . . . . 9 ^ ^ _ arr: 1 2 3 3 3 4|5 5 6 7 7 9 tmp: . . . . . . ^ ^

And this ends the merge algorithm!