added neargoodsort idea (and some merge space optimization ideas that I think are known)

2023-07-20 21:09:57 +02:00 · 2023-07-20 21:09:57 +02:00 · 0521ddd52d
commit 0521ddd52d
parent e3c229337c
1 changed files with 276 additions and 0 deletions
--- a/neargoodsort_and_merge_ideas.md
+++ b/neargoodsort_and_merge_ideas.md
@ -0,0 +1,276 @@
+# Sorting for "nearly sorted data"
+
+## Algorithm:
+
+* Go over the data and like in Stalin-sort keep only those who are in order
+* BUT: Unlike stalin-sort we partition!
+* in[], outs[], outns[]
+* The in is the input array
+* The outs is the "sorted part" of the separation (Stalin would keep them)
+* The outns is the "outliers" part of the separation (Stalin would kill them)
+* Use the same algorithm to recursively sort the outns part
+* Use the merge sort's merge algoritm to merge outs[] and outns[] back into in[]
+
+## This works because we know for sure that outs has at least a single element!
+## When it only has one element we get worst case O(n^2) runtime!
+## When the data is nearly sorted, we get nearly O(n) runtime!
+## Can be used to "keep an array/list sorted" with an "update" method on it that iterates over and update pos/key similar to kismap.
+## Idea: decide if we go from top or bottom based on which is smaller - hopefully mitigates worst case being descending case!
+
+## Example
+
+    ------------------------- Split 0
+    in0:
+    3 7 5 8 9 5 8 9 5 9 9 3 1
+    
+    outs0:
+    3 7 8 9 9 9 9
+    outns0:
+    5 5 8 5 3 1
+    ------------------------- Split 1
+    in1:
+    5 5 8 5 3 1
+    
+    outs1:
+    5 5 8
+    outns1:
+    5 3 1
+    ------------------------- Split 2
+    in2:
+    5 3 1
+    
+    outs2:
+    5
+    outns2:
+    3 1
+    ------------------------- Split 3
+    in3:
+    3 1
+    
+    outs2:
+    3
+    outns2:
+    1
+    ------------------------- Merge 3
+    outs2:
+    3
+    outns2:
+    1
+
+    in3 [merge-out]:
+    1 3
+    ------------------------- Merge 2
+    outs2:
+    5
+    outns2:
+    1 3      == in3
+
+    in2 [merge-out]:
+    1 3 5
+    ------------------------- Merge 1
+    outs1:
+    5 5 8
+    outns1:
+    1 3 5    == in2
+
+    in1 [merge-out]:
+    1 3 5 5 5 8
+    ------------------------- Merge 0
+    outs0:
+    3 7 8 9 9 9 9
+    outns0:
+    1 3 5 5 5 8  == in1
+
+    in0:
+    1 3 3 5 5 5 7 8 8 9 9 9 9
+
+Which is - as you can see the sort result of the input array!
+
+    3 7 5 8 9 5 8 9 5 9 9 3 1
+    
+## Time and space analysis
+
+On random data this sounds to be close to the O(n*logn) amortized runtime statistically I think but did not go after it.
+
+On the worst case its clearly O(n^2) because we always just get a single element to outsi means that...
+
+Space analysis is roughly same as the non-optimized merge sort - see below for space optimized merge steps - maybe useful for this to!
+
+
+# A random bad inplace-merge idea
+
+## Example
+
+Lets say we have this two lists
+
+    1 3 3 5 7 9
+    2 3 4 5 6 7
+
+But represented in the same array, partitioned into two parts:
+
+    1 3 3 5 7 9|2 3 4 5 6 7
+
+We can go with two pointers and try to make this work with SWAPs:
+
+    1 3 3 5 7 9|2 3 4 5 6 7
+    ^           ^         ~
+(noswap)
+    1 3 3 5 7 9|2 3 4 5 6 7
+      ^         ^
+(swap*)
+    1 2 3 5 7 9|3 3 4 5 6 7
+        ^       ^ 
+(noswap)
+    1 2 3 5 7 9|3 3 4 5 6 7
+          ^     ^
+(swap*)
+    1 2 3 3 7 9|3 4 5 5 6 7
+            ^   ^
+(swap*)
+    1 2 3 3 3 9|4 5 5 6 7 7
+              ^ ^
+(swap*)
+    1 2 3 3 3 4|5 5 6 7 7 9
+              ^ ^
+## Where: swap* means swap element on left with right, but on the right list put it in its right place (binary search + memcpy)
+## Maybe: The second part should be heapified! Then we can get log(n) pop&insert, but issue is then it does not stay sorted :-(
+## Runtime: O(n^2) worst case which is extreme slow...
+## Rem.: Likely swap + bubble is better here for the second side...
+
+# Better, but still slow random inplace merge idea
+    1 3 3 5 7 9|2 3 4 5 6 7
+    ^           ^
+        (<=)
+    1 3 3 5 7 9|2 3 4 5 6 7
+      ^           ^
+        (<=)
+    1 3 3 5 7 9|2 3 4 5 6 7
+        ^           ^
+        (<=)
+    1 3 3 5 7 9|2 3 4 5 6 7
+          ^           ^
+        (<=)
+    1 3 3 5 7 9|2 3 4 5 6 7
+            ^           ^
+        (>)
+    1 3 3 5 6 9|2 3 4 5 7 7
+              ^           ^
+        (>)
+    1 3 3 5 6 7|2 3 4 5 7 9
+                ^           ^
+        (!!)
+    1 3 3 5 6 7|2 3 4 5 7 9
+    ^ !         ^
+        (logsearch: ~)
+    1 3 3 5 6 7|2 3 4 5 7 9
+    ^ ^         ^ ~
+        (tmpvec)
+    1 3 3 5 6 7|. . 4 5 7 9
+    ^ ^         ^ ~
+        tmp: 2 3
+        (memcpy)
+    1 . . 3 3 5|6 7 4 5 7 9
+    ^ ^         ^ ~
+        tmp: 2 3
+        (backwrite)
+    1 2 3 3 3 5 6 7|4 5 7 9
+          ^ ^       ^
+        tmp: nil
+        (not(3 <= 4 < 3))
+    1 2 3 3 3 5 6 7|4 5 7 9
+            ^ ^     ^
+        tmp: nil
+        (logsearch: ~)
+        (not(3 <= 4 < 3))
+    1 2 3 3 3 5 6 7|4 5 7 9
+            ^ ^     ^ ~
+        (tmpvec)
+    1 2 3 3 3 5 6 7|. . 7 9
+            ^ ^     ^ ~
+        tmp: 4 5
+        (memcpy)
+    1 2 3 3 3 . . 5|6 7 7 9
+            ^ ^     ^ ~
+        tmp: 4 5
+        (backwrite)
+    1 2 3 3 3 4 5 5|6 7|7 9
+              ^ ^     ^ ~
+        tmp: nil
+
+        (not(3 <= 4 < 3))
+        (not(3 <= 4 < 3))
+        (not(3 <= 4 < 3))
+    [END]
+
+## This sounds like O(n*logn) for the merge operation - which would make a merge sort slower than n*log*n still, but not so bad as above
+## This is not totally in-place because can use worst case a lot of mem, but averagely less than regular merge
+## But just using n/2 element tmp array for "regular" alg works if you think about it so not sure if beating that one...
+
+# Doing n/2 element tmp array
+
+From:
+    arr:    1 3 3 5 7 9|2 3 4 5 6 7
+
+To:
+    arr:    . . . . . .|2 3 4 5 6 7
+    tmp:    1 3 3 5 7 9
+
+And then we just always  pick the smaller between the two piecewise:
+            _
+    arr:    . . . . . .|2 3 4 5 6 7
+    tmp:    1 3 3 5 7 9 ^
+            ^
+              _
+    arr:    1 . . . . .|2 3 4 5 6 7
+    tmp:    . 3 3 5 7 9 ^
+              ^
+                _
+    arr:    1 2 . . . .|. 3 4 5 6 7
+    tmp:    . 3 3 5 7 9   ^
+              ^
+        (rem.: tmp is preferred to keep order of elements unchanged for same keys!)
+                  _
+    arr:    1 2 3 . . .|. 3 4 5 6 7
+    tmp:    . . 3 5 7 9   ^
+                ^
+        (rem.: tmp is preferred to keep order of elements unchanged for same keys!)
+                    _
+    arr:    1 2 3 3 . .|. 3 4 5 6 7
+    tmp:    . . . 5 7 9   ^
+                  ^
+                      _
+    arr:    1 2 3 3 3 .|. . 4 5 6 7
+    tmp:    . . . 5 7 9     ^
+                  ^
+                        _
+    arr:    1 2 3 3 3 4|. . . 5 6 7
+    tmp:    . . . 5 7 9       ^
+                  ^
+        (rem.: tmp is preferred to keep order of elements unchanged for same keys!)
+                          _
+    arr:    1 2 3 3 3 4|5 . . 5 6 7
+    tmp:    . . . . 7 9       ^
+                    ^
+                            _
+    arr:    1 2 3 3 3 4|5 5 . . 6 7
+    tmp:    . . . . 7 9         ^
+                    ^
+                              _
+    arr:    1 2 3 3 3 4|5 5 6 . . 7
+    tmp:    . . . . 7 9           ^
+                    ^
+        (rem.: tmp is preferred to keep order of elements unchanged for same keys!)
+                                _
+    arr:    1 2 3 3 3 4|5 5 6 7 . 7
+    tmp:    . . . . . 9           ^
+                      ^
+                                  _
+    arr:    1 2 3 3 3 4|5 5 6 7 7 .
+    tmp:    . . . . . 9             ^
+                      ^
+                                    _
+    arr:    1 2 3 3 3 4|5 5 6 7 7 9
+    tmp:    . . . . . .             ^
+                        ^
+
+And this ends the merge algorithm!