Pseudo-quicksort time complexity

Pseudo-quicksort time complexity - haskell

I know that quicksort has O(n log n) average time complexity. A pseudo-quicksort (which is only a quicksort when you look at it from far enough away, with a suitably high level of abstraction) that is often used to demonstrate the conciseness of functional languages is as follows (given in Haskell):
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = quicksort [y | y<-xs, y<p] ++ [p] ++ quicksort [y | y<-xs, y>=p]
Okay, so I know this thing has problems. The biggest problem with this is that it does not sort in place, which is normally a big advantage of quicksort. Even if that didn't matter, it would still take longer than a typical quicksort because it has to do two passes of the list when it partitions it, and it does costly append operations to splice it back together afterwards. Further, the choice of the first element as the pivot is not the best choice.
But even considering all of that, isn't the average time complexity of this quicksort the same as the standard quicksort? Namely, O(n log n)? Because the appends and the partition still have linear time complexity, even if they are inefficient.

This "quicksort" is actually deforested tree sort:
http://www.reddit.com/r/programming/comments/2h0j2/real_quicksort_in_haskell
data Tree a = Leaf | Node a (Tree a) (Tree a)
mkTree [] = Leaf
mkTree (x:xs) = Node x (mkTree (filter (<= x) xs)) (mkTree (filter (x <) xs))
Binary tree is unbalanced, so O(N^2) worst-case and O(N*Log N) average-case complexity for building search tree.
foldTree f g Leaf = g
foldTree f g (Node x l r) = f x (foldTree f g l) (foldTree f g r)
treeSort l = foldTree (\x lft rht -> lft++[x]++rht) [] (mkTree l)
Retrieval algorithm have O(N^2) worst-case and O(N*Log N) average-case complexity.
Well-balanced:
Prelude> let rnds = iterate step where step x = (75*x) `mod` 65537
Prelude> length . quicksort . take 4000 . rnds $ 1
4000
(0.08 secs, 10859016 bytes)
Prelude> length . quicksort . take 8000 . rnds $ 1
8000
(0.12 secs, 21183208 bytes)
Prelude> length . quicksort . take 16000 . rnds $ 1
16000
(0.25 secs, 42322744 bytes)
Not-so-well-balanced:
Prelude> length . quicksort . map (`mod` 10) $ [1..4000]
4000
(0.62 secs, 65024528 bytes)
Prelude> length . quicksort . map (`mod` 10) $ [1..8000]
8000
(2.45 secs, 241906856 bytes)
Prelude> length . quicksort . map (`mod` 10) $ [1..16000]
16000
(9.52 secs, 941667704 bytes)

I agree with your assumption that the average time complexity still is O(n log n). I'm not an expert and 100% sure, but these are my thoughts:
This is a pseudo code of the in-place quicksort: (call quicksort with l=1 and r=length of the array)
Quicksort(l,r)
--------------
IF r-l>=1 THEN
choose pivot element x of {x_l,x_l+1,...,x_r-1,x_r}
order the array-segment x_l,...x_r in such a way that
all elements < x are on the left side of x // line 6
all elements > x are on the right side of x // line 7
let m be the position of x in the 'sorted' array (as said in the two lines above)
Quicksort(l,m-1);
Quicksort(m+1,r)
FI
The average time complexity analysis then reasons by selecting the "<"-comparisons in line 6 and 7 as the dominant operation in this algorithm and finally comes to the conclusion that the average time complexity is O(n log n). As the cost of line "order the array-segment x_l,...x_r in such a way that..." are not considered (only the dominant operation is important in time complexity analysis if you want to find bounds), I think "because it has to do two passes of the list when it partitions it" is not a problem, also as your Haskell version would just take approximately twice as long in this step. The same holds true for the appendix-operation and I agree with on that this adds nothing to the asymptotic costs:
Because the appends and the partition still have linear time complexity, even if they are inefficient.
For the sake of convenience lets assume that this adds up "n" to our time complexity costs, so that we have "O(n log n+n)". As there exists a natural number o for that n log n > n for all natural numbers greater than o holds true, you can estimate n log n +n to the top by 2 n log n and to the bottom by n log n, therefore n log n+n = O(n log n).
Further, the choice of the first element as the pivot is not the best choice.
I think the choice of the pivot element is irrelevant here, because in the average case analysis you assume uniform distribution of the elements in the array. You can't know from which place in the array you should select it, and you therefore have to consider all these cases in which your pivot-element (independently from which place of the list you take it) is the i-st smallest element of your list, for i=1...r.

I can offer you a run time test on Ideone.com which seems to show more or less linearithmic run-times for both (++) based versions and the one using accumulator technique from the Landei's answer, as well as another one, using one-pass three-way partitioning. On ordered data this turns quadratic or worse for all of them.
-- random: 100k 200k 400k 800k
-- _O 0.35s-11MB 0.85s-29MB 1.80s-53MB 3.71s-87MB n^1.3 1.1 1.0
-- _P 0.36s-12MB 0.80s-20MB 1.66s-45MB 3.76s-67MB n^1.2 1.1 1.2
-- _A 0.31s-14MB 0.62s-20MB 1.58s-54MB 3.22s-95MB n^1.0 1.3 1.0
-- _3 0.20s- 9MB 0.41s-14MB 0.88s-24MB 1.92s-49MB n^1.0 1.1 1.1
-- ordered: 230 460 900 1800
-- _P 0.09s 0.33s 1.43s 6.89s n^1.9 2.1 2.3
-- _A 0.09s 0.33s 1.44s 6.90s n^1.9 2.1 2.3
-- _3 0.05s 0.15s 0.63s 3.14s n^1.6 2.1 2.3
quicksortO xs = go xs where
go [] = []
go (x:xs) = go [y | y<-xs, y<x] ++ [x] ++ go [y | y<-xs, y>=x]
quicksortP xs = go xs where
go [] = []
go (x:xs) = go [y | y<-xs, y<x] ++ (x : go [y | y<-xs, y>=x])
quicksortA xs = go xs [] where
go [] acc = acc
go (x:xs) acc = go [y | y<-xs, y<x] (x : go [y | y<-xs, y>=x] acc)
quicksort3 xs = go xs [] where
go (x:xs) zs = part x xs zs [] [] []
go [] zs = zs
part x [] zs a b c = go a ((x : b) ++ go c zs)
part x (y:ys) zs a b c =
case compare y x of
LT -> part x ys zs (y:a) b c
EQ -> part x ys zs a (y:b) c
GT -> part x ys zs a b (y:c)
The empirical run-time complexities are estimated here as O(n^a) where a = log( t2/t1 ) / log( n2/n1 ). The timings are very approximate as ideone aren't very reliable with occasional far outlyers, but for checking the time complexity it's enough.
Thus these data seem to indicate that one-pass partition is faster by 1.5x-2x than two-pass schemes, and that using (++) is in no way slowing things down - at all. I.e. the "append operations" are not "costly" at all. The quadratic behaviour or (++)/append seems to be an urban myth — in Haskell context of course (edit: ... i.e. in the context of guarded recursion/tail recursion modulo cons; cf. this answer) (update: as user:AndrewC explains, it really is quadratic with the left folding; linear when (++) is used with the right folding; more about this here and here).
later addition: To be stable, the three-way partitioning quicksort version should too build its parts in the top-down manner:
q3s xs = go xs [] where
go (x:xs) z = part x xs go (x:) (`go` z)
go [] z = z
part x [] a b c = a [] (b (c []))
part x (y:ys) a b c =
case compare y x of
LT -> part x ys (a . (y:)) b c
EQ -> part x ys a (b . (y:)) c
GT -> part x ys a b (c . (y:))
(performance not tested).

I don't know how much this improves the runtime complexity, but by using an accumulator you can avoid the expensive (++):
quicksort xs = go xs [] where
go [] acc = acc
go (x:xs) acc = go [y | y<-xs, y<x] (x : go [y | y<-xs, y>=x] acc)

Look here for a true O(n log n) quicksort that will work on both arrays and lists :
http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.23.4398&rep=rep1&type=pdf
It is quite easy to implement in Common Lisp, and it outperforms the sort implementation of many commercial lisps.

Yes, this version has the same asymptotic complexity as the classic version -- you replace the linear-time partition with: two passes (< and >=), and you have the additional linear-time ++ (which includes linear re-allocing/copying). So it's a hefty constant-factor worse than an in-place partition, but it's still linear. All the other aspects of the algorithm are the same, so the same analysis that gives O(n log n) average-case for "true" (i.e. in-place) quicksort still holds here.

Related

Build sorted infinite list of infinite lists

I know it is impossible to sort infinite lists, but I am trying to write a definition of the infinite increasing list of multiples of n numbers.
I already have the function
multiples :: Integer -> [Integer]
multiples n = map (*n) [1..]
that returns the infinite list of multiples of n. But now I want to build a function that given a list of Integers returns the increasing infinite list of the multiples of all the numbers in the list. So the function multiplesList :: [Integer] -> [Integer] given the input [3,5] should yield [3,5,6,9,10,12,15,18,20,....].
I'm new at Haskell, and I'm struggling with this. I think I should use foldr or map since I have to apply multiples to all the numbers in the input, but I don't know how. I can't achieve to mix all the lists into one.
I would really appreciate it if someone could help me.
Thank you!

You are in the right path. following the comments here is a template you can complete.
multiples :: Integer -> [Integer]
multiples n = map (*n) [1..]
-- This is plain old gold recursion.
mergeSortedList :: [Integer] -> [Integer] -> [Integer]
mergeSortedList [] xs = undefined
mergeSortedList xs [] = undefined
mergeSortedList (x:xs) (y:ys)
| x < y = x:mergeSortedList xs (y:ys) -- Just a hint ;)
| x == y = undefined
| x > y = undefined
multiplesList :: [Integer] -> [Integer]
multiplesList ms = undefined -- Hint: foldX mergeSortedList initial xs
-- Which are initial and xs?
-- should you foldr or foldl?

We can easily weave two infinite lists together positionally, taking one element from each list at each step,
weave (x:xs) ys = x : weave ys xs
or we could take longer prefixes each time,
-- warning: expository code only
weaveN n xs ys = take n xs ++ weaveN n ys (drop n xs)
but assuming both lists are not only infinite but also strictly increasing (i.e. there are no duplicates in the lists), we can guide the taking of prefixes by the head value of the opposite list:
umerge :: Ord a => [a] -> [a] -> [a]
-- warning: only works with infinite lists
umerge xs (y:ys) = a ++ [y | head b > y ] ++ umerge ys b
where
(a,b) = span (< y) xs
This is thus a possible encoding of the unique merge operation ("unique" meaning, there won't be any duplicates in its output).
Testing, it seems to work as intended:
> take 20 $ umerge [3,6..] [5,10..]
[3,5,6,9,10,12,15,18,20,21,24,25,27,30,33,35,36,39,40,42]
> [3,6..42] ++ [5,10..42] & sort & nub
[3,5,6,9,10,12,15,18,20,21,24,25,27,30,33,35,36,39,40,42]
> [ p | let { ms :: [Integer] ; ms = takeWhile (< 25^2) $
foldl1 umerge [[p*p,p*p+p..] | p <- [2..25]] },
p <- [2..545], not $ elem p ms ]
[2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
97,101,...........,499,503,509,521,523,541]
> length it
100
And with an ingenious little tweak (due to Richard Bird as seen in the JFP article by Melissa O'Neill) it can even be used to fold an infinite list of ascending lists, provided that it is sorted in ascending order of their head elements, so the head of the first argument is guaranteed to be the first in the output and can thus be produced without testing:
umerge1 :: Ord a => [a] -> [a] -> [a]
-- warning: only works with infinite lists
-- assumes x < y
umerge1 (x:xs) ~(y:ys) = x : a ++ [y | head b > y ] ++ umerge ys b
where
(a,b) = span (< y) xs
Now
> take 100 [ p | let { ms :: [Integer] ;
ms = foldr1 umerge1 [[p*p,p*p+p..] | p <- [2..]] },
p <- [2..], not $ elem p $ takeWhile (<= p) ms ]
[2,3,5,7,11,13, ...... 523,541]
the same calculation works indefinitely.
to the literalists in the audience: yes, calling elem here is Very Bad Thing. The OP hopefully should have recognized this on their own, (*) but unfortunately I felt compelled to make this statement, thus inadvertently revealing this to them, depriving them of their would-be well-earned a-ha moment, unfortunately.
Also, umerge1's definition can be radically simplified. Again, this is left to the OP to discover on their own. (which would, again, be much better for them if I wasn't compelled to make this remark revealing it to them --- finding something on your own is that much more powerful and fulfilling)
(*) and search for ways to replace it with something more efficient, on their own. No, this code is not presented as The Best Solution to Their Problem.

Project Euler 50: Algorithm is incredibly slow, failing to understand why

I'm using Project Euler to learn Haskell. I'm new at Haskell and am having a lot of trouble coming up with an algorithm that doesn't take an absurd amount of time. I'm estimating that the program here would take 14 gigayears to arrive at the solution.
The problem:
Which prime, below one-million, can be written as the sum of the most
consecutive primes?
Here's my source. I've left out isPrime. I've posted it because it's far too inefficient to solve the problem. I think the issue lies with the slicedChains and primeChains calls, but I'm not sure what it is. I've resolved this before with C++. But for whatever reason, the efficient solution seems beyond me in Haskell.
Edit: I've included isPrime.
import System.Environment (getArgs)
import Data.List (nub,maximumBy)
import Data.Ord (comparing)
isPrime :: Integer -> Bool
isPrime 1 = False
isPrime 2 = True
isPrime x
| any (== 0) (fmap (x `mod`) [2..x-1]) = False
| otherwise = True
primeChain :: Integer -> [Integer]
primeChain x = [ n | n <- 1 : 2 : [3,5..x-1], isPrime n ]
slice :: [a] -> [Int] -> [a]
slice xs args = take (to - from + 1) (drop from xs)
where from = head args
to = last args
subsequencesOfSize :: Int -> [a] -> [[a]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
slicedChains :: Int -> [Integer] -> [[Integer]]
slicedChains len xs = nub [x | x <- fmap (xs `slice`) subseqs, length x > 1]
where subseqs = [x | x <- (subsequencesOfSize 2 [1..len]), (last x) > (head x)]
primeSums :: Integer -> [[Integer]]
primeSums x = filter (\ns -> sum ns == x) chain
where xs = primeChain x
len = length xs
chain = slicedChains len xs
compLength :: [[a]] -> [a]
compLength xs = maximumBy (comparing length) xs
cleanSums :: [Integer] -> [[Integer]]
cleanSums xs = fmap (compLength) filtered
where filtered = filter (not . null) (fmap primeSums xs)
main :: IO()
main = do
args <- getArgs
let arg = read (head args) :: Integer
let xs = primeChain arg
print $ maximumBy (comparing length) $ cleanSums xs

Your basic problem is that you are not pruning your search space based on the best solution you have found so far.
I can tell this just from the fact that you are using maximumBy to find the longest sequence.
For instance, if during your search your find a consecutive sequence of 4 primes whose sum is a prime < 10^6, you don't have to examine any sequence which begins with a prime greater than 250000.
To do this kind of pruning you have to keep track of the solution found so far and interleave the testing of candidate sequences with their generation so that the best solution found so far can stop the search early.
Update
There are several inefficiencies in slicedChains. Haskell lists are implemented a linked lists. This video is pretty good overview of linked lists and how they differ from arrays: (link)
The following expressions in your code are going to be problematic w.r.t. efficiency:
* nub has quadratic running time
* length x > 1 - the complexity of length is O(n) where n is the length of the list. A better way to write this is:
lengthGreaterThan1 :: [a] -> Bool
lengthGreaterThan1 (_:_:_) = True
lengthGreaterThan1 _ = False
* subsequencesOfSize 2 [1..len] may be more succinctly written:
[ [a,b] | a <- [1..len], b <- [a+1..len] ]
and this will also ensure that a < b.
* The take and drop calls in slice are also O(n)
* In primeSums the call to primeChain will regenerate essentially the same list over and over again resulting in a lot of multiple calls to isPrime. A better approach is to define primeChain like this:
allPrimes = filter isPrime [1..]
primeChain x = takeWhile (<= x) allPrimes
The list allPrimes will be generated once, and primeChain simply takes prefixes of that list.
* primeSums x is charged with finding sequences whose sum is exactly x, but it looks at a lot of sequences that can't possibly work. For instance, primeSums 31 will examine:
11 + 13 + 17, 11 + 13 + 17 + 23, 11 + 13 + 17 + 23 + 29,
17 + 19, 17 + 19 + 23, 17 + 19 + 23 + 29,
19 + 23, 19 + 23 + 29
23 + 29
even though it's pretty obvious that none of these sums could equal 31.

So the first thing you need is a good data structure: Once you find a sequence of length n you don't care about sequences of shorter length, so your primary needs are: (1) tracking the sum, (2) tracking the primes in the set, (3) removing the least element, (4) adding a new greatest element. The key is amortization, where a big cost is paid infrequently enough that you can pretend it is a small cost per procedure. The data structure looks like this:
data Queue x = Q [x] [x]
q_empty (Q [] []) = True
q_empty _ = False
q_headtails (Q (x:xs) rest) = (x, Q xs rest)
q_headtails (Q [] xs) = case reverse xs of y:ys -> (y, Q ys [])
[] -> error "End of queue."
q_append el (Q beg end) = Q beg (el:end)
So deconstructing the list is possible, but sometimes triggers an O(n) operation, but that's OK because when it does, we won't have to do it for another n steps, so it averages out to one operation per step. (You might also want to do it with a spine-strict list.)
To save on length operations and summing the items of the list you probably want to cache those, too:
type Length = Int
type Sum = Int
type Prime = Int
data PrimeSeq = PS Length Sum (Queue Prime)
headTails (PS len sum q) = (x, PS (len - 1) (sum - x) xs)
where (x, xs) = q_headtails q
append x (PS len sum xs) = PS (len + 1) (sum + x) (q_append x xs)
The algorithm for these looks like:
Cache a copy of the PrimeSeq you're starting with
Keep adding primes to it and testing primality until you get to 10^6.
If you find a new prime with a longer sequence, replace the cache.
Whenever you run into 10^6, revert to the cache, pull a prime off the front of the queue, then repeat as needed.

Your prime generation is quadratic (isPrime 101 tests rem 101 100 == 0 even though 10 is the biggest number by which 101 needs to be tested -- and actually 7 is enough).
Yet even with it, a simple enough list-based code finds the answer in under 2 seconds (on an Intel Core i7 2.5 GHz, interpreted in GHCi). And with the code corrected to take advantage of the above mentioned optimization (and additionally, testing by primes only), it takes 0.1s.
Also, f x | t = False | otherwise = True is the same as f x = not t.
We are asked by the PE site not to give you even a hint.
But in general, the key to efficiency in Haskell, thanks to its laziness, is being generative with as small a duplication of effort as possible. As one example, instead of calculating each slice of a list in isolation starting anew, we can produce the bunch of them together as part of one process,
slices :: Int -> [a] -> [[a]]
slices n = map (take n) . iterate tail -- sequence of list's slices of length n each
Another principle is, try to solve a more general problem, of which yours is an instance.
Having written such a function, we can play with it by trying out different values for its parameters, from smaller to the bigger ones, for an exploratory style of problem solving. We're told about 21 consecutive primes. What about 22 of them? 27? 1127 of them? ... and I've said enough about this already.
If it starts taking too much time, we can assess the full solution's needed run time by empirical orders of growth analysis.
Though the solution is found quickly enough with your unoptimized isPrime code, the exploratory process can be prohibitively slow with it, but it is fast enough with the optimized code:
primes :: [Int]
primes = 2 : filter isPrime [3,5..]
isPrime n = and [rem n p > 0 | p <- takeWhile ((<= n).(^2)) primes]

Generating a list which is made by right shifting elements n times

I am trying a problem recently. And in this case I am having few problems.
Input: generatingListforRightShifting 2 [1,2,3,4,5,6,7,8]
Output: [[1,2,3,4,5,6,7,8],[8,1,2,3,4,5,6,7],[7,8,1,2,3,4,5,6]]
As you understand this program will shift an element in right direction. The 1st argument indicates how many times it will do shifting.
As a newbie I am trying solving it few well known list functions. and using recursion. But to me recursion idea is not clear. My code is:
generatingListforRightShifting' _ []=[]
generatingListforRightShifting' 0 x=x
generatingListforRightShifting' n xs= newList where
newList=takeWhile(\i->[1..n]
<=n)(reverse(take
i(reverse xs))++reverse(drop i(reverse xs)))
I understand that the main mistake I'm doing is in the part takeWhile. But how can I iterate through n times. I have already made a program which directly shows the shifted result such as
Input:generatingListforRightShifting 2 [1,2,3,4,5,6,7,8]
Output: [7,8,1,2,3,4,5,6]
But when I try to get all previous shifting I cannot.
Can anyone help me out here. I also welcome you if you give me the solving idea.

This is more commonly known as rotating instead of shifting. Rotating the list right once is simple, as there are methods to get the last element and the the sublist of all elements but the last.
rotateOnce lst = (last lst):(init lst)
Also note that the rotating twice is equivalent to calling rotateOnce twice. Therefore, the method could be implemented simply as a recursion from the previous result:
rotateN 0 lst = [lst]
rotateN n lst = lst : rotateN (n-1) ((last lst):(init lst))
(Note: that may not be the most optimal solution.)

You can define "shift" recursively: shift 0 is a no-op, shift 1+n (x:xs) is shift n xs.
Something like:
shift 0 = \x -> x
shift n = \lst#(x:xs) -> (shift (n-1) xs)
-- example:
sh3 = shift 3
Then the 'rotate' problem becomes easier:
rotate n = \lst -> (shift lst) ++ (take n lst)

You seem to prefer that we fix your code than start again, so
let's have a look at your code. Firstly, the main list chopping:
reverse (take i (reverse xs)) ++ reverse (drop i (reverse xs))
Now reverse (take i (reverse xs)) takes i elements from the end of the list,
but you reverse the list twice to achieve this, and it would be better to do
drop (length xs - i) xs. Similarly, you can implement reverse (drop i (reverse xs)))
as take (length xs - i) xs. That gives us
drop (length xs - i) xs ++ take (length xs - i) xs
Now your code \i->[1..n]<=n doesn't make sense because it compares the list [1..n]
with n, which can't work. I think you're trying to make a loop where i runs from
1 to n, which is a good plan. Let's use a list comprehension to get the ones we wanted:
[drop (length xs - i) xs ++ take (length xs - i) xs | i <- [1 .. length xs], i <= n]
but now we're running from 1 to the length of the list but throwing away numbers above n,
which would be better written
[drop (length xs - i) xs ++ take (length xs - i) xs | i <- [1..n]]
This does allow n to be more than length xs, but I don't see a big issue there, we could check that at first.
Notice now that we're only using i in the form (length xs - i), and really we're recalculating
length xs an awful lot more than we should, so instead of letting i run from 1 to n, and using
length xs - i, why don't we just have j=length xs -i so j runs from length xs to length xs - n:
[drop j xs ++ take j xs | j <- [length xs,length xs - 1 .. length xs - n]]
which works because for example [6,5..1] == [6,5,4,3,2,1]
It would be neater to do
let l = length xs in
[drop j xs ++ take j xs | j <- [l,l - 1 .. l - n]]
or maybe you like to take more than you like to do arithmetic, so we could use:
let l = length xs in
take n [drop j xs ++ take j xs | j <- [l,l - 1 .. 0]]
which has the added benefit of stopping you doing too many, stopping
you when you get back to the start.
I'd rename your function from generatingListforRightShifting to rotationsR, giving
rotationsR n xs = let l = length xs in
take n [drop j xs ++ take j xs | j <- [l,l - 1 ..]]
Which gives rotationsR 6 [1..4] == [[1,2,3,4],[4,1,2,3],[3,4,1,2],[2,3,4,1],[1,2,3,4]].
Left rotation would look simpler:
rotationsL n xs = take n [drop j xs ++ take j xs | j <- [0..length xs]]
Digression: I couldn't help myself, sorry, and I started again.
I still don't like all that dropping and taking every single time, I'd rather pop
infinitely many copies of xs next to each other (cycle xs) and take infinitely
many tails of that, chopping them all to the right length, but just give you the first n:
rotationsL' n xs = let l = length xs in
take n . map (take l) . tails . cycle $ xs
Because of lazy evaluation, only a finite amount of cycle xs ever gets calculated,
but this one can run and run: rotationsL' 10 [1..4] gives you:
[[1,2,3,4],[2,3,4,1],[3,4,1,2],[4,1,2,3],[1,2,3,4],[2,3,4,1],[3,4,1,2],[4,1,2,3],[1,2,3,4],[2,3,4,1]]
It would be nice to do the right roations that way too, but it doesn't work because
I'd need to start at the end of an infinite list and work my way back. Let's reuse
your reverse, take what you need, reverse trick again, though:
rotationsR' n xs = let l = length xs in
take n . map (reverse.take l) . tails . cycle . reverse $ xs
Undigression: If you'd rather stick more closely to your original code, you can do
generatingListforRightShifting n xs =
[reverse (take i (reverse xs)) ++ reverse (drop i (reverse xs)) | i <- [1..n]]

I would drop the current approach, which is very convoluted. Instead, focus on abstracting the different components of the operation. If you break the operation into parts, you will notice that there are two symmetric components: rotating the list to the left, and rotating the list to the right. The operation you wish to define iterates the right rotation a specified number of times over some list. This suggests that the desired operation can be defined by taking a specified number of iterations of either the left or right rotation. For example,
left :: [a] -> [a]
left [] = []
left xs = tail xs ++ [head xs]
right :: [a] -> [a]
right [] = []
right xs = last xs : init xs
shiftL :: Int -> [a] -> [[a]]
shiftL n = take n . iterate left
shiftR :: Int -> [a] -> [[a]]
shiftR n = take n . iterate right

Using cycle here seems nice:
shifts n xs = take (n+1) $ shifts' (cycle xs)
where
len = length xs
shifts' ys = take len ys:shifts' (drop (len-1) ys)

I find a left rotation to be very straight forward using splitAt:
import Data.Tuple (swap)
rotateLeft n = uncurry (++) . swap . splitAt n
> rotateLeft 2 "Hello World!"
>>> "llo World!He"

No speedup with naive merge sort parallelization in Haskell

Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 # 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
Is there a better way (arrows, applicative) to express parallelism?

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:
do xr <- rpar $ runEval $ mergeSort' x
yr <- rseq $ runEval $ mergeSort' y
rseq (merge xr yr)
This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.
Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).
So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
do xr <- spawn $ mergeSort' x
yr <- spawn $ mergeSort' y
merge <$> get xr <*> get yr
So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.
import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
where half = length xs `div` 2
-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)
mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree
-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
yr = mergeSortP' y
in xr `par` yr `pseq` merge xr yr
mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree
-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) =
runEval $ merge <$> rpar (mergeSortR' x) <*> rpar (mergeSortR' y)
mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree
-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t#(Node x y)
| n <= 1 = mergeSort' t
| otherwise = let xr = smartMerge' (n-1) x
yr = smartMerge' (n-2) y
in xr `par` yr `pseq` merge xr yr
smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree
main = defaultMain $ [ bench "original" $ nf mergeSort lst
, bench "par" $ nf mergeSortP lst
, bench "rpar" $ nf mergeSortR lst
, bench "smart" $ nf smartMerge lst ]
where lst = [100000,99999..0] :: [Int]

A way to measure performance

Given Exercise 14 from 99 Haskell Problems:
(*) Duplicate the elements of a list.
Eg.:
*Main> dupli''' [1..10]
[1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10]
I've implemented 4 solutions:
{-- my first attempt --}
dupli :: [a] -> [a]
dupli [] = []
dupli (x:xs) = replicate 2 x ++ dupli xs
{-- using concatMap and replicate --}
dupli' :: [a] -> [a]
dupli' xs = concatMap (replicate 2) xs
{-- usign foldl --}
dupli'' :: [a] -> [a]
dupli'' xs = foldl (\acc x -> acc ++ [x,x]) [] xs
{-- using foldl 2 --}
dupli''' :: [a] -> [a]
dupli''' xs = reverse $ foldl (\acc x -> x:x:acc) [] xs
Still, I don't know how to really measure performance .
So what's the recommended function (from the above list) in terms of performance .
Any suggestions ?

These all seem more complicated (and/or less efficient) than they need to be. Why not just this:
dupli [] = []
dupli (x:xs) = x:x:(dupli xs)
Your last example is close to a good fold-based implementation, but you should use foldr, which will obviate the need to reverse the result:
dupli = foldr (\x xs -> x:x:xs) []
As for measuring performance, the "empirical approach" is profiling. As Haskell programs grow in size, they can get fairly hard to reason about in terms of runtime and space complexity, and profiling is your best bet. Also, a crude but often effective empirical approach when gauging the relative complexity of two functions is to simply compare how long they each take on some sufficiently large input; e.g. time how long length $ dupli [1..1000000] takes and compare it to dupli'', etc.
But for a program this small it shouldn't be too hard to figure out the runtime complexity of the algorithm based on your knowledge of the data structure(s) in question--in this case, lists. Here's a tip: any time you use concatenation (x ++ y), the runtime complexity is O(length x). If concatenation is used inside of a recursive algorithm operating on all the elements of a list of size n, you will essentially have an O(n ^2) algorithm. Both examples I gave, and your last example, are O(n), because the only operation used inside the recursive definition is (:), which is O(1).

As recommended you can use the criterion package. A good description is http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/.
To summarize it here and adapt it to your question, here are the steps.
Install criterion with
cabal install criterion -fchart
And then add the following to your code
import Criterion.Main
l = [(1::Int)..1000]
main = defaultMain [ bench "1" $ nf dupli l
, bench "2" $ nf dupli' l
, bench "3" $ nf dupli'' l
, bench "4" $ nf dupli''' l
]
You need the nf in order to force the evaluation of the whole result list. Otherwise you'll get just the thunk for the computation.
After that compile and run
ghc -O --make dupli.hs
./dupli -t png -k png
and you get pretty graphs of the running times of the different functions.
It turns out that dupli''' is the fastest from your functions but the foldr version that pelotom listed beats everything.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string