Why I take a message for stack overflow in Haskell? - haskell

import Data.Vector hiding((++))
import System.Environment
d = generate 1000000 (\z->case z of
0 -> 2
1 -> 3
2 -> 5
otherwise -> if odd z then (d ! (z-1)) +2 else (d ! (z-1)) + 4)
algorithmA _ _ 1 pt = pt
algorithmA t k n pt = let dk = d ! k
q = div n dk
r = mod n dk
in if r /=0 then
if q>dk then
algorithmA t (k+1) n pt
else (n:pt)
else
algorithmA (t+1) k q (dk:pt)
main = do
args<-getArgs
let n = read (args !! 0)
if (floor(sqrt(fromInteger n))) > Data.Vector.last d then error ("The square root of number is greater than " ++ show (Data.Vector.last d))
else
print (algorithmA 0 0 n [])
When I compile the above program and give for example in the command line test1 2222 I take the message "Stake space overflow: current size ... use +RTS -Ksize -RTS to increase ... ". But when I delete the if in the main function then the program works without problem. Also if I give the command Data.Vector.last d in the ghci the value is calculated without problem. So why this message is printed? When I increase the stack size to 20M the program plays without problem.
The test1 is the name of executable.
Thanks.

The problem is that your code is being too lazy when constructing d. Remember that Data.Vector.Vector is a boxed vector type - that is, it is represented internally as an array of pointers to heap objects (which are either values or unevaluated thunks). So when you're populating d with generate, you are actually creating a vector of thunks. In your example, when the thunk at position n is accessed, it triggers the evaluation of thunks at positions n-1 and n-2, which in turn triggers evaluation of thunks n-3, n-4, n-5 and so on. So evaluating the last element causes the previous 1000000 - 1 elements to be evaluated, causing the stack to grow. This is why you get the stack overflow error.
An easy way to fix this without modifying your code is to fully evaluate the whole vector before accessing the last element. In that case all thunks are evaluated in order and there is no stack overflow (since once a thunk has been evaluated, it's replaced with the value of the expression it represented, so when you're evaluating element n after having already evaluated elements n-1 and n-2, only those two elements have to be accessed and the cascading evaluation of all previous thunks is not triggered):
import Control.DeepSeq (($!!))
...
let l = V.last $!! d
...
Testing:
$ ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ ./Test 2222
[101,11,2]
Alternatively, you can use unboxed vectors (flat arrays of Ints):
d :: U.Vector Int
d = U.create $ do
v <- M.new dSize
go 0 v
where
dSize = 1000000
go i v | i >= dSize = return v
| otherwise = do
val <- case i of
0 -> return 2
1 -> return 3
2 -> return 5
_ -> if odd i
then (+2) <$> (M.read v (i-1))
else (+4) <$> (M.read v (i-1))
M.write v i val
go (i+1) v

Related

Haskell: Parallel code is slower than sequential version

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

Haskell ways to the 3n+1 challenge

Here is a simple programming problem from SPOJ: http://www.spoj.com/problems/PROBTRES/.
Basically, you are asked to output the biggest Collatz cycle for numbers between i and j. (Collatz cycle of a number $n$ is the number of steps to eventually get from $n$ to 1.)
I have been looking for a Haskell way to solve the problem with comparative performance than that of Java or C++ (so as to fits in the allowed run-time limit). Although a simple Java solution that memoizes the cycle length of any already computed cycles will work. I haven't been successful at applying the idea to obtain a Haskell solution.
I have tried the Data.Function.Memoize, as well as home-brewed log time memoization technique using the idea from this post: Memoization in Haskell?. Unfortunately, memoization actually makes the computation of cycle(n) even slower. I believe the slow down comes from the overhead of haskell way. (I tried running with the compiled binary code, instead of interpreting.)
I also suspect that simply iterating numbers from i to j can be costly ($i,j\le10^6$). So I even tried precompute everything for the range query, using idea from http://blog.openendings.net/2013/10/range-trees-and-profiling-in-haskell.html. However, this still gives "Time Limit Exceeding" error.
Can you help to inform a neat competitive Haskell program for this?
Thanks!
>>> using the approach bellow, I could submit an accepted answer to SPOJ. You may check the entire code from here.
The problem has bounds 0 < n < 1,000,000. Pre-calculate all of them and store them inside an array; then freeze the array. The array can be used as its own cache / memoization space.
The problem would then reduce to a range query problem over an array, which can be done very efficiently using trees.
With the code bellow I can get Collatz of 1..1,000,000 in a fraction of a second:
$ time echo 1000000 | ./collatz
525
real 0m0.177s
user 0m0.173s
sys 0m0.003s
Note that collatz function below, uses mutable STUArray internally, but itself is a pure function:
import Control.Monad.ST (ST)
import Control.Monad (mapM_)
import Control.Applicative ((<$>))
import Data.Array.Unboxed (UArray, elems)
import Data.Array.ST (STUArray, readArray, writeArray, runSTUArray, newArray)
collatz :: Int -> UArray Int Int
collatz size = out
where
next i = if odd i then 3 * i + 1 else i `div` 2
loop :: STUArray s Int Int -> Int -> ST s Int
loop arr k
| size < k = succ <$> loop arr (next k)
| otherwise = do
out <- readArray arr k
if out /= 0 then return out
else do
out <- succ <$> loop arr (next k)
writeArray arr k out
return out
out = runSTUArray $ do
arr <- newArray (1, size) 0
writeArray arr 1 1
mapM_ (loop arr) [2..size]
return arr
main = do
size <- read <$> getLine
print . maximum . elems $ collatz size
In order to perform range queries on this array, you may build a balanced tree as simple as below:
type Range = (Int, Int)
data Tree = Leaf Int | Node Tree Tree Range Int
build_tree :: Int -> Tree
build_tree size = loop 1 cnt
where
ctz = collatz size
cnt = head . dropWhile (< size) $ iterate (*2) 1
(Leaf a) +: (Leaf b) = max a b
(Node _ _ _ a) +: (Node _ _ _ b) = max a b
loop lo hi
| lo == hi = Leaf $ if size < lo then minBound else ctz ! lo
| otherwise = Node left right (lo, hi) (left +: right)
where
i = (lo + hi) `div` 2
left = loop lo i
right = loop (i + 1) hi
query_tree :: Tree -> Int -> Int -> Int
query_tree (Leaf x) _ _ = x
query_tree (Node l r (lo, hi) x) i j
| i <= lo && hi <= j = x
| mid < i = query_tree r i j
| j < 1 + mid = query_tree l i j
| otherwise = max (query_tree l i j) (query_tree r i j)
where mid = (lo + hi) `div` 2
Here is the same as in the other answer, but with an immutable recursively defined array (and it also leaks slightly (can someone say why?) and so two times slower):
import Data.Array
upper = 10^6
step :: Integer -> Int
step i = 1 + colAt (if odd i then 3 * i + 1 else i `div` 2)
colAt :: Integer -> Int
colAt i | i > upper = step i
colAt i = col!i
col :: Array Integer Int
col = array (1, upper) $ (1, 1) : [(i, step i) | i <- [2..upper]]
main = print $ maximum $ elems col

Euler #4 with bigger domain

Consider the modified Euler problem #4 -- "Find the maximum palindromic number which is a product of two numbers between 100 and 9999."
rev :: Int -> Int
rev x = rev' x 0
rev' :: Int -> Int -> Int
rev' n r
| n == 0 = r
| otherwise = rev' (n `div` 10) (r * 10 + n `mod` 10)
pali :: Int -> Bool
pali x = x == rev x
main :: IO ()
main = print . maximum $ [ x*y | x <- nums, y <- nums, pali (x*y)]
where
nums = [9999,9998..100]
This Haskell solution using -O2 and ghc 7.4.1 takes about 18
seconds.
The similar C solution takes 0.1 second.
So Haskell is 180 times
slower. What's wrong with my solution? I assume that this type of
problems Haskell solves pretty well.
Appendix - analogue C solution:
#define A 100
#define B 9999
int ispali(int n)
{
int n0=n, k=0;
while (n>0) {
k = 10*k + n%10;
n /= 10;
}
return n0 == k;
}
int main(void)
{
int max = 0;
for (int i=B; i>=A; i--)
for (int j=B; j>=A; j--) {
if (i*j > max && ispali(i*j))
max = i*j; }
printf("%d\n", max);
}
The similar C solution
That is a common misconception.
Lists are not loops!
And using lists to emulate loops has performance implications unless the compiler is able to eliminate the list from the code.
If you want to compare apples to apples, write the Haskell structure more or less equivalent to a loop, a tail recursive worker (with strict accumulator, though often the compiler is smart enough to figure out the strictness by itself).
Now let's take a more detailed look. For comparison, the C, compiled with gcc -O3, takes ~0.08 seconds here, the original Haskell, compiled with ghc -O2 takes ~20.3 seconds, with ghc -O2 -fllvm ~19.9 seconds. Pretty terrible.
One mistake in the original code is to use div and mod. The C code uses the equivalent of quot and rem, which map to the machine division instructions and are faster than div and mod. For positive arguments, the semantics are the same, so whenever you know that the arguments are always non-negative, never use div and mod.
Changing that, the running time becomes ~15.4 seconds when compiling with the native code generator, and ~2.9 seconds when compiling with the LLVM backend.
The difference is due to the fact that even the machine division operations are quite slow, and LLVM replaces the division/remainder with a multiply-and-shift operation. Doing the same by hand for the native backend (actually, a slightly better replacement taking advantage of the fact that I know the arguments will always be non-negative) brings its time down to ~2.2 seconds.
We're getting closer, but are still a far cry from the C.
That is due to the lists. The code still builds a list of palindromes (and traverses a list of Ints for the two factors).
Since lists cannot contain unboxed elements, that means there is a lot of boxing and unboxing going on in the code, that takes time.
So let us eliminate the lists, and take a look at the result of translating the C to Haskell:
module Main (main) where
a :: Int
a = 100
b :: Int
b = 9999
ispali :: Int -> Bool
ispali n = go n 0
where
go 0 acc = acc == n
go m acc = go (m `quot` 10) (acc * 10 + (m `rem` 10))
maxpal :: Int
maxpal = go 0 b
where
go mx i
| i < a = mx
| otherwise = go (inner mx b) (i-1)
where
inner m j
| j < a = m
| p > m && ispali p = inner p (j-1)
| otherwise = inner m (j-1)
where
p = i*j
main :: IO ()
main = print maxpal
The nested loop is translated to two nested worker functions, we use an accumulator to store the largest palindrome found so far. Compiled with ghc -O2, that runs in ~0.18 seconds, with ghc -O2 -fllvm it runs in ~0.14 seconds (yes, LLVM is better at optimising loops than the native code generator).
Still not quite there, but a factor of about 2 isn't too bad.
Maybe some find the following where the loop is abstracted out more readable, the generated core is for all intents and purposes identical (modulo a switch of argument order), and the performance of course the same:
module Main (main) where
a :: Int
a = 100
b :: Int
b = 9999
ispali :: Int -> Bool
ispali n = go n 0
where
go 0 acc = acc == n
go m acc = go (m `quot` 10) (acc * 10 + (m `rem` 10))
downto :: Int -> Int -> a -> (a -> Int -> a) -> a
downto high low acc fun = go high acc
where
go i acc
| i < low = acc
| otherwise = go (i-1) (fun acc i)
maxpal :: Int
maxpal = downto b a 0 $ \m i ->
downto b a m $ \mx j ->
let p = i*j
in if mx < p && ispali p then p else mx
main :: IO ()
main = print maxpal
#axblount is at least partly right; the following modification makes the program run almost three times as fast as the original:
maxPalindrome = foldl f 0
where f a x | x > a && pali x = x
| otherwise = a
main :: IO ()
main = print . maxPalindrome $ [x * y | x <- nums, y <- nums]
where nums = [9999,9998..100]
That still leaves a factor 60 slowdown, though.
This is more true to what the C code is doing:
maxpali :: [Int] -> Int
maxpali xs = go xs 0
where
go [] m = m
go (x:xs) m = if x > m && pali(x) then go xs x else go xs m
main :: IO()
main = print . maxpali $ [ x*y | x <- nums, y <- nums ]
where nums = [9999,9998..100]
On my box this takes 2 seconds vs .5 for the C version.
Haskell may be storing that entire list [ x*y | x <- nums, y <- nums, pali (x*y)] where as the C solution calculates the maximum on the fly. I'm not sure about this.
Also the C solution will only calculate ispali if the product beats the previous maximum. I would bet Haskell calculates are palindrome products regardless of whether x*y is a possible max.
It seems to me that you are having a branch prediction problem. In the C code, you have two nested loops and as soon as a palindrome is seen in the inner loop, the rest of the inner loop will be skipped very fast.
The way you feed this list of products instead of the nested loops I am not sure that ghc is doing any of this prediction.
Another way to write this is to use two folds, instead of one fold over the flattened list:
-- foldl g0 0 [x*y | x<-[b-1,b-2..a], y<-[b-1,b-2..a], pali(x*y)] (A)
-- foldl g1 0 [x*y | x<-[b-1,b-2..a], y<-[b-1,b-2..a]] (B)
-- foldl g2 0 [ [x*y | y<-[b-1,b-2..a]] | x<-[b-1,b-2..a]] (C)
maxpal b a = foldl f1 0 [b-1,b-2..a] -- (D)
where
f1 m x = foldl f2 m [b-1,b-2..a]
where
f2 m y | p>m && pali p = p
| otherwise = m
where p = x*y
main = print $ maxpal 10000 100
Seems to run much faster than (B) (as in larsmans's answer), too (only 3x - 4x slower then the following loops-based code). Fusing foldl and enumFromThenTo definitions gets us the "functional loops" code (as in DanielFischer's answer),
maxpal_loops b a = f (b-1) 0 -- (E)
where
f x m | x < a = m
| otherwise = g (b-1) m
where
g y m | y < a = f (x-1) m
| p>m && pali p = g (y-1) p
| otherwise = g (y-1) m
where p = x*y
The (C) variant is very suggestive of further algorithmic improvements (that's outside the scope of the original Q of course) that exploit the hidden order in the lists, destroyed by the flattening:
{- foldl g2 0 [ [x*y | y<-[b-1,b-2..a]] | x<-[b-1,b-2..a]] (C)
foldl g2 0 [ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C1)
foldl g0 0 [ safehead 0 . filter pali $
[x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C2)
fst $ until ... (\(m,s)-> (max m .
safehead 0 . filter pali . takeWhile (> m) $
head s, tail s))
(0,[ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]]) (C3)
safehead 0 $ filter pali $ mergeAllDescending
[ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C4)
-}
(C3) can stop as soon as the head x*y in a sub-list is smaller than the currently found maximum. It is what short-cutting functional loops code could achieve, but not (C4), which is guaranteed to find the maximal palindromic number first. Plus, for list-based code its algorithmic nature is more visually apparent, IMO.

Haskell solution for InterviewStreet String Similarity challenge

This is my best attempt to solve the String Similarity challenge for InterviewStreet.
import Control.Monad
import Data.Text as T
import qualified Data.Text.IO as TIO
sumSimilarities s = (T.length s) + (sum $ Prelude.map (similarity s) (Prelude.tail $ tails s))
similarity :: Text -> Text -> Int
similarity a b = case commonPrefixes a b of
Just (x,_,_) -> T.length x
Nothing -> 0
main = do
cases <- fmap read getLine
inputs <- replicateM cases TIO.getLine
forM_ inputs $ print . sumSimilarities
It only passes 7/10 of the test cases. Test cases 7, 8, and 9 fail because they exceed the allotted execution time.
I'm half trying to verify that this is indeed possible to solve in Haskell and half looking for what an optimized Haskell program looks like.
Thanks!
Tyler
Like user5402, I'd be curious whether an equivalent (for certain values of equivalent) C programme would finish within the time limit or also time out. If it would, it would be interesting to see whether an equivalent programme using ByteStrings could finish in time. - Not that ByteStrings are per se faster than Text, but since the input must be converted to the internal representation of Text while ByteString takes it as is, that might make a difference. Another possible reason that ByteStrings might be faster - if the testing machines have 32-bit GHCs - would be that text's fusion at least used to need more registers than generally available on 32 bit architectures to get full profit [a long time ago, in the days of text-0.5 to text-0.7, on my 32-bit box, bytestring used to be quite a bit faster, no idea whether that still holds for newer text versions].
Okay, since user5402 has verified that the naïve algorithm is fast enough in C, I've gone ahead and wrote an implementation of the naïve algorithm using ByteStrings
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Control.Monad
import Data.Word
main :: IO ()
main = do
cl <- C.getLine
case C.readInt cl of
Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
Nothing -> return ()
-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex
similarity :: B.ByteString -> Int
similarity bs
| len == 0 = 0
| otherwise = go len 1
where
!len = B.length bs
go !acc i
| i < len = go (acc + prf 0 i) (i+1)
| otherwise = acc
prf !k j
| j < len && bs ? k == bs ? j = prf (k+1) (j+1)
| otherwise = k
and compared it to the OP's Text version on some bad cases. On my box, that is more than four times faster than the Text version, so it'd be interesting whether that's fast enough (the C version is another 4.5 times faster, so it may well not be).
However, I consider it more likely that the time limit is exceeded due to using the naïve algorithm that has quadratic worst-case behaviour. Probably there are test cases that evoke the worst-case for the naïve algorithm.
So the solution would be to use an algorithm that scales better, optimally linear. One linear algorithm to compute the similarity of a string is the Z-algorithm.
The idea is simple (but, like most good ideas, not easy to have). Let us call a (non-empty) substring that is also a prefix of the string a prefix-substring. To avoid recomputation, the algorithm uses a window of the prefix-substring starting before the currently considered index that extends farthest to the right (initially, the window is empty).
Variables used and invariants of the algorithm:
i, the index under consideration, starts at 1 (for 0-based indexing; the entire string is not considered) and is incremented to length - 1
left and right, the first and last index of the prefix-substring window; invariants:
left < i, left <= right < length(S), either left > 0 or right < 1,
if left > 0, then S[left .. right] is the maximal common prefix of S and S[left .. ],
if 1 <= j < i and S[j .. k] is a prefix of S, then k <= right
An array Z, invariant: for 1 <= k < i, Z[k] contains the length of the longest common prefix of S[k .. ] and S.
The algorithm:
Set i = 1, left = right = 0 (any values with left <= right < 1 are allowed), and set Z[j] = 0 for all indices 1 <= j < length(S).
If i == length(S), stop.
If i > right, find the length l of the longest common prefix of S and S[i .. ], store it in Z[i]. If l > 0 we have found a window extending farther right than the previous, then set left = i and right = i+l-1, otherwise leave them unchanged. Increment i and go to 2.
Here left < i <= right, so the substring S[i .. right] is known - since S[left .. right] is a prefix of S, it is equal to S[i-left .. right-left].
Now consider the longest common prefix of S with the substring starting at index i - left.
Its length is Z[i-left], hence S[k] = S[i-left + k] for 0 <= k < Z[i-left] and
S[Z[i-left]] ≠ S[i-left+Z[i-left]]. Now, if Z[i-left] <= right-i, then i + Z[i-left] is inside the known window, therefore
S[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]]
S[i + k] = S[i-left + k] = S[k] for 0 <= k < Z[i-left]
and we see that the length of the longest common prefix of S and S[i .. ] has length Z[i-left].
Then set Z[i] = Z[i-left], increment i, and go to 2.
Otherwise, S[i .. right] is a prefix of S and we check how far it extends, starting the comparison of characters at the indices right+1 and right+1 - i. Let the length be l. Set Z[i] = l, left = i, right = i + l - 1, increment i, and go to 2.
Since the window never moves left, and the comparisons always start after the end of the window, each character in the string is compared at most once successfully to an earlier character in the string, and for each starting index, there is at most one unsuccessful comparison, therefore the algorithm is linear.
The code (using ByteString out of habit, ought to be trivially portable to Text):
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Control.Monad
import Data.Word
main :: IO ()
main = do
cl <- C.getLine
case C.readInt cl of
Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
Nothing -> return ()
-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex
-- Calculate the similarity of a string using the Z-algorithm
similarity :: B.ByteString -> Int
similarity bs
| len == 0 = 0
| otherwise = runST getSim
where
!len = B.length bs
getSim = do
za <- newArray (0,len-1) 0 :: ST s (STUArray s Int Int)
-- The common prefix of the string with itself is entire string.
unsafeWrite za 0 len
let -- Find the length of the common prefix.
go !k j
| j < len && (bs ? j == bs ? k) = go (k+1) (j+1)
| otherwise = return k
-- The window with indices in [left .. right] is the prefix-substring
-- starting before i that extends farthest.
loop !left !right i
| i >= len = count 0 0 -- when done, sum
| i > right = do
-- We're outside the window, simply
-- find the length of the common prefix
-- and store it in the Z-array.
w <- go 0 i
unsafeWrite za i w
if w > 0
-- We got a non-empty common prefix and a new window.
then loop i (i+w-1) (i+1)
-- No new window, same procedure at next index.
else loop left right (i+1)
| otherwise = do
-- We're inside the window, so the substring starting at
-- (i - left) has a common prefix with the substring
-- starting at i of length at least (right - i + 1)
-- (since the [left .. right] window is a prefix of bs).
-- But we already know how long the common prefix
-- starting at (i - left) is.
z <- unsafeRead za (i-left)
let !s = right-i+1 -- length of known prefix starting at i
if z < s
-- If the common prefix of the substring starting at
-- (i - left) is shorter than the rest of the window,
-- the common prefix of the substring starting at i
-- is the same. Store it and move on with the same window.
then do
unsafeWrite za i z
loop left right (i+1)
else do
-- Otherwise, find out how far the common prefix
-- extends, starting at (right + 1) == s + i.
w <- go s (s+i)
unsafeWrite za i w
loop i (i+w-1) (i+1)
count !acc i
| i == len = return acc
| otherwise = do
n <- unsafeRead za i
count (acc+n) (i+1)
loop 0 0 1

Longest non decrease subseq in Haskell is slow. How to improve?

longest'inc'subseq seq = maximum dp
where dp = 1 : [val n | n <- [1..length seq - 1]]
val n = (1 +) . filter'and'get'max ((<= top) . (seq!!)) $ [0..pred n]
where top = seq!!n
-----
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!!x
vxs = filter'and'get'max f xs
that take about 1-2s with lenght of seq = 1000
while in python is come out imtermedialy
in python
def longest(s):
dp = [0]*len(s)
dp[0] = 1
for i in range(1,len(s)):
need = 0
for j in range (0, i):
if s[j] <= s[i] and dp[j] > need:
need = dp[j]
dp[i] = need + 1
return max(dp)
and when length of seq is 10000, the haskell program run sooo long
while python return the answer after 10-15s
Can we improve haskell speed?
Your core problem is that you're using the wrong data structure in Haskell for this algorithm. You've translated an algorithm that depends on O(1) lookups on a sequence (as in your Python code), into one that does O(n) lookups on a list in Haskell.
Use like-for-like data structures, and then your complexity problems will take care of themselves. In this case, it means using something like Data.Vector.Unboxed to represent the sequence, which has O(1) indexing, as well as low constant overheads in general.
With nothing more than a really mindless wrapping of your lists into Vectors I get 2.5 seconds when the input list is [1..10000].
import qualified Data.Vector as V
import Data.Vector (Vector, (!))
main = print $ liss [0..10000]
liss :: [Int] -> Int
liss seqL = V.maximum dp
where dp = V.fromList $ 1 : [val n | n <- [1..length seqL - 1]]
seq = V.fromList seqL
val n = (1 +) . filter'and'get'max ((<= top) . (seq!)) $ [0..pred n]
where top = seq!n
-----
filter'and'get'max :: (Int -> Bool) -> [Int] -> Int
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!x
vxs = filter'and'get'max f xs
The compilation and execution:
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
10001
real 0m2.536s
user 0m2.528s
A worker-wrapper transformation on filter'and'get'max seems to shave off another second.
Also, I don't understand why you need that middle case (filter'and'get'max f [x]), shouldn't it work fine without that? I guess this changes the result if dp!x < 0. Note eliminating that saves 0.3 seconds right there.
And the python code you provided takes ~ 10.7 seconds (added a call of longest(range(1,10000));).
tommd#Mavlo:Test$ time python so.py
real 0m10.745s
user 0m10.729s

Resources