I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.
What I've tried till now
It does the job but is a bit slow as it parses the file 256 times
calculateFrequency :: L.ByteString -> [(Word8, Int64)]
calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).
import qualified Data.ByteString.Lazy as L
import qualified Data.Map as M
calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit for reading the file to avoid lazy I/O.
import Control.Monad.IO.Class
import qualified Data.ByteString as S
import Data.Conduit
import Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import Data.Word (Word8)
type Freq = VM.IOVector Int
newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0
printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
liftIO $ mapM_ go [0..255]
where
go i = do
x <- VM.read freq i
putStrLn $ show i ++ ": " ++ show x
addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
let index = fromIntegral w
oldCount <- VM.read f index
VM.write f index (oldCount + 1)
addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
loop (S.length bs - 1)
where
loop (-1) = return ()
loop i = do
addFreqWord8 f (S.index bs i)
loop (i - 1)
-- | The main entry point.
main :: IO ()
main = do
freq <- newFreq
runResourceT
$ sourceFile "random"
$$ CL.mapM_ (addFreqBS freq)
printFreq freq
I ran this on 500MB of random data and compared with #josejuan's UArray-based answer:
conduit based/mutable vectors: 1.006s
UArray: 17.962s
I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.
You can play with this implementation on FP Haskell Center as well.
EDIT: I added one of those missing functions to conduit and cleaned up the code a bit; it now looks like the following:
import Control.Monad.Trans.Class (lift)
import Data.ByteString (ByteString)
import Data.Conduit (Consumer, ($$))
import qualified Data.Conduit.Binary as CB
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as VM
import System.IO (stdin)
freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
freq <- lift $ VM.replicate 256 0
CB.mapM_ $ \w -> do
let index = fromIntegral w
oldCount <- VM.read freq index
VM.write freq index (oldCount + 1)
lift $ V.freeze freq
main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print
The only difference in functionality is how the frequency is printed.
#Alex answer is good but, with only 256 values (indexes) an array should be better
import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import Data.Int
import Data.Word
fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks
main = L.getContents >>= print . fq
#alex code take (for my sample file) 24.81 segs, using array take 7.77 segs.
UPDATED:
although Snoyman solution is better, an improvement avoiding unpack maybe
fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks
where toCounterC [] = []
toCounterC (x:xs) = toCounter x (B.length x) xs
toCounter _ 0 xs = toCounterC xs
toCounter x i xs = (B.index x i', 1): toCounter x i' xs
where i' = i - 1
with ~50% speedup.
UPDATED:
Using IOVector as Snoyman is as Conduit version (a bit faster really, but this is a raw code, better use Conduit)
import Data.Int
import Data.Word
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as V
fq :: L.ByteString -> IO (V.IOVector Int64)
fq xs =
do
v <- V.replicate 256 0 :: IO (V.IOVector Int64)
g v $ L.toChunks xs
return v
where g v = toCounterC
where toCounterC [] = return ()
toCounterC (x:xs) = toCounter x (B.length x) xs
toCounter _ 0 xs = toCounterC xs
toCounter x i xs = do
let i' = i - 1
w = fromIntegral $ B.index x i'
c <- V.read v w
V.write v w (c + 1)
toCounter x i' xs
main = do
v <- L.getContents >>= fq
mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255]
This works for me on my computer:
module Main where
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Int
calculateFrequency :: L.ByteString -> [(Word8, Int64)]
calculateFrequency xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) M.empty xs
main = do
bs <- L.readFile "E:\\Steam\\SteamApps\\common\\Sid Meier's Civilization V\\Assets\\DLC\\DLC_Deluxe\\Behind the Scenes\\Behind the Scenes.wmv"
print (calculateFrequency bs)
Doesn't run out of memory, or even load the whole file in, but takes forever (about a minute) on 600mb+ files! I compiled this using ghc 7.6.3.
I should point out that the code is basically identical save for the strict HashMap instead of the lazy Map.
Note that insertWith is twice as fast with HashMap than Map in this case. On my machine, the code as written executes in 54 seconds, while the version using Map takes 107.
My two cents (using an STUArray). Can't compare it to other solutions here. Someone might be willing to try it...
module Main where
import Data.Array.ST (runSTUArray, newArray, readArray, writeArray)
import Data.Array.Unboxed (UArray)
import qualified Data.ByteString.Lazy as L (ByteString, unpack, getContents)
import Data.Word
import Data.Int
import Control.Monad (forM_)
calculateFrequency :: L.ByteString -> UArray Word8 Int64
calculateFrequency bs = runSTUArray $ do
a <- newArray (0, 255) 0
forM_ (L.unpack bs) $ \i -> readArray a i >>= writeArray a i . succ
return a
main = L.getContents >>= print . calculateFrequency
Related
I followed Simon Marlow's book on parallel Haskell (Chapter 1) using rpar/rseq .
Below is the code (Solving the Squid Game bridge simulation):
{-# LANGUAGE FlexibleContexts #-}
import Control.DeepSeq (force)
import Control.Exception (evaluate)
import Control.Parallel.Strategies
import Data.Array.IO
( IOUArray,
getAssocs,
newListArray,
readArray,
writeArray,
)
import Data.Functor ((<&>))
import System.Environment (getArgs)
import System.Random (randomRIO)
game ::
Int -> -- number of steps
Int -> -- number of glass at each step
Int -> -- number of players
IO Int -- return the number of survivors
game totalStep totalGlass = go 1 totalGlass
where
go currentStep currentGlass numSurvivors
| numSurvivors == 0 || currentStep > totalStep = return numSurvivors
| otherwise = do
r <- randomRIO (1, currentGlass)
if r == 1
then go (currentStep + 1) totalGlass numSurvivors
else go currentStep (currentGlass - 1) (numSurvivors - 1)
simulate :: Int -> IO Int -> IO [(Int, Int)]
simulate n game =
(newListArray (0, 16) (replicate 17 0) :: IO (IOUArray Int Int))
>>= go 1
>>= getAssocs
where
go i marr
| i <= n = do
r <- game
readArray marr r >>= writeArray marr r . (+ 1)
go (i + 1) marr
| otherwise = return marr
main1 :: IO ()
main1 = do
[n, steps, glassNum, playNum] <- getArgs <&> Prelude.map read
res <- simulate n (game steps glassNum playNum)
mapM_ print res
main2 :: IO ()
main2 = do
putStrLn "Running main2"
[n, steps, glassNum, playNum] <- getArgs <&> Prelude.map read
res <- runEval $ do
r1 <- rpar $ simulate (div n 2) (game steps glassNum playNum) >>= evaluate . force
r2 <- rpar $ simulate (div n 2) (game steps glassNum playNum) >>= evaluate . force
rseq r1
rseq r2
return $
(\l1 l2 -> zipWith (\e1 e2 -> (fst e1, snd e1 + snd e2)) l1 l2)
<$> r1
<*> r2
mapM_ print res
main = main2
For main2, I've compiled using:
ghc -O2 -threaded ./squid.hs
and run as:
./squid 10000000 18 2 16 +RTS -N2
I can't understand why main1 is faster than main2 while main2 has parallelism in it.
Could anyone give me some comments on my code as to whether this is the correct use of parallelism?
Update:
Here's the updated version (the new random is quite cumbersome to use):
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE RankNTypes #-}
import Control.Monad.ST (ST, runST)
import Control.Parallel.Strategies (rpar, rseq, runEval)
import Data.Array.ST
( STUArray,
getAssocs,
newListArray,
readArray,
writeArray,
)
import Data.Functor ((<&>))
import System.Environment (getArgs)
import System.Random (StdGen)
import System.Random.Stateful
( StdGen,
applySTGen,
mkStdGen,
runSTGen,
uniformR,
)
game ::
Int -> -- number of steps
Int -> -- number of glass at each step
Int -> -- number of players
StdGen ->
ST s (Int, StdGen) -- return the number of survivors
game ns ng = go 1 ng
where
go
!cs -- current step number
!cg -- current glass number
!ns -- number of survivors
!pg -- pure generator
| ns == 0 || cs > ns = return (ns, pg)
| otherwise = do
let (r, g') = runSTGen pg (applySTGen (uniformR (1, cg)))
if r == 1
then go (cs + 1) ng ns g'
else go cs (cg - 1) (ns - 1) g'
simulate :: Int -> (forall s. StdGen -> ST s (Int, StdGen)) -> [(Int, Int)]
simulate n game =
runST $
(newListArray (0, 16) (replicate 17 0) :: ST s1 (STUArray s1 Int Int))
>>= go 1 (mkStdGen n)
>>= getAssocs
where
go !i !g !marr
| i <= n = do
(r, g') <- game g
readArray marr r >>= writeArray marr r . (+ 1)
go (i + 1) g' marr
| otherwise = return marr
main :: IO ()
main = do
[n, steps, glassNum, playNum] <- getArgs <&> Prelude.map read
let res = runEval $ do
r1 <- rpar $ simulate (div n 2 - 1) (game steps glassNum playNum)
r2 <- rpar $ simulate (div n 2 + 1) (game steps glassNum playNum)
rseq r1
rseq r2
return $ zipWith (\e1 e2 -> (fst e1, snd e1 + snd e2)) r1 r2
mapM_ print res
Update 2:
Use pure code and the elapsed time is down to 7 seconds.
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE RankNTypes #-}
import Control.Monad.ST ( runST, ST )
import Control.Parallel ( par, pseq )
import Data.Array.ST
( getAssocs, newListArray, readArray, writeArray, STUArray )
import Data.Functor ((<&>))
import System.Environment (getArgs)
import System.Random (StdGen, uniformR, mkStdGen)
game ::
Int -> -- number of total steps
Int -> -- number of glass at each step
Int -> -- number of players
StdGen ->
(Int, StdGen) -- return the number of survivors
game ts ng = go 1 ng
where
go
!cs -- current step number
!cg -- current glass number
!ns -- number of survivors
!pg -- pure generator
| ns == 0 || cs > ts = (ns, pg)
| otherwise = do
let (r, g') = uniformR (1, cg) pg
if r == 1
then go (cs + 1) ng ns g'
else go cs (cg - 1) (ns - 1) g'
simulate :: Int -> (StdGen -> (Int, StdGen)) -> [(Int, Int)]
simulate n game =
runST $
(newListArray (0, 16) (replicate 17 0) :: ST s1 (STUArray s1 Int Int))
>>= go 1 (mkStdGen n)
>>= getAssocs
where
go !i !g !marr
| i <= n = do
let (r, g') = game g
readArray marr r >>= writeArray marr r . (+ 1)
go (i + 1) g' marr
| otherwise = return marr
main :: IO ()
main = do
[n, steps, glassNum, playNum] <- getArgs <&> Prelude.map read
let r1 = simulate (div n 2 - 1) (game steps glassNum playNum)
r2 = simulate (div n 2 + 1) (game steps glassNum playNum)
res = zipWith (\e1 e2 -> (fst e1, snd e1 + snd e2)) r1 r2
res' = par r1 (pseq r2 res)
mapM_ print res'
You aren't actually using any parallelism. You write
r1 <- rpar $ simulate (div n 2) (game steps glassNum playNum) >>= evaluate . force
This sparks a thread to evaluate an IO action, not to run it. That's not useful.
Since your simulate is essentially pure, you should convert it from IO to ST s by swapping in the appropriate array types, etc. Then you can rpar (runST $ simulate ...) and actually do work in parallel. I don't think the force invocations are useful/appropriate in context; they'll free the arrays sooner, but at significant cost.
I'd like to be able to use O(1) amortized addressing with a vector type that grows lazily according to the demanded index.
This could be achieved by using pairing an MVector (PrimState m) a:
with a PrimRef m [a] to hold the remainder, using the standard doubling-algorithm for amoritzed O(1) access:
{-# LANGUAGE ExistentialQuantification #-}
module LazyVec where
import Control.Monad.Primitive
import Data.PrimRef
import Data.Vector.Mutable (MVector)
import qualified Data.Vector.Mutable as M
import Data.Vector (fromList, thaw)
import Control.Monad (forM_)
data LazyVec m a = PrimMonad m => LazyVec (MVector (PrimState m) a) (PrimRef m [a])
-- prime the LazyVec with the first n elements
lazyFromListN :: PrimMonad m => Int -> [a] -> m (LazyVec m a)
lazyFromListN n xs = do
let (as,bs) = splitAt n xs
mvec <- thaw $ fromList as
mref <- newPrimRef bs
return $ LazyVec mvec mref
-- look up the i'th element
lazyIndex :: PrimMonad m => Int -> LazyVec m a -> m a
lazyIndex i lv#(LazyVec mvec mref) | i < 0 = error "negative index"
| i < n = M.read mvec i
| otherwise = do
xs <- readPrimRef mref
if null xs
then error "index out of range"
else do
-- expand the mvec by some power of 2
-- so that it includes the i'th index
-- or ends
let n' = n * 2 ^ ( 1 + floor (logBase 2 (toEnum (i `div` n))))
let growth = n' - n
let (as, bs) = splitAt growth xs
M.grow mvec $ if null bs then length as else growth
forM_ (zip [n,n+1..] as) . uncurry $ M.write mvec
writePrimRef mref bs
lazyIndex i lv
where n = M.length mvec
And I could just use my code - but I'd rather reuse someone else's (for one, I haven't tested mine).
Does a vector type with these semantics (lazy creation from a possibly-infinite list, O(1) amortized access) exist in some package?
As Jake McArthur noted in the comments: "If it's just a function, then I recommend just using one of the existing memoization packages like MemoTrie or data-memocombinators. They should make it easy."
I would like to convert an IO Int to Int from System.Random.MWC, using unsafePerformIO. It does work in ghci:
Prelude System.Random.MWC System.IO.Unsafe> let p = unsafePerformIO(uniformR (0, 30) gen :: IO Int)
Prelude System.Random.MWC System.IO.Unsafe> p
11
Prelude System.Random.MWC System.IO.Unsafe> :t p
p :: Int
However in GHC
import System.Random.MWC
import System.IO.Unsafe
main :: IO()
main = do
gen <-createSystemRandom
print $! s 30 gen
s :: Int-> GenIO -> Int
s !k g = unsafePerformIO(uniformR (0, k - 1) g)
it returns
ghc: panic! (the 'impossible' happened)
(GHC version 7.6.3 for i386-unknown-linux):
make_exp (App _ (Coercion _))
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
There's really no need for unsafePerformIO here. Just change the type of s to return IO Int and use do-notation or the bind operator to feed the result to print.
s :: Int -> GenIO -> IO Int
s k g = uniformR (0, k - 1) g
main :: IO ()
main = do
gen <- createSystemRandom
x <- s 30 gen
print x
or
main = do
gen <- createSystemRandom
print =<< s 30 gen
or
main = print =<< s 30 =<< createSystemRandom
I'm attempting to run multiple downloads in parallel in Haskell, which I would normally just use the Control.Concurrent.Async.mapConcurrently function for. However, doing so opens ~3000 connections, which causes the web server to reject them all. Is it possible to accomplish the same task as mapConcurrently, but only have a limited number of connections open at a time (i.e. only 2 or 4 at a time)?
A quick solution would be to use a semaphore to restrict the number of concurrent actions. It's not optimal (all threads are created at once and then wait), but works:
import Control.Concurrent.MSem
import Control.Concurrent.Async
import Control.Concurrent (threadDelay)
import qualified Data.Traversable as T
mapPool :: T.Traversable t => Int -> (a -> IO b) -> t a -> IO (t b)
mapPool max f xs = do
sem <- new max
mapConcurrently (with sem . f) xs
-- A little test:
main = mapPool 10 (\x -> threadDelay 1000000 >> print x) [1..100]
You may also try the pooled-io package where you can write:
import qualified Control.Concurrent.PooledIO.Final as Pool
import Control.DeepSeq (NFData)
import Data.Traversable (Traversable, traverse)
mapPool ::
(Traversable t, NFData b) =>
Int -> (a -> IO b) -> t a -> IO (t b)
mapPool n f = Pool.runLimited n . traverse (Pool.fork . f)
This is really easy to do using the Control.Concurrent.Spawn library:
import Control.Concurrent.Spawn
type URL = String
type Response = String
numMaxConcurrentThreads = 4
getURLs :: [URL] -> IO [Response]
getURLs urlList = do
wrap <- pool numMaxConcurrentThreads
parMapIO (wrap . fetchURL) urlList
fetchURL :: URL -> IO Response
Chunking the threads may be inefficient if a few of them last significantly longer than the others. Here is a smoother, yet more complex, solution:
{-# LANGUAGE TupleSections #-}
import Control.Concurrent.Async (async, waitAny)
import Data.List (delete, sortBy)
import Data.Ord (comparing)
concurrentlyLimited :: Int -> [IO a] -> IO [a]
concurrentlyLimited n tasks = concurrentlyLimited' n (zip [0..] tasks) [] []
concurrentlyLimited' _ [] [] results = return . map snd $ sortBy (comparing fst) results
concurrentlyLimited' 0 todo ongoing results = do
(task, newResult) <- waitAny ongoing
concurrentlyLimited' 1 todo (delete task ongoing) (newResult:results)
concurrentlyLimited' n [] ongoing results = concurrentlyLimited' 0 [] ongoing results
concurrentlyLimited' n ((i, task):otherTasks) ongoing results = do
t <- async $ (i,) <$> task
concurrentlyLimited' (n-1) otherTasks (t:ongoing) results
Note : the above code could be made more generic using an instance of MonadBaseControl IO in place of IO, thanks to lifted-async.
If you have actions in a list, this one has less dependencies
import Control.Concurrent.Async (mapConcurrently)
import Data.List.Split (chunksOf)
mapConcurrentChunks :: Int -> (a -> IO b) -> [a] -> IO [b]
mapConcurrentChunks n ioa xs = concat <$> mapM (mapConcurrently ioa) (chunksOf n xs)
Edit: Just shortened a bit
What is this error?
1.hs:41:30:
Couldn't match expected type Eval [a]' against inferred type()
module Main where
import Control.Parallel(par,pseq)
import Text.Printf
import Control.Exception
import System.CPUTime
import Data.List
import IO
import Data.Char
import Control.DeepSeq
import Control.Parallel.Strategies
--Calcula o tempo entre o inicio e o fim de rodagem do programa
time :: IO t -> IO t
time a = do
start <- getCPUTime
v <- a
end <- getCPUTime
let diff = (fromIntegral (end - start)) / (10^12)
printf "Computation time: %0.3f sec\n" (diff :: Double)
return v
learquivo :: FilePath -> IO ([[Int]])
learquivo s = do
conteudo <- readFile s
return (read conteudo)
main :: IO ()
main = do
t1 <- getCPUTime
conteudo <- learquivo "list.txt"
let !mapasort = (map qsort conteudo) `using` (parList rdeepseq)
t2 <- getCPUTime
let difft2t1 = (fromIntegral (t2 -t1)) / (10^12)
printf "Computation time %0.3f sec" (difft2t1 :: Double)
qsort [] = []
qsort [x] = [x]
qsort (x:xs) =
` losort ++ (x:hisort) `using` strategy `
where
losort = qsort [y|y <- xs, y < x]
hisort = qsort [y|y <- xs, y >= x]
strategy result = rnf losort `par`
rnf hisort `pseq`
rnf result
Probably the problem is that you are using rnf from the Control.Deepseq
rnf :: (NFData a) => a -> ()
which is coincidentally the strategy in terms of parallel < 2.2:
type Strategy a = a -> () -- strategy type in early parallel package
but since version 2.2 of parallel, strategies have a different type:
type Strategy a = a -> Eval a
P.S. Latest version of parallel is 3.1.0.1. You may consider reading the complete history of API revisions. As far as I understand, the latest API version is explained in Seq no More: Better Strategies for Parallel Haskell paper.