Where to find chunks in haskell? - haskell

I'm trying to follow this tutorial: https://wiki.haskell.org/Tutorials/Programming_Haskell/String_IO.
In the last part 7 Extension: using SMP parallelism I copy the code but it fails to compile with this error message
/home/dhilst/parallelspell.hs:13:20: error:
Variable not in scope: chunk :: Int -> [String] -> t
I searched for chunks at Hoogle and got Data.Text.Internal.Lazy, but this seems to be an internal module. And I couldn't import it anyway.
Here is the code:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
main = do
(f,g,n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
run n dict work = do
chan <- newChan
errs <- getChanContents chan
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
thread chan dict (me,xs) = do
mapM_ spellit xs
writeChan chan Nothing
where spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
spell d w = w `notMember` d
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
And here is the compilation line:
ghc -O --make -threaded parallelspell.hs
--
Update: I write my own version of chunk based on this quest:How to partition a list in Haskell?
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
Still, does this means that the tutorial that I'm following is very old and out of date!? Can anyone confirm if that function already existed some day or if I'm missing something?
Regards,

Looks like the tutorial just forgot to define chunk. I encourage you to update the wiki to include a suitable definition.

Related

How do I recover lazy evaluation of a monadically constructed list, after switching from State to StateT?

With the following code:
(lazy_test.hs)
-- Testing lazy evaluation of monadically constructed lists, using State.
import Control.Monad.State
nMax = 5
foo :: Int -> State [Int] Bool
foo n = do
modify $ \st -> n : st
return (n `mod` 2 == 1)
main :: IO ()
main = do
let ress = for [0..nMax] $ \n -> runState (foo n) []
sts = map snd $ dropWhile (not . fst) ress
print $ head sts
for = flip map
I can set nMax to 5, or 50,000,000, and I get approximately the same run time:
nMax = 5:
$ stack ghc lazy_test.hs
[1 of 1] Compiling Main ( lazy_test.hs, lazy_test.o )
Linking lazy_test ...
$ time ./lazy_test
[1]
real 0m0.019s
user 0m0.002s
sys 0m0.006s
nMax = 50,000,000:
$ stack ghc lazy_test.hs
[1 of 1] Compiling Main ( lazy_test.hs, lazy_test.o )
Linking lazy_test ...
$ time ./lazy_test
[1]
real 0m0.020s
user 0m0.002s
sys 0m0.005s
which is as I expect, given my understanding of lazy evaluation mechanics.
However, if I switch from State to StateT:
(lazy_test2.hs)
-- Testing lazy evaluation of monadically constructed lists, using StateT.
import Control.Monad.State
nMax = 5
foo :: Int -> StateT [Int] IO Bool
foo n = do
modify $ \st -> n : st
return (n `mod` 2 == 1)
main :: IO ()
main = do
ress <- forM [0..nMax] $ \n -> runStateT (foo n) []
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
for = flip map
then I see an extreme difference between the respective run times:
nMax = 5:
$ stack ghc lazy_test2.hs
[1 of 1] Compiling Main ( lazy_test2.hs, lazy_test2.o )
Linking lazy_test2 ...
$ time ./lazy_test2
[1]
real 0m0.019s
user 0m0.002s
sys 0m0.004s
nMax = 50,000,000:
$ stack ghc lazy_test2.hs
[1 of 1] Compiling Main ( lazy_test2.hs, lazy_test2.o )
Linking lazy_test2 ...
$ time ./lazy_test2
[1]
real 0m29.758s
user 0m25.488s
sys 0m4.231s
And I'm assuming that's because I'm losing lazy evaluation of the monadically constructed list, when I switch to the StateT-based implementation.
Is that correct?
Can I recover lazy evaluation of a monadically constructed list, while keeping with the StateT-based implementation?
In your example, you're only running one foo action per runState, so your use of State and/or StateT is essentially irrelevant. You can replace the use of foo with the equivalent:
import Control.Monad
nMax = 50000000
main :: IO ()
main = do
ress <- forM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
and it behaves the same way.
The issue is the strictness of the IO monad. If you ran this computation in the Identity monad instead:
import Control.Monad
import Data.Functor.Identity
nMax = 50000000
main :: IO ()
main = do
let ress = runIdentity $ forM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
then it would run lazily.
If you want to run lazily in the IO monad, you need to do it explicitly with unsafeInterleaveIO, so the following would work:
import System.IO.Unsafe
import Control.Monad
nMax = 50000000
main :: IO ()
main = do
ress <- lazyForM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
lazyForM :: [a] -> (a -> IO b) -> IO [b]
lazyForM (x:xs) f = do
y <- f x
ys <- unsafeInterleaveIO (lazyForM xs f)
return (y:ys)
lazyForM [] _ = return []
The other answer by K A Buhr explains why State vs StateT is not the pertinent factor (IO is), and also points out how your example is strangely constructed (in that the State(T) part isn't actually used as each number uses a new state []). But aside from those points, I'm not sure I would say "losing lazy evaluation of the monadically constructed list", because if we understand something like "lazy evaluation = evaluated only when needed", then foo does indeed need to run on every element on the input list in order to perform all the effects, so lazy evaluation is not being "lost". You are getting what you asked for. (It just so happens that foo doesn't perform any IO, and perhaps someone else can comment with if it's ever possible for a compiler/GHC to optimize it away on this basis, but you can easily see why GHC does the naive thing here.)
This is a common, well-known problem in Haskell. There are various libraries (best known of which are streaming, pipes, conduit) which solve the problem by giving you streams (basically lists) which are lazy in the effects too. If I recreate your example in a streaming style,
import Data.Function ((&))
import Control.Monad.State
import Streaming
import qualified Streaming.Prelude as S
foo :: Int -> StateT [Int] IO Bool
foo n =
(n `mod` 2 == 1) <$ modify (n:)
nMax :: Int
nMax = 5000000
main :: IO ()
main = do
mHead <- S.head_ $ S.each [0..nMax]
& S.mapM (flip runStateT [] . foo)
& S.dropWhile (not . fst)
print $ snd <$> mHead
then both versions run practically instantaneously. To make the difference more apparent, imagine that foo also called print "hi". Then the streaming version, being lazy in the effects, would print only twice, whereas your original versions would both print nMax times. As they're lazy in the effects, then the whole list doesn't need to be traversed in order to short-circuit and finish early.

How can I read a sentence, separate the words and apply my function to each word? Haskell

I have a function that reads a word, separates the first and the last letter and the remaining content mixes it and at the end writes the first and last letter of the word but with the mixed content.
Example:
Hello -> H lle o
But I want you to be able to read a phrase and do the same in each word of the sentence. What I can do?
import Data.List
import System.IO
import System.Random
import System.IO.Unsafe
import Data.Array.IO
import Control.Monad
import Data.Array
oracion = do
frase <- getLine
let pL = head frase
let contentR = devContent frase
charDisorder <- aleatorio contentR
let uL = last frase
putStrLn $ [pL] ++ charDisorder ++ [uL]
aleatorio :: [d] -> IO [d]
aleatorio xs = do
ar <- newArray n xs
forM [1..n] $ \i -> do
t <- randomRIO (i,n)
vi <- readArray ar i
vt <- readArray ar t
writeArray ar t vi
return vt
where
n = length xs
newArray :: Int -> [d] -> IO (IOArray Int d)
newArray n xs = newListArray (1,n) xs
devContent :: [Char] -> [Char]
devContent x = init (drop 1 x)
That should go like this:
doStuffOnSentence sentence = mapM aleatorio (words sentence)
Whenever You are dealing with monads (especially IO) mapM is real lifesaver.
What's more, if You want to concatenate the final result You can add:
concatIoStrings = liftM unwords
doStuffAndConcat = concatIoStrings . doStuffOnSentence

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).
It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Why doesn't this code operate in constant memory?

I'm using Data.Text.Lazy to process some text files. I read in 2 files and distribute their text to 3 files according to some criteria. The loop which does the processing is go'. I've designed it in a way in which it should process the files incrementally and keep nothing huge in memory. However, as soon as the execution reaches the go' part the memory keeps on increasing till it reaches around 90MB at the end, starting from 2MB.
Can someone explain why this memory increase happens and how to avoid it?
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import System.IO
import System.Environment
import Control.Monad
main = do
[in_en, in_ar] <- getArgs
[h_en, h_ar] <- mapM (`openFile` ReadMode) [in_en, in_ar]
hSetEncoding h_en utf8
en_txt <- TI.hGetContents h_en
let len = length $ T.lines en_txt
len `seq` hClose h_en
h_en <- openFile in_en ReadMode
hs#[hO_lm, hO_en, hO_ar] <- mapM (`openFile` WriteMode) ["lm.txt", "tun_"++in_en, "tun_"++in_ar]
mapM_ (`hSetEncoding` utf8) [h_en, h_ar, hO_lm, hO_en, hO_ar]
[en_txt, ar_txt] <- mapM TI.hGetContents [h_en, h_ar]
let txts#[_, _, _] = map T.unlines $ go len en_txt ar_txt
zipWithM_ TI.hPutStr hs txts
mapM_ (liftM2 (>>) hFlush hClose) hs
print "success"
where
go len en_txt ar_txt = go' (T.lines en_txt) (T.lines ar_txt)
where (q,r) = len `quotRem` 3000
go' [] [] = [[],[],[]]
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
~[lm,en',ar'] = go' aft aftA
in [bef ++ lm, h:en', hA:ar']
EDIT
As per #kosmikus's suggestion I've tried replacing zipWithM_ TI.hPutStr hs txts with a loop which prints line by line as shown below. The memory consumption is now 2GB+!
fix (\loop lm en ar -> do
case (en,ar,lm) of
([],_,lm) -> TI.hPutStr hO_lm $ T.unlines lm
(h:t,~(h':t'),~(lh:lt)) -> do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar h'
TI.hPutStrLn hO_lm lh
loop lt t t')
lm en ar
What's going on here?
The function go' builds a [T.Text] with three elements. The list is built lazily: in each step of go each of the three lists becomes known to a certain extent. However, you consume this structure by printing each element to a file in order, using the line:
zipWithM_ TI.hPutStr hs txts
So the way you consume the data does not match the way you produce the data. While printing the first of the three list elements to a file, the other two are built and kept in memory. Hence the space leak.
Update
I think that for the current example, the easiest fix would be to write to the target files during the loop, i.e., in the go' loop. I'd modify go' as follows:
go' :: [T.Text] -> [T.Text] -> IO ()
go' [] [] = return ()
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
in do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar hA
mapM_ (TI.hPutStrLn hO_lm) bef
go' aft aftA
And then replace the call to go and the subsequent zipWithM_ call with a plain call to:
go hs len en_txt ar_txt

Lazy binary get

Why is Data.Binary.Get isn't lazy as it says? Or am I doing something wrong here?
import Data.ByteString.Lazy (pack)
import Data.Binary.Get (runGet, isEmpty, getWord8)
getWords = do
empty <- isEmpty
if empty
then return []
else do
w <- getWord8
ws <- getWords
return $ w:ws
main = print $ take 10 $ runGet getWords $ pack $ repeat 1
This main function just hangs instead of printing 10 words.
The documentation you linked provides several examples. The first one needs to read all the input before it can return and looks a lot like what you have written. The second one is a left-fold and processes the input in a streaming fashion. Here's your code rewritten in this style:
module Main where
import Data.Word (Word8)
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get (runGetState, getWord8)
getWords :: BL.ByteString -> [Word8]
getWords input
| BL.null input = []
| otherwise =
let (w, rest, _) = runGetState getWord8 input 0
in w : getWords rest
main :: IO ()
main = print . take 10 . getWords . BL.pack . repeat $ 1
Testing:
*Main> :main
[1,1,1,1,1,1,1,1,1,1]

Resources