Why doesn't this code operate in constant memory? - haskell

I'm using Data.Text.Lazy to process some text files. I read in 2 files and distribute their text to 3 files according to some criteria. The loop which does the processing is go'. I've designed it in a way in which it should process the files incrementally and keep nothing huge in memory. However, as soon as the execution reaches the go' part the memory keeps on increasing till it reaches around 90MB at the end, starting from 2MB.
Can someone explain why this memory increase happens and how to avoid it?
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import System.IO
import System.Environment
import Control.Monad
main = do
[in_en, in_ar] <- getArgs
[h_en, h_ar] <- mapM (`openFile` ReadMode) [in_en, in_ar]
hSetEncoding h_en utf8
en_txt <- TI.hGetContents h_en
let len = length $ T.lines en_txt
len `seq` hClose h_en
h_en <- openFile in_en ReadMode
hs#[hO_lm, hO_en, hO_ar] <- mapM (`openFile` WriteMode) ["lm.txt", "tun_"++in_en, "tun_"++in_ar]
mapM_ (`hSetEncoding` utf8) [h_en, h_ar, hO_lm, hO_en, hO_ar]
[en_txt, ar_txt] <- mapM TI.hGetContents [h_en, h_ar]
let txts#[_, _, _] = map T.unlines $ go len en_txt ar_txt
zipWithM_ TI.hPutStr hs txts
mapM_ (liftM2 (>>) hFlush hClose) hs
print "success"
where
go len en_txt ar_txt = go' (T.lines en_txt) (T.lines ar_txt)
where (q,r) = len `quotRem` 3000
go' [] [] = [[],[],[]]
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
~[lm,en',ar'] = go' aft aftA
in [bef ++ lm, h:en', hA:ar']
EDIT
As per #kosmikus's suggestion I've tried replacing zipWithM_ TI.hPutStr hs txts with a loop which prints line by line as shown below. The memory consumption is now 2GB+!
fix (\loop lm en ar -> do
case (en,ar,lm) of
([],_,lm) -> TI.hPutStr hO_lm $ T.unlines lm
(h:t,~(h':t'),~(lh:lt)) -> do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar h'
TI.hPutStrLn hO_lm lh
loop lt t t')
lm en ar
What's going on here?

The function go' builds a [T.Text] with three elements. The list is built lazily: in each step of go each of the three lists becomes known to a certain extent. However, you consume this structure by printing each element to a file in order, using the line:
zipWithM_ TI.hPutStr hs txts
So the way you consume the data does not match the way you produce the data. While printing the first of the three list elements to a file, the other two are built and kept in memory. Hence the space leak.
Update
I think that for the current example, the easiest fix would be to write to the target files during the loop, i.e., in the go' loop. I'd modify go' as follows:
go' :: [T.Text] -> [T.Text] -> IO ()
go' [] [] = return ()
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
in do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar hA
mapM_ (TI.hPutStrLn hO_lm) bef
go' aft aftA
And then replace the call to go and the subsequent zipWithM_ call with a plain call to:
go hs len en_txt ar_txt

Related

How can I read a sentence, separate the words and apply my function to each word? Haskell

I have a function that reads a word, separates the first and the last letter and the remaining content mixes it and at the end writes the first and last letter of the word but with the mixed content.
Example:
Hello -> H lle o
But I want you to be able to read a phrase and do the same in each word of the sentence. What I can do?
import Data.List
import System.IO
import System.Random
import System.IO.Unsafe
import Data.Array.IO
import Control.Monad
import Data.Array
oracion = do
frase <- getLine
let pL = head frase
let contentR = devContent frase
charDisorder <- aleatorio contentR
let uL = last frase
putStrLn $ [pL] ++ charDisorder ++ [uL]
aleatorio :: [d] -> IO [d]
aleatorio xs = do
ar <- newArray n xs
forM [1..n] $ \i -> do
t <- randomRIO (i,n)
vi <- readArray ar i
vt <- readArray ar t
writeArray ar t vi
return vt
where
n = length xs
newArray :: Int -> [d] -> IO (IOArray Int d)
newArray n xs = newListArray (1,n) xs
devContent :: [Char] -> [Char]
devContent x = init (drop 1 x)
That should go like this:
doStuffOnSentence sentence = mapM aleatorio (words sentence)
Whenever You are dealing with monads (especially IO) mapM is real lifesaver.
What's more, if You want to concatenate the final result You can add:
concatIoStrings = liftM unwords
doStuffAndConcat = concatIoStrings . doStuffOnSentence

Where to find chunks in haskell?

I'm trying to follow this tutorial: https://wiki.haskell.org/Tutorials/Programming_Haskell/String_IO.
In the last part 7 Extension: using SMP parallelism I copy the code but it fails to compile with this error message
/home/dhilst/parallelspell.hs:13:20: error:
Variable not in scope: chunk :: Int -> [String] -> t
I searched for chunks at Hoogle and got Data.Text.Internal.Lazy, but this seems to be an internal module. And I couldn't import it anyway.
Here is the code:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
main = do
(f,g,n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
run n dict work = do
chan <- newChan
errs <- getChanContents chan
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
thread chan dict (me,xs) = do
mapM_ spellit xs
writeChan chan Nothing
where spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
spell d w = w `notMember` d
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
And here is the compilation line:
ghc -O --make -threaded parallelspell.hs
--
Update: I write my own version of chunk based on this quest:How to partition a list in Haskell?
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
Still, does this means that the tutorial that I'm following is very old and out of date!? Can anyone confirm if that function already existed some day or if I'm missing something?
Regards,
Looks like the tutorial just forgot to define chunk. I encourage you to update the wiki to include a suitable definition.

Read a list of integers lazily as a bytestring

I'm trying to find the sum of integers in a file. The code using the normal string is:
main = do
contents <- getContents
L.putStrLn (sumFile contents)
where sumFile = sum . map read. words
I tried to change it to use the Data.ByteString.Lazy module like this:
import Data.ByteString.Lazy as L
main = do
contents <- L.getContents
L.putStrLn (sumFile contents)
where sumFile = sum . L.map read. words
But this refused as words was returning a string. Then I tried using Data.ByteString.Char8 but it used a strict ByteString.
How can I make this function completely lazy?
I found a slightly length workaround to reading the file as a ByteString and then as a list of integers. Thanks to #melpomene
import Data.ByteString.Lazy.Char8 as L
main = do
contents <- L.getContents
print (sumFile contents)
where sumFile x = sum $ Prelude.map tups $ Prelude.map L.readInt (L.words x)
where read' = tups.(L.readInt)
tups :: (Num a) => (Maybe (a, b)) -> a
tups (Just (a,b)) = a
tups Nothing = 0

Expanding the abbreviated words from a file in Haskell

I am new at working with files in haskell.I wrote a code to check for occurence of words in a .c file. words are listed in a .txt file .
for example:
abbreviations.txt
ix=index
ctr=counter
tbl=table
Another file is:
main.c
main ()
{
ix = 1
for (ctr =1; ctr < 10; ctr++)
{
tbl[ctr] = ix
}
}
on encountering ix it should be expanded to index and same for ctr and tbl.
This is the code I wrote to check for occurrences(not yet to replace the encountered words)
import System.Environment
import System.IO
import Data.Char
import Control.Monad
import Data.Set
main = do
s <- getLine
f <- readFile "abbreviations.txt"
g <- readFile s
let dict = fromList (lines f)
mapM_ (spell dict) (words g)
spell d w = when (w `member` d) (putStrLn w)
On executing the code it is giving no output.
Instead of the upper code,I tried reading a file using hgetLine then converting it into list of words using words
getLines' h = do
isEOF <- hIsEOF h
if isEOF then
return ()
else
do
line <- hGetLine h
list<-remove (return (words line))
getLines' h
-- print list
main = do
inH <- openFile "abbreviations.txt" ReadMode
getLines' inH
hClose inH
remove [] = []
remove (x:xs)| x == "=" = remove xs
| otherwise = x:remove (xs)
But its giving me errors relating to IO() ,is there any other way in which I could do the following.
Where am I going wrong?
Thank you for any help.
First, there is a problem with your spell function. It should also have an else clause with it:
spell :: (Show a, Ord a) => Set a -> a -> IO ()
spell d w = if (w `member` d)
then print d
else return ()
Also, note that I have changed your putStrLn to print and added a type signature to your code.
On executing the code it is giving no output.
That's because, it's always going to the else clause in your spell function. If you try to trace up the execution of your program, then you will note that, your dict variable will actually contain this Set: ["ctr=counter","ix=index","tbl=table"] and it doesn't contains the words of the file main.c. I hope this will be sufficient to get you started.

Having trouble finishing off this enumeratee

At one point I wrote a packet capture program in haskell and it used lazy IO to catch all the tcp packets. The problem was that sometimes packets are out of order, so I had to insert all of them into a list until I got a fin flag to be sure that I had all the packets necessary to do anything with them, and if I was sniffing something really big, like a video, I had to hold all that in memory. To do it any other way would require some difficult imperative code.
So later I learned about iteratees, and I decided to implement my own. How it would work is, there is an enumeratee. You supply it with the number of packets you want it to hold. As it pulls in packets, it sorts them, and then once it gets up to the number you specify, it starts flushing, but leaves a few in there so that new chunks are sorted into that list before more packets are flushed. The idea is that chunks will be almost in order before they hit this enumeratee, and it will fix most small order problems. When it gets an EOF, it should send all remaining packets back out.
So it almost works. I realize some of these could be replaced by standard enumerator functions, but I wanted to write them myself to understand how it works better. Here's some code:
Readlines just gets lines from a file one line at a time and feeds it.
PrintLines just prints each chunk.
numbers.txt is a line delimited set of numbers that are slightly out of order, some numbers are several spaces before or after they should be.
Reorder is the function that holds n numbers and sorts new ones into its accumulator list, and then shoves out all but the last n of those numbers.
import Prelude as P
import Data.Enumerator as E
import Data.Enumerator.List as EL
import Data.List (sort, insert)
import IO
import Control.Monad.Trans (lift)
import Control.Monad (liftM)
import Control.Exception as Exc
import Debug.Trace
test = run_ (readLines "numbers.txt" $$ EL.map (read ::String -> Int) =$ reorder 10 =$ printLines)
reorder :: (Show a, Ord a) => (Monad m) => Int -> Enumeratee a a m b
reorder n step = reorder' [] n step
where
reorder' acc n (Continue k) =
let
len = P.length
loop buf n' (Chunks xs)
| (n' - len xs >= 0) = continue (loop (foldr insert buf xs) (n' - len xs))
| otherwise =
let allchunx = foldr insert buf xs
(excess,store)= P.splitAt (negate (n' - len xs)) allchunx
in k (Chunks excess) >>== reorder' store 0
loop buf n' (EOF) = k (Chunks (trace ("buf:" ++ show buf) buf)) >>== undefined
in continue (loop acc n)
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks (x:xs)) = do
lift $ print x
printLines
loop (EOF) = yield () EOF
readLines :: FilePath -> Enumerator String IO ()
readLines filename s = do
h <- tryIO $ openFile filename ReadMode
Iteratee (Exc.finally (runIteratee $ checkContinue0 (blah h) s) (hClose h))
where
blah h loop k = do
x <- lift $ myGetLine h
case x of
Nothing -> continue k
Just line -> k (Chunks [line]) >>== loop
myGetLine h = Exc.catch (liftM Just (hGetLine h)) checkError
checkError :: IOException -> IO (Maybe String)
checkError e = return Nothing
My problem is at the undefined in reorder. What happens is reorder has 10 items stuck in it, and then it receives an EOF from up the stack. So it goes k (Chunks those10items) and then there is an undefined because I don't know what to put here to make it work.
What happens is that the last 10 items get chopped out of the output of the program. You can see the trace, that variable buf has all the remaining items in it. I have tried yielding, but I'm not sure what to yield or if I should yield at all. I'm not sure what to put there to make this work.
Edit: Turns out the reorder was fixed by changing the undefined part of the loop to:
loop buf n' EOF = k (Chunks buf) >>== (\s -> yield s EOF)
which I almost definitely had at one point, but I didn't get the right answer so I assumed it was wrong.
The problem was with printLines. Since reorder was sending out chunks one at a time until it got to the very end, I never noticed the problem with printLines which was that it was discarding chunks other than the first one per loop. In my head I thought that the chunks would carry over or something, which was stupid.
Anyways I changed printLines to this:
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks xs) = do
lift $ mapM_ print xs
printLines
loop (EOF) = yield () EOF
And now it works. Thanks a lot, I was afraid I wouldn't get an answer.
How about
loop buf n' (EOF) = k (Chunks buf) >>== (\s -> yield s EOF)
(idea taken from EB.isolate).
Depending on what exactly you're trying to do, your printLines may also need fixing; the case for Chunks (x:xs) throws away xs. Something like
loop (Chunks (x:xs)) = do
lift $ print x
loop (Chunks xs)
may (or may not) have been what you intended.

Resources