I have a following program in Haskell:
processDate :: String -> IO ()
processDate date = do
...
let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let addresses = fmap location properties
let batchAddresses = chunksOf 100 addresses
batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return (zip properties geocodedLocations)
geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
mapQuestKey <- getEnv "MAP_QUEST_KEY"
geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
return $ geocodeResponseToResults geocodeResponse
geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
latLangs
where
decodedResponse :: Maybe GeocodingResponse
decodedResponse = decodeGeocodingResponse inputResponse
latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse
decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse
It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.
After commenting out the geocoding part of the program memory usage is around 30M which is fine.
You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so
I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?
Cheers!
It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date
allProperties <- mapM fileToProperties allFiles
let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
print geocodedProperties
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let batchProperties = chunksOf 100 properties
batchGeocodedLocations <- mapM geocodeAddresses batchProperties
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return geocodedLocations
with something like this
import Streaming
import qualified Streaming.Prelude as S
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date -- we accept an unstreamed list
S.print $ propertiesWithGeocoding -- this was the main pain point see below
$ S.filter hasPrice
$ S.concat
$ S.mapM fileToProperties -- this mapM doesn't accumulate
$ S.each allFiles -- the list is converted to a stream
propertiesWithGeocoding
:: Stream (Of ParsedProperty) IO r
-> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =
S.concat $ S.concat
$ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
$ S.mapped S.toList -- convert segments to haskell lists
$ chunksOf 100 properties -- this is the streaming `chunksOf`
-- concat here flattens a stream of lists of as into a stream of as
-- and a stream of maybe as into a stream of as
Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:
Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).
Related
I want to do some infinite sequence of IO actions processing with filtration their results in realtime+perfoming some IO actions in certain moments:
We have some function for reducing sequences (see my question haskell elegant way to filter (reduce) sequences of duplicates from infinte list of numbers):
f :: Eq a => [a] -> [a]
f = map head . group
and expression
join $ sequence <$> ((\l -> (print <$> l)) <$> (f <$> (sequence $ replicate 6 getLine)))
if we run this, user can generate any seq of numbers, for ex:
1
2
2
3
3
"1"
"2"
"3"
[(),(),()]
This means that at first all getLine actions performed (6 times in the example and at the end of this all IO actions for filtered list performed, but I want to do IO actions exactly in the moments then sequencing reduces done for some subsequences of same numbers.
How can I archive this output:
1
2
"1"
2
3
"2"
3
3
"3"
[(),(),()]
So I Want this expression not hangs:
join $ sequence <$> ((\l -> (print <$> l)) <$> (f <$> (sequence $ repeat getLine)))
How can I archive real-time output as described above without not blocking it on infinite lists?
Without a 3rd-party library, you can lazily read the contents of standard input, appending a dummy string to the end of the expected input to force output. (There's probably a better solution that I'm stupidly overlooking.)
import System.IO
print_unique :: (String, String) -> IO ()
print_unique (last, current) | last == current = return ()
| otherwise = print last
main = do
contents <- take 6 <$> lines <$> hGetContents stdin
traverse print_unique (zip <*> tail $ (contents ++ [""]))
zip <*> tail produces tuples consisting of the ith and i+1st lines without blocking. print_unique then immediately outputs a line if the following line is different.
Essentially, you are sequencing the output actions as the input is executed, rather than sequencing the input actions.
This seems like a job for a streaming library, like streaming.
{-# LANGUAGE ImportQualifiedPost #-}
module Main where
import Streaming
import Streaming.Prelude qualified as S
main :: IO ()
main =
S.mapM_ print
. S.catMaybes
. S.mapped S.head
. S.group
$ S.replicateM 6 getLine
"streaming" has an API reminiscent to that of lists, but works with effectful sequences.
The nice thing about streaming's version of group is that it doesn't force you to keep the whole group in memory if it isn't needed.
The least intuitive function in this answer is mapped, because it's very general. It's not obvious that streaming's version of head fits as its parameter. The key idea is that the Stream type can represent both normal effectful sequences, and sequences of elements on which groups have been demarcated. This is controlled by changing a functor type parameter (Of in the first case, a nested Stream (Of a) m in the case of grouped Streams).
mapped let's you transform that functor parameter while having some effect in the underlying monad (here IO). head processes the inner Stream (Of a) m groups, getting us back to an Of (Maybe a) functor parameter.
I found a nice solution with iterateUntilM
iterateUntilM (\_->False) (\pn -> getLine >>= (\n -> if n==pn then return n else (if pn/="" then print pn else return ()) >> return n) ) ""
I don't like some verbose with
(if pn/="" then print pn else return ())
if you know how to reduce this please comment)
ps.
It is noteworthy that I made a video about this function :)
And could not immediately apply it :(
I'm trying to write code in source -> transform -> sink style, for example:
let (|>) = flip ($)
repeat 1 |> take 5 |> sum |> print
But would like to do that using IO. I have this impression that my source can be an infinite list of IO actions, and each one gets evaluated once it is needed downstream. Something like this:
-- prints the number of lines entered before "quit" is entered
[getLine..] >>= takeWhile (/= "quit") >>= length >>= print
I think this is possible with the streaming libraries, but can it be done along the lines of what I'm proposing?
Using the repeatM, takeWhile and length_ functions from the streaming library:
import Streaming
import qualified Streaming.Prelude as S
count :: IO ()
count = do r <- S.length_ . S.takeWhile (/= "quit") . S.repeatM $ getLine
print r
This seems to be in that spirit:
let (|>) = flip ($)
let (.>) = flip (.)
getContents >>= lines .> takeWhile (/= "quit") .> length .> print
The issue here is that Monad is not the right abstraction for this, and attempting to do something like this results in a situation where referential transparency is broken.
Firstly, we can do a lazy IO read like so:
module Main where
import System.IO.Unsafe (unsafePerformIO)
import Control.Monad(forM_)
lazyIOSequence :: [IO a] -> IO [a]
lazyIOSequence = pure . go where
go :: [IO a] -> [a]
go (l:ls) = (unsafePerformIO l):(go ls)
main :: IO ()
main = do
l <- lazyIOSequence (repeat getLine)
forM_ l putStrLn
This when run will perform cat. It will read lines and output them. Everything works fine.
But consider changing the main function to this:
main :: IO ()
main = do
l <- lazyIOSequence (map (putStrLn . show) [1..])
putStrLn "Hello World"
This outputs Hello World only, as we didn't need to evaluate any of l. But now consider replacing the last line like the following:
main :: IO ()
main = do
x <- lazyIOSequence (map (putStrLn . show) [1..])
seq (head x) putStrLn "Hello World"
Same program, but the output is now:
1
Hello World
This is bad, we've changed the results of a program just by evaluating a value. This is not supposed to happen in Haskell, when you evaluate something it should just evaluate it, not change the outside world.
So if you restrict your IO actions to something like reading from a file nothing else is reading from, then you might be able to sensibly lazily evaluate things, because when you read from it in relation to all the other IO actions your program is taking doesn't matter. But you don't want to allow this for IO in general, because skipping actions or performing them in a different order can matter (and above, certainly does). Even in the reading a file lazily case, if something else in your program writes to the file, then whether you evaluate that list before or after the write action will affect the output of your program, which again, breaks referential transparency (because evaluation order shouldn't matter).
So for a restricted subset of IO actions, you can sensibly define Functor, Applicative and Monad on a stream type to work in a lazy way, but doing so in the IO Monad in general is a minefield and often just plain incorrect. Instead you want a specialised streaming type, and indeed Conduit defines Functor, Applicative and Monad on a lot of it's types so you can still use all your favourite functions.
I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!
First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!
could you please help me with Turtle library.
I want to write simple program, that calculates disk space usage.
Here is the code:
getFileSize :: FilePath -> IO Size
getFileSize f = do
status <- stat f
return $ fileSize status
main = sh $ do
let sizes = fmap getFileSize $ find (suffix ".hs") "."
so now I have sizes bind of type Shell (IO Size). But I can't just sum it, with sum fold, cause there is IO Size in there. If it was something like [IO Size] I could pull IO monad out of there by using sequence to transform it to IO [Size]. But I can't do this with Shell monad since it is not Traversable. So I wrote something like this
import qualified Control.Foldl as F
main = sh $ do
let sizes = fmap getFileSize $ find (suffix ".hs") "."
lst <- fold sizes F.list
let cont = sequence lst
sz <- liftIO $ cont
liftIO $ putStrLn (show (sum sz))
First I folded Shell (IO Size) to [IO Size] and then to IO [Size] to sum list afterwards.
But I wonder if there is more canonical or elegant solution to this, because here I created two lists to accomplish my task. And I throught that Shell monad is for manipulating entities in constant space. Maybe there is some fold to make IO (Shell Size) from Shell (IO Size)?
Thanks.
You have an IO action, and you really want a Shell action. The usual way to handle that is with the liftIO method, which is available because Shell is an instance of MonadIO.
file <- find (suffix ".hs") "."
size <- liftIO $ getFileSize file
or even
size <- liftIO . getFileSize =<< find (suffix ".hs") "."
Fortunately, the Turtle package itself offers some size functions you can use directly with MonadIO instances like Shell in Turtle.Prelude so you don't need to use liftIO yourself.
Now you actually have to sum these up, but you can do that with fold and sum.
I would recommend that you avoid breaking open the Shell type itself. That should be reserved for adding totally new functionality to the API. That certainly isn't necessary in this case.
Actually I've managed to get rid of IO here by using helper transformation
sio :: Shell (IO a) -> Shell a
sio s = Shell (\(FoldShell step begin done) ->
let step' x a = do
a' <- a
step x a'
in
_foldShell s (FoldShell step' begin done))
But now I wonder is there any simpler solution to this task...
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.