haskell write large string - haskell

Hello Stackoverflow Community.
I'm relativly new to Haskell and i have noticed writing large strings to a file with
writeFile or hPutStr is extremly slow.
For a 1.5 Mb String my Programm (compiled with ghc) takes about 2 seconds while the
"same" code in c++ only takes about 0.1 seconds.
The string is generated from a list with about 10000 elements and then dumped with writeFile. I have also tried to traverse the the list with mapM_ and hPutStr with the same result.
Is there a faster way to write a large string?
Update
As #applicative pointed out the following code finishes with a 2MB file in no time
main = readFile "input.txt" >>= writeFile "ouput.txt"
So my problem seems to be somewhere else. Here are my two implementations for
Writing the list (WordIndex and CoordList are typealiases for a Map and a List)
with hPutStrLn
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index =
let
indexList = map (\(k, v) -> entryToString k v) (Map.toList index)
in do
output <- openFile filename WriteMode
mapM_ (\v -> hPutStrLn output v) indexList
hClose output
-- Convert Listelement to String
entryToString :: String -> CoordList -> String
entryToString key value = (embedString 25 key) ++ (coordListToString value) ++ "\n"
with writeFile
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index = writeFile filename (indexToString "" index)
-- Index to String
indexToString :: String -> WordIndex -> String
indexToString lead index = Map.foldrWithKey (\k v r -> lead ++ (entryToString k v) ++ r) "" index
Maybe you guys can help me a little in finding a speed up here.
Thanks in advance

This is well-known problem. The default Haskell String type is simple [Char] and is slow by definition and is dead slow if it is constructed lazily (usual situation). However, as list, it allows simple and clean processing using list combinators and is useful when performance is not an issue. If it is, one should use ByteString or Text packages. ByteString is better as it is shipped with ghc, but does not provide unicode support. ByteString-based utf8 packages are available on hackage.

Yes. You could, for instance, use the Text type from the module Data.Text or Data.Text.Lazy, which internally represent text in a more efficient way (namely UTF-16) than lists of Chars do.
When writing binary data (which may or may not contain text encoded in some form) you can use ByteStrings or their lazy equivalents.
When modifying Text or ByteStrings, some operations to modify them are faster on the lazy versions. If you only want to read from such a string after creating it the non-lazy versions can generally be recommended.

Related

Does Haskell's `readFile` read the whole file content into memory?

I want to select the n-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.
The following program will require only about as much memory as the longest n lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).

Reduce memory usage of a Haskell program

I have a following program in Haskell:
processDate :: String -> IO ()
processDate date = do
...
let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let addresses = fmap location properties
let batchAddresses = chunksOf 100 addresses
batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return (zip properties geocodedLocations)
geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
mapQuestKey <- getEnv "MAP_QUEST_KEY"
geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
return $ geocodeResponseToResults geocodeResponse
geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
latLangs
where
decodedResponse :: Maybe GeocodingResponse
decodedResponse = decodeGeocodingResponse inputResponse
latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse
decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse
It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.
After commenting out the geocoding part of the program memory usage is around 30M which is fine.
You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so
I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?
Cheers!
It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date
allProperties <- mapM fileToProperties allFiles
let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
print geocodedProperties
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let batchProperties = chunksOf 100 properties
batchGeocodedLocations <- mapM geocodeAddresses batchProperties
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return geocodedLocations
with something like this
import Streaming
import qualified Streaming.Prelude as S
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date -- we accept an unstreamed list
S.print $ propertiesWithGeocoding -- this was the main pain point see below
$ S.filter hasPrice
$ S.concat
$ S.mapM fileToProperties -- this mapM doesn't accumulate
$ S.each allFiles -- the list is converted to a stream
propertiesWithGeocoding
:: Stream (Of ParsedProperty) IO r
-> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =
S.concat $ S.concat
$ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
$ S.mapped S.toList -- convert segments to haskell lists
$ chunksOf 100 properties -- this is the streaming `chunksOf`
-- concat here flattens a stream of lists of as into a stream of as
-- and a stream of maybe as into a stream of as
Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:
Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).

Reading a file into an array of strings

I'm pretty new to Haskell, and am trying to simply read a file into a list of strings. I'd like one line of the file per element of the list. But I'm running into a type issue that I don't understand. Here's what I've written for my function:
readAllTheLines hdl = (hGetLine hdl):(readAllTheLines hdl)
That compiles fine. I had thought that the file handle needed to be the same one returned from openFile. I attempted to simply show the list from the above function by doing the following:
displayFile path = show (readAllTheLines (openFile path ReadMode))
But when I try to compile it, I get the following error:
filefun.hs:5:43:
Couldn't match expected type 'Handle' with actual type 'IO Handle'
In the return type of a call of 'openFile'
In the first argument of 'readAllTheLines', namely
'(openFile path ReadMode)'
In the first argument of 'show', namely
'(readAllTheLines (openFile path ReadMode))'
So it seems like openFile returns an IO Handle, but hGetLine needs a plain old Handle. Am I misunderstanding the use of these 2 functions? Are they not intended to be used together? Or is there just a piece I'm missing?
Use readFile and lines for a better alternative.
readLines :: FilePath -> IO [String]
readLines = fmap lines . readFile
Coming back to your solution openFile returns IO Handle so you have to run the action to get the Handle. You also have to check if the Handle is at eof before reading something from that. It is much simpler to just use the above solution.
import System.IO
readAllTheLines :: Handle -> IO [String]
readAllTheLines hndl = do
eof <- hIsEOF hndl
notEnded eof
where notEnded False = do
line <- hGetLine hndl
rest <- readAllTheLines hndl
return (line:rest)
notEnded True = return []
displayFile :: FilePath -> IO [String]
displayFile path = do
hndl <- openFile path ReadMode
readAllTheLines hndl
To add on to Satvik's answer, the example below shows how you can utilize a function to populate an instance of Haskell's STArray typeclass in case you need to perform computations on a truly random access data type.
Code Example
Let's say we have the following problem. We have lines in a text file "test.txt", and we need to load it into an array and then display the line found in the center of that file. This kind of computation is exactly the sort situation where one would want to use a random access array over a sequentially structured list. Granted, in this example, there may not be a huge difference between using a list and an array, but, generally speaking, list accesses will cost O(n) in time whereas array accesses will give you constant time performance.
First, let's create our sample text file:
test.txt
This
is
definitely
a
test.
Given the file above, we can use the following Haskell program (located in the same directory as test.txt) to print out the middle line of text, i.e. the word "definitely."
Main.hs
{-# LANGUAGE BlockArguments #-} -- See footnote 1
import Control.Monad.ST (runST, ST)
import Data.Array.MArray (newArray, readArray, writeArray)
import Data.Array.ST (STArray)
import Data.Foldable (for_)
import Data.Ix (Ix) -- See footnote 2
populateArray :: (Integral i, Ix i) => STArray s i e -> [e] -> ST s () -- See footnote 3
populateArray stArray es = for_ (zip [0..] es) (uncurry (writeArray stArray))
middleWord' :: (Integral i, Ix i) => i -> STArray s i String -> ST s String
middleWord' arrayLength = flip readArray (arrayLength `div` 2)
middleWord :: [String] -> String
middleWord ws = runST do
let len = length ws
array <- newArray (0, len - 1) "" :: ST s (STArray s Int String)
populateArray array ws
middleWord' len array
main :: IO ()
main = do
ws <- words <$> readFile "test.txt"
putStrLn $ middleWord ws
Explanation
Starting with the top of Main.hs, the ST s monad and its associated function runST allow us to extract pure values from imperative-style computations with in-place updates in a referentially transparent manner. The module Data.Array.MArray exports the MArray typeclass as an interface for instantiating mutable array data types and provides helper functions for creating, reading, and writing MArrays. These functions can be used in conjunction with STArrays since there is an instance of MArray defined for STArray.
The populateArray function is the crux of our example. It uses for_ to "applicatively" loop over a list of tuples of indices and list elements to fill the given STArray with those list elements, producing a value of type () in the ST s monad.
The middleWord' helper function uses readArray to produce a String (wrapped in the ST s monad) that corresponds to the middle element of a given STArray of Strings.
The middleWord function instantiates a new STArray, uses populateArray to fill the array with values from a provided list of strings, and calls middleWord' to obtain the middle string in the array. runST is applied to this whole ST s monadic computation to extract the pure String result.
We finally use our middleWord function in main to find the middle word in the text file "test.txt".
Further Reading
Haskell's STArray is not the only way to work with arrays in Haskell. There are in fact Arrays, IOArrays, DiffArrays and even "unboxed" versions of all of these array types that avoid using the indirection of pointers to simply store "raw" values. There is a page on the Haskell Wikibook on this topic that may be worth some study. Before that, however, looking at the Wikibook page on mutable objects may give you some insight as to why the ST s monad allows us to safely compute pure values from functions that use imperative/destructive operations.
Footnotes
1 The BlockArguments language extension is what allows us to pass a do block directly to a function without any parentheses or use of the function application operator $.
2 As suggested by the Hackage documentation, Ix is a typeclass mainly meant to be used to specify types for indexing arrays.
3 The use of the Integral and Ix type constraints may be a bit of overkill, but it's used to make our type signatures as general as possible.

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?
If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.
As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Haskell iteratee: simple worked example of stripping trailing whitespace

I'm trying to understand how to use the iteratee library with Haskell. All of the articles I've seen so far seem to focus on building an intuition for how iteratees could be built, which is helpful, but now that I want to get down and actually use them, I feel a bit at sea. Looking at the source code for iteratees has been of limited value for me.
Let's say I have this function which trims trailing whitespace from a line:
import Data.ByteString.Char8
rstrip :: ByteString -> ByteString
rstrip = fst . spanEnd isSpace
What I'd like to do is: make this into an iteratee, read a file and write it out somewhere else with the trailing whitespace stripped from each line. How would I go about structuring that with iteratees? I see there's an enumLinesBS function in Data.Iteratee.Char which I could plumb into this, but I don't know if I should use mapChunks or convStream or how to repackage the function above into an iteratee.
If you just want code, it's this:
procFile' iFile oFile = fileDriver (joinI $
enumLinesBS ><>
mapChunks (map rstrip) $
I.mapM_ (B.appendFile oFile))
iFile
Commentary:
This is a three-stage process: first you transform the raw stream into a stream of lines, then you apply your function to convert that stream of lines, and finally you consume the stream. Since rstrip is in the middle stage, it will be creating a stream transformer (Enumeratee).
You can use either mapChunks or convStream, but mapChunks is simpler. The difference is that mapChunks doesn't allow for you to cross chunk boundaries, whereas convStream is more general. I prefer convStream because it doesn't expose any of the underlying implementation, but if mapChunks is sufficient the resulting code is usually shorter.
rstripE :: Monad m => Enumeratee [ByteString] [ByteString] m a
rstripE = mapChunks (map rstrip)
Note the extra map in rstripE. The outer stream (which is the input to rstrip) has type [ByteString], so we need to map rstrip onto it.
For comparison, this is what it would look like if implemented with convStream:
rstripE' :: Enumeratee [ByteString] [ByteString] m a
rstripE' = convStream $ do
mLine <- I.peek
maybe (return B.empty) (\line -> I.drop 1 >> return (rstrip line)) mLine
This is longer, and it's less efficient because it will only apply the rstrip function to one line at a time, even though more lines may be available. It's possible to work on all of the currently available chunk, which is closer to the mapChunks version:
rstripE'2 :: Enumeratee [ByteString] [ByteString] m a
rstripE'2 = convStream (liftM (map rstrip) getChunk)
Anyway, with the stripping enumeratee available, it's easily composed with the enumLinesBS enumeratee:
enumStripLines :: Monad m => Enumeratee ByteString [ByteString] m a
enumStripLines = enumLinesBS ><> rstripE
The composition operator ><> follows the same order as the arrow operator >>>. enumLinesBS splits the stream into lines, then rstripE strips them. Now you just need to add a consumer (which is a normal iteratee), and you're done:
writer :: FilePath -> Iteratee [ByteString] IO ()
writer fp = I.mapM_ (B.appendFile fp)
processFile iFile oFile =
enumFile defaultBufSize iFile (joinI $ enumStripLines $ writer oFile) >>= run
The fileDriver functions are shortcuts for simply enumerating over a file and running the resulting iteratee (unfortunately the argument order is switched from enumFile):
procFile2 iFile oFile = fileDriver (joinI $ enumStripLines $ writer oFile) iFile
Addendum: here's a situation where you would need the extra power of convStream. Suppose you want to concatenate every 2 lines into one. You can't use mapChunks. Consider when the chunk is a singleton element, [bytestring]. mapChunks doesn't provide any way to access the next chunk, so there's nothing else to concatenate with this. With convStream however, it's simple:
concatPairs = convStream $ do
line1 <- I.head
line2 <- I.head
return $ line1 `B.append` line2
this looks even nicer in applicative style,
convStream $ B.append <$> I.head <*> I.head
You can think of convStream as continually consuming a portion of the stream with the provided iteratee, then sending the transformed version to the inner consumer. Sometimes even this isn't general enough, since the same iteratee is called at each step. In that case, you can use unfoldConvStream to pass state between successive iterations.
convStream and unfoldConvStream also allow for monadic actions, since the stream processing iteratee is a monad transformer.

Resources