Space leaks with Haskell's cereal library? - haskell

As a hobby project called 'beercan', I'm reverse-engineering the resource files of the Torchlight games. Using an okay-ish hex editor, I try to guess the structure of the files, and then I model my ideas, use cereal to write Getters (and later some Putters), and try to decode every file in an application of the library.
I've just started on Torchlight's compiled layout files (*.LAYOUT in TL1, *.LAYOUT.cmp in TL2). The format turns out to be a little trickier than the dat files, but I think I figured out the basic structure, and how they are encoded in the TL2 files. so I'm trying to make a map of file versions, tag numbers, and guessed data types.
To do so, I wrote an application that flattens the data structure, leaving only the guessed type of the values of the leaves, each annotated with the file version and the node and leaf tag numbers. I turn this into a map from the file version and tag numbers to a set of the guessed types. For every file, I'd expect this Map to maybe take twice the file size in memory. (Not sure, though.) Then, I merge these maps, and I print the map.
For some reason, even if I only take 20MB worth of files (100 files), memory usage increases linearly to about 200MB, then decreases to the final size of the resulting map, and then deflates rapidly as I print it.
I wouldn't expect this memory usage. Does anyone know how I could fix it? I've tried to force values after decoding them (using deepseq), I've tried adding bangs to data types, but this hasn't really helped. I've tried copying all bytestrings I keep in the file structure, which brought down the memory usage a bit, but it's still unacceptably high, especially when I want to analyze the entire dataset (200MB+ of original files).
-edit- I've pushed a (not very S)SCCE to demonstrate the performance issue, (accidentally) along with my profiling results.
Clone the repository.
cabal configure, with flags to enable profiling (is it normal to need --enable-library-profiling --enable-executable-profiling --ghc-options="-rtsopts -prof"?)
cabal build
cd test, and run StressTest.sh.
This script tries to load a regular TL2 layout file 100 times. On my machine, top says it takes about 500MB of memory, and the profiling results are consistent with my description above.

I totally agree with #petrpudlak, we would need actual code to make any meaningful comments to the question "why does my code use so much memory?" :) (sorry, you did offer code), however, some of the patterns you describe are pretty typical in Haskell and some generic discussion is possible.
First of all, note that native Haskell types use a lot more memory than you might guess. Take a look at the ghc memory footprint page at http://www.haskell.org/haskellwiki/GHC/Memory_Footprint. Note that even a simple Char will take a full 16 bytes of memory! Add to that pointers for linked list items in a String, and you will easily use more than an order of magnitude greater memory than you might have guessed. If memory is important, you should use another data type, like Data.Text or Data.ByteString, which store Strings internally more like c would (as a block of bytes in memory, with 1-4 bytes per char, depending on encoding and what char is used). If data other than Strings are the problem, you can use unboxed arrays for arbitrary data types.
Second of all, if possible, you can cut down memory usage by processing items in series (where the memory will be garbage collected right away). Haskell laziness often does this for you automatically, for instance, try to run the following program
import Data.Char
main = interact $ map toUpper
As you type, the output will appear continuously (your OS, not Haskell, may buffer full lines, so you may need to hit 'enter' before seeing anything, but you will see output update for each 'enter'). Rather than loading the whole input into memory and then processing all at once, Char memory is being created and garbage collected Char by Char.
Of course this isn't always possible (ie- if you have to process the data in a very nonlocal way), but most of the time at least parts of the code can be refactored this way to cut down total memory usage.
Edit- Sorry, I just realized that you did post a link to the code, and you are using ByteString..... So some of what I wrote isn't valid. But I do still see boxed lists and unpacking of the ByteString, so I will leave the answer as it is.

The memory usage pattern sounds like your application is building up a lot of unnecessary thunks and then memory consumption starts going down when those thunks get evaluated. I only glanced at your code quickly but one simple change you could try is to replace all imports of Data.Map with Data.Map.Strict. This is especially important if you are doing a lot of updates on the values inside a Map without forcing evaluation in between.
Another things you should be aware of is that replicateM is quite inefficient with larger numbers in a strict monad (see e.g. this answer). I'm not sure what kinds of counts you are usually dealing with in your application, but it's good to keep in mind.
It might also help to use strict fields in simple container data types like your LeafValue type and compile with -funbox-strict-fields (and -O2 of course).

Related

philosophy behind http-simple setRequestBodyLBS

I am trying to develop an http client by using http-simple library. Some implementation of the library seems confusing to me.
This library makes heavy use of Conduit; however there is also this 'setRequestBodyLBS' function and interestingly, the function 'setRequestBodyBS' is missing here. It is documented that Conduit and lazy IO do not work well together. So my question is, why not the other way around? i.e., implement the BS version of the function instead of the LBS version? What is the idea behind the choice made here?
Internally, a lazy bytestring is like a linked list of strict bytestrings. Moving from a strict bytestring to a lazy one is cheap (you build a linked list of one element) but going in the reverse direction is costlier (you need to allocate a contiguous chunk of memory for the combined bytes, and then copy each chunk from the list).
Lazy IO uses lazy bytestrings, but they're also useful in other contexts, for example when you have strict chunks arriving from an external source and you want an easy way of accumulating them without having to preallocate a big area of memory or perform frequent reallocations/copies. Instead, you just keep a list of chunks that you later present as a lazy bytestring. (When list concatenations start getting expensive or the granularity is too small, you can use a Builder as a further optimization.)
Another frequent use case is serialization of some composite data structure (say, aeson's Value). If all you are going to do is dump the generated bytes into a file or a network request, it doesn't make much sense to perform a relatively costly consolidation of the serialized bytes of each sub-component. If needed, you can always perform it later with toStrict anyway.

Small subset of huge matrix-like structure from disk transparently

A simplified version of the question
I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.
What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.
Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?
A more honest description of the on-disk data
The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).
In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.
Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.
An attempt at a more handwaving, but possibly easier to understand version of the question
Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.
What options do I have?
I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.
I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.
Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)
A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.
You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Most efficient method of constructing and printing very large strings

I have a program which constructs very large strings. Currently I am using lazy ByteStrings. Here are the problem parameters summarized:
The current implementation works up to about 500k characters, simply running out of memory afterwards (~600MB). I would like this (amount of characters) to run in under 50MB.
The string isn't accessed while being built. This probably leads to a lot of thunks and hence the memory issue. I am using Builder to make the ByteStrings, but it seems that there is no strict version of Builder (or at least I can't find it).
The string cannot be put in the file while being built. The entire build operation has to happen before the string is placed in a file.
I don't need unicode support. Even 7 bit ascii would do. I believe that ByteString doesn't waste memory to encode unicode characters though.
Things I have tried:
Calling seq on the ByteStrings as they are being built. This seems to work for 50-100k characters but after that the effect is the same.
Using strict ByteStrings. I couldn't figure out how to use Builder with them, so I ended up using lists and concat.
Using UArray Int Char. This means either knowing the size of the string in advance and allocating the entire array, or having a ton of intermediate data structures.

Dealing with large files in Haskell

I have a large file (4+ gigs) of, lets just say, 4 byte floats. I would like to treat it as List, in the sense that I would like to be able to use map, filter, foldl, etc. However, instead of producing a new list with the output, I would like to write the output back into the file, and thus only have to load a small portion of the file in memory. You could say I what a type called MutableFileList
Has anyone ran into this situation before? Instead of re-inventing the wheel I was wondering if there a Hackish way for dealing with this?
You should not treat it as a [Double] or [Float] in memory. What you could do is use one of the list-like packed array types, such as uvector/vector/... in company with mmapFile or readFile to pull chunks of the file in at a time, and process them. Or use a lazy packed array type, equivalent to lazy bytestrings.
This should be quite helpful to you. You can use readFile and writeFile for what you need to do, and everything is done lazily. It only keeps things in memory while they are still being used, so you can read, process, and write the file out without blowing up your computer.
You might use mmap to map the file to memory and then process it. There is a mmap module that promises to read and write mmaped files and can even work with lazily mapped chunks of files, but I haven't tried it.
The interface for writing to the mapped file seems to be quite low level, so you'd have to build your own abstractions or work with Foreign.Ptr and the like.

Resources