Small subset of huge matrix-like structure from disk transparently

Small subset of huge matrix-like structure from disk transparently - haskell

A simplified version of the question
I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.
What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.
Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?
A more honest description of the on-disk data
The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).
In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.
Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.
An attempt at a more handwaving, but possibly easier to understand version of the question
Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.
What options do I have?

I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.
I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.
Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

Related

philosophy behind http-simple setRequestBodyLBS

I am trying to develop an http client by using http-simple library. Some implementation of the library seems confusing to me.
This library makes heavy use of Conduit; however there is also this 'setRequestBodyLBS' function and interestingly, the function 'setRequestBodyBS' is missing here. It is documented that Conduit and lazy IO do not work well together. So my question is, why not the other way around? i.e., implement the BS version of the function instead of the LBS version? What is the idea behind the choice made here?

Internally, a lazy bytestring is like a linked list of strict bytestrings. Moving from a strict bytestring to a lazy one is cheap (you build a linked list of one element) but going in the reverse direction is costlier (you need to allocate a contiguous chunk of memory for the combined bytes, and then copy each chunk from the list).
Lazy IO uses lazy bytestrings, but they're also useful in other contexts, for example when you have strict chunks arriving from an external source and you want an easy way of accumulating them without having to preallocate a big area of memory or perform frequent reallocations/copies. Instead, you just keep a list of chunks that you later present as a lazy bytestring. (When list concatenations start getting expensive or the granularity is too small, you can use a Builder as a further optimization.)
Another frequent use case is serialization of some composite data structure (say, aeson's Value). If all you are going to do is dump the generated bytes into a file or a network request, it doesn't make much sense to perform a relatively costly consolidation of the serialized bytes of each sub-component. If needed, you can always perform it later with toStrict anyway.

Haskell: Lazy vs. Strict Text values, which one is recommended when?

I've been doing quite a bit of reading on Data.Text, but I haven't been able to find much in the way of when to prefer Strict over Lazy, or vice-versa.
My understanding is that Data.Text.Strict is a data structure of contiguous characters in memory whereas Data.Text.Lazy is a chunks of contiguous characters.
My question is why shouldn't I always use Data.Text.Lazy? It seems the only overhead is the chunk management, but I don't know if it's noticeable enough? In exchange, concatenation operations can be much cheaper when Text values become large.
Thoughts and insights welcome!

From the docs:
Data.Text.Lazy
A time and space-efficient implementation of Unicode text using lists of packed arrays. This representation is suitable for high performance use and for streaming large quantities of data. It provides a means to manipulate a large body of text without requiring that the entire content be resident in memory.
Some operations, such as concat, append, reverse and cons, have better complexity than their Data.Text equivalents, due to optimisations resulting from the list spine structure. And for other operations lazy Texts are usually within a few percent of strict ones, but with better heap usage. For data larger than available memory, or if you have tight memory constraints, this module will be the only option.
Data.Text
A time and space-efficient implementation of Unicode text using packed Word16 arrays. Suitable for performance critical use, both in terms of large data quantities and high speed.
...
Most of the functions in this module are subject to fusion, meaning that a pipeline of such functions will usually allocate at most one Text value.
So while Data.Text is sufficient for most purposes, Data.Text.Lazy is specifically for when you have very large amounts of data to process and can't practically hold it all in memory at once. Data.Text is somewhat more efficient in general, but which is better for your application is entirely dependent on your use case. A good rule of thumb is to start with strict, and if you're having memory or speed problems then try using lazy.

I'd say that using Data.Text.Lazy inherits many of the problems of lazy IO. So my suggestion would be to prefer Strict, and if you need to process large pieces of data sequentially, use one of the available streaming libraries. See also What is pipes/conduit trying to solve.

Often times, packages for connecting to a database (postgres, redis, etc) only give you strict values; Any lazy values you get from them are created through functions like Data.Text.Lazy's fromStrict. In this case, using lazy values adds extra overhead. An example of such a package is postgresql-simple.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)

A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.

You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Space leaks with Haskell's cereal library?

As a hobby project called 'beercan', I'm reverse-engineering the resource files of the Torchlight games. Using an okay-ish hex editor, I try to guess the structure of the files, and then I model my ideas, use cereal to write Getters (and later some Putters), and try to decode every file in an application of the library.
I've just started on Torchlight's compiled layout files (*.LAYOUT in TL1, *.LAYOUT.cmp in TL2). The format turns out to be a little trickier than the dat files, but I think I figured out the basic structure, and how they are encoded in the TL2 files. so I'm trying to make a map of file versions, tag numbers, and guessed data types.
To do so, I wrote an application that flattens the data structure, leaving only the guessed type of the values of the leaves, each annotated with the file version and the node and leaf tag numbers. I turn this into a map from the file version and tag numbers to a set of the guessed types. For every file, I'd expect this Map to maybe take twice the file size in memory. (Not sure, though.) Then, I merge these maps, and I print the map.
For some reason, even if I only take 20MB worth of files (100 files), memory usage increases linearly to about 200MB, then decreases to the final size of the resulting map, and then deflates rapidly as I print it.
I wouldn't expect this memory usage. Does anyone know how I could fix it? I've tried to force values after decoding them (using deepseq), I've tried adding bangs to data types, but this hasn't really helped. I've tried copying all bytestrings I keep in the file structure, which brought down the memory usage a bit, but it's still unacceptably high, especially when I want to analyze the entire dataset (200MB+ of original files).
-edit- I've pushed a (not very S)SCCE to demonstrate the performance issue, (accidentally) along with my profiling results.
Clone the repository.
cabal configure, with flags to enable profiling (is it normal to need --enable-library-profiling --enable-executable-profiling --ghc-options="-rtsopts -prof"?)
cabal build
cd test, and run StressTest.sh.
This script tries to load a regular TL2 layout file 100 times. On my machine, top says it takes about 500MB of memory, and the profiling results are consistent with my description above.

I totally agree with #petrpudlak, we would need actual code to make any meaningful comments to the question "why does my code use so much memory?" :) (sorry, you did offer code), however, some of the patterns you describe are pretty typical in Haskell and some generic discussion is possible.
First of all, note that native Haskell types use a lot more memory than you might guess. Take a look at the ghc memory footprint page at http://www.haskell.org/haskellwiki/GHC/Memory_Footprint. Note that even a simple Char will take a full 16 bytes of memory! Add to that pointers for linked list items in a String, and you will easily use more than an order of magnitude greater memory than you might have guessed. If memory is important, you should use another data type, like Data.Text or Data.ByteString, which store Strings internally more like c would (as a block of bytes in memory, with 1-4 bytes per char, depending on encoding and what char is used). If data other than Strings are the problem, you can use unboxed arrays for arbitrary data types.
Second of all, if possible, you can cut down memory usage by processing items in series (where the memory will be garbage collected right away). Haskell laziness often does this for you automatically, for instance, try to run the following program
import Data.Char
main = interact $ map toUpper
As you type, the output will appear continuously (your OS, not Haskell, may buffer full lines, so you may need to hit 'enter' before seeing anything, but you will see output update for each 'enter'). Rather than loading the whole input into memory and then processing all at once, Char memory is being created and garbage collected Char by Char.
Of course this isn't always possible (ie- if you have to process the data in a very nonlocal way), but most of the time at least parts of the code can be refactored this way to cut down total memory usage.
Edit- Sorry, I just realized that you did post a link to the code, and you are using ByteString..... So some of what I wrote isn't valid. But I do still see boxed lists and unpacking of the ByteString, so I will leave the answer as it is.

The memory usage pattern sounds like your application is building up a lot of unnecessary thunks and then memory consumption starts going down when those thunks get evaluated. I only glanced at your code quickly but one simple change you could try is to replace all imports of Data.Map with Data.Map.Strict. This is especially important if you are doing a lot of updates on the values inside a Map without forcing evaluation in between.
Another things you should be aware of is that replicateM is quite inefficient with larger numbers in a strict monad (see e.g. this answer). I'm not sure what kinds of counts you are usually dealing with in your application, but it's good to keep in mind.
It might also help to use strict fields in simple container data types like your LeafValue type and compile with -funbox-strict-fields (and -O2 of course).

Is it possible to make fast big circular buffer arrays for stream recording in Haskell?

I'm considering converting a C# app to Haskell as my first "real" Haskell project. However I want to make sure it's a project that makes sense. The app collects data packets from ~15 serial streams that come at around 1 kHz, loads those values into the corresponding circular buffers on my "context" object, each with ~25000 elements, and then at 60 Hz sends those arrays out to OpenGL for waveform display. (Thus it has to be stored as an array, or at least converted to an array every 16 ms). There are also about 70 fields on my context object that I only maintain the current (latest) value, not the stream waveform.
There are several aspects of this project that map well to Haskell, but the thing I worry about is the performance. If for each new datapoint in any of the streams, I'm having to clone the entire context object with 70 fields and 15 25000-element arrays, obviously there's going to be performance issues.
Would I get around this by putting everything in the IO-monad? But then that seems to somewhat defeat the purpose of using Haskell, right? Also all my code in C# is event-driven; is there an idiom for that in Haskell? It seems like adding a listener creates a "side effect" and I'm not sure how exactly that would be done.

Look at this link, under the section "The ST monad":
http://book.realworldhaskell.org/read/advanced-library-design-building-a-bloom-filter.html
Back in the section called “Modifying array elements”, we mentioned
that modifying an immutable array is prohibitively expensive, as it
requires copying the entire array. Using a UArray does not change
this, so what can we do to reduce the cost to bearable levels?
In an imperative language, we would simply modify the elements of the
array in place; this will be our approach in Haskell, too.
Haskell provides a special monad, named ST, which lets us work
safely with mutable state. Compared to the State monad, it has some
powerful added capabilities.
We can thaw an immutable array to give a mutable array; modify the
mutable array in place; and freeze a new immutable array when we are
done.
...
The IO monad also provides these capabilities. The major difference between the two is that the ST monad is intentionally designed so that we can escape from it back into pure Haskell code.
So should be possible to modify in-place, and it won't defeat the purpose of using Haskell after all.

Yes, you would probably want to use the IO monad for mutable data. I don't believe the ST monad is a good fit for this problem space because the data updates are interleaved with actual IO actions (reading input streams). As you would need to perform the IO within ST by using unsafeIOToST, I find it preferable to just use IO directly. The other approach with ST is to continually thaw and freeze an array; this is messy because you need to guarantee that old copies of the data are never used.
Although evidence shows that a pure solution (in the form of Data.Sequence.Seq) is often faster than using mutable data, given your requirement that data be pushed out to OpenGL, you'll possible get better performance from working with the array directly. I would use the functions from Data.Vector.Storable.Mutable (from the vector package), as then you have access to the ForeignPtr for export.
You can look at arrows (Yampa) for one very common approach to event-driven code. Another area is Functional Reactivity (FRP). There are starting to be some reasonably mature libraries in this domain, such as Netwire or reactive-banana. I don't know if they'd provide adequate performance for your requirements though; I've mostly used them for gui-type programming.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string