Best approach to read and write large files with collective MPI-IO - io

I would like to read and write large data sets in Fortran using MPI-IO. My preferred approach would be to use a MPI type defined with MPI_type_create_subarray with a single dimension to describe the view of each process to the file. My Fortran code thus looks like this:
! A contiguous type to describe the vector per element.
! MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR)
call MPI_Type_contiguous(nComponents, rk_mpi, &
& me%vectype, iError)
call MPI_Type_commit( me%vectype, iError )
! A subarray to describe the view of this process on the file.
! MPI_TYPE_CREATE_SUBARRAY(ndims, array_of_sizes, array_of_subsizes,
! array_of_starts, order, oldtype, newtype, ierror)
call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], &
& [ elemOff ], MPI_ORDER_FORTRAN, &
& me%vectype, me%ftype, iError)
However, array_of_sizes and array_of_starts, describing global quantities are just "normal" integers in the MPI-Interface. Thus there is a limit at about 2 billion elements with this approach.
Is there another interface, which uses MPI_OFFSET_KIND for these global values?
The only way to work around this, I see so far, is using the displacement option in the MPI_File_set_view instead of defining the view with the help of the subarray MPI type. However this "feels" wrong. Would you expect a performance impact in either approach for collective IO? Does anybody know, if this interface will change in MPI-3?
Maybe I should use some other MPI type?
What is the recommended solution here to write large data files with collective IO efficiently in parallel to disk?

Help is coming.
In MPI-3, there will be datatype manipulation routines that use MPI_Count instead of an int. For backwards compatability (groan) the existing routines won't change, but you should be able to make your type.
But for now..
For subarray in particular, though, this isn't usually thought of as a huge issue at the moment - even for a 2d array, indices of 2 billion give you an array size of 4x1018 which is admittedly pretty large (but exactly the sort of numbers targetted for exascale-type computing). In higher dimensions, it's even larger.
In 1d, though, a list of numbers 2 billion long is only ~8GB which isn't by any stretch big data, and I think that's the situation you find yourself in. My suggstion would be to leave it in the form you have it now for as long as you can. Is there a common factor in the local elements? You can work around this by bundling up the types in units of (say) 10 vectypes if that works - for your code it shouldn't matter, but it would reduce by that same factor the numbers in the locElements and globElements. Otherwise, yes, you could always use the displacement field in file set view.

Related

Selecting arbitrary rows from a Neo matrix in Nim?

I am using the Neo library for linear algebra in Nim, and I would like to extract arbitrary rows from a matrix.
I can explicitly select a continuous sequence of rows as per the examples in the README, but can't select a disjoint subset of rows.
import neo
let x = randomMatrix(10, 4)
let some_rows = #[1,3,5]
echo x[2..4, All] # works fine
echo x[some_rows, All] ## error
The first echo works because you are creating a Slice object, which neo has defined a proc for. The second echo uses a sequence of integers, and that kind of access is not defined in the neo library. Unfortunately Slices define contiguous closed ranges, you can't even specify steps to iterate in bigger increments than one, so there is no way to accomplish what you want.
Looking at the structure of a Matrix, it seems that it is highly optimised to avoid copying data. Matrix transformation operations seem to reuse the data of the previous matrix and change the access/dimensions. As such, a matrix transformation with arbitrary random would not be possible, the indexes in your example specifically access non contiguos data and this would need to be encoded somehow in the new structure. Plus if you wrote #[1,5,3] that would defeat any kind of normal iterative looping.
An alternative of course is to write a proc which accepts a sequence instead of a slice and then builds a new matrix copying data from the old one. This implies a performance penalty, but if you think this is a good addition to the library please request it in the issue tracker of the project. If it is not accepted, then you will need to write yourself such a proc for personal use in your programs.

Small subset of huge matrix-like structure from disk transparently

A simplified version of the question
I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.
What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.
Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?
A more honest description of the on-disk data
The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).
In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.
Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.
An attempt at a more handwaving, but possibly easier to understand version of the question
Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.
What options do I have?
I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.
I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.
Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

Efficient algorithm for grouping array of strings by prefixes

I wonder what is the best way to group an array of strings according to a list of prefixes (of arbitrary length).
For example, if we have this:
prefixes = ['GENERAL', 'COMMON', 'HY-PHE-NATED', 'UNDERSCORED_']
Then
tasks = ['COMMONA', 'COMMONB', 'GENERALA', 'HY-PHE-NATEDA', 'UNDERESCORED_A', 'HY-PHE-NATEDB']
Should be grouped this way:
[['GENERALA'], ['COMMONA', 'COMMONB'], ['HY-PHE-NATEDA', 'HY-PHE-NATEDB'], ['UNDERESCORED_A'] ]
The naïve approach is to loop through all the tasks and inner loop through prefixes (or vice versa, whatever) and test each task for each prefix.
Can one give me a hint how to make this in a more efficient way?
It depends a bit on the size of your problem, of course, but your naive approach should be okay if you sort both your prefixes and your tasks and then build your sub-arrays by traversing both sorted lists only forwards.
There are a few options, but you might be interested in looking into the trie data structure.
http://en.wikipedia.org/wiki/Trie
The trie data structure is easy to understand and implement and works well for this type of problem. If you find that this works for your situation you can also look at Patricia Tries which achieve the similar performance characteristics but typically have better memory utilization. They are a little more involved to implement but not overly complex.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)
A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.
You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Is it a reasonable practice to serialize Haskell data structures to disk just using Show/Read

I've played around with the Text.Show.Pretty module, and it makes it possible to serialize out Haskell data structures like records into a nice human-readable format & still be able to deserialize them easily using read. The output format is even more readable than YAML and JSON.
Example serialized output for a Haskell record using Text.Show.Pretty:
Book
{ author = "Plato"
, title = "Republic"
, numbers = [ 123
, 1234
]
}
Coming from the Ruby world, I know that YAML and JSON are most Rubyists' preferred format for serializing data structures. Are Haskell Show and Read instances used often to achieve the same end in Haskell?
For big structures, I wouldn't recommend it. read is slower than molasses. Anecdote time: I have a program named yeganesh. Conceptually, it's pretty simple: read in a [(String,Double)] with about 2000 elements and dump out the keys sorted by their elements. I used to store this using Show/Read, but found that switching to a custom printer and parser sped up the program by a factor of 8. (Note: it's not that the parsing sped up by a factor of eight. The whole program sped up by a factor of eight. That means the parsing sped up by a bigger factor than that.) That made the difference between uncomfortably long pauses and instant gratification.
I agree with Daniel Wagner but if you want file that a user can manipulate with simple text editors you could use the read/show for a small set of data, aka config files.
I don't think that is a common way amongst haskellers though, I usually use parsec instead of read 'config data' and a custom class /instance instead of Show.
If you got alot of data one usually use Data.Binary or Data.Serialize.

Resources