C++ Threads writing to different parts of array of vector - multithreading

I have an std::array<std::vector, NUM_THREADS> and I basically want each thread to go get some data, and store it in its own std::vector, and also to read from its vector.
Is this safe? Or am I going to have to use a mutex or something?

The rule regarding data-races is that if every memory location is either accessed by no more than one thread at a time, or is only read (by any number of threads, but no writes), you don't need atomicity. Otherwise, you need either atomicity or synchronization (such as mutual-exclusion).
If every thread is only writing to and reading from its own vector, this would be safe. If two threads are writing to the same vector elements without synchronization, or if they're both writing to the same vector itself (e.g., appending or truncating the vector), you're pretty much clobbered --- that's two simultaneous writes. If two threads are each writing to elements of their own vectors and reading from both vectors, it's more complicated, but in general I would expect it to be unsafe. There are very specific arrangements where it may be safe/legal, but they will be very brittle, and likely hard to maintain, so it's probably better to re-architect to avoid it.
As an example of a usage like this where it would be legal (but again, brittle and hard to retain safety during code maintenance) would be where none of the vectors are changing size (a reallocation is going to be a write to the vector itself which would preclude any reads on the vector or its elements by other threads) and each thread is able to avoid reading from any specific element of a vector that is written to by any other thread (for example, you have two threads, one reading from and writing to even elements of the vectors and the other reading from and writing to odd elements of the vectors).
The above example is very artificial and probably not all that useful for real access patterns that might be desired. Other examples I could think of would probably also be artificial and unhelpful. And it's very easy to do some simple operation that would destroy the whole guarantee. In particular, if any thread performs push_back() on their own vector, any threads that may be concurrently reading the vector are almost guaranteed to result in undefined behavior. (You might be able to align the stars using reserve() very carefully and make code that is legal, but I certainly wouldn't attempt it myself.)


How to use Compare-and-swap for complex data structures

Classic CaS examples deal with simple data structures where critical change is to one primitive variable. However, even linked list, for example, requires change in multiple data items. For example: head, tail, next and prev. How to deal with it?
I have in mind some "capturing" 4 pointers in 4-long memory block and putting the block into CAS routine. Is it technically available and good practice?

What constructs are not possible using Ponylang's lock-free model?

Ponylang is a new language that is lock-free and datarace-free. My impression is that to accomplish this, Ponylang looks at the sentence "if two threads can see the same object, then writes must prohibit any other operation by another thread", and uses a type system to enforce the various special cases. For example, there's a type descriptor that says, "no other thread can see this object", and one that says, "this reference is read-only", and various others. Admittedly my understanding of this is quite poor, and ponylang's documentation is short on examples.
My question is: are there operations possible with a lock-based language that aren't translatable into ponylang's type-based system at all? Also, are there such operations that are not translatable into efficient constructs in ponylang?
[...] are there operations possible with a lock-based language that aren't translatable into ponylang's type-based system at all?
The whole point with reference capabilities, in Pony, is to prevent you from doing things that are possible and even trivial, in other languages, like sharing a list between two threads and add elements to it concurrently. So, yes, in languages like Java, you can share data between threads in a way that is impossible in Pony.
Also, are there such operations that are not translatable into efficient constructs in ponylang?
If you're asking if the lock-based languages can be more efficient in some situations, than pony, then I think so. You can always create a situation that benefits from N threads and 1 lock and is worse when you use the actor model which forces you to pass information around in messages.
This thing is not to see the actor model as superior in all cases. It's a different model of concurrency and problems are solved differently. For example, to compute N values and accumulate the results in a list:
In a thread-model you would
create a thread pool,
create thread-safe list,
Create N tasks sharing the list, and
wait for N tasks to finish.
In an actor-model you would
create an actor A waiting for N values,
create N actors B sharing the actor A, and
wait for A to produce a list.
Obviously, each task would add a value to the list and each actor B would send the value to actor A. Depending on how messages are passed between actors, it can be a slower to send N values than to lock N times. Typically it will be slower but, on the other hand, you will never get a list with an unexpected size.
I believe it can do anything that a shared everything + locks can do. with just iso objects and consume it is basically pure a message passing system which can do anything that a lock system does. As in mach3 can do anything linux can.

Small subset of huge matrix-like structure from disk transparently

A simplified version of the question
I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.
What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.
Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?
A more honest description of the on-disk data
The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).
In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.
Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.
An attempt at a more handwaving, but possibly easier to understand version of the question
Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.
What options do I have?
I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.
I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.
Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)
A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.
You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Any way to manually indicate element of a MutableArray# safe to GC?

In my application I'm working with MutableArrays (via the primitive package) shared across threads. I know when individual elements are no longer used and I'd like some way (unsafeMarkGarbage or something) to indicate to the runtime that they can be collected. At least I'd like to experiment with that if such a function or equivalent technique exists.
EDIT, to add a bit more detail: I've got a conceptual "infinite tape" implemented as a linked list of short MutableArray segments, something like:
data Seg a = Seg (MutableArray a) (IORef (Maybe (Seg a)))
I access the tape using a concurrent counter and always know when an element of the tape will no longer be accessed. In certain cases when a thread is descheduled it's possible that entire array segments (both the array and its elements) which could have been GC'd will stick around as their references will persist.
An ideal solution would avoid an additional write (maybe that's silly), avoid another layer of indirection in the array, and allow entire MutableArrays to be collected when all their elements expire.
Weak references do seem to be the most promising sort of mechanism I've seen, but I can't yet see how they can help me here.
I would suggest you store undefined in the positions that you would like to garbage collect.
