Howto program thread-based parallel list iteration?

Howto program thread-based parallel list iteration? - multithreading

I need as an example how to program a parallel iter-function using ocaml-threads. My first idea was to have a function similiar to this:
let procs = 4 ;;
let rec _part part i lst = match lst with
[] -> ()
| hd::tl ->
let idx = i mod procs in
(* Printf.printf "part idx=%i\n" idx; *)
let accu = part.(idx) in
part.(idx) <- (hd::accu);
_part part (i+1) tl ;;
Then a parallel iter could look like this (here as process-based variant):
let iter f lst = let part = Array.create procs [] in
_part part 0 lst;
let rec _do i =
(* Printf.printf "do idx=%i\n" i; *)
match Unix.fork () with
0 -> (* Code of child *)
if i < procs then
begin
(* Printf.printf "child %i\n" i; *)
List.iter f part.(i)
end
| pid -> (* Code of father *)
(* Printf.printf "father %i\n" i; *)
if i >= procs then ignore (Unix.waitpid [] pid)
else _do (i+1)
in
_do 0 ;;
Because the usage of Thread-module is a little bit different, how would I code this using ocaml's thread module?
And there is another question, the _part() function must scan the whole list to split them into n parts and then each part will be piped through each own processes (here). Still exists there a solution without splitting a list first?

If you have a function which processes a list, and you want to run it on several lists independently, you can call Thread.create with that function and every list. If you store your lists in array part then:
let threads = Array.map (Thread.create (List.iter f)) part in
Array.iter Thread.join threads
INRIA OCaml threads are not actual threads: only one thread executes at any given time, which means if you have four processors and four threads, all four threads will use the same processor and the other three will remain unused.
Where threads are useful is that they still allow asynchronous programming: some Thread module primitives can wait for an external resource to become available. This can reduce the time your software spends blocked by an unavailable resource, because you can have another thread do something else in the mean time. You can also use this to concurrently start several external asynchronous processes (like querying several web servers through HTTP). If you don't have a lot of resource-related blocking, this is not going to help you.
As for your list-splitting question: to access an element of a list, you must traverse all previous elements. While this traversal could theoretically be split across several threads or processes, the communication overhead would likely make it a lot slower than just splitting things ahead of time in one process. Or using arrays.

Answer to a question from the comments. The answer does not quite fit in a comment itself.
There is a lock on the OCaml runtime. The lock is released when an OCaml thread is about to enter a C function that
may block;
may take a long time.
So you can only have one OCaml thread using the heap, but you can sometimes have non-heap-using C functions working in parallel with it.
See for instance the file ocaml-3.12.0/otherlibs/unix/write.c
memmove (iobuf, &Byte(buf, ofs), numbytes); // if we kept the data in the heap
// the GC might move it from
// under our feet.
enter_blocking_section(); // release lock.
// Another OCaml thread may
// start in parallel of this one now.
ret = write(Int_val(fd), iobuf, numbytes);
leave_blocking_section(); // take lock again to continue
// with Ocaml code.

Related

How to explain Read/Write global variables in multi threads environment

I am not familiar with multi-thread and locks and atomic/nonatomic operations.
Recently I saw an interview question as below.
Put f1 and f2 in two separate threads and run them at the same time, when both of them return, what is the value of a?
int a = 2, b = 0, c = 0
func f1()
{
a = a * 2
a = b
}
func f2()
{
c = a + 11
a = c
}
I tried to implement the above code in objective c environment and what I got is a = 11. I'm not sure if this is right since what I did is put f1 in main queue and put f2 in a dispatch global queue and ran it async which could be incorrect.
If someone could give an answer and explain the process based on the level of register accessing, CPU processing, memory usage, that would be great.

The answer is - the result of A is random. It can be anything. Since access to A is not atomic and there is no synchronization, different threads might see a different value for a depending on random factors. If you manage to make a unaligned and run it on X86, you might even see a non-value for a.

What is process interleaving? (in the realm of Concurrency)

I'm not quite sure as to what this term means. I saw it during a course where we are learning about concurrency. I've seen a lot of definitions for data interleaving, but I could find anything about process interleaving.
When looking at the term my instincts tell me it is the use of threads to run more than one process simultaneously, is that correct?

If you imagine a process as a (possibly infinite) sequence/trace of statements (e.g. obtained by loop unfolding), then the set of possible interleavings of several processes consists of all possible sequences of statements of any of those process.
Consider for example the processes
int i;
proctype A() {
i = 1;
}
proctype B() {
i = 2;
}
Then the possible interleavings are i = 1; i = 2 and i = 2; i = 1, i.e. the possible final values for i are 1 and 2. This can be of course more complex, for instance in the presence of guarded statements: Then the next possible statements in an interleaving sequence are not necessarily those at the position of the next program counter, but only those that are allowed by the guard; consider for example the proctype
proctype B() {
if
:: i == 0 -> i = 2
:: else -> skip
fi
}
Then the possible interleavings (given A() as before) are i = 1; skip and i = 2; i = 1, so there is only one possible final value for i.
Indeed the notion of interleavings is crucial for Spin's view of concurrency. In a trace semantics, the set of possible traces of concurrent processes is the set of possible interleavings of the traces of the individual processes.

It simply means performing (data access or execution or ... ) in an arbitrary order**(see the note). In the case of concurrency, it usually refers to action interleaving.
If the process P and Q are in parallel composition (P||Q) then the actions of these will be interleaved. Consider following processes:
PLAYING = (play_music -> stop_music -> STOP).
PERFORMING = (dance -> STOP).
||PLAY_PERFORM = (PLAYING || PERFORMING).
So each primitive process can be shown as: (generated by LTSA model-cheking tool)
Then the possible traces as the result of action interleaving will be:
dance -> play_music -> stop_music
play_music -> dance -> stop_music
play_music -> stop_music -> dance
Here is the LTSA tool generated output of this example.
**note: "arbitrary" here means arbitrary choice of process execution not their inner sequence of codes. The code execution in each process will be always followed sequentially.
If it is still something that you're not comfortable with you can take a look at: https://www.doc.ic.ac.uk/~jnm/book/firstbook/pdf/ch3.pdf
Hope it helps! :)

Operating Systems support Tasks (or Processes). But for now let's think of "Actitivities".
Activities can be executed in parallel. Here are two activities, P and Q:
P: abc
Q: def
a, b, c, d, e, f, are operations. *
Each operation has always the same effect independent of what other
operations may be executing at the same time (atomicity).
What is the effect of executing the two activities concurrently? We
do not know for sure, but we know that it will be the same as obtained
by executing sequentially an INTERLEAVING of the two activities
[interleavings are also called SCHEDULES]. Here are the possible
interleavings of these two activities:
abcdef
abdcef
abdecf
abdefc
adbcef
......
defabc
That is, the operations of the two activities are sequenced in all possible ways that preserve the order in which the operations appeared in the two activities. A serial interleaving [serial schedule] of two activities is one where all the operations of one activity precede all the operations of the other activity.
The importance of the concept of interleaving is that it allows us to express the meaning of concurrent programs: The parallel execution of activities is equivalent to the sequential execution of one of the interleavings of these activities.
For detailed information: https://cis.temple.edu/~ingargio/cis307/readings/interleave.html

Multithread+Recursion strategies

I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!

Simple Generators

This code comes from a paper called "Lazy v. Yield". Its about a way to decouple producers and consumer of streams of data. I understand the Haskell portion of the code but the O'Caml/F# eludes me. I don't understand this code for the following reasons:
What kind of behavior can I expect from a function that takes as argument an exception and returns unit?
How does the consumer project into a specific exception? (what does that mean?)
What would be an example of a consumer?
module SimpleGenerators
type 'a gen = unit -> 'a
type producer = unit gen
type consumer = exn -> unit (* consumer will project into specific exception *)
type 'a transducer = 'a gen -> 'a gen
let yield_handler : (exn -> unit) ref =
ref (fun _ -> failwith "yield handler is not set")
let iterate (gen : producer) (consumer : consumer) : unit =
let oldh = !yield_handler in
let rec newh x =
try
yield_handler := oldh
consumer x
yield_handler := newh
with e -> yield_handler := newh; raise e
in
try
yield_handler := newh
let r = gen () in
yield_handler := oldh
r
with e -> yield_handler := oldh; raise e

I'm not familiar with the paper, so others will probably be more enlightening. Here are some quick answers/guesses in the meantime.
A function of type exn -> unit is basically an exception handler.
Exceptions can contain data. They're quite similar to polymorphic variants that way--i.e., you can add a new exception whenever you want, and it can act as a data constructor.
It looks like the consumer is going to look for a particular exception(s) that give it the data it wants. Others it will just re-raise. So, it's only looking at a projection of the space of possible exceptions (I guess).

I think the OCaml sample is using a few constructs and design patterns that you would not typically use in F#, so it is quite OCaml-specific. As Jeffrey says, OCaml programs often use exceptions for control flow (while in F# they are only used for exceptional situations).
Also, F# has really powerful sequence expressions mechanism that can be used quite nicely to separate producers of data from the consumers of data. I did not read the paper in detail, so maybe they have something more complicated, but a simple example in F# could look like this:
// Generator: Produces infinite sequence of numbers from 'start'
// and prints the numbers as they are being generated (to show I/O behaviour)
let rec numbers start = seq {
printfn "generating: %d" start
yield start
yield! numbers (start + 1) }
A simple consumer can be implemented using for loop, but if we want to consume the stream, we need to say how many elements to consume using Seq.take:
// Consumer: takes a sequence of numbers generated by the
// producer and consumes first 100 elements
let consumer nums =
for n in nums |> Seq.take 100 do
printfn "consuming: %d" n
When you run consumer (numbers 0) the code starts printing:
generating: 0
consuming: 0
generating: 1
consuming: 1
generating: 2
consuming: 2
So you can see that the effects of producers and consumers are interleaved. I think this is quite simple & powerful mechanism, but maybe I'm missing the point of the paper and they have something even more interesting. If so, please let me know! Although I think the idiomatic F# solution will probably look quite similar to the above.

How to add a finalizer on a TVar

Background
In response to a question, I built and uploaded a bounded-tchan (wouldn't have been right for me to upload jnb's version). If the name isn't enough, a bounded-tchan (BTChan) is an STM channel that has a maximum capacity (writes block if the channel is at capacity).
Recently, I've received a request to add a dup feature like in the regular TChan's. And thus begins the problem.
How the BTChan looks
A simplified (and actually non-functional) view of BTChan is below.
data BTChan a = BTChan
{ max :: Int
, count :: TVar Int
, channel :: TVar [(Int, a)]
, nrDups :: TVar Int
}
Every time you write to the channel you include the number of dups (nrDups) in the tuple - this is an 'individual element counter' which indicates how many readers have gotten this element.
Every reader will decrement the counter for the element it reads then move it's read-pointer to then next element in the list. If the reader decrements the counter to zero then the value of count is decremented to properly reflect available capacity on the channel.
To be clear on the desired semantics: A channel capacity indicates the maximum number of elements queued in the channel. Any given element is queued until a reader of each dup has received the element. No elements should remain queued for a GCed dup (this is the main problem).
For example, let there be three dups of a channel (c1, c2, c3) with capacity of 2, where 2 items were written into the channel then all items were read out of c1 and c2. The channel is still full (0 remaining capacity) because c3 hasn't consumed its copies. At any point in time if all references toc3 are dropped (so c3 is GCed) then the capacity should be freed (restored to 2 in this case).
Here's the issue: let's say I have the following code
c <- newBTChan 1
_ <- dupBTChan c -- This represents what would probably be a pathological bug or terminated reader
writeBTChan c "hello"
_ <- readBTChan c
Causing the BTChan to look like:
BTChan 1 (TVar 0) (TVar []) (TVar 1) --> -- newBTChan
BTChan 1 (TVar 0) (TVar []) (TVar 2) --> -- dupBTChan
BTChan 1 (TVar 1) (TVar [(2, "hello")]) (TVar 2) --> -- readBTChan c
BTChan 1 (TVar 1) (TVar [(1, "hello")]) (TVar 2) -- OH NO!
Notice at the end the read count for "hello" is still 1? That means the message is not considered gone (even though it will get GCed in the real implementation) and our count will never decrement. Because the channel is at capacity (1 element maximum) the writers will always block.
I want a finalizer created each time dupBTChan is called. When a dupped (or original) channel is collected all elements remaining to be read on that channel will get the per-element count decremented, also the nrDups variable will be decremented. As a result, future writes will have the correct count (a count that doesn't reserve space for variables not-read by GCed channels).
Solution 1 - Manual Resource Management (what I want to avoid)
JNB's bounded-tchan actually has manual resource management for this reason. See the cancelBTChan. I'm going for something harder for the user to get wrong (not that manual management isn't the right way to go in many cases).
Solution 2 - Use exceptions by blocking on TVars (GHC can't do this how I want)
EDIT this solution, and solution 3 which is just a spin-off, does not work! Due to bug 5055 (WONTFIX) the GHC compiler sends exceptions to both blocked threads, even though one is sufficient (which is theoretically determinable, but not practical with the GHC GC).
If all the ways to get a BTChan are IO, we can forkIO a thread that reads/retries on an extra (dummy) TVar field unique to the given BTChan. The new thread will catch an exception when all other references to the TVar are dropped, so it will know when to decrement the nrDups and individual element counters. This should work but forces all my users to use IO to get their BTChans:
data BTChan = BTChan { ... as before ..., dummyTV :: TVar () }
dupBTChan :: BTChan a -> IO (BTChan a)
dupBTChan c = do
... as before ...
d <- newTVarIO ()
let chan = BTChan ... d
forkIO $ watchChan chan
return chan
watchBTChan :: BTChan a -> IO ()
watchBTChan b = do
catch (atomically (readTVar (dummyTV b) >> retry)) $ \e -> do
case fromException e of
BlockedIndefinitelyOnSTM -> atomically $ do -- the BTChan must have gotten collected
ls <- readTVar (channel b)
writeTVar (channel b) (map (\(a,b) -> (a-1,b)) ls)
readTVar (nrDup b) >>= writeTVar (nrDup b) . (-1)
_ -> watchBTChan b
EDIT: Yes, this is a poor mans finalizer and I don't have any particular reason to avoid using addFinalizer. That would be the same solution, still forcing use of IO afaict.
Solution 3: A cleaner API than solution 2, but GHC still doesn't support it
Users start a manager thread by calling initBTChanCollector, which will monitor a set of these dummy TVars (from solution 2) and do the needed clean-up. Basically, it shoves the IO into another thread that knows what to do via a global (unsafePerformIOed) TVar. Things work basically like solution 2, but the creation of BTChan's can still be STM. Failure to run initBTChanCollector would result in an ever-growing list (space leak) of tasks as the process runs.
Solution 4: Never allow discarding BTChans
This is akin to ignoring the problem. If the user never drops a dupped BTChan then the issue disappears.
Solution 5
I see ezyang's answer (totally valid and appreciated), but really would like to keep the current API just with a 'dup' function.
** Solution 6**
Please tell me there's a better option.
EDIT:
I implemented solution 3 (totally untested alpha release) and handled the potential space leak by making the global itself a BTChan - that chan should probably have a capacity of 1 so forgetting to run init shows up really quick, but that's a minor change. This works in GHCi (7.0.3) but that seems to be incidental. GHC throws exceptions to both blocked threads (the valid one reading the BTChan and the watching thread) so my if you are blocked reading a BTChan when another thread discards it's reference then you die.

Here is another solution: require all accesses to the the bounded channel duplicate to be bracketed by a function that releases its resources on exit (by an exception or normally). You can use a monad with a rank-2 runner to prevent duplicated channels from leaking out. It's still manual, but the type system makes it a lot harder to do naughty things.
You really don't want to rely on true IO finalizers, because GHC gives no guarantees about when a finalizer may be run: for all you know it may wait until the end of the program before running the finalizer, which means you're deadlocked until then.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Howto program thread-based parallel list iteration? - multithreading

Related

How to explain Read/Write global variables in multi threads environment

What is process interleaving? (in the realm of Concurrency)

Multithread+Recursion strategies

Simple Generators

How to add a finalizer on a TVar

Categories

Resources