Simple Generators - haskell

This code comes from a paper called "Lazy v. Yield". Its about a way to decouple producers and consumer of streams of data. I understand the Haskell portion of the code but the O'Caml/F# eludes me. I don't understand this code for the following reasons:
What kind of behavior can I expect from a function that takes as argument an exception and returns unit?
How does the consumer project into a specific exception? (what does that mean?)
What would be an example of a consumer?
module SimpleGenerators
type 'a gen = unit -> 'a
type producer = unit gen
type consumer = exn -> unit (* consumer will project into specific exception *)
type 'a transducer = 'a gen -> 'a gen
let yield_handler : (exn -> unit) ref =
ref (fun _ -> failwith "yield handler is not set")
let iterate (gen : producer) (consumer : consumer) : unit =
let oldh = !yield_handler in
let rec newh x =
yield_handler := oldh
consumer x
yield_handler := newh
with e -> yield_handler := newh; raise e
yield_handler := newh
let r = gen () in
yield_handler := oldh
with e -> yield_handler := oldh; raise e

I'm not familiar with the paper, so others will probably be more enlightening. Here are some quick answers/guesses in the meantime.
A function of type exn -> unit is basically an exception handler.
Exceptions can contain data. They're quite similar to polymorphic variants that way--i.e., you can add a new exception whenever you want, and it can act as a data constructor.
It looks like the consumer is going to look for a particular exception(s) that give it the data it wants. Others it will just re-raise. So, it's only looking at a projection of the space of possible exceptions (I guess).

I think the OCaml sample is using a few constructs and design patterns that you would not typically use in F#, so it is quite OCaml-specific. As Jeffrey says, OCaml programs often use exceptions for control flow (while in F# they are only used for exceptional situations).
Also, F# has really powerful sequence expressions mechanism that can be used quite nicely to separate producers of data from the consumers of data. I did not read the paper in detail, so maybe they have something more complicated, but a simple example in F# could look like this:
// Generator: Produces infinite sequence of numbers from 'start'
// and prints the numbers as they are being generated (to show I/O behaviour)
let rec numbers start = seq {
printfn "generating: %d" start
yield start
yield! numbers (start + 1) }
A simple consumer can be implemented using for loop, but if we want to consume the stream, we need to say how many elements to consume using Seq.take:
// Consumer: takes a sequence of numbers generated by the
// producer and consumes first 100 elements
let consumer nums =
for n in nums |> Seq.take 100 do
printfn "consuming: %d" n
When you run consumer (numbers 0) the code starts printing:
generating: 0
consuming: 0
generating: 1
consuming: 1
generating: 2
consuming: 2
So you can see that the effects of producers and consumers are interleaved. I think this is quite simple & powerful mechanism, but maybe I'm missing the point of the paper and they have something even more interesting. If so, please let me know! Although I think the idiomatic F# solution will probably look quite similar to the above.


Writing several unit definitions?

I've seen many OCaml programs that have all their functions at the top and then a unit definition at the end, like:
let rec factorial num =
if num = 0 then 1
else num * factorial (num-1)
let () =
let num2 = read_int () in
print_int (factorial num2)
Why is this? Does it act like a main function? If so, you shouldn't be able to use several of them right?
What is the best way to handle several input for example? Writing several unit definitions?
Yes, a unit expression at the top level of a module acts like the main function of the module. I.e., it gets executed at the time the program is started.
You can have many unit expressions anywhere you can have one unit expression. The ; operator is specifically intended for such cases:
let () =
Printf.printf "hello\n";
Printf.printf "world\n"
As a side comment, I often write a main function in my main module:
let main () =
(* main calculation of program *)
let () = main ()
This is possibly a holdover from all the years I wrote C code.
I have also seen this in other people's code (possibly there are a lot of us who used to write C code).
I really like Jeffrey's answer, but in case if you want extra details and what to know what let () = foo means here is some extracurricular reading.
Abstractly speaking the operation of OCaml programs could be defined as a machine that reduces expressions until they become irreducible. And an irreducible expression is called a value. For example, 5 + 3 is reduced to 8 and there is no other way to reduce 8 so 8 is a value. A more complex example of a value is (fun x -> x + 1). And a more complex example of expression would be
(fun x -> x + 1) 5
Which is reduced to 6.
The whole semantics of the language is defined as a set of such reduction rules. And a program in OCaml is an ordered list of definitions of the form,
let <pattern> = <expression>
So that when an OCaml program is evaluated (executed) it reduces the part of each definition and assigns it to the pattern on the left-hand side, e.g.,
let 5 = 2 + 3
is a valid definition in OCaml. It will reduce the 2 + 3 expression to 5 and then try to match the resulting value with the left-hand side. If it matches, then the next definition is evaluated, and so on. If it doesn't the program is terminated.
Here 5 is a very simple value that matches only with 5 and, in general, your values will be more complex. However, there is a value that is even more primitive than 5. It is a value of type unit that has only one inhabitant, denoted as (). And this is also the value, to which colloquially expressions with side effects are reduced. Since in OCaml every expression must reduce to a value, we need a value that represents no value, and that is unit. For example print_endline "foo" reduces to () with a side effect of emitting string foo to the standard output.
Therefore, when we write
let foo () = print_endline "foo"
let () = foo ()
We evaluate (reduce) the function foo until it reaches the () value that indicates that we fully reduced foo ().
We could also use a wildcard matcher and write
let _ = foo ()
or bind the result to a variable, e.g.,
let bar = foo ()
But it is considered a good style to use () on the left-hand side of an expression that evaluates to () to indicate that the right-hand side doesn't produce any interesting value. It also prevents common errors, e.g.,
let () = foo
will yield an error saying that unit -> unit and can't be matched with unit and even provide a hint: Did you forget to provide ()' as argument?`

Multithreading based on duplicated jOOλ streams

The code below represents a toy example of the problem I am trying to solve.
Imagine that we have an original stream of data originalStream and that the goal is to apply 2 very different data processing. As an example here, one data processing will multiply each element by 2 and sum the result (dataProcess1) and the other will multiply by 4 and sum the result (dataProcess2). Obviously the operation would not be so simple in real life....
The idea is to use jOOλ in order to duplicate the stream and apply both operations to the 2 streams. However, the trick is that I want to run both data processing in different threads. Since originalStream.duplicate() is not thread-safe out of the box, the code below will fail to give the right result which should be: result1 = 570; result2 = 180. Instead the code may unpredictably fail on NPE, yield the wrong result or (sometimes) even give the right result...
The question is how to minimally modify the code such that it will become thread-safe.
Note that I do not want to first collect the stream into a list and then generate 2 new streams. Instead I want to stay with streams until they are eventually collected at the end of the data process. It may not be the most efficient nor the most logical thing to want to do but I think it is nevertheless conceptually interesting. Note also that I wish to keep using org.jooq.lambda.Seq (group: 'org.jooq', name: 'jool', version: '0.9.12') as much as possible as the real data processing functions will use methods that are specific to this library and not present in regular Java streams.
Seq<Long> originalStream = seq(LongStream.range(0, 10));
Tuple2<Seq<Long>, Seq<Long>> duplicatedOriginalStream = originalStream.duplicate();
ExecutorService executor = Executors.newFixedThreadPool(2);
List<Future<Long>> res = executor.invokeAll(Arrays.asList(
() -> -> 2 * x).zipWithIndex().map(x -> x.v1 * x.v2).reduce((x, y) -> x + y).orElse(0L),
() -> -> 4 * x).reduce((x, y) -> x + y).orElse(0L)
System.out.printf("result1 = %d\tresult2 = %d\n", res.get(0).get(), res.get(1).get());

G-machine, (non-)strict contexts - why case expressions need special treatment

I'm currently reading Implementing functional languages: a tutorial by SPJ and the (sub)chapter I'll be referring to in this question is 3.8.7 (page 136).
The first remark there is that a reader following the tutorial has not yet implemented C scheme compilation (that is, of expressions appearing in non-strict contexts) of ECase expressions.
The solution proposed is to transform a Core program so that ECase expressions simply never appear in non-strict contexts. Specifically, each such occurrence creates a new supercombinator with exactly one variable which body corresponds to the original ECase expression, and the occurrence itself is replaced with a call to that supercombinator.
Below I present a (slightly modified) example of such transformation from 1
t a b = Pack{2,1} ;
f x = Pack{2,2} (case t x 7 6 of
<1> -> 1;
<2> -> 2) Pack{1,0} ;
main = f 3
== transformed into ==>
t a b = Pack{2,1} ;
f x = Pack{2,2} ($Case1 (t x 7 6)) Pack{1,0} ;
$Case1 x = case x of
<1> -> 1;
<2> -> 2 ;
main = f 3
I implemented this solution and it works like charm, that is, the output is Pack{2,2} 2 Pack{1,0}.
However, what I don't understand is - why all that trouble? I hope it's not just me, but the first thought I had of solving the problem was to just implement compilation of ECase expressions in C scheme. And I did it by mimicking the rule for compilation in E scheme (page 134 in 1 but I present that rule here for completeness): so I used
E[[case e of alts]] p = E[[e]] p ++ [Casejump D[[alts]] p]
and wrote
C[[case e of alts]] p = C[[e]] p ++ [Eval] ++ [Casejump D[[alts]] p]
I added [Eval] because Casejump needs an argument on top of the stack in weak head normal form (WHNF) and C scheme doesn't guarantee that, as opposed to E scheme.
But then the output changes to enigmatic: Pack{2,2} 2 6.
The same applies when I use the same rule as for E scheme, i.e.
C[[case e of alts]] p = E[[e]] p ++ [Casejump D[[alts]] p]
So I guess that my "obvious" solution is inherently wrong - and I can see that from outputs. But I'm having trouble stating formal arguments as to why that approach was bound to fail.
Can someone provide me with such argument/proof or some intuition as to why the naive approach doesn't work?
The purpose of the C scheme is to not perform any computation, but just delay everything until an EVAL happens (which it might or might not). What are you doing in your proposed code generation for case? You're calling EVAL! And the whole purpose of C is to not call EVAL on anything, so you've now evaluated something prematurely.
The only way you could generate code directly for case in the C scheme would be to add some new instruction to perform the case analysis once it's evaluated.
But we (Thomas Johnsson and I) decided it was simpler to just lift out such expressions. The exact historical details are lost in time though. :)

What is process interleaving? (in the realm of Concurrency)

I'm not quite sure as to what this term means. I saw it during a course where we are learning about concurrency. I've seen a lot of definitions for data interleaving, but I could find anything about process interleaving.
When looking at the term my instincts tell me it is the use of threads to run more than one process simultaneously, is that correct?
If you imagine a process as a (possibly infinite) sequence/trace of statements (e.g. obtained by loop unfolding), then the set of possible interleavings of several processes consists of all possible sequences of statements of any of those process.
Consider for example the processes
int i;
proctype A() {
i = 1;
proctype B() {
i = 2;
Then the possible interleavings are i = 1; i = 2 and i = 2; i = 1, i.e. the possible final values for i are 1 and 2. This can be of course more complex, for instance in the presence of guarded statements: Then the next possible statements in an interleaving sequence are not necessarily those at the position of the next program counter, but only those that are allowed by the guard; consider for example the proctype
proctype B() {
:: i == 0 -> i = 2
:: else -> skip
Then the possible interleavings (given A() as before) are i = 1; skip and i = 2; i = 1, so there is only one possible final value for i.
Indeed the notion of interleavings is crucial for Spin's view of concurrency. In a trace semantics, the set of possible traces of concurrent processes is the set of possible interleavings of the traces of the individual processes.
It simply means performing (data access or execution or ... ) in an arbitrary order**(see the note). In the case of concurrency, it usually refers to action interleaving.
If the process P and Q are in parallel composition (P||Q) then the actions of these will be interleaved. Consider following processes:
PLAYING = (play_music -> stop_music -> STOP).
PERFORMING = (dance -> STOP).
So each primitive process can be shown as: (generated by LTSA model-cheking tool)
Then the possible traces as the result of action interleaving will be:
dance -> play_music -> stop_music
play_music -> dance -> stop_music
play_music -> stop_music -> dance
Here is the LTSA tool generated output of this example.
**note: "arbitrary" here means arbitrary choice of process execution not their inner sequence of codes. The code execution in each process will be always followed sequentially.
If it is still something that you're not comfortable with you can take a look at:
Hope it helps! :)
Operating Systems support Tasks (or Processes). But for now let's think of "Actitivities".
Activities can be executed in parallel. Here are two activities, P and Q:
P: abc
Q: def
a, b, c, d, e, f, are operations. *
Each operation has always the same effect independent of what other
operations may be executing at the same time (atomicity).
What is the effect of executing the two activities concurrently? We
do not know for sure, but we know that it will be the same as obtained
by executing sequentially an INTERLEAVING of the two activities
[interleavings are also called SCHEDULES]. Here are the possible
interleavings of these two activities:
That is, the operations of the two activities are sequenced in all possible ways that preserve the order in which the operations appeared in the two activities. A serial interleaving [serial schedule] of two activities is one where all the operations of one activity precede all the operations of the other activity.
The importance of the concept of interleaving is that it allows us to express the meaning of concurrent programs: The parallel execution of activities is equivalent to the sequential execution of one of the interleavings of these activities.
For detailed information:

Howto program thread-based parallel list iteration?

I need as an example how to program a parallel iter-function using ocaml-threads. My first idea was to have a function similiar to this:
let procs = 4 ;;
let rec _part part i lst = match lst with
[] -> ()
| hd::tl ->
let idx = i mod procs in
(* Printf.printf "part idx=%i\n" idx; *)
let accu = part.(idx) in
part.(idx) <- (hd::accu);
_part part (i+1) tl ;;
Then a parallel iter could look like this (here as process-based variant):
let iter f lst = let part = Array.create procs [] in
_part part 0 lst;
let rec _do i =
(* Printf.printf "do idx=%i\n" i; *)
match Unix.fork () with
0 -> (* Code of child *)
if i < procs then
(* Printf.printf "child %i\n" i; *)
List.iter f part.(i)
| pid -> (* Code of father *)
(* Printf.printf "father %i\n" i; *)
if i >= procs then ignore (Unix.waitpid [] pid)
else _do (i+1)
_do 0 ;;
Because the usage of Thread-module is a little bit different, how would I code this using ocaml's thread module?
And there is another question, the _part() function must scan the whole list to split them into n parts and then each part will be piped through each own processes (here). Still exists there a solution without splitting a list first?
If you have a function which processes a list, and you want to run it on several lists independently, you can call Thread.create with that function and every list. If you store your lists in array part then:
let threads = (Thread.create (List.iter f)) part in
Array.iter Thread.join threads
INRIA OCaml threads are not actual threads: only one thread executes at any given time, which means if you have four processors and four threads, all four threads will use the same processor and the other three will remain unused.
Where threads are useful is that they still allow asynchronous programming: some Thread module primitives can wait for an external resource to become available. This can reduce the time your software spends blocked by an unavailable resource, because you can have another thread do something else in the mean time. You can also use this to concurrently start several external asynchronous processes (like querying several web servers through HTTP). If you don't have a lot of resource-related blocking, this is not going to help you.
As for your list-splitting question: to access an element of a list, you must traverse all previous elements. While this traversal could theoretically be split across several threads or processes, the communication overhead would likely make it a lot slower than just splitting things ahead of time in one process. Or using arrays.
Answer to a question from the comments. The answer does not quite fit in a comment itself.
There is a lock on the OCaml runtime. The lock is released when an OCaml thread is about to enter a C function that
may block;
may take a long time.
So you can only have one OCaml thread using the heap, but you can sometimes have non-heap-using C functions working in parallel with it.
See for instance the file ocaml-3.12.0/otherlibs/unix/write.c
memmove (iobuf, &Byte(buf, ofs), numbytes); // if we kept the data in the heap
// the GC might move it from
// under our feet.
enter_blocking_section(); // release lock.
// Another OCaml thread may
// start in parallel of this one now.
ret = write(Int_val(fd), iobuf, numbytes);
leave_blocking_section(); // take lock again to continue
// with Ocaml code.
