The code below represents a toy example of the problem I am trying to solve.
Imagine that we have an original stream of data originalStream and that the goal is to apply 2 very different data processing. As an example here, one data processing will multiply each element by 2 and sum the result (dataProcess1) and the other will multiply by 4 and sum the result (dataProcess2). Obviously the operation would not be so simple in real life....
The idea is to use jOOλ in order to duplicate the stream and apply both operations to the 2 streams. However, the trick is that I want to run both data processing in different threads. Since originalStream.duplicate() is not thread-safe out of the box, the code below will fail to give the right result which should be: result1 = 570; result2 = 180. Instead the code may unpredictably fail on NPE, yield the wrong result or (sometimes) even give the right result...
The question is how to minimally modify the code such that it will become thread-safe.
Note that I do not want to first collect the stream into a list and then generate 2 new streams. Instead I want to stay with streams until they are eventually collected at the end of the data process. It may not be the most efficient nor the most logical thing to want to do but I think it is nevertheless conceptually interesting. Note also that I wish to keep using org.jooq.lambda.Seq (group: 'org.jooq', name: 'jool', version: '0.9.12') as much as possible as the real data processing functions will use methods that are specific to this library and not present in regular Java streams.
Seq<Long> originalStream = seq(LongStream.range(0, 10));
Tuple2<Seq<Long>, Seq<Long>> duplicatedOriginalStream = originalStream.duplicate();
ExecutorService executor = Executors.newFixedThreadPool(2);
List<Future<Long>> res = executor.invokeAll(Arrays.asList(
() -> duplicatedOriginalStream.v1.map(x -> 2 * x).zipWithIndex().map(x -> x.v1 * x.v2).reduce((x, y) -> x + y).orElse(0L),
() -> duplicatedOriginalStream.v2.map(x -> 4 * x).reduce((x, y) -> x + y).orElse(0L)
));
executor.shutdown();
System.out.printf("result1 = %d\tresult2 = %d\n", res.get(0).get(), res.get(1).get());
Related
My Apache Beam pipeline takes an infinite stream of messages. Each message fans out N elements (N is ~1000 and is different for each input). Then for each element produced by the previous stage there is a map operation that produces new N elements, which should be reduced using a top 1 operation (elements are grouped by the original message that was read from the queue). The results of top 1 is saved to an external storage. In Spark I can easily do it by reading messages from the stream and creating RDD for each message that does map + reduce. Since Apache Beam does not have nested pipelines, I can't see a way implementing it in Beam with an infinite stream input. Example:
Infinite stream elements: A, B
Step 1 (fan out, N = 3): A -> A1, A2, A3
(N = 2): B -> B1, B2
Step 2 (map): A1, A2, A3 -> A1', A2', A3'
B1, B2, B3 -> B1', B2'
Step 3 (top1): A1', A2', A3' -> A2'
B1', B2' -> B3'
Output: A2', B2'
There is no dependency between A and B elements. A2' and B2' are top elements withing their group. The stream is infinite. The map operation can take from a couple of seconds to a couple of minutes. Creating a window watermark for the maximum time it takes to do the map operation would make the overall pipeline time much slower for fast map operations. Nested pipeline would help because this way I could create a pipeline per message.
It doesn't seem like you'd need a 'nested pipeline' for this. Let me show you what that looks like in the Beam Python SDK (it's similar for Java):
For example, try the dummy operation of appending a number, and an apostrophe to a string (e.g. "A"=>"A1'"), you'd do something like this:
def my_fn(value):
def _inner(elm):
return (elm, elm + str(value) + "'") # A KV-pair
return _inner
# my_stream has [A, B]
pcoll_1 = (my_stream
| beam.Map(my_fn(1)))
pcoll_2 = (my_stream
| beam.Map(my_fn(2)))
pcoll_3 = (my_stream
| beam.Map(my_fn(3)))
def top_1(elms):
... # Some operation
result = ((pcoll_1, pcoll_2, pcoll_3)
| beam.CoGroupByKey()
| beam.Map(top_1))
So here is the sort of working solution. I will most likely be editing it for any mistakes I may make in understanding the question. (P.s. the template code is in java). Assuming that input is your stream source
PCollection<Messages> msgs = input.apply(Window.<Messages>into(
FixedWindows.of(Duration.standardSeconds(1))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//optional since you have small window
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.discardingFiredPanes());
This would allow you to read a stream of Messages which could either be a string or a HashMap or even a list. Observe that you are telling beam to fire a window for every element that it receives and you have set a maximum windowing of 1 second. You can change this if you want to fire every 10 messages and a window of a minute etc.
After that you would need to write 2 classes that extend DoFn primarily
PCollection<Element> top = msgs.apply(ParDo.of(new ExtractElements()))
.apply(ParDo.of(new TopElement()));
Where Element can be a String, an int, double, etc.
Finally, you would right each Element to storage with:
top.apply(ParDo.of(new ParsetoString()))
.apply(TextIO.write().withWindowedWrites()
.withNumShards(1)
.to(filename));
Therefore, you would have roughly 1 file for every message which may be a lot. But sadly you can not append to file. Unless you do a windowing where you group all the elements into one list and write to that.
Of course, there is the hacky way to do it without windowing, which I will explain if this use case does not seem to work out with you (or if you are curious)
Let me know if I missed anything! :)
I'm relatively new to spark and might even be wrong before finishing building up the scenario questions so feel free to skip reading and point it out where you find I'm conceptually wrong, thanks!
Imagine a piece of driver code like this:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.someAction()... //do sth with B
...
C.someAction()... //do sth with C
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ? Spark is intelligent on optimizing one transformation execution at a time because the bundled tasks in it could be throughly analyzed. For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary. For further example It could happen like this:
val arr = Array(...)
val A = sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
... //B and C stays the same as above
B.someAction()
...
arr(i) = arr(i) + 10 //local state modified
...
C.someAction() //should A be recomputed? YES
This is easy to verify so I did a quick experiment and the result supports my reasoning.
However if B and C just independently depend on A and no other logic like above exists then a programmer or some tool could statically analyze the code and say hey it’s feasible to add a cache on A so that it doesn’t unnecessarily recompute! But spark can do nothing about this and sometimes it’s even hard for human to decide:
val A = ... (some transformation)
var B = A.filter( fun1 )
var C: ??? = null
var D: ??? = null
if (cond) {
//now whether multiple dependencies exist is runtime determined
C = A.filter( fun2 )
D = A.filter( fun3 )
}
B.someAction()... //do sth with B
if (cond) {
C.someAction()... //do sth with C
D.someAction()... //do sth with D
}
If the condition is true then it’s tempting to cache A but you’ll never know until runtime. I know this is an artificial crappy example but these are already simplified models things could get more complicated in practice and the dependencies could be quite long and implicit and spread across modules so my question is what’s the general principle to deal with this kind of problem. When should the common ancestors on the transformation dependency graph be cached (provided memory is not an issue) ?
I’d like to hear something like always follow functional programming paradigms doing spark or always cache them if you can however there’s another situation that I may not need to:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.join(C).someAction()
Again B and C both depend on A but instead of calling two actions separately they are joined to form one single transformation. This time I believe spark is smart enough to compute A exactly once. Haven’t found a proper way to run and examine yet but should be obvious in the web UI DAG. What's further I think spark can even reduce the two filter operations into one traversal on A to get B and C at the same time. Is this true?
There's a lot to unpack here.
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ?
Yes, it will be computed twice, unless you call A.cache() or A.persist(), in which case it will be calculated only once.
For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary
No, this is not correct, A is immutable, therefore it's state cannot change. B and C are also immutable RDDs that represent transformations of A.
sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
No, it doesn't depend on the local array, it is an immutable RDD containing the copy of the elements of the (driver) local array. If the array changes, A does not change. To obtain that behaviour you would have to var A = sc. parallelize(...) and then set A again when local array changes A = sc.paralellize(...). In that scenario, A isn't 'updated' it is replaced by a new RDD representation of the local array, and as such any cached version of A is invalid.
The subsequent examples you have posted benefit from caching A. Again because RDDs are immutable.
I'm not quite sure as to what this term means. I saw it during a course where we are learning about concurrency. I've seen a lot of definitions for data interleaving, but I could find anything about process interleaving.
When looking at the term my instincts tell me it is the use of threads to run more than one process simultaneously, is that correct?
If you imagine a process as a (possibly infinite) sequence/trace of statements (e.g. obtained by loop unfolding), then the set of possible interleavings of several processes consists of all possible sequences of statements of any of those process.
Consider for example the processes
int i;
proctype A() {
i = 1;
}
proctype B() {
i = 2;
}
Then the possible interleavings are i = 1; i = 2 and i = 2; i = 1, i.e. the possible final values for i are 1 and 2. This can be of course more complex, for instance in the presence of guarded statements: Then the next possible statements in an interleaving sequence are not necessarily those at the position of the next program counter, but only those that are allowed by the guard; consider for example the proctype
proctype B() {
if
:: i == 0 -> i = 2
:: else -> skip
fi
}
Then the possible interleavings (given A() as before) are i = 1; skip and i = 2; i = 1, so there is only one possible final value for i.
Indeed the notion of interleavings is crucial for Spin's view of concurrency. In a trace semantics, the set of possible traces of concurrent processes is the set of possible interleavings of the traces of the individual processes.
It simply means performing (data access or execution or ... ) in an arbitrary order**(see the note). In the case of concurrency, it usually refers to action interleaving.
If the process P and Q are in parallel composition (P||Q) then the actions of these will be interleaved. Consider following processes:
PLAYING = (play_music -> stop_music -> STOP).
PERFORMING = (dance -> STOP).
||PLAY_PERFORM = (PLAYING || PERFORMING).
So each primitive process can be shown as: (generated by LTSA model-cheking tool)
Then the possible traces as the result of action interleaving will be:
dance -> play_music -> stop_music
play_music -> dance -> stop_music
play_music -> stop_music -> dance
Here is the LTSA tool generated output of this example.
**note: "arbitrary" here means arbitrary choice of process execution not their inner sequence of codes. The code execution in each process will be always followed sequentially.
If it is still something that you're not comfortable with you can take a look at: https://www.doc.ic.ac.uk/~jnm/book/firstbook/pdf/ch3.pdf
Hope it helps! :)
Operating Systems support Tasks (or Processes). But for now let's think of "Actitivities".
Activities can be executed in parallel. Here are two activities, P and Q:
P: abc
Q: def
a, b, c, d, e, f, are operations. *
Each operation has always the same effect independent of what other
operations may be executing at the same time (atomicity).
What is the effect of executing the two activities concurrently? We
do not know for sure, but we know that it will be the same as obtained
by executing sequentially an INTERLEAVING of the two activities
[interleavings are also called SCHEDULES]. Here are the possible
interleavings of these two activities:
abcdef
abdcef
abdecf
abdefc
adbcef
......
defabc
That is, the operations of the two activities are sequenced in all possible ways that preserve the order in which the operations appeared in the two activities. A serial interleaving [serial schedule] of two activities is one where all the operations of one activity precede all the operations of the other activity.
The importance of the concept of interleaving is that it allows us to express the meaning of concurrent programs: The parallel execution of activities is equivalent to the sequential execution of one of the interleavings of these activities.
For detailed information: https://cis.temple.edu/~ingargio/cis307/readings/interleave.html
I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!
This code comes from a paper called "Lazy v. Yield". Its about a way to decouple producers and consumer of streams of data. I understand the Haskell portion of the code but the O'Caml/F# eludes me. I don't understand this code for the following reasons:
What kind of behavior can I expect from a function that takes as argument an exception and returns unit?
How does the consumer project into a specific exception? (what does that mean?)
What would be an example of a consumer?
module SimpleGenerators
type 'a gen = unit -> 'a
type producer = unit gen
type consumer = exn -> unit (* consumer will project into specific exception *)
type 'a transducer = 'a gen -> 'a gen
let yield_handler : (exn -> unit) ref =
ref (fun _ -> failwith "yield handler is not set")
let iterate (gen : producer) (consumer : consumer) : unit =
let oldh = !yield_handler in
let rec newh x =
try
yield_handler := oldh
consumer x
yield_handler := newh
with e -> yield_handler := newh; raise e
in
try
yield_handler := newh
let r = gen () in
yield_handler := oldh
r
with e -> yield_handler := oldh; raise e
I'm not familiar with the paper, so others will probably be more enlightening. Here are some quick answers/guesses in the meantime.
A function of type exn -> unit is basically an exception handler.
Exceptions can contain data. They're quite similar to polymorphic variants that way--i.e., you can add a new exception whenever you want, and it can act as a data constructor.
It looks like the consumer is going to look for a particular exception(s) that give it the data it wants. Others it will just re-raise. So, it's only looking at a projection of the space of possible exceptions (I guess).
I think the OCaml sample is using a few constructs and design patterns that you would not typically use in F#, so it is quite OCaml-specific. As Jeffrey says, OCaml programs often use exceptions for control flow (while in F# they are only used for exceptional situations).
Also, F# has really powerful sequence expressions mechanism that can be used quite nicely to separate producers of data from the consumers of data. I did not read the paper in detail, so maybe they have something more complicated, but a simple example in F# could look like this:
// Generator: Produces infinite sequence of numbers from 'start'
// and prints the numbers as they are being generated (to show I/O behaviour)
let rec numbers start = seq {
printfn "generating: %d" start
yield start
yield! numbers (start + 1) }
A simple consumer can be implemented using for loop, but if we want to consume the stream, we need to say how many elements to consume using Seq.take:
// Consumer: takes a sequence of numbers generated by the
// producer and consumes first 100 elements
let consumer nums =
for n in nums |> Seq.take 100 do
printfn "consuming: %d" n
When you run consumer (numbers 0) the code starts printing:
generating: 0
consuming: 0
generating: 1
consuming: 1
generating: 2
consuming: 2
So you can see that the effects of producers and consumers are interleaved. I think this is quite simple & powerful mechanism, but maybe I'm missing the point of the paper and they have something even more interesting. If so, please let me know! Although I think the idiomatic F# solution will probably look quite similar to the above.