Direct Kafka input stream and window(...) function - apache-spark

I am using direct-Kafka-input-stream in my Spark app. When I use window(...) function in the chain it will cause the processing pipeline to stop - when I open the Spark-UI I can see that the streaming batches are being queued and the pipeline reports to process one of the first batches.
Derivations of window(..) function - like reduceByKeyAndWindow(..), etc. works as expected - pipeline doesn't stop. The same applies when using different type of stream.
Is it some known limitation of window(..) function when used with direct-Kafka-input-stream ?
Thanks
Martin
Java pseudo code:
org.apache.spark.streaming.kafka.DirectKafkaInputDStream s;
s.window(Durations.seconds(10)).print(); // the pipeline will stop
To be more correct: the issue happens only when the windows overlap (if sliding_interval < window_length). Otherwise the system behaves as expected.

Related

Netty multithreading broken in version 4.1 ? Unable to process short queries after long ones?

I just want to setup a very common server : it must accept connections and make some business calculations to return the answer. Calculations can be short or long -> I need some kind of ThreadPoolExecutor to execute these calculations.
In netty 3, that we were using since a long time, this was achieved very easily, by just using an ExecutionHandler in the pipeline, before my BusinessHandler.
But now trying to setup the same thing in netty 4, i read in the documentation that ExecutionHandler was not existing anymore, and that i add to specify a EventExecutor when adding my BusinessHandler to the channel pipeline.
DefaultEventExecutorGroup applicativeExecutorGroup = new DefaultEventExecutorGroup(10);
...
ch.pipeline().addLast(applicativeExecutorGroup, businessHandler);
It works for very basic scenarios (only short queries), but not in the following one. The reason is that DefaultEventExecutorGroup will not select a free worker, but any one based on a round-robin.
A first request (R1) comes, is assigned T1 (Thread 1 of the DefaultEventExecutorGroup), and will take a long time (say 1 minute).
Then a few other queries Ri (i=2 to 10) are received. They are assigned Ti, and are also processed successfully.
But when R11 comes, it is assigned again T1, due to the round-robin algorithm implemented in DefaultEventExecutorGroup, and the query is queued after the long R1. As a result, it will not start its processing before one minute, and that is clearly an unacceptable delay. In concrete scenarios, clients never get the answer, because they time out waiting for the answer before we start the processing.
And this continue like this. One query every 10 queries will just fail, because queued after the long one in the only busy thread, while all the other threads of the Group were just idle.
Is there another configuration of my pipeline that would work ? For example, does a implemntation of EventExecutor exist that would just work like a standard Executor (select a FREE worker).
Or is it just a bug in netty 4.1 ? It would looks very strange, as this looks as a very common scenario for any server.
Thanks for your help.
From what you explained above I think you want to use UnorderedThreadPoolEventExecutor as a replacement for DefaultEventExecutorGroup. Or if ordering is important NonStickyEventExecutorGroup.

Know when logstash has finished processing everything in its pipelines

I have a relatively complicated logstash pipeline setup, with some pipelines feeding into others, splitting events, making http calls to external services, and sometimes feeding an event back into the pipeline it came from. (There is logic to prevent an infinite loop).
I'm trying to write some integration tests that feed test event(s) into a running logstash, wait for logstash to finish processing them completely (including any extra events they spawned), then check that the resulting output is as expected.
This logstash instance shouldn't be receiving any additional input from elsewhere, so I think it would be sufficient to check that it was "idle" (ignoring any events to do with the xpack monitoring).
I think the pipleine stats monitoring API is probably what I want to use - https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#pipeline-stats - but I'm unsure. If the values for "in" and "out" for every pipeline are equal, does that mean I can be certain that there's nothing more "in flight"? Or is it possible for these counters to be out of sync for some other reason e.g. event splitting, event filtering?
Discussions at https://discuss.elastic.co/t/pipeline-stats-api-in-out-filtered/163742 (and following the links) appear to suggest that "in" and "out" will always appear in sync, with a bug suggesting this shouldn't be the case - https://github.com/elastic/logstash/issues/8752 - and a related bug shows a situation where in and out are different - https://github.com/elastic/logstash/issues/8753
Most of the time, I have found that using the event stats API https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#event-stats and waiting for "out" to equal "in" achieves what I want.
However, I have on occasions witnessed "out" being higher than "in" - I haven't been able to track this down, but I think it happens when there's an error elsewhere.
So this check is probably good enough for integration tests, when you're going to tear logstash down again afterwards - but I wouldn't want to rely on it in production.
Sadly I've not been able to find any official definitions of what these numbers mean.

Spark Streaming - Poison Pill?

I'm trying to decide how best to design a data pipeline that will involve Spark Streaming.
The essential process I imagine is:
Set up a streaming job that watches a fileStream (this is the consumer)
Do a bunch of computation elsewhere, which populates that file (this is the producer)
The streaming job consumes the data as it comes in, performing various actions
When the producer is done, wait for all the streaming computations to finish, and tear down the streaming job.
It's step (4) that has me confused. I'm not sure how to shut it down gracefully. Recommendations I've found generally seem to recommend "Ctrl-C" on the driver, along with the spark.streaming.stopGracefullyOnShutdown config setting
I don't like that approach since it requires the producing code to somehow access the consumer's driver and send it a signal. These two systems could be completely unrelated; this is not necessarily easy to do.
Plus, there is already a communication channel — the fileStream — can't I use that?
In a traditional threaded producer/consumer situation, one common technique is to use a "poison pill". The producer sends a special piece of data indicating "no more data", then you wait for your consumers to exit.
Is there a reason this can't be done in Spark?
Surely there is a way for the stream processing code, upon seeing some special data, to send a message back to its driver?
The Spark docs have an example of listening to a socket, with socketTextStream, and it somehow is able to terminate when the producer is done. I haven't dived into that code yet, but this seems like it should be possible.
Any advice?
Is this fundamentally wrong-headed?

Tensorflow tf.estimator.Estimator with QueueRunner

I'm rewriting my to code to use tf.estimator.Estimator as an encapsulating object for my models.
The problem is :
I don't see how typical input pipeline fits into the picture.
My input pipeline use queues which are coordianted by tf.train.Coordinator.
To satisify tf.estimator.Estimator requirements i create all the "input graph" in init_fn function that is passed to estimator when calling:
Estimator.train(...)
It looks like this
input_fn(f):
...create input graph...
qr = tf.train.QueueRunner(queue, [operations...])
tf.train.add_queue_runner(qr)
The problem is: in such scenario how can I start and stop queue runners, respectivly at the start and beginning of the Estimator.train(...)?
Starting
I figured out for starting the queues I can pass and init_fn that does it to scaffold object passed to Estimator.
However how to join threads and close them gracefully - this I do not know.
Is there reference architecture for proper threaded input pipeline when using tf.estimator.?
Is Estimator class even ready to work with queues?
Estimator uses tf.train.MonitoredTrainingSession which handles starting and joining threads. You can check a couple example input-fns, such as
tf.estimator.inputs.*, tf.contrib.learn.io.read*

Splitting a large task to Azure functions and collecting the results (implementing a barrier over a service-bus)

I have tasks that I need to perform in Azure. each task is broken into several parts that need to run in parallel. The number of parts is not known in advance.
I would like to implement this using Azure functions and service bus. I was thinking about the following architecture:
I receive the task in a service bus. Func 1 determines how many sub-parts should be created, Func 2 does the work, and func 3 collects the results and passes them using a service bus.
I could not find an efficient mechanism for collecting the (variable) number of sub-results and knowing when everything has completed; If all sub-parts are ready passing the combined results to the output service bus.
Is there such a mechanism in Azure for collecting results from parallel sub-parts and only after everything is ready sending the data to the next stage? (this is like the barrier synchronization mechanism).
My approach:
Function 1 gets the work from it's message queue. It creates a jobId (GUID) that will be used to correlate all of the pieces. It breaks up the work into sub-parts and records the jobId and sub-parts in a database.
Each sub-part becomes a message that is added into the message queue that Func 2 is listening on.
Once Func 2 receives and processes the message, it places a message in the queue for Func 3.
Func 3 records in the db that this sub-part of the job has completed and check to see if now all of the sub-parts are complete. If they are not, do nothing else, if it is, it now knows that everything has completed and can proceed.
Check Logic Apps to govern the whole process. You can trigger from Service Bus, invoke individual functions through a loop etc. and incorporate results back to the document.

Resources