Spark Streaming - Poison Pill?

Spark Streaming - Poison Pill? - apache-spark

I'm trying to decide how best to design a data pipeline that will involve Spark Streaming.
The essential process I imagine is:
Set up a streaming job that watches a fileStream (this is the consumer)
Do a bunch of computation elsewhere, which populates that file (this is the producer)
The streaming job consumes the data as it comes in, performing various actions
When the producer is done, wait for all the streaming computations to finish, and tear down the streaming job.
It's step (4) that has me confused. I'm not sure how to shut it down gracefully. Recommendations I've found generally seem to recommend "Ctrl-C" on the driver, along with the spark.streaming.stopGracefullyOnShutdown config setting
I don't like that approach since it requires the producing code to somehow access the consumer's driver and send it a signal. These two systems could be completely unrelated; this is not necessarily easy to do.
Plus, there is already a communication channel — the fileStream — can't I use that?
In a traditional threaded producer/consumer situation, one common technique is to use a "poison pill". The producer sends a special piece of data indicating "no more data", then you wait for your consumers to exit.
Is there a reason this can't be done in Spark?
Surely there is a way for the stream processing code, upon seeing some special data, to send a message back to its driver?
The Spark docs have an example of listening to a socket, with socketTextStream, and it somehow is able to terminate when the producer is done. I haven't dived into that code yet, but this seems like it should be possible.
Any advice?
Is this fundamentally wrong-headed?

Related

Is my design for sending data to clients at various intervals correct?

The code should be written in C++. I'm mentioning this just in case someone will suggest a solution that won't work efficient when implementing in C++.
Objective:
Producer that runs on thread t1 inserts images to Consumer that runs on thread t2. The consumer has a list of clients that he should send the images to at various intervals. E.g. client1 requires images every 1sec, client2 requires images every 5sec and etc.
Suggested implementation:
There is one main queue imagesQ in Consumer to which Producer enqueues images to. In addition to the main queue, the Consumer manages a list of vector of queues clientImageQs of size as number of clients. The Consumer creates a sub-consumer, which runs on its own thread, for each client. Each such sub-consumer dequeues the images from a relevant queue from clientImageQs and sends images to its client at its interval.
Every time a new image arrives to imagesQ, the Consumer duplicates it and enqueus to each queue in clientImageQs. Thus, each sub-consumer will be able to send the images to its client at its own frequency.
Potential problem and solution:
If Producer enqueues images at much higher rate than one of the sub-consumers dequeues, the queue will explode. But, the Consumer can check the size of the queue in clientImageQs before enqueuing. And, if needed, Consumer will dequeue a few old images before enqueuing new ones.
Question
Is this a good design or there is a better one?

You describe the problem within a set of already determined solution limitations. Your description is complex, confusing, and I dare say, confused.
Why have a consumer that only distributes images out of a shared buffer? Why not allow each "client" as you call it read from the buffer as it needs to?
Why not implement the shared buffer as a single-image buffer. The producer writes at its rate. The clients perform non-destructive reads of the buffer at their own rate. Each client is ensured to read the most recent image in the buffer whenever the client reads the buffer. The producer simply over-writes the buffer with each write.
A multi-element queue offers no benefit in this application. In fact, as you have described, it greatly complicates the solution.
See http://sworthodoxy.blogspot.com/2015/05/shared-resource-design-patterns.html Look for the heading "unconditional buffer".
The examples in the posting listed above are all implemented using Ada, but the concepts related to concurrent design patterns are applicable to all programming languages supporting concurrency.

scala - best way to do parallel constant polling and processing

I am trying to figure out what is the best way to do constant polling in async and non blocking way. Entire goal of the application is to start few threads and with each thread do constant polling on external service (kafka) to get data; each thread then can process that data or hand it over to some other thread. I don't see a way to do this just with scala Future as it requires timeout value. I can set it to a year but that still doesn't seem like a good solution. e.g. Await.result(future, 365 days) Any pointers ?

There are couple of Async Non-Blocking Kafka libraries. You can write a consumer in any of these to pull data from Kafka topics.
https://github.com/cakesolutions/scala-kafka-client
https://github.com/akka/reactive-kafka

Perforce streams: Task streams usage

I have a requirement to create a child stream which will pick only specific folders from mainline(parent) stream. While creating child stream, to achieve this I restrict the view by using share/isolate/import successfully able to create the child streams which only the code i am interested in.
But, I have gone through some tutorials on streams and found something on lightweight streams (task streams) which is used to create the streams partially from parent. In my scenario do i need to really use this lightweight streams? What is the main advantage & limitiations of using this light-weight streams over using normal approach as I mentioned above?

The purpose of task streams is not to create streams "partially" -- you have already done this with your share/import paths. Don't fix what isn't broken!
Task streams are built to be short-lived and easily archive-able once the associated task is complete (via the "unload" command). The limitations of task streams are described in the documentation here:
https://www.perforce.com/perforce/doc.current/manuals/p4v/Content/P4V/streams.task.html
namely that they can't be reparented and they may not have children. If you use task streams as short-lived single-task streams (as the name "task stream" implies, a task stream is for a single task), these limitations won't generally be a problem. If you try to use a task stream as a development mainline, you're going to have problems.
If your development process involves creating a new branch for a short-term task (e.g. an individual hotfix parented to a particular branch), and you have a lot of these tasks, task streams may be useful due to their easy cleanup and low overhead (when a task stream is unloaded it's removed from the db, which means you don't accumulate db cruft over time as you create and abandon them).
If this does not sound like your development process, forget you ever heard about task streams. Do not try to imagine ways that you can use task streams for things that aren't short-term tasks. Hammers are suitable for nails. Do not use them to try to drive screws, especially when you have a perfectly good screwdriver right there and are already using it successfully.
(Can you tell I've seen more than a few instances of people trying to use task streams for absolutely everything because they "sound cool"? Resist the urge!)

File writing from multiple threads.

I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?

I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?

is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!

This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string