Recovery after driver failure by exception with spark-streaming - apache-spark

We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function used to create the Streaming context, we use createDirectStream to create our DStream and from this point, we execute several transformations and actions derived from call saveToCassandra on different RDDs
We are running different tests to establish how the application should recover when a failure occurs. Some key points about our scenario are:
We are testing with a fixed number of records in kafka (between 10 million and 20 million), that means, we consume from kafka once and the application brings all the records from kafka.
We are executing the application in --deploy-mode 'client' inside one of the workers, that means that we stop and start the driver manually.
We are not sure how to handle exceptions after DStreams were created, for example, if while writing to cassandra all nodes are dead, we get an exception that aborts the job, but after re-submitting the application, that job is not re-scheduled and the application keeps consuming from kafka getting multiple 'isEmpty' calls.
We made a couple of tests using 'cache' on the repartitioned RDD (which didn't work after a failure different than just stopping and starting the driver), and changing the parameters "query.retry.count", "query.retry.delay" and "spark.task.maxFailures" without success, e.g., the job is aborted after x failed times.
At this point we are confused on how should we use the checkpoint to re-schedule jobs after a failure.

Related

Spark BroadcastHashJoin operator and Dynamic Allocation enabled

In my company, we have the following scenario:
we have dynamic allocation enabled by default for every data pipeline written, so we can save some costs and enable resource sharing among the different execution
also, most of the queries running perform joins and Spark has some interesting optimizations regarding it, like the join strategy change, which occurs when Spark identifies that one side of the join is small enough to be broadcasted. This is what we called BroadcastHashJoin and we have lots of queries with these operators in their respective query plans
last but not least, our pipelines run on EMR clusters using the client mode.
We are having a problem that happens when the YARN (RM on EMR) queue where a job was submitted is full, and there are not enough resources to allocate new executors for a given application. Since the driver process runs on the machine that submitted the application (client-mode), the broadcast job started and, after 300s it fails showing the broadcast timeout error.
Running the same job in a different schedule (a time when the queue usage is not too high), it was able to run successfully.
My questions are all related to how these three different things work together (dynamic allocation enabled, BHJ, client-mode). So, if you haven't enabled dynamic allocation, it's easier to see that the broadcast operation will occur for every executor that was requested initially through the spark-submit command. But, if we enable dynamic allocation, how the broadcast operation will occur for the next executors that will be dynamically allocated? Will the driver have to send it again for every new executor? Will they be subject to the same 300 timeout seconds? Is there a way to prevent the driver (client-mode) from starting the broadcast operation unless it has enough executors?
Source: BroadcastExchangeExec source code here
PS: we have tried already define spark.dynamicAllocation.minExecutors property equal to 1, but no success. The job still started only with the driver allocated and errored after the 300s.

What is the difference between FAILED AND ERROR in spark application states

I am trying to create a state diagram of a submitted spark application. I and kind of lost on when then an application is considered FAILED.
States are from here: https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala
This stage is very important, since when it comes to Big Data, Spark is awesome, but let's face it, we haven't solve the problem yet!
When a task/job fails, Spark restarts it (recall that the RDD, the main abstraction Spark provides, is a Resilient Distributed Dataset, which is not what we are looking for here, but it would give the intuition).
I use Spark 1.6.2 and my cluster restarts the job/task 3 times, when it is marked as FAILED.
For example, one of my recent jobs had to restart a whole stage:
In the cluster/app, one can see the attempt IDs, here the application is in its 3rd and final attempt:
If that attempt is marked as FAILED (for whatever reason, e.g. out-of-memory, bad DNS, GC allocation memory, disk failed, node didn't respond to the 4 heartbeats (probably is down), etc.), then Spark relaunches the job.

Execution order of windows in spark streaming

Background
I've got a spark streaming application, that reads data from Kinesis -> does the windowing on it -> saves the data to external system (by doing foreachRDD).
Recently I've observed, that my windows are consumed by foreachRDD one-by-one. This means if I have sudden burst of data in my app (so that foreachRDD for a window takes a long time), then the windows will be stacking in a queue before being processed (while most of machines in my cluster are idle).
Question
Is this a semantic of spark streaming that windows are being processed one-by-one? If yes, is there any way to do "windowing" operation in parallel in spark, so that windows are consumed by foreachRDD at the same time?
Find out how many shards your kinesis stream has and create that many receivers by invoking createStream defined in the KinesisUtils Scala class.
This blog post from Amazon explains it well ....
Every input DStream is associated with a receiver, and in this case
also with a KCL worker. The best way to understand this is to refer to
the method createStream defined in the KinesisUtils Scala class.
Every call to KinesisUtils.createStream instantiates a Spark Streaming
receiver and a KCL worker process on a Spark executor. The first time
a KCL worker is created, it connects to the Amazon Kinesis stream and
instantiates a record processor for every shard that it manages. For
every subsequent call, a new KCL worker is created and the record
processors are re-balanced among all available KCL workers. The KCL
workers pull data from the shards, and routes them to the receiver,
which in turns stores them into the associated DStream.

Where is executed the Apache Spark reductionByWindow function?

I try to learn apache spark and I can't understand from documentation how window operations work.
I have two worker node and I use Kafka Spark Utils to create DStream from a Topic.
On this DStream I apply map function and a reductionByWindow.
I can't understand if reductionByWindow is executed on a each worker or in the driver.
I have searched on google without any result.
Can Someone explain me?
Both receiving and processing data happens on the worker nodes. Driver creates receivers (on worker nodes) which are responsible for data collection, and periodically starts jobs to process collected data. Everything else is pretty much standard RDDs and normal Spark jobs.

Data receiving in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.
How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?
There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)

Resources