Background
I've got a spark streaming application, that reads data from Kinesis -> does the windowing on it -> saves the data to external system (by doing foreachRDD).
Recently I've observed, that my windows are consumed by foreachRDD one-by-one. This means if I have sudden burst of data in my app (so that foreachRDD for a window takes a long time), then the windows will be stacking in a queue before being processed (while most of machines in my cluster are idle).
Question
Is this a semantic of spark streaming that windows are being processed one-by-one? If yes, is there any way to do "windowing" operation in parallel in spark, so that windows are consumed by foreachRDD at the same time?
Find out how many shards your kinesis stream has and create that many receivers by invoking createStream defined in the KinesisUtils Scala class.
This blog post from Amazon explains it well ....
Every input DStream is associated with a receiver, and in this case
also with a KCL worker. The best way to understand this is to refer to
the method createStream defined in the KinesisUtils Scala class.
Every call to KinesisUtils.createStream instantiates a Spark Streaming
receiver and a KCL worker process on a Spark executor. The first time
a KCL worker is created, it connects to the Amazon Kinesis stream and
instantiates a record processor for every shard that it manages. For
every subsequent call, a new KCL worker is created and the record
processors are re-balanced among all available KCL workers. The KCL
workers pull data from the shards, and routes them to the receiver,
which in turns stores them into the associated DStream.
Related
I'm building Apache Spark application that acts on multiple streams.
I did read the Performance Tuning section of the documentation:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
What I didn't get is:
1) Are the streaming receivers located on multiple worker nodes or is the driver machine?
2) What happens if one of the nodes that receives the data fails (power off/restart)
Are the streaming receivers located on multiple worker nodes or is the
driver machine
Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.
What happens if one of the nodes that receives the data fails (power
off/restart)
The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.
This is why a replayable source such as Kafka or AWS Kinesis is needed.
We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function used to create the Streaming context, we use createDirectStream to create our DStream and from this point, we execute several transformations and actions derived from call saveToCassandra on different RDDs
We are running different tests to establish how the application should recover when a failure occurs. Some key points about our scenario are:
We are testing with a fixed number of records in kafka (between 10 million and 20 million), that means, we consume from kafka once and the application brings all the records from kafka.
We are executing the application in --deploy-mode 'client' inside one of the workers, that means that we stop and start the driver manually.
We are not sure how to handle exceptions after DStreams were created, for example, if while writing to cassandra all nodes are dead, we get an exception that aborts the job, but after re-submitting the application, that job is not re-scheduled and the application keeps consuming from kafka getting multiple 'isEmpty' calls.
We made a couple of tests using 'cache' on the repartitioned RDD (which didn't work after a failure different than just stopping and starting the driver), and changing the parameters "query.retry.count", "query.retry.delay" and "spark.task.maxFailures" without success, e.g., the job is aborted after x failed times.
At this point we are confused on how should we use the checkpoint to re-schedule jobs after a failure.
Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?
When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
I try to learn apache spark and I can't understand from documentation how window operations work.
I have two worker node and I use Kafka Spark Utils to create DStream from a Topic.
On this DStream I apply map function and a reductionByWindow.
I can't understand if reductionByWindow is executed on a each worker or in the driver.
I have searched on google without any result.
Can Someone explain me?
Both receiving and processing data happens on the worker nodes. Driver creates receivers (on worker nodes) which are responsible for data collection, and periodically starts jobs to process collected data. Everything else is pretty much standard RDDs and normal Spark jobs.
Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.
How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?
There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)