I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details
Related
I have a Spark Structured Streaming job reading from Kafka that has task durations that vary greatly.
I don't know why this is the case since the topic partitions are not skewed, and I am using maxOffsetsPerTrigger on the readStream to cap the limit. I think each executor should be getting the same amount of data.
Yet it is common for a stage to have a minimum task duration of 0.8s and maximum of 12s. In the Spark UI under Event Timeline I can see the green bars for Executor Computing Time show the variation.
Details of the job:
is running on Spark-Kubernetes
uses PySpark via Jupyter Notebook
reads from a Kafka topic with n partitions
creates n executors to match the topic partition number
sets maxOffsetsPerTrigger on the readStream
has enough memory and CPU
to isolate where the lag is happening, the output sink is noop but normally this would be a Kafka sink
How can I even out the task durations?
Lets say there are 3 executors (ex1, ex2, ex3). In one executor(ex1) lets say receiver is running, now what happens when the data comes in that source.
Lets say a data is arrived in a kafka topic say "topic1", now the receiver running in ex1 will consume that data arrived in the topic right? Now where is that data stored?
Is it stored in that executor ex1 itself?
What if that data is too huge? Does it breaks it down and distributes it over to other executors?
Lets say a capacity of 10gb each executor(ex1, ex2, ex3). And a data arrived say 15gb (hypothetical assumption) now what happens to ex1. Will it fail or it will be handled? If handles how will it handle? Does it distributes over the cluster. If it distributes over the cluster, how will the foreachRDD fits into picture if in a batch only one rdd is formed. As if it distributes by breaking up, now it is more then one rdd in cluster right for that particular batch?
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
Is it stored in that executor ex1 itself?
Yes all the data fetched by the Spark Driver is pushed to executor for further processing.
What if that data is too huge? Does it breaks it down and
distributes it over to other executors?
The data is Hash partitioned once read by Spark receiver and then distributed among the executor fairly. If still you a data skew try adding a Custom partitioner and repartition the data.
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
There only one single receiver (i.e. one or multiple topics) which does the Kafka offset management. It hands over range of offsets per topic to Spark executor and Spark executor reads the data directly from Kafka. If anyone of the executor dies all the stages of it are re-executed from the last successful saved stage.
The Spark load distribution is not based size of data but on the count of events.
Guidelines say that if there are N partitions for a topic then Spark should have 2N executors to achieve optimum CPU resource utilization.
You should find more details on below link,
https://blog.cloudera.com/reading-data-securely-from-apache-kafka-to-apache-spark/
I have to write spark streaming(createDirectStream API) code. I will be receiving around 90K messages per second so though of using 100 partitions for kafka topic to improve the performance.
Could you please let me know how many executors should I use? Can I use 50 executors and 2 cores per executor?
Also, consider if the batch interval is 10seconds and number of partitions of kafka topic is 100, will I receive 100 RDDs i.e. 1 RDD from each kafka partition? Will there be only 1 RDD from each partition for the 10second batch interval.
Thanks
There is no good answer, really, and it depends on how much executor memory + cores you have in your cluster.
The hard-limit is that you cannot have more total executor processes than kafka partitions, and you don't want to saturate your network or other IO.
Therefore, first find if you are capping the network and/or memory/disks with one executor, then run two, and see if throughput doubles & network rates cut in half on the one machine. Then scale out the cores and instances, as needed.
Dropbox recently wrote a blog on their performance testing
Regarding RDDs, assuming you have a 1:1 mapping for executor instances to partition, then each executor would see 10sec worth of data per interval for only one partition, and each executor having its own RDD to process, so thus 100 total RDDs proessed per batch. IMO, the "amount of RDDs" isn't too important because you always get 1 RDD per interval.
I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).
I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.
An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.