Low Spark Streaming CPU utilization - apache-spark

In my Spark Streaming job, CPU is under utilized (only 5 -10 %).
It is fetching data from Kafka and sending to DynomoDB or thridparty endpoint.
Is there any recommendation for job that will better utilize the cpu resources, assuming that endpoint is not bottleneck.

The level of parallelism of Kafka depends on the number of partitions of the topic.
If the number of partitions in a topic is small, you will not be able to efficiently parallelize in a spark streaming cluster.
First, increase the number of partitions of the topic.
If you can not increase the partition of Kafka topic, increase the number of partitions by repartitioning after DStream.foreachRdd.
This will distribute the data across all the nodes and be more efficient.

Related

Does the number of kafka partitions increase the speed of Spark writing to kafka?

When reading, Spark have a mapping 1:1 to kafka partitions, so, with more partitions we can leverage more parellelism to our job.
But does it apply when Spark is writing in kafka ? Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition ?
Yes.
If your topic has 1 partition means it is in one broker. So, If you increase producer rate for the topic, then that broker becomes busy. But if you have multiple partitions, your Kafka cluster shared those partitions into different brokers and those production rate shared within multiple brokers. So, Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition.
This not only production rate. In Kafka brokers, There is multiple processes like compactions, compressions, segmentations etc... So with number of messages, that work load becomes high. But with multiple partitions in multiple brokers, it will be distributed.
However, you don’t necessarily want to use more partitions than needed because increasing partition count simultaneously increases the number of open server files and leads to increased replication latency.
from kafka documentation
Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Optimal value of spark.sql.shuffle.partitions for a Spark batch Job reading from Kafka

I have a Spark batch job that consumes data from a Kafka topic with 300 partitions. As part of my job, there are various transformations like group by and join which require shuffling.
I want to know if I should go with the default value of spark.sql.shuffle.partitions which are 200 or set it to 300 which is the same as the number of input partitions in Kafka and hence the number of parallel tasks spawn to read it.
Thanks
In the chapter on Optimizing and Tuning Spark Applications of the book "Learning Spark, 2nd edition" (O'Reilly) it is written that the default value
"is too high for smaller or streaming workloads; you may want to reduce it to a lower value such as the number of cores on the executors or less.
There is no magic formula for the number of shuffle partitions to set for the shuffle stage; the number may vary depending on your use case, data set, number of cores, and the amount of executors memory available - it's a trial-and-error-approach."
Your goal should be to reduce the amount of small partitions being sent accross the network to executor's task.
There is a recording of a talk on Tuning Apache Spark for Large Scale Workloads which also talks about this configuration.
However, when you are using Spark 3.x, you do not think about that much as the Adaptive Query Execution (AQE) framework will dynamically coalesce shuffle partitions based on the shuffle file statistics. More details on the AQE framework are given in this blog.

How many executors should I use for spark streaming

I have to write spark streaming(createDirectStream API) code. I will be receiving around 90K messages per second so though of using 100 partitions for kafka topic to improve the performance.
Could you please let me know how many executors should I use? Can I use 50 executors and 2 cores per executor?
Also, consider if the batch interval is 10seconds and number of partitions of kafka topic is 100, will I receive 100 RDDs i.e. 1 RDD from each kafka partition? Will there be only 1 RDD from each partition for the 10second batch interval.
Thanks
There is no good answer, really, and it depends on how much executor memory + cores you have in your cluster.
The hard-limit is that you cannot have more total executor processes than kafka partitions, and you don't want to saturate your network or other IO.
Therefore, first find if you are capping the network and/or memory/disks with one executor, then run two, and see if throughput doubles & network rates cut in half on the one machine. Then scale out the cores and instances, as needed.
Dropbox recently wrote a blog on their performance testing
Regarding RDDs, assuming you have a 1:1 mapping for executor instances to partition, then each executor would see 10sec worth of data per interval for only one partition, and each executor having its own RDD to process, so thus 100 total RDDs proessed per batch. IMO, the "amount of RDDs" isn't too important because you always get 1 RDD per interval.

Does Spark read data from Kafka partition into executor, for a batch which is queued?

During spark streaming with streaming-kafka-0-8-integration Direct Approach, If the batches are getting queued, will the executors pull the data for queued batches into their memory? If not, what is the harm in having a very long backlog of batches?
Yes, the Spark will pull data from Kafka Queue and do processing on memory and the harm would be a pressure on Kafka resource as Kafka is having the long backlog of batches.

Kafka topic partitions to Spark streaming

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.
I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.
So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.
What happens if I have:
2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?
2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?
1 kafka partition and 2 executor cores?
Thank you.
The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:
2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)
2p, 2c: parallel execution
1p, 2c: one thread is idle
For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.
Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

Resources