large differences in Spark Structured Streaming task duration - apache-spark

I have a Spark Structured Streaming job reading from Kafka that has task durations that vary greatly.
I don't know why this is the case since the topic partitions are not skewed, and I am using maxOffsetsPerTrigger on the readStream to cap the limit. I think each executor should be getting the same amount of data.
Yet it is common for a stage to have a minimum task duration of 0.8s and maximum of 12s. In the Spark UI under Event Timeline I can see the green bars for Executor Computing Time show the variation.
Details of the job:
is running on Spark-Kubernetes
uses PySpark via Jupyter Notebook
reads from a Kafka topic with n partitions
creates n executors to match the topic partition number
sets maxOffsetsPerTrigger on the readStream
has enough memory and CPU
to isolate where the lag is happening, the output sink is noop but normally this would be a Kafka sink
How can I even out the task durations?

Related

Spark stuck on SynapseLoggingShim.sala while writing into delta table

I'm streaming data from kafka and trying to merge ~30 million records to delta lake table.
def do_the_merge(microBatchDF, partition):
deltaTable.alias("target")\
.merge(microBatchDF.alias("source"), "source.id1= target.id2 and source.id= target.id")\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
I see that spark is stuck on task for almost an hour on the task named SynapseLoggingShim
once this stage completes, then writing to delta table actually starts and takes one more
I'm trying to understand what this SynapseLoggingShim stage does ?
Answering question myself, the synapseLoggingShim scala was waiting on the merge task to complete.
It's just a open telemetry wrapper to collect the metrics.
The problem is , we are bottlenecked by the source ! The event hub that we are reading has 32 partitions and spark parallelism is constrained the event hub partitions.
In simple words, increasing the spark cores doesn't help in decreasing the time as the source event hub limits the parallelism as per the topic partition count.

Optimal value of spark.sql.shuffle.partitions for a Spark batch Job reading from Kafka

I have a Spark batch job that consumes data from a Kafka topic with 300 partitions. As part of my job, there are various transformations like group by and join which require shuffling.
I want to know if I should go with the default value of spark.sql.shuffle.partitions which are 200 or set it to 300 which is the same as the number of input partitions in Kafka and hence the number of parallel tasks spawn to read it.
Thanks
In the chapter on Optimizing and Tuning Spark Applications of the book "Learning Spark, 2nd edition" (O'Reilly) it is written that the default value
"is too high for smaller or streaming workloads; you may want to reduce it to a lower value such as the number of cores on the executors or less.
There is no magic formula for the number of shuffle partitions to set for the shuffle stage; the number may vary depending on your use case, data set, number of cores, and the amount of executors memory available - it's a trial-and-error-approach."
Your goal should be to reduce the amount of small partitions being sent accross the network to executor's task.
There is a recording of a talk on Tuning Apache Spark for Large Scale Workloads which also talks about this configuration.
However, when you are using Spark 3.x, you do not think about that much as the Adaptive Query Execution (AQE) framework will dynamically coalesce shuffle partitions based on the shuffle file statistics. More details on the AQE framework are given in this blog.

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?

When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.

Spark Structured Streaming Print Offsets Per Batch Per Executor

I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details

Low Spark Streaming CPU utilization

In my Spark Streaming job, CPU is under utilized (only 5 -10 %).
It is fetching data from Kafka and sending to DynomoDB or thridparty endpoint.
Is there any recommendation for job that will better utilize the cpu resources, assuming that endpoint is not bottleneck.
The level of parallelism of Kafka depends on the number of partitions of the topic.
If the number of partitions in a topic is small, you will not be able to efficiently parallelize in a spark streaming cluster.
First, increase the number of partitions of the topic.
If you can not increase the partition of Kafka topic, increase the number of partitions by repartitioning after DStream.foreachRdd.
This will distribute the data across all the nodes and be more efficient.

Resources