gracefully exit without running pending batches in spark Dstreams - apache-spark

I am trying spark Dstreams with Kafka.
I am not able to commit for pending/ lag batches that are consumed by spark Dstream. This issues happens after the spark streaming context is stopped using ssc.stop(true,true). Here in this case the streaming context is stopped but spark context is still running the pending batched.
Here are a few things I have done.
create Dstream to get data from kafka topic. (Successfullyy)
commit offset manually back to kafka using
stream.asInstanceOf[canCommitOffsets].commitAsync()
Batch Interval is 60 seconds.
Batch time (time taken to perform some operation on incoming data ) is 2 mins.
streamingContext.stop(true,true)
Please tell me if there is a way to commit the offset for pending batches as well, or gracefully exit after the currently running batch and discard the pending batches, that way the offset for pending batches is not commited.

Related

Does Spark streaming receivers continue pulling data for every block interval during the current micro-batch

For every spark.streaming.blockInterval (say, 1 minute) receivers listen to streaming sources for data. Suppose the current micro-batch is taking an unnaturally long time to complete (by intention, say 20 min). During this micro-batch, would the Receivers still listens to the streaming source and store it in Spark memory?
The current pipeline runs in Azure Databricks by using Spark Structured Streaming.
Can anyone help me understand this!
With the above scenario the Spark will continue to consume/pull data from Kafka and micro batches will continue to pile up and eventually cause Out of memory (OOM) issues.
In order to avoid the scenario enable back pressure setting,
spark.streaming.backpressure.enabled=true
https://spark.apache.org/docs/latest/streaming-programming-guide.html
For more details on Spark back pressure feature

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?

When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.

How to run multiple spark streaming batch jobs at the same time?

I have been using spark streaming to process data in Spark 2.1.0.
9 receivers receive data in 10 second intervals by streaming.
The average processed time is about 10 seconds since my submitted the streaming application. However, the queued batches delayed more than one day.
Is the queue in the driver? Or is it in each receiver executer?
And in Active Batch processing, only one real data processing batch is processed except 9 receivers. So there are always only 10 batches running.
I am inquiring how to increase the number of active batches processing data.
And there is only one Streaming Batch job at a time. I set spark.scheduler.mode to FAIR in SparkConf and set the schedule pool to fair, but the batch job only runs one at a time.
In the spark job scheduling guide, the fair pool is supposed to operate as a FIFO in the same pool. Is this right?
How to run multiple spark streaming batch jobs at the same time?
spark streaming run spark-yarn client mode
8 node cluster, 1 node : 32core, 128G
executor_memory: 6g
executor_cores: 4
driver-memory: 4g
sparkConf.set("spark.scheduler.mode", "FAIR")
ssc.sparkContext.setLocalProperty("spark.scheduler.pool",
"production")
production is FAIR pool
sparkConf.set("spark.dynamicAllocation.enabled", false)

spark streaming - kafka direct stream - Can finished batches re-run after restoring from checkpoint

Consider a scenario where I've enabled concurrent jobs, so my batches could execute in any order and not wait for previous batches to finish. So, what happens when my driver failure occurs while my batch at time t is still executing and the batch at t+1 has already finished execution. Assuming that checkpointing is enabled, does my job just launch only one job for the pending batch at time t? and not bother about the batch at time t+1? Or does it consider the batch at time t+1 also as incomplete? I'm interested in knowing this because, I would like my output operations on the stream to write data in the same order as in my input.

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

Resources