Spark Structured Streaming Foreachbatch [ pyspark]

Spark Structured Streaming Foreachbatch [ pyspark] - apache-spark

We are using spark Structured Streaming with foreachbatch to update records in delta table. The number of records part of each batch are random. We have 10000 record in kinesis stream but while creating micro batch it picks random number of records that is some time 800, 500 and some time . This takes long time in processing because ideally it should create 2 batch of 5000 each.

Related

How to limit number of batches to run in Spark Structured Streaming forEachBatch?

I'm reading data from Kafka in batch fashion using readStream then doing some transfromations and writing the data using forEachBacth & writeStream.
I have a usecase to hold the job for sometime and so i want to limit the job for x number of batches. Is it possible to do in Spark Structured Streaming ? Specifically, Spark 2.4.8

How do I find time it took to fetch data from Kafka in Spark Structured Streaming job?

I want to log how much time it took to fetch data from streaming source (Kafka, in my case) and getBatch is one metric logged by MicrobatchExecution class which tells time it took to fetch data from streaming source.
I am reading data from Kafka and inside foreachBatch loop, I am just printing the count of micro batch. What I observe here is that for all micro batches, getBatch is either 0 or 1, whereas addBatch is in thousands (11516, 9244, 8626 ms etc)
Looks like getBatch is always 0 or 1 ms as it not capturing correct time due to lazy evaluation in Spark.

Spark Streaming: Many queued batches after a long time running without problems

We wrote a Spark Streaming application, that receives Kafka messages (backpressure enabled and spark.streaming.kafka.maxRatePerPartition set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD) at the end of every batch.
At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.
Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.
We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).
Similar results happened also when we wrote ORC instead of Parquet files.
Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.
We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.
Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.

How to halt spark streaming until the data recieved in a batch duration is processed?

I have a situation where I collected data from AWS kinesis to apache spark over streaming. After I receive data for a batch duration, I process those data and update in cassandra. Here the processing should be done in such a way that untill the result is not updated in cassandra, spark should not recive next batch of records.
So, how to halt the streaming of next batch of record until the current batch is not processed?

Spark Streaming does not support this type of functionality. You can simply check row count after you receive data from kinesis for each batch, if there is no record (count equal to zero), don't call cassandra update and insertion API.

Can we define batches on basis records rather than time(interval) in spark streaming

Is there a way to define batch definition in spark streaming such that, each RDD represent a record rather than data of a time interval.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string