Spark Streaming: Many queued batches after a long time running without problems - apache-spark

We wrote a Spark Streaming application, that receives Kafka messages (backpressure enabled and spark.streaming.kafka.maxRatePerPartition set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD) at the end of every batch.
At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.
Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.
We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).
Similar results happened also when we wrote ORC instead of Parquet files.
Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.
We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.
Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.

Related

Spark-Streaming Application Optimisation using Repartition

I am trying to Optimize My Spark Streaming Application and I am able to Optimize it by repartition. However I am not able to Understand How exactly Repartition is working here and optimising the Streaming Process.
can anyone help me to understand below scenario.
I have created 2 Kafka Topics. let's say SrcTopic, DestTopic With 6 Partitions.While Processing the data from SrcTopic to DestTopic In My Streaming Application I have batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000, So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Fine as expected and Taking Avg 250-300 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now , I have Updated My SparkStreming Job Delated checkpoints and Again Processing data for the same source and destination (all the configurations are exactly same for the topics/Using same topics which I mentioned In first Point), Here Only Change I did it like Before Writing the data in DestTopic I have repartitioned my Dataframe (df.repartition(6)) and Then Sink into Kafka Topic.for This Process also I am Taking batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000,So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Also Fine as expected but Taking Avg 25-30 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now My doubt is.
For the first and 2nd Process No of Partitions are exactly same.
Both The Process has 6 Partitions in SrcTopic and DestTopic.
I checked the count of each partitions( 0,1,2,3,4,5) It's same in Both the cases(partition and repartition).
Executing Both the Application With Exactly same Configuration.
What extra repartition is doing here, so It's taking 10 time less time as compared to Normal Partition.
can You Help me to Understand the Process.

Checkpoint takes long time in a Spark Job

I have a Spark job (batch) with a checkpoint that it takes over 3h to finish, and appears the checkpoint over 30 times in the SparkUI:
I tried to delete the checkpoint from the code, and similar thing happens, there is a 3h GAP between the job before and the next job.
Data is not too big, and the job just read from 6 tables with no more than 3GB of data, and this job is running in a Cloudera Platform (YARN).
I have already tried using more shuffle partitions and parallelism and also using less, but it doesn't work. I also tried with the number of executors, but nothing changed...
What do you think is happening?
I finally could solve it.
The problem was that the input hive table had just 5 partitions (5 parquet files), so the job was working all the time with just 5 partitions.
.repartition(100) after reading solved the problem and speed up the process from 5h to 40 min.

How to stream 100GB of data in Kafka topic?

So, in one of our kafka topic, there's close to 100 GB of data.
We are running spark-structured streaming to get the data in S3
When the data is upto 10GB, streaming runs fine and we are able to get the data in S3.
But with 100GB, it is taking forever to stream the data in kafka.
Question: How does spark-streaming reads data from Kafka?
Does it take the entire data from current offset?
Or does it take in batch of some size?
Spark will work off consumer groups, just as any other Kafka consumer, but in batches. Therefore it takes as much data as possible (based on various Kafka consumer settings) from the last consumed offsets. In theory, if you have the same number of partitions, with the same commit interval as 10 GB, it should only take 10x longer to do 100 GB. You've not stated how long that currently takes, but to some people 1 minute vs 10 minutes might seem like "forever", sure.
I would recommend you plot the consumer lag over time using the kafka-consumer-groups command line tool combined with something like Burrow or Remora... If you notice an upward trend in the lag, then Spark is not consuming records fast enough.
To overcome this, the first option would be to ensure that the number of Spark executors is evenly consuming all Kafka partitions.
You'll also want to be making sure you're not doing major data transforms other than simple filters and maps between consuming and writing the records, as this also introduces lag.
For non-Spark approaches, I would like to point out that the Confluent S3 connector is also batch-y in that it'll only periodically flush to S3, but the consumption itself is still closer to real-time than Spark. I can verify that it's able to write very large S3 files (several GB in size), though, if the heap is large enough and the flush configurations are set to large values.
Secor by Pinterest is another option that requires no manual coding

How to write Spark streaming calculated results to HDFS?

I am writing Spark streaming job and my batch window is 1 min. At regular intervals of 30 mins i want to write something to HDFS.
Can i do that in Spark streaming ?
If yes , How ?
I dont want to write in each Spark streaming batch as it will be too many files on HDFS.
I am getting input stream , I am adding only records which I have not seen earlier to RDD (or Dataframe) and then in the end after 30 mins interval i want to write that to HDFS.
The current solution in my mind is
Use updateStateByKey
Use Checkpoint with huge interval
Just wondering what the standard pattern is in such use cases.
Thanks,

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

Resources