Does spark expose global watermark gap metric so I can access the exact values as per SparkUI global watermark gap chart?
I think the answer is "NO".
I have found this PR: https://github.com/apache/spark/pull/30427/files
I do not want to compute myself global watermark gap so I wonder how to compute it from the existing spark exposed metrics?
The purpose is to set some trheshold and send alert for GWG.
Related
To optimize performance, we don't want too many data partitions or too few data partitions, so during an autoscaling event, we'd like to adjust default.parallelism and sql.shuffle dynamically. Where can this be set that takes effect on the Spark cluster (running on aws-emr)?
I've tried to set this dynamically on the driver node, but when I look at the environment tab, the defaul.parallelism and sql.shuffle values doesn't change. I suspect I have to do this on the master node, since Spark UI runs on master node.
I'm thinking I'd have to set these values in the master node, but how can this be done within a Spark application or does this have to be done elsewhere?
What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?
As far as I understand data size will vary as per window interval and slide interval and big intervals like weekly and more (though monthly interval is not allowed) might affect performance as actual data would be stored in rdds in a datastream.
Does window and slide intervals affect the spark streaming application performance? if yes, then what are ways to fine tune performance and intervals?
Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.
I think your problem can be solved by Spark Streaming Backpressure.
Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.
By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.
From Apache Spark Kafka configuration
spark.streaming.backpressure.enabled:
This enables the Spark Streaming to control the receiving rate based
on the current batch scheduling delays and processing times so that
the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate:
This is the initial maximum receiving rate at which each receiver will
receive data for the first batch when the backpressure mechanism is
enabled.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.
Apart from above answers. Batch size is product of 3 parameters
batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
No of partitions in kafka topic
For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)
Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.
By settings below properties, we could control the batch size
spark.streaming.receiver.maxRate=
spark.streaming.kafka.maxRatePerPartition=
You could even dynamically set the batch size based on processing time, by enabling the back pressure
spark.streaming.backpressure.enabled:true
spark.streaming.backpressure.initialRate:
I am relatively new to spark. However I needed to find out that is there are a way by which we can see which data frame is being accessed at what time. Can this be achieved by native spark logging?
If so, then how do I implement this??
The DAG Visualization and Event Timeline are two very important built-in spark tools available from Spark 1.4 that you can use to see which DF/RDD is used and in what steps. See more details here - Understanding your Spark application through visualization