I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.
Related
I am running a spark application where data comes in every 1 minute. No of repartitions i am doing is 48. It is running on 12 executor with 4G as executor memory and executor-cores=4.
Below are the streaming batches processing time
Here we can see that some of the batches are taking around 20 sec but some are taking around 45 sec
I further drilled down in one of the batch which is taking less time. Below is the image.
and the one which is taking more time.
Here we can see more time is taken in repartitioning task, but above one was not taking much time in repartitioning. Its happening with every 3-4 batch. The data is coming from kafka Stream and has only value, no key.
Is there any reason related to spark configuration?
Try reducing "spark.sql.shuffle.partitions" size, the default value is 200 which is an overkill. Reduce the values and analyse the performance.
I have a spark job which reads, deduplicates and joins datasets stored in S3. The stored data is in ORC format and zlib compressed. In the first stage(the reading and deduplicating part), a small number of straggler tasks take up a great amount of time to complete. I analysed the metrics and found the following:
The tasks are processing nearly the same amount of data.
Shuffle writes for the tasks are nearly the same.
GC duration for each task is negligible.
Please find some screenshots for reference. One of the screenshots shows the metrics. The other depicts the time taken (30 mins / 4.1 mins) for two tasks with barely any difference in shuffle writes(9.2 mb/10.3 mb) or data skew(6.4M/7.2M) and without any considerable GC lag(5s/1s).
I am lost here, no idea what could be causing this to happen. Any help would be appreciated.
Best Regards
Note: IPs have been removed from the fifth column in the second image.
I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time the batch job can finish within about 5 seconds. However, in very rare cases, the batch job need cost 60 seconds and this will delay some batch of jobs.
To cut down the total delay time, I hope I can process more streaming data which spread over the delayed jobs at one time. This will help the streaming return to normal as soon as possible.
So, I want to know there is some method to dynamically update/merge batch size of input for spark and kafka when delay appears.
You can set the "spark.streaming.backpressure.enabled" option to true.
If the batch delay occurs when the backpressure option is true, it initially starts with a small batch size and then dynamically changes to a large batch size.
See the spark configuration document.
You can see the description below.
Enables or disables Spark Streaming's internal backpressure mechanism
(since 1.5). This enables the Spark Streaming to control the receiving
rate based on the current batch scheduling delays and processing times
so that the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these graphs.
This happens even though the program could keep up and it slows down execution considerably. I believe the drop happens when there is not much unprocessed data left, but because the rate is so low, these final records take up most of the time needed to run the job. Is there any way to avoid this and speed up?
I am using PySpark with Spark 1.6.2 and using the direct approach for Kafka streaming. Backpressure is turned on and there is a maxRatePerPartition of 100.
Setting backpressure is more meaningful in the case of old spark streaming versions where you need receivers to consume the messages from a stream. From Spark 1.3 you have the receiver-less “direct” approach to ensure stronger end-to-end guarantees. So you do not need to worry about backpressure as spark does most of the fine tuning.
When I run a Spark Streaming application, the processing time shows strange behavior, even when there are no incoming data. Processing times are not near zero, and steadily increase until they reach the batch interval value of 10 seconds. They then suddenly drop to a minimum.
Is there an explanation for this strange behavior? I am aware of this question, but I am not using Mesos, but YARN. I have seen similar behavior multiple times with multiple applications.