In our spark app we're consuming Kafka stream and storing data to Cassandra DB.
First, we've run the stream without backpressure and experienced a weird anomaly where processing time was constant ~ 1 minute, however the scheduling delay was increasing. In this way the queue was piling up, eventually crashing the stream.
Any thoughts why this could be happening? If it's not the processing, what can cause such dramatic delays?
Then we tried the same setup with backpressure (with increased maxRatePerPartition), initially, everything was running well. Backpressure did its throttling job and we were able to process at a constant rate of ~ 100K / minute.
Then after few hours, something happened and the rate dropped rapidly to 5K / minute. The processing time was only 5-6 second with no scheduling delay, but backpressure absurdly kept the rate at 5k / minute and never increased. Actually, there was no reason to throttle down to 5K at all.
Our Setup:
Window: 1 minute
spark.streaming.kafka.maxRatePerPartition = 500 (4 partition * 60 sec * 500 = 120K / window)
spark.streaming.backpressure.enabled = true
spark.streaming.kafka.allowNonConsecutiveOffsets = true
spark.streaming.kafka.consumer.cache.enabled = false
Spark cluster with one master and 2 worker nodes
Related
I have a spark batch job that runs every minute and processes ~200k records per batch. The usual processing delay of the app is ~30 seconds. In the app, for each request, we make a write request to DynamoDB. At times, the server-side DDB write latency is ~5 ms instead of 3.5 ms (~30% increase w.r.t to usual latency 3.5ms). This is causing the overall delay of the app to bump by 6 times (~3 minutes).
How does sub-second latency of DDB call impact the overall latency of the app by 6 times?
PS: I have verified the root cause through overlapping the cloud-watch graphs of DDB put latency and the spark app processing delay.
Thanks,
Vinod.
Just a ballpark estimate:
If the average is 3.5 ms latency and about half of your 200k records are processed in 5ms instead of 3.5ms, this would leave us with:
200.000 * 0.5 * (5 - 3.5) = 150.000 (ms)
of total delay, which is 150 seconds or 2.5 minutes. I don't know how well the process is parallelized, but this seems to be within the expected delay.
Here's my stream analytics topology
EventHubSource => Job A (HoppingWindow every second) => EventHubA
EventHubSource => Job B (HoppingWindow every second) => EventHubB
Each job has a different consumer group in EventHubSource.
Each job is embarrassingly parallel and consumes only
14% SU resources.
When testing the JobA and JobC, the difference between the windowEnd and the original Event Time is just some few millisecond (~300), which is ok (latency from my producer + eventhub + stream analytics processing time).
But when I join both streams in a new Job C like this:
EventHubA
\
=> Job C (Join Datediff = 0 and timestamp by windowEnd)
/
EventHubB
This produces some output, but the problems comes here:
The real events are multiple minutes apart even if they were pushed at the same time by Job A and B (same windowEnd)
When I inspect the data coming out from EventHub A and B, the difference between the windowEnd and the real event timestamp ranges between 39 and 44 minutes, for all of them. But when testing like mentionned above, it was only 300ms.
The worst part here is that when I run it in prod, it only emits some dozen events and stops, even if the input count is still in the thousands.
It's been weeks I'm working on this and everytime I'm dealing with some cryptic behavior from ASA, my topology is quite simple and I'm only using simple hopping windows of 1s hop, this shouldn't take weeks of tweaking and trial errors without even understanding what's happening.
For people who used ASA and AWS Kinesis analytics, did you find Kinesis analytics simpler to work with ? What annoys me here in ASA is the unpredictable behavior and issues without error messages (I activated log analytics and no error was there...)
Sorry to hear you encountered some issues with ASA. I see you have a 1 second hopping windows, but what is the total size of the windows and what is your approximate throughput?
Regarding the delay: Looking are your question, I think your ASA job may not have enough CPU resources, and then the event processing is delayed. Unfortunately this is not visible in the current SU% metric, but we plan to show metrics for both CPU and memory in the future.
To confirm this is the root cause, you can check the number of backlogged events in the job diagram. If there are lot of events backlogged, you may need to increase the number of SUs for this job.
You also mentioned the job stops after a dozen output, do you see an error message in the logs?
We want to cache some entries (i.e. depending upon predicate) in continuous query cache on client for IMap. But we want to send update to CQC only after some delay seconds (i.e. 30 sec) even if the entries receives like 100 updates per sec. This we can achieve by setting delay seconds to 30 seconds & coalescing to true.
QueryCacheConfig cqc = new QueryCacheConfig();
cqc.setDelaySeconds(30);
cqc.setCoalesce(true);
cqc.setBatchSize(30)
CQC fits perfectly well for the above use case.
But we are noticing CQC is not receiving updates after delay seconds until batch size capacity is not reached. Is this is the expected behavior?
We thought CQC will receive the latest updated value for entries after delay seconds or batch size reached its capacity.
delaySeconds and batchSize 'OR' relation. Updates are pushed to caches either when batchSize is reached or delaySeconds are passed. If coalesce is true, then only latest update of a key is pushed to cache.
We have noticed some issues when testing with intellij. please try using another IDE if you are using intellij
I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?
I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.
I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.
Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.
A live example from a working Streaming application:
We see that:
The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.
We're experiencing stable processing time, however increasing scheduling delay.
Based on the answer, the scheduling delay should be influenced only by processing time of previous runs.
Spark is running only streaming, nothing else.
Time window is 1 minute, processing 120K records.
If your window is 1 minute, and the average processing time is 1 minute 7 seconds, you have a problem : each batch will delay the next one by 7 seconds.
Your processing time graph shows a stable processing time, but always higher than batch time.
I think after a given amount of time, your driver will crash on GC overhead limit exceeded, as it will be full of pending batch waiting to be excecuted.
You can change this by reducing the processing time so that it goes under the expected microbatch max duration (requires code and/or resources allocation changes), or increase the microbatch size, or go to continuous streaming.
Rgds
We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here