In the documentation on https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking, an example is shown using a window of 10 minutes, using a watermark of 10 minutes and a trigger of 5 minutes.
In the diagram when using the APPEND mode, the first results form the 12:00:00->12:10:00 window are only shown at 12:25:00. The reason is that at that time, the watermark is at 12:11:00 so all windows before 12:11:00 can already be sent to sink.
However, at 12:20:00, we already know the watermark is 12:11:00. So why isn't the first window not sent at 12:20:00 instead of 12:25:00?
Because Spark applies global watermark instead of watermark for each partition: watermark for a next batch is decided when tasks in current batch "finishes". Each partition is no idea to decide watermark: it only knows about events in its partition.
So at 12:20:00, Spark gets 12:21:00 and process it, and at the end of batch, Spark collects the events' timestamp and determines max timestamp, and decides watermark for a next batch - "12:11:00" - which will be the watermark for a batch 12:25:00.
Related
I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html
I’m using Apache Spark structured streaming for reading from Kafka. Sometimes my micro batches get processed in a greater time than specified, due to heavy writes IO operations. I was wondering if there’s an option of starting the next batch before the first one has finished, but make the second batch blocked by the first?
I mean that if the first one took 7 seconds and the batch is set for 5 seconds, then start the second batch on the fifth second. But if the second batch finishes block it so it won’t write before it’s previous batch (because of the will to keep the correct messages order).
No. Next batch only starts if previous completed. I think you mean term interval. It would become a mess otherwise.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
I'm using Spark 3.0.2 and I have a streaming job that consumes data from Kafka with trigger duration of "1 minute".
I see in Spark UI that there is a new job every 1 minute as defined, but I see method onQueryProgress is being called every 5~6 minutes. I thought this method should be called directly after each microbatch.
Is there a way to control this duration and make it equals the trigger duration?
The inQueryProgress method of the StreamingQueryListener is called asynchronously after the data has been completely processed within each micro-batch.
You are seeing this listener being triggered only every 5~6 minutes because it takes the streaming job that time to process all the data fetched in the micro-batch. Setting the Trigger duration to 1 minute will have Spark to plan tasks accordingly but it does not mean that the job is also able to process all available data within this time frame of 1 minute.
To reduce the amount of data being fetched by your query from Kafka you can play around with the source option maxOffsetsPerTrigger.
By the way, if you are not processing any data, this method is called every 10 seconds by default. In case you want to avoid this from happening you can do an if(event.progress.numInputRows > 0).
I found the reason for my case that onQueryProgress method was taking 5 minutes to complete.
as Mike mentioned that onQueryProgress is being called asynchronously, but I think it's using the same thread to call this method. So it's waiting for the method call to finish to call it again.
So the solution in my case was to figure out why it was taking that long and to make it faster than the trigger duration.
We plan to implement a Spark Structured Streaming application which will consume a continuous flow of data: evolution of a metric value over time.
This streaming application will work with a window size of 7 days (and a sliding window) in order to frequently calculate the average of the metric value over the last 7 days.
1- Will Spark retain all those 7 days of data (impacting a lot the memory consumed), OR Spark continuously calculates and updates the average requested (and then get rid of handled data) and so does not impact so much memory consumed (not retaining 7 days of data) ?
2- In case answer to first question is that those 7 days of data are retained, does the usage of watermark prevent this retention ?
Let’s say that we have a watermark of 1 hour; will only 1 hour of data be retained in Spark, OR 7 days are still retained in spark memory and watermark is here just for ignoring new data coming in with a datatimestamp older than 1 hour ?
Window Size 7 is definitely a significant one, but it also depends on the streaming data volume/records coming in. The trick lies in how to use the Window duration, update interval, output mode and if necessary the watermark (if the business rule is not impacted)
1- If the streaming is configured to be of tumbling window size (ie the window duration is same as the update duration), with complete mode, you may end up full data being kept in memory for 7 days. However, if you configure the window duration to be 7 days with an update of every x minutes, aggregates will be calculated every x minutes and only the result data will be kept in memory. Hence look at the window API parameters and configure the way to get the results.
2- Watermark brings a different behaviour and it ignores the records before the watermark duration and update the result tables after every micro batch crosses the water mark time. If your business rule is ok to include watermark calculation, it is fine to use it too.
It is good to go through the API in detail, output modes and watermark usage at enter link description here
This would help to choose the right combination.
I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command
tail -f <logfile>| nc -lk 9999
Here the spark streaming service is taking data from port 9999. However, I observe that the last few lines are being dropped, i.e. spark streaming does not receive those log lines or they are not processed.
However, I also observed that if I simply take the logfile as standard input instead of tailing it, no lines are dropped:
nc -q 10 -lk 9999 < logfile
Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance?
In Spark Streaming, data comes in over the wire, and constitutes a block on every block interval. This block is replicated on other machines (according to your storage level as soon as formed. Once a batch interval elapses, each block formed since the last batch interval tick forms part of a new RDD. It is once you have formed this RDD that you can schedule a job, so the data collected during the batch interval n is then processed during batch interval n+1.
So, the possible culprits for "losing a bit of data towards the end" could be:
you are observing your input file at the same time as you are monitoring the input for Spark. If you consider your monitoring at instant t, a bit after n batch intervals have elapsed, your log file has produced the data for n batches and then some ("a little bit more"). Except, the beginning of the next batch (n+1) is at this stage in the data collection phase, in the form of blocks on your Receiver. No data has been lost, the processing of batch n+1 has simply not started yet.
or your application assumes it's receiving a similar number of elements in each RDD and does not process the potentially (much) smaller last batch's RDD correctly.
or you're stopping your application or data before the last batch interval elapses (you need to wait n+1 batch intervals to see the processing of n batches of data).
or there is something weird occurring with the system clock of your executors. Have you thought of synchronizing them with ntp ?