Spark streaming, get x batch of window - apache-spark

I'm using pyspark to do some operations on 2 incoming streams.
Also I do window operations on both streams.
stream = ssc.
streamW = stream.window(100,1)
for example.
One of the operations requires me to look at a specific batch inside streamW , so for example if the window length is 100, I want to look at the batch that is number 3 of the window.
More specifically I want to look at the first batch (ie: the batch that in next time interval is going to be removed from the window ).
Just to be clear: I want to get the batch leaving the window
Any ideas how to accomplish this ?

Related

Single Sliding Window in Spark Structured Streaming

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html

How to do multiple aggregate with SINGLE window in Flink?

I'm new to Flink, and I want to do something I have done in Spark many times.
For example, in Spark I can do something like below
ds.groupByKey(???).mapGroups(???) // aggregate 1
.groupByKey(???).mapGroups(???) // aggregate 2
The first aggregate deals with a batch of input data, and the second aggregate deals with the output of the first aggregate. What I need is the output of the second aggregate.
But in Flink, it seems that any aggregate should execute with a specific window like below
ds.keyBy(???)
.window(???) // window 1
.aggregate(???) // aggregate 1
.keyBy(???)
.window(???) // window 2
.aggregate(???) // aggregate 2
If I set the window 2, then the input data of the second aggregate may NOT be the output of the first aggregate, which will go against my wish.
I want to do multiple continuous aggregate with the same batch data, which can be gathered in a single window. How to realize it in Flink?
Thanks for your help.
Update for more details.
Window must have its own strategy, for example I may set window strategy like below
ds.keyBy(key1)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.HOURS))) // window 1, 1 hour tumbling window
.aggregate(???) // aggregate 1
.keyBy(key2)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.MINUTES))) // window 2, 1 minute tumbling window
.aggregate(???) // aggregate 2
Window 1 may gather one billion rows in the one hour of tumbling time window, and after aggregate it outputs one million rows.
I want to do some calculation with those one million rows in aggregate 2, but I don't know which window strategy could gather exactly those one million rows.
If I set the window 2 with tumbling time window like above, it may split those one million rows into two batch, and the output of aggregate 2 will not be what I need.
You can avoid this problem by using event-time windows rather than processing-time windows. And if you don't already have timestamps in the events that you want to use as the basis for timing, then you can do something like this in order to use ingestion-time timestamps:
WatermarkStrategy<MyType> watermarkStrategy =
WatermarkStrategy
.<MyType>forMonotonousTimestamps()
.withTimestampAssigner(
(event, streamRecordTimestamp) -> Instant.now());
DataStream<MyType> timestampedEvents = ds
.assignTimestampsAndWatermarks(watermarkStrategy);
timestampedEvents.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.aggregate(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.HOURS)))
.aggregate(...)
This works because the events produced by the first window will each be timestamped with the timestamp for the end of the window they were assigned to. This requires then that the second window's duration be the same as, or a multiple of, the first window's duration.
Similarly, making arbitrary changes to the key-partitioning used by window 2 (compared to window 1) may produce nonsensical results.

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Resources