How to do multiple aggregate with SINGLE window in Flink? - apache-spark

I'm new to Flink, and I want to do something I have done in Spark many times.
For example, in Spark I can do something like below
ds.groupByKey(???).mapGroups(???) // aggregate 1
.groupByKey(???).mapGroups(???) // aggregate 2
The first aggregate deals with a batch of input data, and the second aggregate deals with the output of the first aggregate. What I need is the output of the second aggregate.
But in Flink, it seems that any aggregate should execute with a specific window like below
ds.keyBy(???)
.window(???) // window 1
.aggregate(???) // aggregate 1
.keyBy(???)
.window(???) // window 2
.aggregate(???) // aggregate 2
If I set the window 2, then the input data of the second aggregate may NOT be the output of the first aggregate, which will go against my wish.
I want to do multiple continuous aggregate with the same batch data, which can be gathered in a single window. How to realize it in Flink?
Thanks for your help.
Update for more details.
Window must have its own strategy, for example I may set window strategy like below
ds.keyBy(key1)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.HOURS))) // window 1, 1 hour tumbling window
.aggregate(???) // aggregate 1
.keyBy(key2)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.MINUTES))) // window 2, 1 minute tumbling window
.aggregate(???) // aggregate 2
Window 1 may gather one billion rows in the one hour of tumbling time window, and after aggregate it outputs one million rows.
I want to do some calculation with those one million rows in aggregate 2, but I don't know which window strategy could gather exactly those one million rows.
If I set the window 2 with tumbling time window like above, it may split those one million rows into two batch, and the output of aggregate 2 will not be what I need.

You can avoid this problem by using event-time windows rather than processing-time windows. And if you don't already have timestamps in the events that you want to use as the basis for timing, then you can do something like this in order to use ingestion-time timestamps:
WatermarkStrategy<MyType> watermarkStrategy =
WatermarkStrategy
.<MyType>forMonotonousTimestamps()
.withTimestampAssigner(
(event, streamRecordTimestamp) -> Instant.now());
DataStream<MyType> timestampedEvents = ds
.assignTimestampsAndWatermarks(watermarkStrategy);
timestampedEvents.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.aggregate(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.HOURS)))
.aggregate(...)
This works because the events produced by the first window will each be timestamped with the timestamp for the end of the window they were assigned to. This requires then that the second window's duration be the same as, or a multiple of, the first window's duration.
Similarly, making arbitrary changes to the key-partitioning used by window 2 (compared to window 1) may produce nonsensical results.

Related

Single Sliding Window in Spark Structured Streaming

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

How to use synchronous messages on rabbit queue?

I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.

Find sub-sequence of events from a stream of events

I am giving a miniature version of my issue below
I have 2 different sensors sending 1/0 values as a stream. I am able to consume the stream using Kafka and bring it to spark for processing. Please note a sample stream I have given below.
Time --------------> 1 2 3 4 5 6 7 8 9 10
Sensor Name --> A A B B B B A B A A
Sensor Value ---> 1 0 1 0 1 0 0 1 1 0
I want to identify a sub sequence pattern occurring in this stream. For eg- if A =0 and the very next value (based on time) in the stream is B =1 then I want to push an alert. In the example above I have highlighted 2 places – where I want to give an alert. In general it will be like
“If a set of sensor-event combination happens within a time interval,
raise an alert”.
I am new to spark and don’t know Scala. I am currently doing my coding using python.
My actual problem contains more sensors and each sensor can have different value combinations. Meaning my subsequence and event stream
I have tried Couple of options without success
Window Functions – Can be useful for moving avgs cumulative sums
etc. not for this usecase
Bring spark Dataframes /RDDs to local python structure like list
and panda Dataframes and do sub-sequencing – it take lots of
shuffles and spark event streams queued after some iterations
UpdateStatewithKey – Tried couple of ways and not able to understand
fully how this works and whether this is applicable for this use
case.
Anyone looking for a solution to this question can use my solution:
1- To keep them connected, you need to gather events with collect_list.
2- It's best to sort your event on the collect_list, but be cautious because it arranges data by the first column, so it's important to put the DateTime in that column.
3- I dropped DateTime from collect_list, as an example.
4- Finally, you should contact all elements to explore it with string functions like contain to find your subsequence.
.agg(expr("array_join(TRANSFORM(array_sort(collect_list((Time , Sensor Value))), a -> a.Time ),'')")as "MySequence")
after this agg function, you can use any regular expression or string function to detect your pattern.
check this link for more information about collect_list:
collect list
check this link for more information about sorting a collect_list:
sort a collect list

Spark streaming, get x batch of window

I'm using pyspark to do some operations on 2 incoming streams.
Also I do window operations on both streams.
stream = ssc.
streamW = stream.window(100,1)
for example.
One of the operations requires me to look at a specific batch inside streamW , so for example if the window length is 100, I want to look at the batch that is number 3 of the window.
More specifically I want to look at the first batch (ie: the batch that in next time interval is going to be removed from the window ).
Just to be clear: I want to get the batch leaving the window
Any ideas how to accomplish this ?

Array of RDDs? One RDD for a time window

I have a question about bucketing time events with Spark, and the best way to handle it.
So I'm ingesting a very large dataset, with specific start/stop times for each event.
For instance, I might load in three weeks of data. Within the main time window, I divide that into buckets of smaller intervals. So 3 weeks divided into 24 hour time buckets, with an array that looks like [(start_epoch, stop_epoch), (start_epoch, stop_epoch), ...]
Within each time bucket I map/reduce my events down into a smaller set.
I'd like to keep the events split up by the time bucket they belong to.
What is the best way to handle this? Each map/reduce operation results in a new RDD so I'm effectively left with a large array of RDDs.
Is it "safe" to just loop over that array from the driver, and then do other transformations/actions on each RDD to get results each time window?
Thanks!
I would suggest to think about it a bit differently:
You want to read your data, and then "keyBy" time rounded to hour resolution. And then you can reduceByKey(or combineByKey if you want another type in output).
While working with spark it's not necessary to collect items into arrays by some key(even antipattern)
RDD[Event] -> keyBy ts rounded to hour -> RDD[(hour, event)] -> reduceByKey(i.e. hour) -> RDD[(hour, aggregated view of all events in this hour)]

Resources