Find sub-sequence of events from a stream of events - apache-spark

I am giving a miniature version of my issue below
I have 2 different sensors sending 1/0 values as a stream. I am able to consume the stream using Kafka and bring it to spark for processing. Please note a sample stream I have given below.
Time --------------> 1 2 3 4 5 6 7 8 9 10
Sensor Name --> A A B B B B A B A A
Sensor Value ---> 1 0 1 0 1 0 0 1 1 0
I want to identify a sub sequence pattern occurring in this stream. For eg- if A =0 and the very next value (based on time) in the stream is B =1 then I want to push an alert. In the example above I have highlighted 2 places – where I want to give an alert. In general it will be like
“If a set of sensor-event combination happens within a time interval,
raise an alert”.
I am new to spark and don’t know Scala. I am currently doing my coding using python.
My actual problem contains more sensors and each sensor can have different value combinations. Meaning my subsequence and event stream
I have tried Couple of options without success
Window Functions – Can be useful for moving avgs cumulative sums
etc. not for this usecase
Bring spark Dataframes /RDDs to local python structure like list
and panda Dataframes and do sub-sequencing – it take lots of
shuffles and spark event streams queued after some iterations
UpdateStatewithKey – Tried couple of ways and not able to understand
fully how this works and whether this is applicable for this use
case.

Anyone looking for a solution to this question can use my solution:
1- To keep them connected, you need to gather events with collect_list.
2- It's best to sort your event on the collect_list, but be cautious because it arranges data by the first column, so it's important to put the DateTime in that column.
3- I dropped DateTime from collect_list, as an example.
4- Finally, you should contact all elements to explore it with string functions like contain to find your subsequence.
.agg(expr("array_join(TRANSFORM(array_sort(collect_list((Time , Sensor Value))), a -> a.Time ),'')")as "MySequence")
after this agg function, you can use any regular expression or string function to detect your pattern.
check this link for more information about collect_list:
collect list
check this link for more information about sorting a collect_list:
sort a collect list

Related

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to run.
Once I took a look at the stage details it was clear that I'm facing a severe data skew, causing a single task to run for the entire 1.2 hours while all other tasks finish within 23 seconds.
The DAG showed this stage involves Window Functions which helped me to quickly narrow down the problematic area to a few queries and finding the root cause -> The column, account, that was being used in the Window.partitionBy("account") had 25% of null values.
I don't have an interest to calculate the sum for the null accounts though I do need the involved rows for further calculations therefore I can't filter them out prior the window function.
Here is my window function query:
problematic_account_window = Window.partitionBy("account")
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(problematic_account_window))
So we found the one to blame - What can we do now? How can we resolve the skew and the performance issue?
We basically have 2 solutions for this issue:
Break the initial dataframe to 2 different dataframes, one that filters out the null values and calculates the sum on, and the second that contains only the null values and is not part of the calculation. Lastly we union the two together.
Apply salting technique on the null values in order to spread the nulls on all partitions and provide stability to the stage.
Solution 1:
account_window = Window.partitionBy("account")
# split to null and non null
non_null_accounts_df = sales_df.where(col("account").isNotNull())
only_null_accounts_df = sales_df.where(col("account").isNull())
# calculate the sum for the non null
sales_with_non_null_accounts_df = non_null_accounts_df.withColumn("sum_sales_per_account", sum(col("price")).over(account_window)
# union the calculated result and the non null df to the final result
sales_with_account_total_df = sales_with_non_null_accounts_df.unionByName(only_null_accounts_df, allowMissingColumns=True)
Solution 2:
SPARK_SHUFFLE_PARTITIONS = spark.conf.get("spark.sql.shuffle.partitions")
modified_sales_df = (sales_df
# create a random partition value that spans as much as number of shuffle partitions
.withColumn("random_salt_partition", lit(ceil(rand() * SPARK_SHUFFLE_PARTITIONS)))
# use the random partition values only in case the account value is null
.withColumn("salted_account", coalesce(col("account"), col("random_salt_partition")))
)
# modify the partition to use the salted account
salted_account_window = Window.partitionBy("salted_account")
# use the salted account window to calculate the sum of sales
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(salted_account_window))
In my solution I've decided to use solution 2 since it didn't force me to create more dataframes for the sake of the calculation, and here is the result:
As seen above the salting technique helped resolving the skewness. The exact same stage now runs for a total of 5.5 minutes instead of 1.2 hours. The only modification in the code was the salting column in the partitionBy. The comparison shown is based on the exact same cluster/nodes amount/cluster config.

How to do multiple aggregate with SINGLE window in Flink?

I'm new to Flink, and I want to do something I have done in Spark many times.
For example, in Spark I can do something like below
ds.groupByKey(???).mapGroups(???) // aggregate 1
.groupByKey(???).mapGroups(???) // aggregate 2
The first aggregate deals with a batch of input data, and the second aggregate deals with the output of the first aggregate. What I need is the output of the second aggregate.
But in Flink, it seems that any aggregate should execute with a specific window like below
ds.keyBy(???)
.window(???) // window 1
.aggregate(???) // aggregate 1
.keyBy(???)
.window(???) // window 2
.aggregate(???) // aggregate 2
If I set the window 2, then the input data of the second aggregate may NOT be the output of the first aggregate, which will go against my wish.
I want to do multiple continuous aggregate with the same batch data, which can be gathered in a single window. How to realize it in Flink?
Thanks for your help.
Update for more details.
Window must have its own strategy, for example I may set window strategy like below
ds.keyBy(key1)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.HOURS))) // window 1, 1 hour tumbling window
.aggregate(???) // aggregate 1
.keyBy(key2)
.window(TumblingProcessingTimeWindows.of(Time.of(1, TimeUnit.MINUTES))) // window 2, 1 minute tumbling window
.aggregate(???) // aggregate 2
Window 1 may gather one billion rows in the one hour of tumbling time window, and after aggregate it outputs one million rows.
I want to do some calculation with those one million rows in aggregate 2, but I don't know which window strategy could gather exactly those one million rows.
If I set the window 2 with tumbling time window like above, it may split those one million rows into two batch, and the output of aggregate 2 will not be what I need.
You can avoid this problem by using event-time windows rather than processing-time windows. And if you don't already have timestamps in the events that you want to use as the basis for timing, then you can do something like this in order to use ingestion-time timestamps:
WatermarkStrategy<MyType> watermarkStrategy =
WatermarkStrategy
.<MyType>forMonotonousTimestamps()
.withTimestampAssigner(
(event, streamRecordTimestamp) -> Instant.now());
DataStream<MyType> timestampedEvents = ds
.assignTimestampsAndWatermarks(watermarkStrategy);
timestampedEvents.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.aggregate(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.HOURS)))
.aggregate(...)
This works because the events produced by the first window will each be timestamped with the timestamp for the end of the window they were assigned to. This requires then that the second window's duration be the same as, or a multiple of, the first window's duration.
Similarly, making arbitrary changes to the key-partitioning used by window 2 (compared to window 1) may produce nonsensical results.

SPARK parallelization of algorithm - non-typical, how to

I have a processing requirement that does not seem to fit the nice SPARK parallelization use cases. On the other hand, I may not see how it can be done in SPARK easily.
I am seeking the easiest way to parallelize the following situation:
Given a set of N records of record type A,
perform some processing on A records that generates a not yet existing set of initial results, say, of J records of record type B. Record type B has a data range aspect to it.
Then repeat the process for the A set of records not yet processed - the leftovers - for any records generated as part of B, but look to the left and to the right of the A records.
Repeat 3 until no new records generated.
This may sound odd, but it is nothing more than taking a set of trading records, and deciding for a given computed period Pn, if there is a bull or bear spread evident during this period. Once that initial period is found, then date-wise before Pn and after Pn, one can attempt to look for a bull or bear spread period that precedes or follows the initial Pn period. And so on. It all works correctly.
The algorithm I designed works on inserting records using SQL and some looping. The records generated do not exist initially and get created on the fly. I looked at dataframes and RDDs, but it is not so evident (to me) how one would do this.
Using SQL it is not such a difficult algorithm, but you need to work through the records of a given logical key set sequentially. Thus not a typical SPARK use case.
My questions are then:
How can I achieve at the very least parallelization?
Should we use mapPartitions in some way so as to at least get ranges of logical key sets to process, or is this simply not possible given the use case I attempt to present? I am going to try this, but feel I may be barking up the wrong tree here. It may just need to be a loop / while in the driver running single thread.
Some examples record A's shown in tabular format - as per how this algorithm works:
Jan Feb Mar Apr May Jun Jul Aug Sep
key X -5 1 0 10 9 -20 0 5 7
would result in record B's being generated as follows:
key X Jan - Feb --> Bear
key X Apr - Jun --> Bull
This falls into the category of non-typical Spark. Solved via looping within a loop in Spark Scala but with JDBC usage. Could as well have been a Scala JDBC program. Also variation with foreachPartition.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Redis: intersection of sorted sets by score

I want to store location triples in a Redis datastore, but want to make them searchable as well. This would make it possible to do range queries, like 'give me al points with 1 < x 3 and y > 2'. Therefore, I use a combination of sorted sets.
Each triple is saved in Redis like so (e.g. location A with x = 1, y = 2, z = 3):
hset /locations/A x "1"
hset /locations/A y "2"
hset /locations/A z "3"
hset /locations/A payload "{ ...some json payload... }"
zadd /locations:x 1 locations/A
zadd /locations:y 2 locations/A
zadd /locations:z 3 locations/A
This way, I can easily find all locations (or paths to locations) with e.g. an x value between 4 and 5:
zrangebyscore /locations:x 4 5
Or all locations with e.g. an y value between 1 and 3:
zrangebyscore /locations:x 1 3
A problem arises when I try to match all locations with an x value between 4 and 5 AND an y value between 1 and 3, because then I have to do two queries to Redis, subsequently comparing the values with Javascript in NodeJS, which could be very time-consuming when a lot of locations are defined. Did anyone encounter such a problem?
I experimented with zinterstore and zunionstore, but haven't found a satisfactory solution yet. I considered storing the zrangebyscores into a temporary set and then doing a zinterstore, but didn't find a Redis command to store the output of zrangebyscore directly to Redis (in the same command).
I'd like to ignore the use case (locations paths/distances) itself because there are multiple proven ways to address this challenge, also with Redis (search for geospatial and ye shall find), and focus instead on the technique.
So, assuming you're going to invent and implement your own geo logic, the most effective way (short of modifying Redis' code) to address the challenges involved in scoring/ranging/intersecting/... these Sorted Sets will be inside of a Lua script. This is what you may buzzwordly call "Data Gravity" - the processor is close to the data so accessing and manipulating the data is fastest and requires no network.
Inside such scripts you can, for example, store the results of ZRANGEBYSCORE in local variables, do whatever you need to do there and with them, and reply with the end result to the (Node.js) client.
do it, do your two queries, Redis is fast and doesnt care. Then javascript Simplest code for array intersection in javascript to find the values that exist in both.

Resources