Array of RDDs? One RDD for a time window - apache-spark

I have a question about bucketing time events with Spark, and the best way to handle it.
So I'm ingesting a very large dataset, with specific start/stop times for each event.
For instance, I might load in three weeks of data. Within the main time window, I divide that into buckets of smaller intervals. So 3 weeks divided into 24 hour time buckets, with an array that looks like [(start_epoch, stop_epoch), (start_epoch, stop_epoch), ...]
Within each time bucket I map/reduce my events down into a smaller set.
I'd like to keep the events split up by the time bucket they belong to.
What is the best way to handle this? Each map/reduce operation results in a new RDD so I'm effectively left with a large array of RDDs.
Is it "safe" to just loop over that array from the driver, and then do other transformations/actions on each RDD to get results each time window?
Thanks!

I would suggest to think about it a bit differently:
You want to read your data, and then "keyBy" time rounded to hour resolution. And then you can reduceByKey(or combineByKey if you want another type in output).
While working with spark it's not necessary to collect items into arrays by some key(even antipattern)
RDD[Event] -> keyBy ts rounded to hour -> RDD[(hour, event)] -> reduceByKey(i.e. hour) -> RDD[(hour, aggregated view of all events in this hour)]

Related

Event Processing (17 Million Events/day) using Azure Stream Analytics HoppingWindow Function - 24H Window, 1 Minute Hops

We have a business problem that needs solving and would like some guidance from the community on the combination of products in Azure we could use to solve it.
The Problem:
I work for a business that produces online games. We would like to display the number of users playing a specific game in a 24 Hour Window, but we want the value to update every minute. Essentially the output that HoppingWindow(Duration(hour, 24), Hop(minute, 1)) function in Azure Stream Analytics will provide.
Currently, the amount of events are around 17 Million a day and the Stream Analytics Job seems to be struggling with the load. We tried the following so far;
Tests Done:
17 Million Events -> Event Hub (32 Partitions) -> ASA (42 Streaming Units) -> Table Storage
Failed: Stream Analytics Job never outputs on large timeframes (Stopped test at 8 Hours)
17 Million Events -> Event Hub (32 Partitions) -> FUNC -> Table Storage (With Proper Partition/Row Key)
Failed: Table storage does not support distinct count
17 Million Events -> Event Hub -> FUNC -> Cosmos DB
Tentative: Cosmos DB doesn't support distinct count, not natively anyways. Seems to be some hacks going around, but not sure that's the way to go.
Is there any known designs geared for processing 17 Million events a minute, every minute?
Edit: As per the comments, the code.
SELECT
GameId,
COUNT(DISTINCT UserId) AS ActiveCount,
DateAdd(hour, -24, System.TimeStamp()) AS StartWindowUtc,
System.TimeStamp() AS EndWindowUtc INTO [out]
FROM
[in] TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
HoppingWindow(Duration(hour, 24), Hop(minute, 1)),
GameId,
UserId
The expected output, note that in reality there will be 1440 records per GameId. One for each minute
To be clear, the problem is that generating the expected output on the larger timeframes, ie 24 Hours doesn't output or at the very least takes 8+ Hours to output. The smaller window sizes work, for example changing the above code to use HoppingWindow(Duration(minute, 10), Hop(minute, 5)).
The tests that followed assumed that ASA is not the answer to the problem and we tried different approaches. Which seemed to have caused a bit of confusion, sorry about that
The way ASA scales up at the moment is with 1 node vertically from 1 to 6SU, then horizontally with multiple nodes of 6SU above that threshold.
Obviously, to be able to scale horizontally a job needs to be parallelizable, which means the stream will be distributed across nodes according to the partition scheme.
Now if the input stream, the query and the output destination are aligned in partitions, then the job is called embarrassingly parallel and that's where you'll be able to reach maximum scale. Each pipeline, from entry to output, will be independent, and each node will only have to maintain in memory the data pertaining to its own state (those 24h of data). That's what we're looking for here: minimizing the local data store at each node.
With EH supporting 32 partitions on most SKUs, the maximum scale publicly available on ASA is 192SU (6*32). If partitions are balanced, that's means that each node will have the least amount of data to maintain in its own state store.
Then we need to minimize the payload itself (the size of an individual message), but from the look of the query that's already the case.
Could you try scaling up to 192SU and see what happens?
We are also working on multiple other features that could help on that scenario. Let me know if that could be of interest to you.

How to optimize this short Spark Apache DataFrame function

Following function is supposed to join two DataFrames and return the number of checkouts per location. It is based on the Seattle Public Library data set.
def topKCheckoutLocations(checkoutDF: DataFrame, libraryInventoryDF: DataFrame, k: Int): DataFrame = {
checkoutDF
.join(libraryInventoryDF, "ItemType")
.groupBy("ItemBarCode", "ItemLocation") //grouping by ItemBarCode and ItemLocation
.agg(count("ItemBarCode")) //counting number of ItemBarCode for each ItemLocation
.withColumnRenamed("count(ItemBarCode)", "NumCheckoutItemsAtLocation")
.select($"ItemLocation", $"NumCheckoutItemsAtLocation")
}
When I run this, it takes ages to finish (40+ minutes), and I'm pretty sure it is not supposed to take more than a couple of minutes. Can I change the order of the calls to decrease computation time?
As I never managed to finish computation I never actually got to check whether the output is correct. I assume it is.
The checkoutDF has 3 mio. rows.
For spark job performance
Select the required column from the dataset before joins to
decrease data size
Partition your both dataset by join column ("ItemType") to avoid shuffling

Hazelcast Jet stream processing end window emission

I've stomped across an interesting observation trying to cross check results of aggregation for my stream processing. I've created a test case when pre-defined data set was fed into a journaled map and aggregation was supposed to populate 1 result as it was inline with window size/sliding and amount of data with pre-determined timestamps. However result was never published. Window was not emitted however few accumulate/combine operations where executed. It works differently with real data, but result of aggregation is always 'behind' the amount of data drawn from the source. I guess this has something to do with Watermarks? How can I make sure in my test case that it doesn't wait for more data to come. Will allowed lateness help?
First, I'll refer you to the two sections in the manual which describe how watermarks work and also talk about the concept of stream skew:
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#unbounded-stream-processing
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#stream-skew
The concept of "current time" in Jet only advances as long as there's events with advancing timestamps. There's typically several factors at play here:
Allowed lateness: This defines your lag per partition, assuming you are using a partitioned source like Kafka. This describes the tolerable degree of out of orderness in terms of timestamps in a single partition. If allowed lateness is 2 sec, the window will only close when you have received an event at N + 2 seconds across all input partitions.
Stream skew: This can happen when for example you have 10 Kafka partitions but only 3 are producing any events. As Jet coalesces watermarks from all partitions, this will cause the stream to wait until the other 7 partitions have some data. There's a timeout after which these partitions are considered idle, but this is by default 60 sec and currently not configurable in the pipeline API. So in this case you won't have any output until these partitions are marked as idle.
When using test data, it's quite common to have very low volume of events and many partitions, which can make it a challenge to advance the time correctly.
Points in Can Gencer's answer are valid. But for test, you can also use a batch source, such as Sources.list. By adding timestamps to a BatchStage you convert it to a StreamStage, on which you can do window aggregation. The aggregate transform will emit pending windows at the end of the batch.
JetInstance inst = Jet.newJetInstance();
IListJet<TimestampedEntry<String, Integer>> list = inst.getList("data");
list.add(new TimestampedEntry(1, "a", 1));
list.add(new TimestampedEntry(1, "b", 2));
list.add(new TimestampedEntry(1, "a", 3));
list.add(new TimestampedEntry(1, "b", 4));
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<TimestampedEntry<String, Integer>>list("data"))
.addTimestamps(TimestampedEntry::getTimestamp, 0)
.groupingKey(TimestampedEntry::getKey)
.window(tumbling(1))
.aggregate(AggregateOperations.summingLong(TimestampedEntry::getValue))
.drainTo(Sinks.logger());
inst.newJob(p).join();
inst.shutdown();
The above code prints:
TimestampedEntry{ts=01:00:00.002, key='a', value='4'}
TimestampedEntry{ts=01:00:00.002, key='b', value='6'}
Remember to keep your data in the list ordered by time as we use allowedLag=0.
Answer is valid for Jet 0.6.1.

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Resources