Samza/Flink/Spark: processing completion ("done" flag)

Samza/Flink/Spark: processing completion ("done" flag) - apache-spark

Good day
My system consumes stream of trades and consists of multiple processing steps (jobs). Essentially it processes stream of trades and computes some metric over them (VaR). Source stream can be split in certain chunks (trades are grouped in portfolios). Currently I have home-grown solution and look to switch to Samza or Flink.
Trades Processors Output
{<portfolio done>, t3, t2, t1} -> A->B->... -> Measure
Fundamental question which I struggle with is - I need to understand when portfolio is fully processed and figure in the output is complete. Basically I need to correlate the output (aggregated measure) with the input (which trades are processed resulting in this measure) so one can say "this figure reflects these trades". Ideally when all portfolios are processed, I need to record final figure.
Processing stages are non-trivial, they create intermediary streams (like return vectors), do some filtering, etc. so simple counting doesn't work.
I feel that my problem should be fairly standard, but can't find neither algorithm nor implementation in Samza or Flink.
Closest question I found is this one: https://mail-archives.apache.org/mod_mbox/samza-dev/201510.mbox/%3CCAA5FViH_2cz9CcS4VEo49HoWHF++QKyPB5t7y+bDCoVynZBqtg#mail.gmail.com%3E

Related

stream processing and batch processing

Since the latency with batch processing generates when accumulating a specific number of data, can I regard batch processing with the "size of one" as stream processing? Or there's other difference when operators do calculations?
For example, if I set the batch size of a spark-based program to 1, can I make its latency as low as flink?
One of my thinking is as below:
For stream processing, one data flows from former operator to latter one if processed, but for batch process, only after all the operator finish processing a data can it accept another data.
It seems the pipeline in stream processing counts for the acceleration.
Am I right in my explantion? If wrong, what's the appropriate explanation to my question.

TLDR: there are quite a lot of reasons why you should help your program and tell it explicitly wether you want a bounded(batch) or unbounded(stream) computation.
Your thinking is good in theory, but that's not how it works in practice: batch vs stream setting is being asked explicitly from the programmer. The runtime won't infer it from the batch size (or batch delay) you set. At least that's how Flink works.
Furthermore, the batch vs stream divide goes much deeper: batch shouldn't care much about time.
Let's say you increase the batch size to be the whole dataset size. Only in that case Flink will be able to apply performance optimization passes over your plan. For example: in streaming mode JOINs need to keep both sides in memory in case a match appears on the other side. In batching mode, Flink knows both sides are fixed-size, it can materialize first the smallest side and only keep that in memory while it queries it with the other side. Thus Flinks need less memory for batching, and it uses CPU caches better (which makes for a faster processing).
Also streaming has to maintain watermarks (special row metadata to help with correlating the right rows together time-wise, persisting coherent set of rows together, etc), while batch doesn't care about them. That's overhead.
If you're up for it you can peruse the Flink source code, and compare the Batch vs Stream SQL optimization rules. You'll see that stream has to deal with watermarks (FlinkLogicalWatermarkAssigner) when batch does not, it has to expand temporal tables fully (LogicalCorrelateToJoinFromTemporalTableRule). Batch also can sort rows and do sort-merge-joins (BatchPhysicalSortMergeJoinRule). Stream has to incrementally process aggregates (IncrementalAggregateRule) when batch can do them locally at the data source (PushLocalHashAggIntoScanRule), etc. Each difference between these two files is either a thing one side has to specifically do because of its (batch vs stream) nature, or an optimization pass that is allowed by its (batch vs stream) nature.
If you would like to know more about this topic and it's numerous subtleties, you can also read the Flink Blog posts, Flink Documentation, Flink Improvement Proposals

Apache Beam Wall Time keeps increasing

I have a Beam pipeline that reads from a pubsub topic, does some small transformations and then writes the events to some BigQuery tables.
The transforms are light on processing, maybe removing a field or something else, but, as you can see from the image below, the Wall Time is very high for some steps. What can actually cause this?
Every element is actually a tuple of the form ((str, str, str), {**dict with data}). By this key we actually try to do a naive deduplication by taking the latest event by this key.
Basically whatever I add after that Get latest element per key is slow, and tagging is also slow, even tho it just adds a tag to the element.

By "slow" I assume you mean how many elements it processes per second?
There are two things that are going on here. First, I assume that Get latest element per key contains a GroupByKey of sorts. This involves a global shuffle, with all elements being sent over the network to other elements, to ensure all elements with a given key are grouped onto the same worker. This IO can be expensive, at least in terms of wall time.
Secondly, steps that don't need worker-to-worker communication are "fused" which couples their throughputs. E.g. if one has DoFnA followed by DoFnB followed by DoFnC, the processing proceeds by passing the first element through DoFnA, then passing those outputs to DoFnB and subsequently DoFnC before getting the second element to pass to DoFnA. This means that if one of the Fns (or the reading or writing) has a bounded throughput, all of them will.

Spark sql structured streaming on multiple sources with relatively small amount of data

I have multiple sources and each of them receives relatively small amount of data. The schemes of the data received in these sources are the same, so they are used just as a separation of the applications which send the events. The aggregations performed on the data are expensive(at least 200 tasks) which means that I need a lot of processing power. That's ok, but when I have many source with small amount of data it becomes really cost inefficient. So if I run multiple streaming queries against every source this means that I will have ~200*x task where x is the number of the sources. This will be ok as soon as I have a lot of data in every source. My question is if there is a way to combine all the data from the different sources(it's ok also to put all the data in one source) and perform the aggregations on the whole chunk but at the same time not to mix the different applications data. The first thing that came to my mind is to group the data based on the application name, but then I'm extremely limited to the operations I can perform. So is there any way of doing this or I'm trying to achieve something impossible?

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

Spark Streaming: stateless overlapping windows vs. keeping state

What would be some considerations for choosing stateless sliding-window operations (e.g. reduceByKeyAndWindow) vs. choosing to keep state (e.g. via updateStateByKey or the new mapStateByKey) when handling a stream of sequential, finite event sessions with Spark Streaming?
For example, consider the following scenario:
A wearable device tracks physical exercises performed by
the wearer. The device automatically detects when an exercise starts,
and emits a message; emits additional messages while the exercise
is undergoing (e.g. heart rate); and finally, emits a message when the
exercise is done.
The desired result is a stream of aggregated records per exercise session. i.e. all events of the same session should be aggregated together (e.g. so that each session could be saved in a single DB row). Note that each session has a finite length, but the entire stream from multiple devices is continuous. For convenience, let's assume the device generates a GUID for each exercise session.
I can see two approaches for handling this use-case with Spark Streaming:
Using non-overlapping windows, and keeping state. A state is saved per GUID, with all events matching it. When a new event arrives, the state is updated (e.g. using mapWithState), and in case the event is "end of exercise session", an aggregated record based on the state will be emitted, and the key removed.
Using overlapping sliding windows, and keeping only the first sessions. Assume a sliding window of length 2 and interval 1 (see diagram below). Also assume that the window length is 2 X (maximal possible exercise time). On each window, events are aggreated by GUID, e.g. using reduceByKeyAndWindow. Then, all sessions which started at the second half of the window are dumped, and the remaining sessions emitted. This enables using each event exactly once, and ensures all events belonging to the same session will be aggregated together.
Diagram for approach #2:
Only sessions starting in the areas marked with \\\ will be emitted.
-----------
|window 1 |
|\\\\| |
-----------
----------
|window 2 |
|\\\\| |
-----------
----------
|window 3 |
|\\\\| |
-----------
Pros and cons I see:
Approach #1 is less computationally expensive, but requires saving and managing state (e.g. if the number of concurrent sessions increases, the state might get larger than memory). However if the maximal number of concurrent sessions is bounded, this might not be an issue.
Approach #2 is twice as expensive (each event is processed twice), and with higher latency (2 X maximal exercise time), but more simple and easily manageable, as no state is retained.
What would be the best way to handle this use case - is any of these approaches the "right" one, or are there better ways?
What other pros/cons should be taken into consideration?

Normally there is no right approach, each has tradeoffs. Therefore I'd add additional approach to the mix and will outline my take on their pros and cons. So you can decide which one is more suitable for you.
External state approach (approach #3)
You can accumulate state of the events in external storage. Cassandra is quite often used for that. You can handle final and ongoing events separately for example like below:
val stream = ...
val ongoingEventsStream = stream.filter(!isFinalEvent)
val finalEventsStream = stream.filter(isFinalEvent)
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
finalEventsStream.foreachRDD { /*finalize state in casssandra, move to final destination if needed*/ }
trackStateByKey approach (approach #1.1)
It might be potentially optimal solution for you as it removes drawbacks of updateStateByKey, but considering it is just got released as part of Spark 1.6 release, it could be risky as well (since for some reason it is not very advertised). You can use the link as starting point if you want to find out more
Pros/Cons
Approach #1 (updateStateByKey)
Pros
Easy to understand or explain (to rest of the team, newcomers, etc.) (subjective)
Storage: Better usage of memory stores only latest state of exercise
Storage: Will keep only ongoing exercises, and discard them as soon as they finish
Latency is limited only by performance of each micro-batch processing
Cons
Storage: If number of keys (concurrent exercises) is large it may not fit into memory of your cluster
Processing: It will run updateState function for each key within the state map, therefore if number of concurrent exercises is large - performance will suffer
Approach #2 (window)
While it is possible to achieve what you need with windows, it looks significantly less natural in your scenario.
Pros
Processing in some cases (depending on the data) might be more effective than updateStateByKey, due to updateStateByKey tendency to run update on every key even if there are no actual updates
Cons
"maximal possible exercise time" - this sounds like a huge risk - it could be pretty arbitrary duration based on a human behaviour. Some people might forget to "finish exercise". Also depends on kinds of exercise, but could range from seconds to hours, when you want lower latency for quick exercises while would have to keep latency as high as longest exercise potentially could exist
Feels like harder to explain to others on how it will work (subjective)
Storage: Will have to keep all data within the window frame, not only the latest one. Also will free the memory only when window will slide away from this time slot, not when exercise is actually finished. While it might be not a huge difference if you will keep only last two time slots - it will increase if you try to achieve more flexibility by sliding window more often.
Approach #3 (external state)
Pros
Easy to explain, etc. (subjective)
Pure streaming processing approach, meaning that spark is responsible to act on each individual event, but not trying to store state, etc. (subjective)
Storage: Not limited by memory of the cluster to store state - can handle huge number of concurrent exercises
Processing: State is updated only when there are actual updates to it (unlike updateStateByKey)
Latency is similar to updateStateByKey and only limited by the time required to process each micro-batch
Cons
Extra component in your architecture (unless you already use Cassandra for your final output)
Processing: by default is slower than processing just in spark as not in-memory + you need to transfer the data via network
you'll have to implement exactly once semantic to output data into cassandra (for the case of worker failure during foreachRDD)
Suggested approach
I'd try the following:
test updateStateByKey approach on your data and your cluster
see if memory consumption and processing is acceptable even with large number of concurrent exercises (expected on peak hours)
fall back to approach with Cassandra in case if not

I think one of other drawbacks of third approach is that the RDDs are not received chronologically..considering running them on a cluster..
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
also what about check-pointing and driver node failure..In that case do u read the whole data again? curious to know how you wanna handle this?
I guess maybe mapwithstate is a better approach why you consider all these scenario..

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string