Spark Streaming: stateless overlapping windows vs. keeping state - apache-spark

What would be some considerations for choosing stateless sliding-window operations (e.g. reduceByKeyAndWindow) vs. choosing to keep state (e.g. via updateStateByKey or the new mapStateByKey) when handling a stream of sequential, finite event sessions with Spark Streaming?
For example, consider the following scenario:
A wearable device tracks physical exercises performed by
the wearer. The device automatically detects when an exercise starts,
and emits a message; emits additional messages while the exercise
is undergoing (e.g. heart rate); and finally, emits a message when the
exercise is done.
The desired result is a stream of aggregated records per exercise session. i.e. all events of the same session should be aggregated together (e.g. so that each session could be saved in a single DB row). Note that each session has a finite length, but the entire stream from multiple devices is continuous. For convenience, let's assume the device generates a GUID for each exercise session.
I can see two approaches for handling this use-case with Spark Streaming:
Using non-overlapping windows, and keeping state. A state is saved per GUID, with all events matching it. When a new event arrives, the state is updated (e.g. using mapWithState), and in case the event is "end of exercise session", an aggregated record based on the state will be emitted, and the key removed.
Using overlapping sliding windows, and keeping only the first sessions. Assume a sliding window of length 2 and interval 1 (see diagram below). Also assume that the window length is 2 X (maximal possible exercise time). On each window, events are aggreated by GUID, e.g. using reduceByKeyAndWindow. Then, all sessions which started at the second half of the window are dumped, and the remaining sessions emitted. This enables using each event exactly once, and ensures all events belonging to the same session will be aggregated together.
Diagram for approach #2:
Only sessions starting in the areas marked with \\\ will be emitted.
-----------
|window 1 |
|\\\\| |
-----------
----------
|window 2 |
|\\\\| |
-----------
----------
|window 3 |
|\\\\| |
-----------
Pros and cons I see:
Approach #1 is less computationally expensive, but requires saving and managing state (e.g. if the number of concurrent sessions increases, the state might get larger than memory). However if the maximal number of concurrent sessions is bounded, this might not be an issue.
Approach #2 is twice as expensive (each event is processed twice), and with higher latency (2 X maximal exercise time), but more simple and easily manageable, as no state is retained.
What would be the best way to handle this use case - is any of these approaches the "right" one, or are there better ways?
What other pros/cons should be taken into consideration?

Normally there is no right approach, each has tradeoffs. Therefore I'd add additional approach to the mix and will outline my take on their pros and cons. So you can decide which one is more suitable for you.
External state approach (approach #3)
You can accumulate state of the events in external storage. Cassandra is quite often used for that. You can handle final and ongoing events separately for example like below:
val stream = ...
val ongoingEventsStream = stream.filter(!isFinalEvent)
val finalEventsStream = stream.filter(isFinalEvent)
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
finalEventsStream.foreachRDD { /*finalize state in casssandra, move to final destination if needed*/ }
trackStateByKey approach (approach #1.1)
It might be potentially optimal solution for you as it removes drawbacks of updateStateByKey, but considering it is just got released as part of Spark 1.6 release, it could be risky as well (since for some reason it is not very advertised). You can use the link as starting point if you want to find out more
Pros/Cons
Approach #1 (updateStateByKey)
Pros
Easy to understand or explain (to rest of the team, newcomers, etc.) (subjective)
Storage: Better usage of memory stores only latest state of exercise
Storage: Will keep only ongoing exercises, and discard them as soon as they finish
Latency is limited only by performance of each micro-batch processing
Cons
Storage: If number of keys (concurrent exercises) is large it may not fit into memory of your cluster
Processing: It will run updateState function for each key within the state map, therefore if number of concurrent exercises is large - performance will suffer
Approach #2 (window)
While it is possible to achieve what you need with windows, it looks significantly less natural in your scenario.
Pros
Processing in some cases (depending on the data) might be more effective than updateStateByKey, due to updateStateByKey tendency to run update on every key even if there are no actual updates
Cons
"maximal possible exercise time" - this sounds like a huge risk - it could be pretty arbitrary duration based on a human behaviour. Some people might forget to "finish exercise". Also depends on kinds of exercise, but could range from seconds to hours, when you want lower latency for quick exercises while would have to keep latency as high as longest exercise potentially could exist
Feels like harder to explain to others on how it will work (subjective)
Storage: Will have to keep all data within the window frame, not only the latest one. Also will free the memory only when window will slide away from this time slot, not when exercise is actually finished. While it might be not a huge difference if you will keep only last two time slots - it will increase if you try to achieve more flexibility by sliding window more often.
Approach #3 (external state)
Pros
Easy to explain, etc. (subjective)
Pure streaming processing approach, meaning that spark is responsible to act on each individual event, but not trying to store state, etc. (subjective)
Storage: Not limited by memory of the cluster to store state - can handle huge number of concurrent exercises
Processing: State is updated only when there are actual updates to it (unlike updateStateByKey)
Latency is similar to updateStateByKey and only limited by the time required to process each micro-batch
Cons
Extra component in your architecture (unless you already use Cassandra for your final output)
Processing: by default is slower than processing just in spark as not in-memory + you need to transfer the data via network
you'll have to implement exactly once semantic to output data into cassandra (for the case of worker failure during foreachRDD)
Suggested approach
I'd try the following:
test updateStateByKey approach on your data and your cluster
see if memory consumption and processing is acceptable even with large number of concurrent exercises (expected on peak hours)
fall back to approach with Cassandra in case if not

I think one of other drawbacks of third approach is that the RDDs are not received chronologically..considering running them on a cluster..
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
also what about check-pointing and driver node failure..In that case do u read the whole data again? curious to know how you wanna handle this?
I guess maybe mapwithstate is a better approach why you consider all these scenario..

Related

Why so much criticism around Spark Streaming micro-batch (when using kafka as source)?

Since any Kafka Consumer is in reality consuming in batches, why there is so much criticism around Spark Streaming micro-batch (when using Kafka as his source), for example, in comparison to Kafka Streams (which markets itself as real streaming)?
I mean: a lot of criticism hover on Spark Streaming micro-batch architecture. And, normally, people say that Kafka Streams is a real 'real-time' tool, since it processes events one-by-one.
It does process events one by one, but, from my understanding, it uses (as almost every other library/framework) the Consumer API. The Consumer API polls from topics in batches in order to reduce network burden (the interval is configurable). Therefore, the Consumer will do something like:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
///// PROCESS A **BATCH** OF RECORDS
for (ConsumerRecord<String, String> record : records) {
///// PROCESS **ONE-BY-ONE**
}
}
So, although it is right to say that Spark:
maybe has higher latency due to its micro-batch minimum interval that limits latency to at best 100 ms (see Spark Structured Streaming DOCs);
processes records in groups (either as DStreams of RDDs or as DataFrames in Structured Streaming).
But:
One can process records one-by-one in Spark - just loop though RDDs/Rows
Kafka Streams in reality polls batches of records, but processes them one-by-one, since it implements the Consumer API under-the-hoods.
Just to make clear, I am not questioning from a 'fan-side' (and therefore, being it an opinion question), just the opposite, I am really trying to understand it technically in order to understand the semantics in the streaming ecosystem.
Appreciate every piece of information in this matter.
DISCLAIMER: I had involved in Apache Storm (which is known to be a streaming framework processing "record-by-record", though there's trident API as well), and now involving in Apache Spark ("micro-batch").
The one of major concerns in streaming technology has been "throughput vs latency". In latency perspective, "record-by-record" processing is clearly a winner, but the cost of "doing everything one by one" is significant and every minor thing becomes a huge overhead. (Consider the system aims to process a million records per second, then any additional overhead on processing gets multiplexed by a million.) Actually, there was opposite criticism as well, bad throughput on "read-by-record" compared to the "micro-batch". To address this, streaming frameworks add batching in their "internal" logic but in a way to less hurting latency. (like configuring the size of batch, and timeout to force flush the batch)
I think the major difference between the twos is that whether the tasks are running "continuously" and they're composing a "pipeline".
In streaming frameworks do "record-by-record", when the application is launched, all necessary tasks are physically planned and launched altogether and they never terminate unless application is terminated. Source tasks continuously push the records to the downstream tasks, and downstream tasks process them and push to next downstream. This is done in pipeline manner. Source won't stop pushing the records unless there's no records to push. (There're backpressure and distributed checkpoint, but let's put aside of the details and focus on the concept.)
In streaming frameworks do "micro-batch", they have to decide the boundary of "batch" for each micro-batch. In Spark, the planning (e.g. how many records this batch will read from source and process) is normally done by driver side and tasks are physically planned based on the decided batch. This approach gives end users a major homework - what is the "appropriate" size of batch to achieve the throughput/latency they're targeting. Too small batch leads bad throughput, as planning a batch requires non-trivial cost (heavily depending on the sources). Too huge batch leads bad latency. In addition, the concept of "stage" is appropriate to the batch workload (I see Flink is adopting the stage in their batch workload) and not ideal for streaming workload, because this means some tasks should wait for the "completion" of other tasks, no pipeline.
For sure, I don't think such criticism means micro-batch is "unusable". Do you really need to bother the latency when your actual workload can tolerate minutes (or even tens of minutes) of latency? Probably no. You'll want to concern about the cost of learning curve (most likely Spark only vs Spark & other, but Kafka stream only or Flink only is possible for sure.) and maintenance instead. In addition, if you have a workload which requires aggregation (probably with windowing), the restriction of latency from the framework is less important, as you'll probably set your window size to minutes/hours.
Micro-batch has upside as well - if there's a huge idle, the resources running idle tasks are wasted, which applies to "record-to-record" streaming frameworks. It also allows to do batch operations for the specific micro-batch which aren't possible on streaming. (Though you should keep in mind it only applies to "current" batch.)
I think there's no silver bullet - Spark has been leading the "batch workload" as it's originated to deal with problems of MapReduce, hence the overall architecture is optimized to the batch workload. Other streaming frameworks start from "streaming native", hence should have advantage on streaming workload, but less optimal on batch workload. Unified batch and streaming is a new trend, and at some time a (or a couple of) framework may provide optimal performance on both workloads, but I'm not sure now is the time.
EDIT: If your workload targets "end-to-end exactly once", the latency is bound to the checkpoint interval even for "record-by-record" streaming frameworks. The records between checkpoint compose a sort of batch, so checkpoint interval would be a new major homework for you.
EDIT2:
Q1) Why windows aggregations would make me bother less about latency? Maybe one really wants to update the stateful operation quickly enough.
The output latency between micro-batch and record-by-record won't be significant (even the micro-batch could also achieve the sub-second latency in some extreme cases) compared to the delay brought by the nature of windowing.
But yes, I'm assuming the case the emit happens only when window gets expired ("append" mode in Structured Streaming). If you'd like to emit all the updates whenever there's change in window then yes, there would be still difference on the latency perspective.
Q2) Why the semantics are important in this trade-off? Sounds like it is related, for example, to Kafka-Streams reducing commit-interval when exactly-once is configured. Maybe you mean that checkpointing possibly one-by-one would increase overhead and then impact latency, in order to obtain better semantics?
I don't know the details about Kafka stream, so my explanation won't be based on how Kafka stream works. That would be your homework.
If you read through my answer correctly, you've also agreed that streaming frameworks won't do the checkpoint per record - the overhead would be significant. That said, records between the two checkpoints would be the same group (sort of a batch) which have to be reprocessed when the failure happens.
If stateful exactly once (stateful operation is exactly once, but the output is at-least once) is enough for your application, your application can just write the output to the sink and commit immediately so that readers of the output can read them immediately. Latency won't be affected by the checkpoint interval.
Btw, there're two ways to achieve end-to-end exactly once (especially the sink side):
supports idempotent updates
supports transactional updates
The case 1) writes the outputs immediately so won't affect latency through the semantic (similar with at-least once), but the storage should be able to handle upsert, and the "partial write" is seen when the failure happens so your reader applications should tolerate it.
The case 2) writes the outputs but not commits them until the checkpoint is happening. The streaming frameworks will try to ensure that the output is committed and exposed only when the checkpoint succeeds and there's no failure in the group. There're various approaches to make the distributed writes be transactional (2PC, coordinator does "atomic rename", coordinator writes the list of the files tasks wrote, etc.), but in any way the reader can't see the partial write till the commit happens so checkpoint interval would greatly contribute the output latency.
Q3) This doesn't necessarily address the point about the batch of records that Kafka clients poll.
My answer explains the general concept which is also applied even the case of source which provides a batch of records in a poll request.
Record-by-record: source continuously fetches the records and sends to the downstream operators. Source wouldn't need to wait for the completion of downstream operators on previous records. In recent streaming frameworks, non-shuffle operators would have handled altogether in a task - for such case, the downstream operator here technically means that there's a downstream operator requires "shuffle".
Micro-batch: the engine plans the new micro-batch (the offset range of the source, etc.) and launch tasks for the micro batch. In each micro batch, it behaves similar with the batch processing.

Cassandra, Counters, and Write Conflicts

We are exploring using Cassandra as a way to store time series type data, so this may be somewhat of a noob question. One of the use cases is to read data from a Kafka stream, look for matches, and incrementing a counter (e.g. 5 customers have clicked through link alpha on page beta, increment (beta, alpha) by 5). However, we expect a very wide degree of parallelism to keep up with the load, so there may be more than one consumer reading from Kafka at the same time.
My question is: How would Cassandra resolve multiple simultaneous writes to a given counter from multiple sources?
It's my understanding that multiple writes to the counter with different timestamps will be added to the counter in the timestamp order received. However, if there were to be a simultaneous write with exact same timestamp, would the LWW model of Cassandra throw out one of those counter increments?
If we were to have a large cluster (100+ nodes), ALL or QUORUM writes may not be sufficient performant to keep up with the messasge traffic. Writes with THREE would seem to be likely to result in a situation where process #1 writes to nodes A, B, and C, but process #2 might write to X, Y, and Z. Would LWT work here, or do they not play well with counter activity?
I would try out a proof of concept and benchmark it, it will most likely work just fine. Counters are not super performant in Cassandra though, especially if there will be a lot of contention.
Counters are not like the normal writes with a simple LWW, it uses paxos with some pessimistic locking and specialized caches. The partition lock contention will slow it down soome, and paxos is an expensive multiple network hop process with reads before writes.
Use quorum, don't try to do something funky with CL's with counters, especially before benchmarking to know if you need it. 100 node cluster should be able to handle a lot as long as your not trying to update all the same partitions constantly.

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

Does it make sense to run Spark job for its side effects?

I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.
Is it a good idea to perform the above task in Spark?
I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.
However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.
To be clear, I'm thinking of something like the following
IDs = sc.parallelize(range(0, n))
def f(x):
for i in range(0,100):
message = make_message(x, i)
SEND_OVER_NETWORK(message)
return (x, 100)
IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)
for (ID, count) in counts.collect():
print ("%i ran %i times" % (ID, count))
Generally speaking it doesn't make sense:
Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.
Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.
You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.
Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.
What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.
It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.

Resources