I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!
Related
Since the latency with batch processing generates when accumulating a specific number of data, can I regard batch processing with the "size of one" as stream processing? Or there's other difference when operators do calculations?
For example, if I set the batch size of a spark-based program to 1, can I make its latency as low as flink?
One of my thinking is as below:
For stream processing, one data flows from former operator to latter one if processed, but for batch process, only after all the operator finish processing a data can it accept another data.
It seems the pipeline in stream processing counts for the acceleration.
Am I right in my explantion? If wrong, what's the appropriate explanation to my question.
TLDR: there are quite a lot of reasons why you should help your program and tell it explicitly wether you want a bounded(batch) or unbounded(stream) computation.
Your thinking is good in theory, but that's not how it works in practice: batch vs stream setting is being asked explicitly from the programmer. The runtime won't infer it from the batch size (or batch delay) you set. At least that's how Flink works.
Furthermore, the batch vs stream divide goes much deeper: batch shouldn't care much about time.
Let's say you increase the batch size to be the whole dataset size. Only in that case Flink will be able to apply performance optimization passes over your plan. For example: in streaming mode JOINs need to keep both sides in memory in case a match appears on the other side. In batching mode, Flink knows both sides are fixed-size, it can materialize first the smallest side and only keep that in memory while it queries it with the other side. Thus Flinks need less memory for batching, and it uses CPU caches better (which makes for a faster processing).
Also streaming has to maintain watermarks (special row metadata to help with correlating the right rows together time-wise, persisting coherent set of rows together, etc), while batch doesn't care about them. That's overhead.
If you're up for it you can peruse the Flink source code, and compare the Batch vs Stream SQL optimization rules. You'll see that stream has to deal with watermarks (FlinkLogicalWatermarkAssigner) when batch does not, it has to expand temporal tables fully (LogicalCorrelateToJoinFromTemporalTableRule). Batch also can sort rows and do sort-merge-joins (BatchPhysicalSortMergeJoinRule). Stream has to incrementally process aggregates (IncrementalAggregateRule) when batch can do them locally at the data source (PushLocalHashAggIntoScanRule), etc. Each difference between these two files is either a thing one side has to specifically do because of its (batch vs stream) nature, or an optimization pass that is allowed by its (batch vs stream) nature.
If you would like to know more about this topic and it's numerous subtleties, you can also read the Flink Blog posts, Flink Documentation, Flink Improvement Proposals
How hazelcast-jet achieves anything vastly different from what was earlier achievable by submitting EntryProcessors on keys in an IMap?
Curious to know.
Quoting the InfoQ article on Jet:
Sending a runnable to a partition is analogous to the work of a single DAG vertex. The advantage of Jet comes from the ability to have the vertex transform the data it reads, producing items which no longer belong to the same partition, then reshuffle them while sending to the downstream vertex so they are again correctly partitioned. This is essential for any kind of map-reduce operation where the reducing unit must observe all the data items with the same key. To minimize network traffic, Jet can first reduce the data slice produced on the local member, then send only one item per key to the remote member that combines the partial results.
And note that this is just an advantage in the context of the same or similar use cases currently covered by entry processors. Jet can take data from any source and make use of the whole cluster's computational resources to process it.
We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?
There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.
I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck
What would be some considerations for choosing stateless sliding-window operations (e.g. reduceByKeyAndWindow) vs. choosing to keep state (e.g. via updateStateByKey or the new mapStateByKey) when handling a stream of sequential, finite event sessions with Spark Streaming?
For example, consider the following scenario:
A wearable device tracks physical exercises performed by
the wearer. The device automatically detects when an exercise starts,
and emits a message; emits additional messages while the exercise
is undergoing (e.g. heart rate); and finally, emits a message when the
exercise is done.
The desired result is a stream of aggregated records per exercise session. i.e. all events of the same session should be aggregated together (e.g. so that each session could be saved in a single DB row). Note that each session has a finite length, but the entire stream from multiple devices is continuous. For convenience, let's assume the device generates a GUID for each exercise session.
I can see two approaches for handling this use-case with Spark Streaming:
Using non-overlapping windows, and keeping state. A state is saved per GUID, with all events matching it. When a new event arrives, the state is updated (e.g. using mapWithState), and in case the event is "end of exercise session", an aggregated record based on the state will be emitted, and the key removed.
Using overlapping sliding windows, and keeping only the first sessions. Assume a sliding window of length 2 and interval 1 (see diagram below). Also assume that the window length is 2 X (maximal possible exercise time). On each window, events are aggreated by GUID, e.g. using reduceByKeyAndWindow. Then, all sessions which started at the second half of the window are dumped, and the remaining sessions emitted. This enables using each event exactly once, and ensures all events belonging to the same session will be aggregated together.
Diagram for approach #2:
Only sessions starting in the areas marked with \\\ will be emitted.
-----------
|window 1 |
|\\\\| |
-----------
----------
|window 2 |
|\\\\| |
-----------
----------
|window 3 |
|\\\\| |
-----------
Pros and cons I see:
Approach #1 is less computationally expensive, but requires saving and managing state (e.g. if the number of concurrent sessions increases, the state might get larger than memory). However if the maximal number of concurrent sessions is bounded, this might not be an issue.
Approach #2 is twice as expensive (each event is processed twice), and with higher latency (2 X maximal exercise time), but more simple and easily manageable, as no state is retained.
What would be the best way to handle this use case - is any of these approaches the "right" one, or are there better ways?
What other pros/cons should be taken into consideration?
Normally there is no right approach, each has tradeoffs. Therefore I'd add additional approach to the mix and will outline my take on their pros and cons. So you can decide which one is more suitable for you.
External state approach (approach #3)
You can accumulate state of the events in external storage. Cassandra is quite often used for that. You can handle final and ongoing events separately for example like below:
val stream = ...
val ongoingEventsStream = stream.filter(!isFinalEvent)
val finalEventsStream = stream.filter(isFinalEvent)
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
finalEventsStream.foreachRDD { /*finalize state in casssandra, move to final destination if needed*/ }
trackStateByKey approach (approach #1.1)
It might be potentially optimal solution for you as it removes drawbacks of updateStateByKey, but considering it is just got released as part of Spark 1.6 release, it could be risky as well (since for some reason it is not very advertised). You can use the link as starting point if you want to find out more
Pros/Cons
Approach #1 (updateStateByKey)
Pros
Easy to understand or explain (to rest of the team, newcomers, etc.) (subjective)
Storage: Better usage of memory stores only latest state of exercise
Storage: Will keep only ongoing exercises, and discard them as soon as they finish
Latency is limited only by performance of each micro-batch processing
Cons
Storage: If number of keys (concurrent exercises) is large it may not fit into memory of your cluster
Processing: It will run updateState function for each key within the state map, therefore if number of concurrent exercises is large - performance will suffer
Approach #2 (window)
While it is possible to achieve what you need with windows, it looks significantly less natural in your scenario.
Pros
Processing in some cases (depending on the data) might be more effective than updateStateByKey, due to updateStateByKey tendency to run update on every key even if there are no actual updates
Cons
"maximal possible exercise time" - this sounds like a huge risk - it could be pretty arbitrary duration based on a human behaviour. Some people might forget to "finish exercise". Also depends on kinds of exercise, but could range from seconds to hours, when you want lower latency for quick exercises while would have to keep latency as high as longest exercise potentially could exist
Feels like harder to explain to others on how it will work (subjective)
Storage: Will have to keep all data within the window frame, not only the latest one. Also will free the memory only when window will slide away from this time slot, not when exercise is actually finished. While it might be not a huge difference if you will keep only last two time slots - it will increase if you try to achieve more flexibility by sliding window more often.
Approach #3 (external state)
Pros
Easy to explain, etc. (subjective)
Pure streaming processing approach, meaning that spark is responsible to act on each individual event, but not trying to store state, etc. (subjective)
Storage: Not limited by memory of the cluster to store state - can handle huge number of concurrent exercises
Processing: State is updated only when there are actual updates to it (unlike updateStateByKey)
Latency is similar to updateStateByKey and only limited by the time required to process each micro-batch
Cons
Extra component in your architecture (unless you already use Cassandra for your final output)
Processing: by default is slower than processing just in spark as not in-memory + you need to transfer the data via network
you'll have to implement exactly once semantic to output data into cassandra (for the case of worker failure during foreachRDD)
Suggested approach
I'd try the following:
test updateStateByKey approach on your data and your cluster
see if memory consumption and processing is acceptable even with large number of concurrent exercises (expected on peak hours)
fall back to approach with Cassandra in case if not
I think one of other drawbacks of third approach is that the RDDs are not received chronologically..considering running them on a cluster..
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
also what about check-pointing and driver node failure..In that case do u read the whole data again? curious to know how you wanna handle this?
I guess maybe mapwithstate is a better approach why you consider all these scenario..