stream processing and batch processing - apache-spark

Since the latency with batch processing generates when accumulating a specific number of data, can I regard batch processing with the "size of one" as stream processing? Or there's other difference when operators do calculations?
For example, if I set the batch size of a spark-based program to 1, can I make its latency as low as flink?
One of my thinking is as below:
For stream processing, one data flows from former operator to latter one if processed, but for batch process, only after all the operator finish processing a data can it accept another data.
It seems the pipeline in stream processing counts for the acceleration.
Am I right in my explantion? If wrong, what's the appropriate explanation to my question.

TLDR: there are quite a lot of reasons why you should help your program and tell it explicitly wether you want a bounded(batch) or unbounded(stream) computation.
Your thinking is good in theory, but that's not how it works in practice: batch vs stream setting is being asked explicitly from the programmer. The runtime won't infer it from the batch size (or batch delay) you set. At least that's how Flink works.
Furthermore, the batch vs stream divide goes much deeper: batch shouldn't care much about time.
Let's say you increase the batch size to be the whole dataset size. Only in that case Flink will be able to apply performance optimization passes over your plan. For example: in streaming mode JOINs need to keep both sides in memory in case a match appears on the other side. In batching mode, Flink knows both sides are fixed-size, it can materialize first the smallest side and only keep that in memory while it queries it with the other side. Thus Flinks need less memory for batching, and it uses CPU caches better (which makes for a faster processing).
Also streaming has to maintain watermarks (special row metadata to help with correlating the right rows together time-wise, persisting coherent set of rows together, etc), while batch doesn't care about them. That's overhead.
If you're up for it you can peruse the Flink source code, and compare the Batch vs Stream SQL optimization rules. You'll see that stream has to deal with watermarks (FlinkLogicalWatermarkAssigner) when batch does not, it has to expand temporal tables fully (LogicalCorrelateToJoinFromTemporalTableRule). Batch also can sort rows and do sort-merge-joins (BatchPhysicalSortMergeJoinRule). Stream has to incrementally process aggregates (IncrementalAggregateRule) when batch can do them locally at the data source (PushLocalHashAggIntoScanRule), etc. Each difference between these two files is either a thing one side has to specifically do because of its (batch vs stream) nature, or an optimization pass that is allowed by its (batch vs stream) nature.
If you would like to know more about this topic and it's numerous subtleties, you can also read the Flink Blog posts, Flink Documentation, Flink Improvement Proposals

Related

Why so much criticism around Spark Streaming micro-batch (when using kafka as source)?

Since any Kafka Consumer is in reality consuming in batches, why there is so much criticism around Spark Streaming micro-batch (when using Kafka as his source), for example, in comparison to Kafka Streams (which markets itself as real streaming)?
I mean: a lot of criticism hover on Spark Streaming micro-batch architecture. And, normally, people say that Kafka Streams is a real 'real-time' tool, since it processes events one-by-one.
It does process events one by one, but, from my understanding, it uses (as almost every other library/framework) the Consumer API. The Consumer API polls from topics in batches in order to reduce network burden (the interval is configurable). Therefore, the Consumer will do something like:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
///// PROCESS A **BATCH** OF RECORDS
for (ConsumerRecord<String, String> record : records) {
///// PROCESS **ONE-BY-ONE**
}
}
So, although it is right to say that Spark:
maybe has higher latency due to its micro-batch minimum interval that limits latency to at best 100 ms (see Spark Structured Streaming DOCs);
processes records in groups (either as DStreams of RDDs or as DataFrames in Structured Streaming).
But:
One can process records one-by-one in Spark - just loop though RDDs/Rows
Kafka Streams in reality polls batches of records, but processes them one-by-one, since it implements the Consumer API under-the-hoods.
Just to make clear, I am not questioning from a 'fan-side' (and therefore, being it an opinion question), just the opposite, I am really trying to understand it technically in order to understand the semantics in the streaming ecosystem.
Appreciate every piece of information in this matter.
DISCLAIMER: I had involved in Apache Storm (which is known to be a streaming framework processing "record-by-record", though there's trident API as well), and now involving in Apache Spark ("micro-batch").
The one of major concerns in streaming technology has been "throughput vs latency". In latency perspective, "record-by-record" processing is clearly a winner, but the cost of "doing everything one by one" is significant and every minor thing becomes a huge overhead. (Consider the system aims to process a million records per second, then any additional overhead on processing gets multiplexed by a million.) Actually, there was opposite criticism as well, bad throughput on "read-by-record" compared to the "micro-batch". To address this, streaming frameworks add batching in their "internal" logic but in a way to less hurting latency. (like configuring the size of batch, and timeout to force flush the batch)
I think the major difference between the twos is that whether the tasks are running "continuously" and they're composing a "pipeline".
In streaming frameworks do "record-by-record", when the application is launched, all necessary tasks are physically planned and launched altogether and they never terminate unless application is terminated. Source tasks continuously push the records to the downstream tasks, and downstream tasks process them and push to next downstream. This is done in pipeline manner. Source won't stop pushing the records unless there's no records to push. (There're backpressure and distributed checkpoint, but let's put aside of the details and focus on the concept.)
In streaming frameworks do "micro-batch", they have to decide the boundary of "batch" for each micro-batch. In Spark, the planning (e.g. how many records this batch will read from source and process) is normally done by driver side and tasks are physically planned based on the decided batch. This approach gives end users a major homework - what is the "appropriate" size of batch to achieve the throughput/latency they're targeting. Too small batch leads bad throughput, as planning a batch requires non-trivial cost (heavily depending on the sources). Too huge batch leads bad latency. In addition, the concept of "stage" is appropriate to the batch workload (I see Flink is adopting the stage in their batch workload) and not ideal for streaming workload, because this means some tasks should wait for the "completion" of other tasks, no pipeline.
For sure, I don't think such criticism means micro-batch is "unusable". Do you really need to bother the latency when your actual workload can tolerate minutes (or even tens of minutes) of latency? Probably no. You'll want to concern about the cost of learning curve (most likely Spark only vs Spark & other, but Kafka stream only or Flink only is possible for sure.) and maintenance instead. In addition, if you have a workload which requires aggregation (probably with windowing), the restriction of latency from the framework is less important, as you'll probably set your window size to minutes/hours.
Micro-batch has upside as well - if there's a huge idle, the resources running idle tasks are wasted, which applies to "record-to-record" streaming frameworks. It also allows to do batch operations for the specific micro-batch which aren't possible on streaming. (Though you should keep in mind it only applies to "current" batch.)
I think there's no silver bullet - Spark has been leading the "batch workload" as it's originated to deal with problems of MapReduce, hence the overall architecture is optimized to the batch workload. Other streaming frameworks start from "streaming native", hence should have advantage on streaming workload, but less optimal on batch workload. Unified batch and streaming is a new trend, and at some time a (or a couple of) framework may provide optimal performance on both workloads, but I'm not sure now is the time.
EDIT: If your workload targets "end-to-end exactly once", the latency is bound to the checkpoint interval even for "record-by-record" streaming frameworks. The records between checkpoint compose a sort of batch, so checkpoint interval would be a new major homework for you.
EDIT2:
Q1) Why windows aggregations would make me bother less about latency? Maybe one really wants to update the stateful operation quickly enough.
The output latency between micro-batch and record-by-record won't be significant (even the micro-batch could also achieve the sub-second latency in some extreme cases) compared to the delay brought by the nature of windowing.
But yes, I'm assuming the case the emit happens only when window gets expired ("append" mode in Structured Streaming). If you'd like to emit all the updates whenever there's change in window then yes, there would be still difference on the latency perspective.
Q2) Why the semantics are important in this trade-off? Sounds like it is related, for example, to Kafka-Streams reducing commit-interval when exactly-once is configured. Maybe you mean that checkpointing possibly one-by-one would increase overhead and then impact latency, in order to obtain better semantics?
I don't know the details about Kafka stream, so my explanation won't be based on how Kafka stream works. That would be your homework.
If you read through my answer correctly, you've also agreed that streaming frameworks won't do the checkpoint per record - the overhead would be significant. That said, records between the two checkpoints would be the same group (sort of a batch) which have to be reprocessed when the failure happens.
If stateful exactly once (stateful operation is exactly once, but the output is at-least once) is enough for your application, your application can just write the output to the sink and commit immediately so that readers of the output can read them immediately. Latency won't be affected by the checkpoint interval.
Btw, there're two ways to achieve end-to-end exactly once (especially the sink side):
supports idempotent updates
supports transactional updates
The case 1) writes the outputs immediately so won't affect latency through the semantic (similar with at-least once), but the storage should be able to handle upsert, and the "partial write" is seen when the failure happens so your reader applications should tolerate it.
The case 2) writes the outputs but not commits them until the checkpoint is happening. The streaming frameworks will try to ensure that the output is committed and exposed only when the checkpoint succeeds and there's no failure in the group. There're various approaches to make the distributed writes be transactional (2PC, coordinator does "atomic rename", coordinator writes the list of the files tasks wrote, etc.), but in any way the reader can't see the partial write till the commit happens so checkpoint interval would greatly contribute the output latency.
Q3) This doesn't necessarily address the point about the batch of records that Kafka clients poll.
My answer explains the general concept which is also applied even the case of source which provides a batch of records in a poll request.
Record-by-record: source continuously fetches the records and sends to the downstream operators. Source wouldn't need to wait for the completion of downstream operators on previous records. In recent streaming frameworks, non-shuffle operators would have handled altogether in a task - for such case, the downstream operator here technically means that there's a downstream operator requires "shuffle".
Micro-batch: the engine plans the new micro-batch (the offset range of the source, etc.) and launch tasks for the micro batch. In each micro batch, it behaves similar with the batch processing.

How big can batches in Flink respectively Spark get?

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Does it make sense to run Spark job for its side effects?

I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.
Is it a good idea to perform the above task in Spark?
I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.
However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.
To be clear, I'm thinking of something like the following
IDs = sc.parallelize(range(0, n))
def f(x):
for i in range(0,100):
message = make_message(x, i)
SEND_OVER_NETWORK(message)
return (x, 100)
IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)
for (ID, count) in counts.collect():
print ("%i ran %i times" % (ID, count))
Generally speaking it doesn't make sense:
Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.
Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.
You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.
Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.
What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.
It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.

Resources