Spring batch difference between Multithreading vs partitioning - multithreading

I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the partitions then process it. I want to know what is the difference and which one is more efficient way to process when the bottleneck is the item-processor.

TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The AsyncItemProcessor/AsyncItemWriter may be a better option.
Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:
Multithreaded step
Parallel steps
Partitioning
Remote chunking
AsyncItemProcessor/AsyncItemWriter
Each has it's own benefits and disadvantages. Let's walk through each:
Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.
Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.
Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).
An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.
Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.
The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.
AsyncItemProcessor/AsyncItemWriter
The final option for scaling Spring Batch processes is the AsyncItemProcessor/AsycnItemWriter combination. In this case, the AsyncItemProcessor wraps your ItemProcessor implementation and executes the call to your implementation in a separate thread. The AsyncItemProcessor then returns a Future that is passed to the AsyncItemWriter where it is unwrapped and passed to the delegate ItemWriter implementation.
Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the ItemProcessor call until inside the ItemWriter) but overall, it can provide a useful tool for parallelizing just the ItemProcessor logic in a single JVM without sacrificing restartability.

Related

How to run multiple queries in Scylla using "Non Atomic" Batch/Pipeline

I understand that Scylla allows batch statements like these.
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
These statements have performance implications as it ensures atomicity. However, I simply have many insert statements which I want to perform from my node client in a single IO. Atomicity among these inserts is not needed. Any idea how I can do that? Can't find anything.
Batching multiple inserts in Cassandra world usually is an antipattern (except when they go into one partition, see the docs). When you're sending inserts into multiple partitions in one batch, the coordinator node will need to take care for taking data from this batch and sending them to nodes that are owning the data. And this puts an additional load onto the coordinating node that first needs to backup the content of the batch just not to lose it if it crashes in the middle of execution, and then need to execute all operations, and wait for results of execution before sending it back to caller (see this diagram to understand how so-called logged batch works).
When you don't need atomicity, then the best performance would be by sending multiple parallel inserts, and waiting for their execution - it will be faster, it will put less load onto nodes, and driver can use token-aware load balancing policy, so requests will be sent to nodes that own data (if you're using prepared statements). In node.js you can achieve this by using Concurrent Execution API - there are several variants of its usage, so it's better to look into the documentation to select what is best for your use case.

Why so much criticism around Spark Streaming micro-batch (when using kafka as source)?

Since any Kafka Consumer is in reality consuming in batches, why there is so much criticism around Spark Streaming micro-batch (when using Kafka as his source), for example, in comparison to Kafka Streams (which markets itself as real streaming)?
I mean: a lot of criticism hover on Spark Streaming micro-batch architecture. And, normally, people say that Kafka Streams is a real 'real-time' tool, since it processes events one-by-one.
It does process events one by one, but, from my understanding, it uses (as almost every other library/framework) the Consumer API. The Consumer API polls from topics in batches in order to reduce network burden (the interval is configurable). Therefore, the Consumer will do something like:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
///// PROCESS A **BATCH** OF RECORDS
for (ConsumerRecord<String, String> record : records) {
///// PROCESS **ONE-BY-ONE**
}
}
So, although it is right to say that Spark:
maybe has higher latency due to its micro-batch minimum interval that limits latency to at best 100 ms (see Spark Structured Streaming DOCs);
processes records in groups (either as DStreams of RDDs or as DataFrames in Structured Streaming).
But:
One can process records one-by-one in Spark - just loop though RDDs/Rows
Kafka Streams in reality polls batches of records, but processes them one-by-one, since it implements the Consumer API under-the-hoods.
Just to make clear, I am not questioning from a 'fan-side' (and therefore, being it an opinion question), just the opposite, I am really trying to understand it technically in order to understand the semantics in the streaming ecosystem.
Appreciate every piece of information in this matter.
DISCLAIMER: I had involved in Apache Storm (which is known to be a streaming framework processing "record-by-record", though there's trident API as well), and now involving in Apache Spark ("micro-batch").
The one of major concerns in streaming technology has been "throughput vs latency". In latency perspective, "record-by-record" processing is clearly a winner, but the cost of "doing everything one by one" is significant and every minor thing becomes a huge overhead. (Consider the system aims to process a million records per second, then any additional overhead on processing gets multiplexed by a million.) Actually, there was opposite criticism as well, bad throughput on "read-by-record" compared to the "micro-batch". To address this, streaming frameworks add batching in their "internal" logic but in a way to less hurting latency. (like configuring the size of batch, and timeout to force flush the batch)
I think the major difference between the twos is that whether the tasks are running "continuously" and they're composing a "pipeline".
In streaming frameworks do "record-by-record", when the application is launched, all necessary tasks are physically planned and launched altogether and they never terminate unless application is terminated. Source tasks continuously push the records to the downstream tasks, and downstream tasks process them and push to next downstream. This is done in pipeline manner. Source won't stop pushing the records unless there's no records to push. (There're backpressure and distributed checkpoint, but let's put aside of the details and focus on the concept.)
In streaming frameworks do "micro-batch", they have to decide the boundary of "batch" for each micro-batch. In Spark, the planning (e.g. how many records this batch will read from source and process) is normally done by driver side and tasks are physically planned based on the decided batch. This approach gives end users a major homework - what is the "appropriate" size of batch to achieve the throughput/latency they're targeting. Too small batch leads bad throughput, as planning a batch requires non-trivial cost (heavily depending on the sources). Too huge batch leads bad latency. In addition, the concept of "stage" is appropriate to the batch workload (I see Flink is adopting the stage in their batch workload) and not ideal for streaming workload, because this means some tasks should wait for the "completion" of other tasks, no pipeline.
For sure, I don't think such criticism means micro-batch is "unusable". Do you really need to bother the latency when your actual workload can tolerate minutes (or even tens of minutes) of latency? Probably no. You'll want to concern about the cost of learning curve (most likely Spark only vs Spark & other, but Kafka stream only or Flink only is possible for sure.) and maintenance instead. In addition, if you have a workload which requires aggregation (probably with windowing), the restriction of latency from the framework is less important, as you'll probably set your window size to minutes/hours.
Micro-batch has upside as well - if there's a huge idle, the resources running idle tasks are wasted, which applies to "record-to-record" streaming frameworks. It also allows to do batch operations for the specific micro-batch which aren't possible on streaming. (Though you should keep in mind it only applies to "current" batch.)
I think there's no silver bullet - Spark has been leading the "batch workload" as it's originated to deal with problems of MapReduce, hence the overall architecture is optimized to the batch workload. Other streaming frameworks start from "streaming native", hence should have advantage on streaming workload, but less optimal on batch workload. Unified batch and streaming is a new trend, and at some time a (or a couple of) framework may provide optimal performance on both workloads, but I'm not sure now is the time.
EDIT: If your workload targets "end-to-end exactly once", the latency is bound to the checkpoint interval even for "record-by-record" streaming frameworks. The records between checkpoint compose a sort of batch, so checkpoint interval would be a new major homework for you.
EDIT2:
Q1) Why windows aggregations would make me bother less about latency? Maybe one really wants to update the stateful operation quickly enough.
The output latency between micro-batch and record-by-record won't be significant (even the micro-batch could also achieve the sub-second latency in some extreme cases) compared to the delay brought by the nature of windowing.
But yes, I'm assuming the case the emit happens only when window gets expired ("append" mode in Structured Streaming). If you'd like to emit all the updates whenever there's change in window then yes, there would be still difference on the latency perspective.
Q2) Why the semantics are important in this trade-off? Sounds like it is related, for example, to Kafka-Streams reducing commit-interval when exactly-once is configured. Maybe you mean that checkpointing possibly one-by-one would increase overhead and then impact latency, in order to obtain better semantics?
I don't know the details about Kafka stream, so my explanation won't be based on how Kafka stream works. That would be your homework.
If you read through my answer correctly, you've also agreed that streaming frameworks won't do the checkpoint per record - the overhead would be significant. That said, records between the two checkpoints would be the same group (sort of a batch) which have to be reprocessed when the failure happens.
If stateful exactly once (stateful operation is exactly once, but the output is at-least once) is enough for your application, your application can just write the output to the sink and commit immediately so that readers of the output can read them immediately. Latency won't be affected by the checkpoint interval.
Btw, there're two ways to achieve end-to-end exactly once (especially the sink side):
supports idempotent updates
supports transactional updates
The case 1) writes the outputs immediately so won't affect latency through the semantic (similar with at-least once), but the storage should be able to handle upsert, and the "partial write" is seen when the failure happens so your reader applications should tolerate it.
The case 2) writes the outputs but not commits them until the checkpoint is happening. The streaming frameworks will try to ensure that the output is committed and exposed only when the checkpoint succeeds and there's no failure in the group. There're various approaches to make the distributed writes be transactional (2PC, coordinator does "atomic rename", coordinator writes the list of the files tasks wrote, etc.), but in any way the reader can't see the partial write till the commit happens so checkpoint interval would greatly contribute the output latency.
Q3) This doesn't necessarily address the point about the batch of records that Kafka clients poll.
My answer explains the general concept which is also applied even the case of source which provides a batch of records in a poll request.
Record-by-record: source continuously fetches the records and sends to the downstream operators. Source wouldn't need to wait for the completion of downstream operators on previous records. In recent streaming frameworks, non-shuffle operators would have handled altogether in a task - for such case, the downstream operator here technically means that there's a downstream operator requires "shuffle".
Micro-batch: the engine plans the new micro-batch (the offset range of the source, etc.) and launch tasks for the micro batch. In each micro batch, it behaves similar with the batch processing.

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.
The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.
What approach can I try?
In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.
So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.
Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!
Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

How hazelcast-jet achieves anything different from hazelcast EntryProcessors

How hazelcast-jet achieves anything vastly different from what was earlier achievable by submitting EntryProcessors on keys in an IMap?
Curious to know.
Quoting the InfoQ article on Jet:
Sending a runnable to a partition is analogous to the work of a single DAG vertex. The advantage of Jet comes from the ability to have the vertex transform the data it reads, producing items which no longer belong to the same partition, then reshuffle them while sending to the downstream vertex so they are again correctly partitioned. This is essential for any kind of map-reduce operation where the reducing unit must observe all the data items with the same key. To minimize network traffic, Jet can first reduce the data slice produced on the local member, then send only one item per key to the remote member that combines the partial results.
And note that this is just an advantage in the context of the same or similar use cases currently covered by entry processors. Jet can take data from any source and make use of the whole cluster's computational resources to process it.

Dynamic CPUs per Task in Spark

Lets say my job performs several spark actions, where the first few are not using multiple cores for a single task so I would like each instance to perform (executor.cores) tasks in parallel (spark.task.cpus=1).
Then suppose I have another action which can be parallelized - I'm desiring a feature where I could increase spark.task.cpus (say to use more cores on the executor), and perform fewer tasks simultaneously on each instance.
My workaround right now is to save data, start a new sparkContext with new settings, and reload the data.
The use case: my later actions may be unavoidable skewed and I may want to apply more than one core per task to avoid bottlenecking on such large tasks, but I don't want this to impact the earlier actions which can benefit from using 1 core per task.
From looking around my guess is that I can't do this currently, so I'm mainly wondering if there is a a significant limitation for not allowing this. Alternatively, suggestions for how I could trick spark into achieving something similar.
Note: Currently using 1.6.2 but willing to hear other options for Spark2+

Resources