Spark streaming with Kafka - createDirectStream vs createStream - apache-spark

We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?

There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.

Related

Why so much criticism around Spark Streaming micro-batch (when using kafka as source)?

Since any Kafka Consumer is in reality consuming in batches, why there is so much criticism around Spark Streaming micro-batch (when using Kafka as his source), for example, in comparison to Kafka Streams (which markets itself as real streaming)?
I mean: a lot of criticism hover on Spark Streaming micro-batch architecture. And, normally, people say that Kafka Streams is a real 'real-time' tool, since it processes events one-by-one.
It does process events one by one, but, from my understanding, it uses (as almost every other library/framework) the Consumer API. The Consumer API polls from topics in batches in order to reduce network burden (the interval is configurable). Therefore, the Consumer will do something like:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
///// PROCESS A **BATCH** OF RECORDS
for (ConsumerRecord<String, String> record : records) {
///// PROCESS **ONE-BY-ONE**
}
}
So, although it is right to say that Spark:
maybe has higher latency due to its micro-batch minimum interval that limits latency to at best 100 ms (see Spark Structured Streaming DOCs);
processes records in groups (either as DStreams of RDDs or as DataFrames in Structured Streaming).
But:
One can process records one-by-one in Spark - just loop though RDDs/Rows
Kafka Streams in reality polls batches of records, but processes them one-by-one, since it implements the Consumer API under-the-hoods.
Just to make clear, I am not questioning from a 'fan-side' (and therefore, being it an opinion question), just the opposite, I am really trying to understand it technically in order to understand the semantics in the streaming ecosystem.
Appreciate every piece of information in this matter.
DISCLAIMER: I had involved in Apache Storm (which is known to be a streaming framework processing "record-by-record", though there's trident API as well), and now involving in Apache Spark ("micro-batch").
The one of major concerns in streaming technology has been "throughput vs latency". In latency perspective, "record-by-record" processing is clearly a winner, but the cost of "doing everything one by one" is significant and every minor thing becomes a huge overhead. (Consider the system aims to process a million records per second, then any additional overhead on processing gets multiplexed by a million.) Actually, there was opposite criticism as well, bad throughput on "read-by-record" compared to the "micro-batch". To address this, streaming frameworks add batching in their "internal" logic but in a way to less hurting latency. (like configuring the size of batch, and timeout to force flush the batch)
I think the major difference between the twos is that whether the tasks are running "continuously" and they're composing a "pipeline".
In streaming frameworks do "record-by-record", when the application is launched, all necessary tasks are physically planned and launched altogether and they never terminate unless application is terminated. Source tasks continuously push the records to the downstream tasks, and downstream tasks process them and push to next downstream. This is done in pipeline manner. Source won't stop pushing the records unless there's no records to push. (There're backpressure and distributed checkpoint, but let's put aside of the details and focus on the concept.)
In streaming frameworks do "micro-batch", they have to decide the boundary of "batch" for each micro-batch. In Spark, the planning (e.g. how many records this batch will read from source and process) is normally done by driver side and tasks are physically planned based on the decided batch. This approach gives end users a major homework - what is the "appropriate" size of batch to achieve the throughput/latency they're targeting. Too small batch leads bad throughput, as planning a batch requires non-trivial cost (heavily depending on the sources). Too huge batch leads bad latency. In addition, the concept of "stage" is appropriate to the batch workload (I see Flink is adopting the stage in their batch workload) and not ideal for streaming workload, because this means some tasks should wait for the "completion" of other tasks, no pipeline.
For sure, I don't think such criticism means micro-batch is "unusable". Do you really need to bother the latency when your actual workload can tolerate minutes (or even tens of minutes) of latency? Probably no. You'll want to concern about the cost of learning curve (most likely Spark only vs Spark & other, but Kafka stream only or Flink only is possible for sure.) and maintenance instead. In addition, if you have a workload which requires aggregation (probably with windowing), the restriction of latency from the framework is less important, as you'll probably set your window size to minutes/hours.
Micro-batch has upside as well - if there's a huge idle, the resources running idle tasks are wasted, which applies to "record-to-record" streaming frameworks. It also allows to do batch operations for the specific micro-batch which aren't possible on streaming. (Though you should keep in mind it only applies to "current" batch.)
I think there's no silver bullet - Spark has been leading the "batch workload" as it's originated to deal with problems of MapReduce, hence the overall architecture is optimized to the batch workload. Other streaming frameworks start from "streaming native", hence should have advantage on streaming workload, but less optimal on batch workload. Unified batch and streaming is a new trend, and at some time a (or a couple of) framework may provide optimal performance on both workloads, but I'm not sure now is the time.
EDIT: If your workload targets "end-to-end exactly once", the latency is bound to the checkpoint interval even for "record-by-record" streaming frameworks. The records between checkpoint compose a sort of batch, so checkpoint interval would be a new major homework for you.
EDIT2:
Q1) Why windows aggregations would make me bother less about latency? Maybe one really wants to update the stateful operation quickly enough.
The output latency between micro-batch and record-by-record won't be significant (even the micro-batch could also achieve the sub-second latency in some extreme cases) compared to the delay brought by the nature of windowing.
But yes, I'm assuming the case the emit happens only when window gets expired ("append" mode in Structured Streaming). If you'd like to emit all the updates whenever there's change in window then yes, there would be still difference on the latency perspective.
Q2) Why the semantics are important in this trade-off? Sounds like it is related, for example, to Kafka-Streams reducing commit-interval when exactly-once is configured. Maybe you mean that checkpointing possibly one-by-one would increase overhead and then impact latency, in order to obtain better semantics?
I don't know the details about Kafka stream, so my explanation won't be based on how Kafka stream works. That would be your homework.
If you read through my answer correctly, you've also agreed that streaming frameworks won't do the checkpoint per record - the overhead would be significant. That said, records between the two checkpoints would be the same group (sort of a batch) which have to be reprocessed when the failure happens.
If stateful exactly once (stateful operation is exactly once, but the output is at-least once) is enough for your application, your application can just write the output to the sink and commit immediately so that readers of the output can read them immediately. Latency won't be affected by the checkpoint interval.
Btw, there're two ways to achieve end-to-end exactly once (especially the sink side):
supports idempotent updates
supports transactional updates
The case 1) writes the outputs immediately so won't affect latency through the semantic (similar with at-least once), but the storage should be able to handle upsert, and the "partial write" is seen when the failure happens so your reader applications should tolerate it.
The case 2) writes the outputs but not commits them until the checkpoint is happening. The streaming frameworks will try to ensure that the output is committed and exposed only when the checkpoint succeeds and there's no failure in the group. There're various approaches to make the distributed writes be transactional (2PC, coordinator does "atomic rename", coordinator writes the list of the files tasks wrote, etc.), but in any way the reader can't see the partial write till the commit happens so checkpoint interval would greatly contribute the output latency.
Q3) This doesn't necessarily address the point about the batch of records that Kafka clients poll.
My answer explains the general concept which is also applied even the case of source which provides a batch of records in a poll request.
Record-by-record: source continuously fetches the records and sends to the downstream operators. Source wouldn't need to wait for the completion of downstream operators on previous records. In recent streaming frameworks, non-shuffle operators would have handled altogether in a task - for such case, the downstream operator here technically means that there's a downstream operator requires "shuffle".
Micro-batch: the engine plans the new micro-batch (the offset range of the source, etc.) and launch tasks for the micro batch. In each micro batch, it behaves similar with the batch processing.

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.
The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.
What approach can I try?
In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.
So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.
Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!
Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Resources