Late data handling in Spark past watermark - apache-spark

Is there a way in Spark to handle data that arrived past watermark?
Consider a use case of devices that send messages, and those messages need to be processed inside Kafka + Spark. While 99% of messages are delivered to Spark server within let us 10 minutes, but occasionally a device may go out of connectivity zone for a day or a week and buffer messages internally, and then once a connection is restored deliver them a week later.
Watermark interval necessarily has to be fairly limited, as (1) results in the mainline case have to be produced timely, and also (2) because buffering space inside Spark is limited too, so Spark cannot keep a week worth of messages for all the devices buffered in a week-long watermark window.
In a regular Spark streaming construct, messages past watermark are discarded.
Is there a way to intercept those "very late" messages and route them to a handler or a separate stream -- only those "rejected" messages that do not fall within the watermark?

No, there is not. Apache Flink can handle such things I seem to remember. Spark has no feed for dropped data.

Related

Watermarking by key in Spark structured streaming

I have data coming in on Kafka from IoT devices. The timestamps of the sensor data of these devices are often not in sync due to network congestion, device being out of range, etc.
We have to write streaming jobs to aggregate sensor values over a window of time for each device independently. With the groupby with watermark operation, we lose the data of all devices that lag behind device with latest timestamp.
Is there any way that the watermarks could be applied separately to each device based on the latest timestamp for that device, and not the latest timestamp across all devices?
We cannot keep a large lag as the device could be out of range for days. We cannot run an individual query for each device as the number of devices is high.
Would it be achievable using flatMapGroupsWithState? Or is this something that cannot be achieved with Spark Structured Streaming at all?
I think instead of watermarking by event timestamp (which could be lagging behind as you said), you could apply a watermark over the processing timestamp (i.e. the time when you process the data in your Spark job). I faced a very similar problem in a recent project I was working on and that's how I solved it.
Example:
val dfWithWatermark = df
.withColumn("processingTimestamp", current_timestamp())
.withWatermark("processingTimestamp", "1 day")
// ... use dfWithWatermark to do aggregations etc
This will keep a state over 1 day of your IoT data, no matter what the timestamp of the data is that you're receiving.
There are some limitations to this of course, for example if there are devices that send data in intervals larger than your watermark. But to figure out a solution for this you'd have to be more specific with your requirements.
By using flatMapGroupsWithState you can be very specific with your state keeping, but I don't think it's really necessary in your case.
Edit: if you however decide to go with flatMapGroupsWithState, you can use different timeouts per device group by calling state.setTimeoutDuration() with different intervals, depending on the type of device you process. This way you can be very specific with the state keeping.

Why so much criticism around Spark Streaming micro-batch (when using kafka as source)?

Since any Kafka Consumer is in reality consuming in batches, why there is so much criticism around Spark Streaming micro-batch (when using Kafka as his source), for example, in comparison to Kafka Streams (which markets itself as real streaming)?
I mean: a lot of criticism hover on Spark Streaming micro-batch architecture. And, normally, people say that Kafka Streams is a real 'real-time' tool, since it processes events one-by-one.
It does process events one by one, but, from my understanding, it uses (as almost every other library/framework) the Consumer API. The Consumer API polls from topics in batches in order to reduce network burden (the interval is configurable). Therefore, the Consumer will do something like:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
///// PROCESS A **BATCH** OF RECORDS
for (ConsumerRecord<String, String> record : records) {
///// PROCESS **ONE-BY-ONE**
}
}
So, although it is right to say that Spark:
maybe has higher latency due to its micro-batch minimum interval that limits latency to at best 100 ms (see Spark Structured Streaming DOCs);
processes records in groups (either as DStreams of RDDs or as DataFrames in Structured Streaming).
But:
One can process records one-by-one in Spark - just loop though RDDs/Rows
Kafka Streams in reality polls batches of records, but processes them one-by-one, since it implements the Consumer API under-the-hoods.
Just to make clear, I am not questioning from a 'fan-side' (and therefore, being it an opinion question), just the opposite, I am really trying to understand it technically in order to understand the semantics in the streaming ecosystem.
Appreciate every piece of information in this matter.
DISCLAIMER: I had involved in Apache Storm (which is known to be a streaming framework processing "record-by-record", though there's trident API as well), and now involving in Apache Spark ("micro-batch").
The one of major concerns in streaming technology has been "throughput vs latency". In latency perspective, "record-by-record" processing is clearly a winner, but the cost of "doing everything one by one" is significant and every minor thing becomes a huge overhead. (Consider the system aims to process a million records per second, then any additional overhead on processing gets multiplexed by a million.) Actually, there was opposite criticism as well, bad throughput on "read-by-record" compared to the "micro-batch". To address this, streaming frameworks add batching in their "internal" logic but in a way to less hurting latency. (like configuring the size of batch, and timeout to force flush the batch)
I think the major difference between the twos is that whether the tasks are running "continuously" and they're composing a "pipeline".
In streaming frameworks do "record-by-record", when the application is launched, all necessary tasks are physically planned and launched altogether and they never terminate unless application is terminated. Source tasks continuously push the records to the downstream tasks, and downstream tasks process them and push to next downstream. This is done in pipeline manner. Source won't stop pushing the records unless there's no records to push. (There're backpressure and distributed checkpoint, but let's put aside of the details and focus on the concept.)
In streaming frameworks do "micro-batch", they have to decide the boundary of "batch" for each micro-batch. In Spark, the planning (e.g. how many records this batch will read from source and process) is normally done by driver side and tasks are physically planned based on the decided batch. This approach gives end users a major homework - what is the "appropriate" size of batch to achieve the throughput/latency they're targeting. Too small batch leads bad throughput, as planning a batch requires non-trivial cost (heavily depending on the sources). Too huge batch leads bad latency. In addition, the concept of "stage" is appropriate to the batch workload (I see Flink is adopting the stage in their batch workload) and not ideal for streaming workload, because this means some tasks should wait for the "completion" of other tasks, no pipeline.
For sure, I don't think such criticism means micro-batch is "unusable". Do you really need to bother the latency when your actual workload can tolerate minutes (or even tens of minutes) of latency? Probably no. You'll want to concern about the cost of learning curve (most likely Spark only vs Spark & other, but Kafka stream only or Flink only is possible for sure.) and maintenance instead. In addition, if you have a workload which requires aggregation (probably with windowing), the restriction of latency from the framework is less important, as you'll probably set your window size to minutes/hours.
Micro-batch has upside as well - if there's a huge idle, the resources running idle tasks are wasted, which applies to "record-to-record" streaming frameworks. It also allows to do batch operations for the specific micro-batch which aren't possible on streaming. (Though you should keep in mind it only applies to "current" batch.)
I think there's no silver bullet - Spark has been leading the "batch workload" as it's originated to deal with problems of MapReduce, hence the overall architecture is optimized to the batch workload. Other streaming frameworks start from "streaming native", hence should have advantage on streaming workload, but less optimal on batch workload. Unified batch and streaming is a new trend, and at some time a (or a couple of) framework may provide optimal performance on both workloads, but I'm not sure now is the time.
EDIT: If your workload targets "end-to-end exactly once", the latency is bound to the checkpoint interval even for "record-by-record" streaming frameworks. The records between checkpoint compose a sort of batch, so checkpoint interval would be a new major homework for you.
EDIT2:
Q1) Why windows aggregations would make me bother less about latency? Maybe one really wants to update the stateful operation quickly enough.
The output latency between micro-batch and record-by-record won't be significant (even the micro-batch could also achieve the sub-second latency in some extreme cases) compared to the delay brought by the nature of windowing.
But yes, I'm assuming the case the emit happens only when window gets expired ("append" mode in Structured Streaming). If you'd like to emit all the updates whenever there's change in window then yes, there would be still difference on the latency perspective.
Q2) Why the semantics are important in this trade-off? Sounds like it is related, for example, to Kafka-Streams reducing commit-interval when exactly-once is configured. Maybe you mean that checkpointing possibly one-by-one would increase overhead and then impact latency, in order to obtain better semantics?
I don't know the details about Kafka stream, so my explanation won't be based on how Kafka stream works. That would be your homework.
If you read through my answer correctly, you've also agreed that streaming frameworks won't do the checkpoint per record - the overhead would be significant. That said, records between the two checkpoints would be the same group (sort of a batch) which have to be reprocessed when the failure happens.
If stateful exactly once (stateful operation is exactly once, but the output is at-least once) is enough for your application, your application can just write the output to the sink and commit immediately so that readers of the output can read them immediately. Latency won't be affected by the checkpoint interval.
Btw, there're two ways to achieve end-to-end exactly once (especially the sink side):
supports idempotent updates
supports transactional updates
The case 1) writes the outputs immediately so won't affect latency through the semantic (similar with at-least once), but the storage should be able to handle upsert, and the "partial write" is seen when the failure happens so your reader applications should tolerate it.
The case 2) writes the outputs but not commits them until the checkpoint is happening. The streaming frameworks will try to ensure that the output is committed and exposed only when the checkpoint succeeds and there's no failure in the group. There're various approaches to make the distributed writes be transactional (2PC, coordinator does "atomic rename", coordinator writes the list of the files tasks wrote, etc.), but in any way the reader can't see the partial write till the commit happens so checkpoint interval would greatly contribute the output latency.
Q3) This doesn't necessarily address the point about the batch of records that Kafka clients poll.
My answer explains the general concept which is also applied even the case of source which provides a batch of records in a poll request.
Record-by-record: source continuously fetches the records and sends to the downstream operators. Source wouldn't need to wait for the completion of downstream operators on previous records. In recent streaming frameworks, non-shuffle operators would have handled altogether in a task - for such case, the downstream operator here technically means that there's a downstream operator requires "shuffle".
Micro-batch: the engine plans the new micro-batch (the offset range of the source, etc.) and launch tasks for the micro batch. In each micro batch, it behaves similar with the batch processing.

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Associate a time-threshold with a message in kafka

Is there any way I can have time threshold associated message in Kafka.
E.g.
Consumer pulls a message out of Kafka but system does not have enough information to process. So I put the message back in "resolver" queue, but I do not want to pull it out of the "resolver" queue for next 15 minutes, is there any way I can achieve that.
No, there is no way to achieve that with just Kafka. Kafka is designed to store messages for a certain period of time or until a certain size. The only way to get messages from Kafka is by offset. The offset identifies which messages have been consumed, messages before the offset have already been read, messages after the offset are yet to be read.
As far as I know Kafka does not provide time stamps with its messages (I could be wrong about that). But reading through the documentation and working with it for several months I have never encountered any example about retrieving messages from Kafka based on time.

Resources