How to handle bad messages in spark structured streaming - apache-spark

I am using Spark 2.3 structured streaming to read messages from Kafka and write to Postgres (using the Scala programming language), my application is supposed to be a long living application, and it should be able to handle any case of failure without human intervention.
I have been looking for ways to catch unexpected errors in Structured Streaming, and I found this example here:
Spark Structured Streaming exception handling
This way it is possible to catch all errors that are thrown in the Stream, but the problem is, when the application tries again, it is stuck on the same exception again.
Is there a way in Structured Streaming that I can handle the error and tell spark to increment the offset in the "checkpointlocation" programatically so that it proceeds to the consume the next message without being stuck?

This is called in the streaming event processing world as handling a "poison pill"
Please have a look on the following link
https://www.waitingforcode.com/apache-spark-structured-streaming/corrupted-records-poison-pill-records-apache-spark-structured-streaming/read
It suggest several ways to handle this type of scenario
Strategy 1: let it crash
The Streaming application will log a poison pill message and stop the processing. It's not a big deal because thanks to the checkpointed offsets we'll be able to reprocess the data and handle it accordingly, maybe with a try-catch block.
However, as you already saw in your question, it's not a good practice in streaming systems because the consumer stops and during that idle period it accumulates the lag (the producer continues to generate data).
Strategy 2: ignore errors
If you don't want downtime of your consumer, you can simply skip the corrupted events. In Structured Streaming it can be summarized to filtering out null records and, eventually, logging the unparseable messages for further investigation, or records that get you an error.
Strategy 3: Dead Letter Queue
we ignore the errors but instead of logging them, we dispatch them into another data storage.
Strategy 4: sentinel value
You can use a pattern called Sentinel Value and it can be freely used with Dead Letter Queue.
Sentinel Value corresponds to a unique value returned every time in case of trouble.
So in your case, whenever a record cannot be converted to the structure we're processing, you will emit a common object,
For code samples look inside the link

Related

Is it possible to return a message to Kafka Topic as "unread" message?

I'm building an Spark Streaming program that i want that it read certain messages and return they as "unread" to the topic if the message could not be processed by the program.
My intention is that the program eventually read the message again and try to process it.
Is that possible?
If you "return" a record to the original topic, it'll be appended to the end of the topic, and therefore consumed again on the next poll.
This will cause an infinite consumption loop (until whatever condition that put the record back is no longer true)
What you're asking for seems to be a "dead-letter-queue", which is implemented with different topics, not the original. As for how to process that in Spark, you'd probably have to maintain some Try object (Scala), or other boolean type, that knows if a particular event had been processed successfully; then filter that out before writing to any downstream systems.
For example a dataframe with this schema that you'd filter/drop the third column before producing
topic - string
value - bytes
hasError - boolean
No. That is not possible if you think of Kafka offsetting.

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Spark streaming with Kafka - createDirectStream vs createStream

We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?
There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

storm - handling exceptions in bolt.execute

Im using storm to process a stream wherein one of the bolts is writing to cassandra. The cassandra session.execute() command can throw an exception and I'm wondering about trapping this to 'fail' the tuple so it gets retried.
The docs for IRichBolt don't show it throwing anything so I'm wondering how the exception cases are being handled.
Primary question: should I wrap the cassandra call in a try/catch or will storm manage this case for me?
Multipart answer:
1) Definitely surround your code with a try-catch block.
2) How to handle failure depends on the Storm topology and kind of failure:
If the exception indicates that an immediate retry might work, then you could loop some small, finite number of times until the attempt does work or you run out of tries.
If the tuple that you're executing is the tuple that was emitted by the spout then your bolt can fail the tuple. That will force a retry by Storm (that is to say that the fail() method on the spout will be called and you can code the retry)
If there's already been at least one side effect as a result of processing this tuple and you don't want to repeat that side effect as a result of retrying the tuple, then you need to get a little more creative. Your Cassandra bolt can emit the failed tuple onto a failed-tuple stream where it can be persisted somewhere (HBase, file system, Kafka) until you're ready to try again. To try again you can add another Spout to your topology that reads from that store of failed tuples and emits them on a stream back to the Cassandra bolt for the retry. That gives you a way to continuously loop your retries with an extended time between retries. If along with the failed tuple you also persist/log the Cassandra exception, you can browse/monitor the log to see if there are any issues your admin should know.

Resources