Im using storm to process a stream wherein one of the bolts is writing to cassandra. The cassandra session.execute() command can throw an exception and I'm wondering about trapping this to 'fail' the tuple so it gets retried.
The docs for IRichBolt don't show it throwing anything so I'm wondering how the exception cases are being handled.
Primary question: should I wrap the cassandra call in a try/catch or will storm manage this case for me?
Multipart answer:
1) Definitely surround your code with a try-catch block.
2) How to handle failure depends on the Storm topology and kind of failure:
If the exception indicates that an immediate retry might work, then you could loop some small, finite number of times until the attempt does work or you run out of tries.
If the tuple that you're executing is the tuple that was emitted by the spout then your bolt can fail the tuple. That will force a retry by Storm (that is to say that the fail() method on the spout will be called and you can code the retry)
If there's already been at least one side effect as a result of processing this tuple and you don't want to repeat that side effect as a result of retrying the tuple, then you need to get a little more creative. Your Cassandra bolt can emit the failed tuple onto a failed-tuple stream where it can be persisted somewhere (HBase, file system, Kafka) until you're ready to try again. To try again you can add another Spout to your topology that reads from that store of failed tuples and emits them on a stream back to the Cassandra bolt for the retry. That gives you a way to continuously loop your retries with an extended time between retries. If along with the failed tuple you also persist/log the Cassandra exception, you can browse/monitor the log to see if there are any issues your admin should know.
Related
We have a ReliableMessageListener that synchronizes some data structures that it holds across the cluster by the onMessage implementation.
The cluster is composed of three nodes. We noticed that one of the topics gets out of sync, and had been terminated due to message loss, detected by the ring-buffer, as we get a "Terminating MessageListener, ... Reason: Underlying ring buffer data related to reliable topic is lost" exception. What happens is that this node is still up, but this specific listener does not get events/messages from the other two nodes, while they do get those from it.
We get to a de-facto split-brain for this specific topic.
Our message listener is configured as isLossTolerant = false, and isTerminal = false.
I am trying to understand what is considered to be a good strategy for handling such a scenario and recovering from it.
For example, is that a good practice to try and subscribe this topic again?
Is that a good practice to send a message for clearing the data from the other nodes in the cluster? Will they even get the message after the ring-buffer got out of sync?
Thanks
The message Reason: Underlying ring buffer data related to the reliable topic is lost means that the data you are trying to read is not available anymore because it was overwritten by newer data in the underlying Ringbuffer - your producer is likely faster than your consumers.
When such a situation occurs the ReliableTopic is still usable, and you can register a new listener.
To prevent the situation from occurring you can either increase the size of the underlying ringbuffer (provide ringbuffer config with same name as the reliable topic) or configure TopicOverloadPolicy. See documentation for details.
I am using Spark 2.3 structured streaming to read messages from Kafka and write to Postgres (using the Scala programming language), my application is supposed to be a long living application, and it should be able to handle any case of failure without human intervention.
I have been looking for ways to catch unexpected errors in Structured Streaming, and I found this example here:
Spark Structured Streaming exception handling
This way it is possible to catch all errors that are thrown in the Stream, but the problem is, when the application tries again, it is stuck on the same exception again.
Is there a way in Structured Streaming that I can handle the error and tell spark to increment the offset in the "checkpointlocation" programatically so that it proceeds to the consume the next message without being stuck?
This is called in the streaming event processing world as handling a "poison pill"
Please have a look on the following link
https://www.waitingforcode.com/apache-spark-structured-streaming/corrupted-records-poison-pill-records-apache-spark-structured-streaming/read
It suggest several ways to handle this type of scenario
Strategy 1: let it crash
The Streaming application will log a poison pill message and stop the processing. It's not a big deal because thanks to the checkpointed offsets we'll be able to reprocess the data and handle it accordingly, maybe with a try-catch block.
However, as you already saw in your question, it's not a good practice in streaming systems because the consumer stops and during that idle period it accumulates the lag (the producer continues to generate data).
Strategy 2: ignore errors
If you don't want downtime of your consumer, you can simply skip the corrupted events. In Structured Streaming it can be summarized to filtering out null records and, eventually, logging the unparseable messages for further investigation, or records that get you an error.
Strategy 3: Dead Letter Queue
we ignore the errors but instead of logging them, we dispatch them into another data storage.
Strategy 4: sentinel value
You can use a pattern called Sentinel Value and it can be freely used with Dead Letter Queue.
Sentinel Value corresponds to a unique value returned every time in case of trouble.
So in your case, whenever a record cannot be converted to the structure we're processing, you will emit a common object,
For code samples look inside the link
I have a following code written in pyspark, which basically does certain transformation.
df2=df1.select('*',F.explode(F.col('UseData')).alias('UseData1'))\
.select('*',F.explode(F.col('UseData1')).alias('UseData2'))\
.drop('UseData','UseData1','value')\
.select('*',F.explode(F.col('point'))).drop('point')\
.withColumn('label',F.col('UseData2.label')).filter(F.col('label')=='jrnyCount')\
.withColumn('value',F.col('UseData2.value'))\
.withColumn('datetime',F.col('UseData2.datetime'))\
.withColumn('latitude',F.col('col.latitude')).withColumn('longitude',F.col('col.longitude'))\
.drop('col','UseData2')\
.where(F.col('latitude').isNotNull() | F.col('longitude').isNotNull())
Is there any way to catch if there is any exception occurs due to bad data in input dataframe df1? Since the job is executed in multiple executors across different nodes, how can I make sure if there is any error in any of the above lines, the code will not fail and it ignores the bad data? Any help would be highly apprcitaed.
Exception handling is to be done with python exception handling methods.
But I think what you want is to write some logic to ignore outlier/junk data, that should be done as part of pre-processing manually; may be write a udf to filter or update data based on conditions.
If there are Errors or Warnings when you job is executed, you will get it in the YARN logs so you don't have to handle them specifically at a node level.
I have one node with replication factor 1 and fire a batch statement query on that node ,cassandra writes the data but failed to acknowledge with in timeout limit . then it gives a write timout exception with following stacktrace .
failed `Exception in thread "main" com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:271)
at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:187)
at com.datastax.driver.core.Session.execute(Session.java:126)
at jason.Stats.analyseLogMessages(Stats.java:91)
at jason.Stats.main(Stats.java:48)
then if you go back and check the table then you will find data has been written . So my question is , if cassandra gives write timout exception then it should rollback the changes .
I mean i don't want to write to database if i am getting write timout exception ,is there any rollback strategy present for that particular scenario .
Based on your description what you are expecting is that Cassandra supports ACID compliant transaction at least with regards to the A - Atomicity. Cassandra does not provide ACID-compliant transactions instead it relies on eventual consistency to provide a durable data store. Cassandra does provide Atomicity in as much as a single partition on a node is atomic by which I mean an entire row will either be written or not. However a write can still succeed on one or more replicas but after the timeout set by your client. In this case the client would receive an error but the data would be written. There is nothing that will rollback that transaction. Instead the data in the cluster will become consistent using the normal repair mechanisms.
My suggestion for you would be to:
In the case of a timeout do a retry of the write query
Investigate why you are getting a timeout error on a write with a CL=ONE. If this is a multi-DC sort of setup have you tried CL=LOCAL_ONE.
Some docs to read:
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesReadRepair.html
Cassandra does not have any notion of rollbacks. If a write times out that means that the write may have succeeded or may not have. This is why C* tries to focus users on idempotent data models and structures.
The only means of actually performing some kind of conditional write is via Light Weight Transactions which allow for some check and set operations.
When some of the cassandra server in cluster are down phpcassa takes lot of time to respond.
Logically phpcassa should connect to running nodes and get data in stead of trying to connect to down nodes.
Do anybody have any idea how phpcassa works?
What is its behavior in down node situations?
Check make_conn function here and last few lines of ConnectionPool constructor
So, first PHPCassa shuffles your servers list in random,
then it tries to connect to every server in your list twice in a cycle, only if the queue length is zero after the first cycle [make_conn]
else it returns the moment it gets the first successful connection
Also remember that your make_conn function is not called from the constructor. It will be called when there is a need for it. The source code is very simple you can go through it to get more sense
Check this code to see how connection failure is handled and this to know the reasons for connection failure