How to handle exception in pyspark? - apache-spark

I have a following code written in pyspark, which basically does certain transformation.
df2=df1.select('*',F.explode(F.col('UseData')).alias('UseData1'))\
.select('*',F.explode(F.col('UseData1')).alias('UseData2'))\
.drop('UseData','UseData1','value')\
.select('*',F.explode(F.col('point'))).drop('point')\
.withColumn('label',F.col('UseData2.label')).filter(F.col('label')=='jrnyCount')\
.withColumn('value',F.col('UseData2.value'))\
.withColumn('datetime',F.col('UseData2.datetime'))\
.withColumn('latitude',F.col('col.latitude')).withColumn('longitude',F.col('col.longitude'))\
.drop('col','UseData2')\
.where(F.col('latitude').isNotNull() | F.col('longitude').isNotNull())
Is there any way to catch if there is any exception occurs due to bad data in input dataframe df1? Since the job is executed in multiple executors across different nodes, how can I make sure if there is any error in any of the above lines, the code will not fail and it ignores the bad data? Any help would be highly apprcitaed.

Exception handling is to be done with python exception handling methods.
But I think what you want is to write some logic to ignore outlier/junk data, that should be done as part of pre-processing manually; may be write a udf to filter or update data based on conditions.
If there are Errors or Warnings when you job is executed, you will get it in the YARN logs so you don't have to handle them specifically at a node level.

Related

How to handle bad messages in spark structured streaming

I am using Spark 2.3 structured streaming to read messages from Kafka and write to Postgres (using the Scala programming language), my application is supposed to be a long living application, and it should be able to handle any case of failure without human intervention.
I have been looking for ways to catch unexpected errors in Structured Streaming, and I found this example here:
Spark Structured Streaming exception handling
This way it is possible to catch all errors that are thrown in the Stream, but the problem is, when the application tries again, it is stuck on the same exception again.
Is there a way in Structured Streaming that I can handle the error and tell spark to increment the offset in the "checkpointlocation" programatically so that it proceeds to the consume the next message without being stuck?
This is called in the streaming event processing world as handling a "poison pill"
Please have a look on the following link
https://www.waitingforcode.com/apache-spark-structured-streaming/corrupted-records-poison-pill-records-apache-spark-structured-streaming/read
It suggest several ways to handle this type of scenario
Strategy 1: let it crash
The Streaming application will log a poison pill message and stop the processing. It's not a big deal because thanks to the checkpointed offsets we'll be able to reprocess the data and handle it accordingly, maybe with a try-catch block.
However, as you already saw in your question, it's not a good practice in streaming systems because the consumer stops and during that idle period it accumulates the lag (the producer continues to generate data).
Strategy 2: ignore errors
If you don't want downtime of your consumer, you can simply skip the corrupted events. In Structured Streaming it can be summarized to filtering out null records and, eventually, logging the unparseable messages for further investigation, or records that get you an error.
Strategy 3: Dead Letter Queue
we ignore the errors but instead of logging them, we dispatch them into another data storage.
Strategy 4: sentinel value
You can use a pattern called Sentinel Value and it can be freely used with Dead Letter Queue.
Sentinel Value corresponds to a unique value returned every time in case of trouble.
So in your case, whenever a record cannot be converted to the structure we're processing, you will emit a common object,
For code samples look inside the link

How to handle Spark write errors?

this is a question about Spark error handling.
Specifically -- handling errors on writes to the target data storage.
The Situation
I'm writing into a non-transactional data storage that does not (in my case) support idempotent inserts — and want to implement error handling for write failures — to avoid inserting data multiple times.
So the scenario I'd like to handle is:
created dataframe / DAG
all executed, read data successfully, persisted within spark job (can be in memory)
writing to the target — but that throws an exception / fails midway / is unavailable
In this scenario, Spark would retry the write — without the ability to roll back (due to the nature of the custom target data store) — and thus potentially duplicate data.
The Question.
What is the proper approach in Spark to handle such cases ?
Option 1.
Is there a way to add an exception handler at task level ? For a specific task ?
Option 2.
Could set max retries to 1 so that the whole app would fail — and cleanup could be done externally — but would like to do better than that :)
Option 3.
As an alternative — we could add an extra column to the dataframe, one that would be computed at runtime and be unique across retries (so we could, again, clean it all up externally later). The question is then — what would be the way to compute a column literal at runtime of the Spark job (and not during DAG creation) ?
So...
Given the question -- what options are there ?
If it's any of the three proposed -- how can it be implemented ?
Would appreciate very much any help on this matter!
Thanks in advance...
I would implement the error handling at the driver level, but this comes at the cost of additional queries (which you would need anyways). Basically you need to
check if your (new) data contains duplicates
your target table/datastore already contains this data
val df : DataFrame = ???
df.cache() // if that fits into memory
val uniqueKeys = Seq("ID")
val noDuplicates = df.groupBy(uniqueKeys.head,uniqueKeys.tail:_*).count().where($"count">0).isEmpty
val notAlreadyExistsInTarget = spark.table(<targettable>).join(df,uniqueKeys,"leftsemi").isEmpty
if(noDuplicates && notAlreadyExistsInTarget) {
df.write // persist to your datastore
} else {
throw new Exception("df contains duplicates / already exists")
}

Pyspark swaps filter and python_udf in optimized plan leading to slowdown and errors

I'm having some issue with the spark I'm using
I'm trying to read some parquet file apply a filter and then run some python udf.
spark.read.parquet(...).where(
col('col_name_a').isNotNull()
).withColumn(
costly_python_udf("col_name_b").alias("col_b_transform")
)
With just those two steps, the code runs fast (in the sense that I'm action on a partial subset of data given the col_a!=null condition) as I expected. However if I take the dataframe and add more operations to it, processing slows down. It almost feels like the python udf operation and the null check operations got swapped.
I looked at the SQL section in the ApplicationMaster and I see in 'Optimized Logical Plan' that Filter step and BatchEvalPython got swapped.
In the same project, this swap impacted me where I was checking for null string and the loading json and the json load was failing due to the null load error, because guess again, the sequence of filter application and python udf was swapped. Few questions.
I'm currently managing this issue, by inserting cache() step after filter and before the python udf, which seems to be forcing Filter and BatchEvalPython run in the right sequence.
Why does the Filter gets pushed after the BatchEvalPython. I can't think of any scenario where that would be an optimization. If anything, it should be the other way around.
Is this is known issue? Is this fixed in future versions?
Is there a better command than cache to force pyspark to not optimize across before and after the cache call?
spark.version = u'2.1.0'

Abort RDD map (all mappers) on condition

I have a huge file to process, loaded into an RDD, and performing some validations on its lines using map function.
I have a set of errors that are fatal for the whole file if encountered even on one line of the file. Thus, I would like to abort any other processing (all launched mappers across the whole cluster) as soon as that validation fails on a line (to save some time).
Is there a way to archive this?
Thank you.
PS: Using Spark 1.6, Java API
Well, after further search and understanding of Spark transformations laziness , I just have to do something like:
rdd.filter(checkFatal).take(1)
Then due to the laziness, the processing will just stop itself once one record matching that rule is found :)

storm - handling exceptions in bolt.execute

Im using storm to process a stream wherein one of the bolts is writing to cassandra. The cassandra session.execute() command can throw an exception and I'm wondering about trapping this to 'fail' the tuple so it gets retried.
The docs for IRichBolt don't show it throwing anything so I'm wondering how the exception cases are being handled.
Primary question: should I wrap the cassandra call in a try/catch or will storm manage this case for me?
Multipart answer:
1) Definitely surround your code with a try-catch block.
2) How to handle failure depends on the Storm topology and kind of failure:
If the exception indicates that an immediate retry might work, then you could loop some small, finite number of times until the attempt does work or you run out of tries.
If the tuple that you're executing is the tuple that was emitted by the spout then your bolt can fail the tuple. That will force a retry by Storm (that is to say that the fail() method on the spout will be called and you can code the retry)
If there's already been at least one side effect as a result of processing this tuple and you don't want to repeat that side effect as a result of retrying the tuple, then you need to get a little more creative. Your Cassandra bolt can emit the failed tuple onto a failed-tuple stream where it can be persisted somewhere (HBase, file system, Kafka) until you're ready to try again. To try again you can add another Spout to your topology that reads from that store of failed tuples and emits them on a stream back to the Cassandra bolt for the retry. That gives you a way to continuously loop your retries with an extended time between retries. If along with the failed tuple you also persist/log the Cassandra exception, you can browse/monitor the log to see if there are any issues your admin should know.

Resources