How to handle Spark write errors? - apache-spark

this is a question about Spark error handling.
Specifically -- handling errors on writes to the target data storage.
The Situation
I'm writing into a non-transactional data storage that does not (in my case) support idempotent inserts — and want to implement error handling for write failures — to avoid inserting data multiple times.
So the scenario I'd like to handle is:
created dataframe / DAG
all executed, read data successfully, persisted within spark job (can be in memory)
writing to the target — but that throws an exception / fails midway / is unavailable
In this scenario, Spark would retry the write — without the ability to roll back (due to the nature of the custom target data store) — and thus potentially duplicate data.
The Question.
What is the proper approach in Spark to handle such cases ?
Option 1.
Is there a way to add an exception handler at task level ? For a specific task ?
Option 2.
Could set max retries to 1 so that the whole app would fail — and cleanup could be done externally — but would like to do better than that :)
Option 3.
As an alternative — we could add an extra column to the dataframe, one that would be computed at runtime and be unique across retries (so we could, again, clean it all up externally later). The question is then — what would be the way to compute a column literal at runtime of the Spark job (and not during DAG creation) ?
So...
Given the question -- what options are there ?
If it's any of the three proposed -- how can it be implemented ?
Would appreciate very much any help on this matter!
Thanks in advance...

I would implement the error handling at the driver level, but this comes at the cost of additional queries (which you would need anyways). Basically you need to
check if your (new) data contains duplicates
your target table/datastore already contains this data
val df : DataFrame = ???
df.cache() // if that fits into memory
val uniqueKeys = Seq("ID")
val noDuplicates = df.groupBy(uniqueKeys.head,uniqueKeys.tail:_*).count().where($"count">0).isEmpty
val notAlreadyExistsInTarget = spark.table(<targettable>).join(df,uniqueKeys,"leftsemi").isEmpty
if(noDuplicates && notAlreadyExistsInTarget) {
df.write // persist to your datastore
} else {
throw new Exception("df contains duplicates / already exists")
}

Related

sparkContext.wholeTextFiles(location) does not throw exception for absent path until action is performed

I am reading data present in hour format present in S3 through spark.For example,
sparkContext.wholeTextFiles("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
The above method returns a (key,value) pair which is (Filename,Content).
Issue
sparkContext.wholeTextFiles("location").values returns an RDD which does not throw exception if the 'location' in S3 is not present until an action is performed on this RDD.
Current code to check whether location given is present or not
val data = sparkSession.sparkContext
.wholeTextFiles(location)
.values
Try {
data.isEmpty()
}
case Success(_)=>{}
case Failure(_)=>{}
Return value for data even if location is not present: MapPartitionsRDD[2]
Return value after performing isEmpty() action on data is
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:
Question
I am using a kind of hack to perform an action isEmpty() (I can used any other action too)on data RDD to give Failure in case the location is not present otherwise if this check is not done then it fails and throws the same above exception when this data is used later due to lazy evaluation.
I wanted to ask if this is the right approach to check whether the location is present to read data as an action is needed to be performed on RDD?
Generally, in short, decision making on exceptions is not the best strategy.
you can use Hadoop's FileSystem API to check paths first before you start the processing through spark.
Spark processes things lazily. That's the expected behavior and the reason why you get the error when an action is performed.

Can I set es.batch.write.retry.count to zero value

I just want to stop spark job if any Exception occur while writing data to ES.
There is one configuration es.batch.write.retry.count whose default value is 3.
Is it make valid that we can set es.batch.write.retry.count = 0 so that if something breaks as per my requirement spark data frame writing will stop there to ES ?
The configuration of es.batch.write.retry.count just handle of how much time to try write to elastic for each batch before giving up and move to the next batch, it doesnt influence on your spark job.
The workaround u can do is to set spark.task.maxFailures=1, but it will influence ur entire job and not only the write to elasticsearch.
You sohuld notice that because the writing to elastic isn`t transactional, if one task of writing to elastic failed, it doesnt mean the some of your data already have been written to elastic.
I dont know what is your usecase over here, but if you want make sure that all of your data is written into elasticsearch, you should make a _count query and check if it equal to df.count() after the writing(assumed that you are writing to new index).

Spark reuse broadcast DF

I would like to reuse my DataFrame (without falling back to doing this using "Map" function in RDD/Dataset) which I marking as broadcast-eable, but seems Spark keeps broadcasting it again and again.
Having a table "bank" (test table). I perform the following:
val cachedDf = spark.sql("select * from bank").cache
cachedDf.count
val dfBroadcasted = broadcast(cachedDf)
val dfNormal = spark.sql("select * from bank")
dfNormal.join(dfBroadcasted, List("age"))
.join(dfBroadcasted, List("age")).count
I'm caching before just in case it made a difference, but its the same with or without.
If I execute the above code, I see the following SQL plan:
As you can see, my broadcasted DF gets broadcasted TWICE with also different timings (if I add more actions afterwards, they broadcast again too).
I care about this, because I actually have a long-running program which has a "big" DataFrame which I can use to filter out HUGE DataFrames, and I would like that "big" DataFrame to be reused.
Is there a way to force reusability? (not only inside the same action, but between actions, I could survive with the same action tho)
Thanks!,
Ok, updating the question.
Summarising:
INSIDE the same action, left_semis will reuse broadcasts
while normal/left joins won't. Not sure related with the fact that Spark/developers already know the columns of that DF won't affect the output at all so they can reuse it or it's just an optimization spark is missing.
My problem seems mostly-solved, although it would be great if someone knew how to keep the broadcast across actions.
If I use left_semi (which is the join i'm going to use in my real app), the broadcast is only performed once.
With:
dfNormalxx.join(dfBroadcasted, Seq("age"),"left_semi")
.join(dfBroadcasted, Seq("age"),"left_semi").count
The plan becomes (I also changed the size so it matches my real one, but this made no difference):
Also the wall total time is much better than when using "left_semi" (I set 1 executor so it doesn't get parallelized, just wanted to check if the job was really being done twice):
Even though my collect takes 10 seconds, this will speedup table reads+groupBys which are taking like 6-7minutes

Why is there different number of elements in Spark DataFrames before and after writing to a new Cassandra table?

In my code I read a data from an existing Cassandra table into a Spark DataFrame and transform it to build a set of new tables with the reverse mappings of the original data (the end goal is to serve the search queries that come via the REST API).
Recently I have added some tracing and discovered a thing I cannot explain.
Below is a piece of Scala code to illustrate the matter.
// df: org.apache.spark.sql.DataFrame
//
// control point 1: before writing the data to Cassandra
val inputCount = df.count
// write data to new C* table
df.createCassandraTable(keyspaceName, tableName, <otherArgs>)
df.write.mode("append").cassandraFormat(tableName, keyspaceName).save()
// read data back
val readbackDf = sqlContext.read.cassandraFormat(tableName, keyspaceName).load().cache
// control point 2: data written to C* table
val outputCount = readbackDf.count
// Produces different numbers
println(s"Input count = ${inputCount}; output count = ${outputCount}")
If I calculate .count of the dataframe before I write the data to the newly created table, it differs from the .count of the dataframe I get by reading back from this new table.
Therefore, I've got 2 questions:
Why do I observe different values for inputCount and outputCount?
If I use the wrong way to calculate outputCount in the code above, what would be the correct approach?
The problem was indeed related to Cassandra consistency settings.
Many thanks Anurag who pointed it out.
It turned out that in my testing environment I used defaults for both read- and write- strategies, which is LOCAL_ONE. So that easily explains the divergence.
I ended up setting them both to LOCAL_QUORUM:
spark.cassandra.input.consistency.level=LOCAL_QUORUM
spark.cassandra.output.consistency.level=LOCAL_QUORUM
Having said that, I'd like to point out that I also tried setting only reads to LOCAL_QUORUM
spark.cassandra.input.consistency.level=LOCAL_QUORUM
spark.cassandra.output.consistency.level=LOCAL_ONE
which almost nullified the divergence.
Yet, I was still able to observe the small divergence with these settings sometimes (one in 3-4 runs) with some of my ETL jobs.
While I don't see significant performance degradation with both reads/writes consistency set to LOCAL_QUORUM, so that the issue doesn't block me anymore, I'm still curious why setting only reads to LOCAL_QUORUM doesn't fully cure the problem.
Could anyone suggest "for-dummies" explanation of this?

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Resources