I just want to stop spark job if any Exception occur while writing data to ES.
There is one configuration es.batch.write.retry.count whose default value is 3.
Is it make valid that we can set es.batch.write.retry.count = 0 so that if something breaks as per my requirement spark data frame writing will stop there to ES ?
The configuration of es.batch.write.retry.count just handle of how much time to try write to elastic for each batch before giving up and move to the next batch, it doesnt influence on your spark job.
The workaround u can do is to set spark.task.maxFailures=1, but it will influence ur entire job and not only the write to elasticsearch.
You sohuld notice that because the writing to elastic isn`t transactional, if one task of writing to elastic failed, it doesnt mean the some of your data already have been written to elastic.
I dont know what is your usecase over here, but if you want make sure that all of your data is written into elasticsearch, you should make a _count query and check if it equal to df.count() after the writing(assumed that you are writing to new index).
Related
How does the spark streaming keeps the track of files which have been processed?
Question 1: Let's take a scenario, spark has processed today’s files(a.csv, b.csv, c.csv), and after 3 days new file(d.csv) has arrived, how does spark know it has to process the only d.csv? what is the underlying mechanism followed here?
Question 2: As a user, I want to know whether the files have been really processed, how can I check?
How does the spark streaming keeps the track of files which have been
processed?
The responsible class for this is FileStreamSource. Here you will find answers for the next 2 questions.
how does spark know it has to process the only d.csv? what is the
underlying mechanism followed here?
A CompactibleFileStreamLog is used to maintain a mapping of seen files based on timestamp when it was last modified. Based on these entries an ever-increasing offset is created (ref. FileStreamSourceOffset). This offset is checkpointed across runs much like other streaming sources like Kafka.
whether the files have been really processed, how can I check?
From the code I can see that you can check this via DEBUG logs,
batchFiles.foreach { file =>
seenFiles.add(file._1, file._2)
logDebug(s"New file: $file")
}
Another place you may check is the checkpoint data but since it contains serialized offset info I doubt you will get any details from there.
My data scenario is as follows:
Reading data in a dataframe from database with JDBC using PySpark
I make a count() call to both see number of records and also "know" when data load is ready. I am doing this to understand a potential bottleneck.
Write to file in s3 (in same region)
So, my objective is to know exactly when all database/table data is loaded, so I can infer if there are problem either reading or writing data when job is getting slow.
In my first attempts, I could get the records number very quick (after 2 min of job running), but my guess is that doing count() does not mean that data is all loaded (in memory).
When you do the count() nothing is already loaded, it's a action that will trigger data processing.
if you have a simply logical plan like this :
spark.read(..)
.map(..)
.filter(..)
...
.count()
the database will be loaded as soon as you call an action (in this exemple count)
this is a question about Spark error handling.
Specifically -- handling errors on writes to the target data storage.
The Situation
I'm writing into a non-transactional data storage that does not (in my case) support idempotent inserts — and want to implement error handling for write failures — to avoid inserting data multiple times.
So the scenario I'd like to handle is:
created dataframe / DAG
all executed, read data successfully, persisted within spark job (can be in memory)
writing to the target — but that throws an exception / fails midway / is unavailable
In this scenario, Spark would retry the write — without the ability to roll back (due to the nature of the custom target data store) — and thus potentially duplicate data.
The Question.
What is the proper approach in Spark to handle such cases ?
Option 1.
Is there a way to add an exception handler at task level ? For a specific task ?
Option 2.
Could set max retries to 1 so that the whole app would fail — and cleanup could be done externally — but would like to do better than that :)
Option 3.
As an alternative — we could add an extra column to the dataframe, one that would be computed at runtime and be unique across retries (so we could, again, clean it all up externally later). The question is then — what would be the way to compute a column literal at runtime of the Spark job (and not during DAG creation) ?
So...
Given the question -- what options are there ?
If it's any of the three proposed -- how can it be implemented ?
Would appreciate very much any help on this matter!
Thanks in advance...
I would implement the error handling at the driver level, but this comes at the cost of additional queries (which you would need anyways). Basically you need to
check if your (new) data contains duplicates
your target table/datastore already contains this data
val df : DataFrame = ???
df.cache() // if that fits into memory
val uniqueKeys = Seq("ID")
val noDuplicates = df.groupBy(uniqueKeys.head,uniqueKeys.tail:_*).count().where($"count">0).isEmpty
val notAlreadyExistsInTarget = spark.table(<targettable>).join(df,uniqueKeys,"leftsemi").isEmpty
if(noDuplicates && notAlreadyExistsInTarget) {
df.write // persist to your datastore
} else {
throw new Exception("df contains duplicates / already exists")
}
I have an application written for Spark using Scala language. My application code is kind of ready and the job runs for around 10-15 mins.
There is an additional requirement to provide status of the application execution when spark job is executing at run time. I know that spark runs in lazy way and it is not nice to retrieve data back to the driver program during spark execution. Typically, I would be interested in providing status at regular intervals.
Eg. if there 20 functional points configured in the spark application then I would like to provide status of each of these functional points as and when they are executed/ or steps are over during spark execution.
These incoming status of function points will then be taken to some custom User Interface to display the status of the job.
Can some one give me some pointers on how this can be achieved.
There are few things you can do on this front that I can think of.
If your job contains multiple actions, you can write a script to poll for the expected output of those actions. For example, imagine your script have 4 different DataFrame save calls. You could have your status script poll HDFS/S3 to see if the data has showed up in the expected output location yet. Another example, I have used Spark to index to ElasticSearch, and I have written status logging to poll for how many records are in the index to print periodic progress.
Another thing I tried before is use Accumulators to try and keep rough track of progress and how much data has been written. This works ok, but it is a little arbitrary when Spark updates the visible totals with information from the executors so I haven't found it to be too helpfully for this purpose generally.
The other approach you could do is poll Spark's status and metric APIs directly. You will be able to pull all of the information backing the Spark UI into your code and do with it whatever you want. It won't necessarily tell you exactly where you are in your driver code, but if you manually figure out how your driver maps to stages you could figure that out. For reference, here are is the documentation on polling the status API:
https://spark.apache.org/docs/latest/monitoring.html#rest-api
So asking if anyone knows a way to change the Spark properties (e.g. spark.executor.memory, spark.shuffle.spill.compress, etc) during runtime, so that a change may take effect between the tasks/stages during a job...
So I know that...
1) The documentation for Spark 2.0+ (and previous versions too) state that once the Spark Context has been created, it can't be changed in runtime.
2) SparkSession.conf.set that may change a few things for SQL, but I was looking at more general, all encompassing configurations.
3) I could start a new context in the program with new properties, but the case here is to actually tune the properties once a job is already executing.
Ideas...
1) Would killing an Executor force it to read a configuration file again, or does it just get what's already configured during the beginning of the job?
2) Is there any command to force a "refresh" of the properties in spark context?
So hoping there might be a way or other ideas out there (thanks in advance)...
After submitting the Spark application, we can change a few parameter values at Runtime and a few not.
By using spark.conf.isModifiable() method, we can check parameter value we can modify at runtime or not. If the value returns true then we can modify the parameter value otherwise, we can't modify the value at runtime.
Examples:
>>> spark.conf.isModifiable("spark.executor.memory")
False
>>> spark.conf.isModifiable("spark.sql.shuffle.partitions")
True
So based on the above testing, we can't modify the spark.executor.memory parameter value at runtime.
No, it is not possible to change settings like spark.executor.memory at runtime.
In addition, there are probably not too many great tricks in the direction of 'quickly switching to a new context' as the strength of spark is that it can pick up data and keep going. What you essentially are asking for is a map-reduce framework. Of course you could rewrite your job into this structure, and divide the work across multiple spark jobs, but then you would lose some of the ease and performance that spark brings. (Though possibly not all).
If you really think the request makes sense on a conceptual level, you could consider making a feature request. This can be through your spark supplier, or directly by logging a Jira on the apache Spark project.