How to Spark batch job global commit on ADLS Gen2? - azure

I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark will perform global commit once the job is committed, but what it really does it commits on each task, meaning once task completes writing it moves from temp to target storage.
So if the batch fails we have partial data, and on retry we are getting data duplications. Our scale is really huge so rolling back (deleting data) is not an option for us, the search will takes a lot of time.
Is there any "built-in" solution, something we can use out of the box?
Right now we are considering writing to some temp destination and move files only after the whole job completed, but we would like to find some more elegant solution (if exists).

This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions.
Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.

Related

pyspark partitioning create an extra empty file for every partition

I am facing with one problem in Azure Databricks. In my notebook I am executing simple write command with partitioning:
df.write.format('parquet').partitionBy("startYear").save(output_path,header=True)
And I see something like this:
Can someone explain why spark is creating this additional empty files for every partition and how to disable it?
I tried different mode for write, different partitioning and spark versions
I reproduced the above and got the same results when I use Blob Storage.
Can someone explain why spark is creating this additional empty files
for every partition and how to disable it?
Spark won't create these types of files. Blob Storage creates the blobs like above when we create parquet files by partitions.
We cannot avoid these if you use Blob Storage. You can avoid it by using ADLS Storage.
These are my Results with ADLS:

Duplicate records as result of Spark Task/Executor Failure

I have a scenario regarding Spark job and want to understand the behavior.
Scenario:
I am reading data using JDBC connector and writing it to HDFS.
As per my understanding there wont be any data duplication or loss in HDFS even though if a executor/ task fails same sql query will execute on RDBMS. Please correct me if I am wrong.
Eventual consistent target like S3, how it will behave. Is there any concern to use it as target.
Strongly consistent target like GCS bucket, how it will behave.
Thanks in advance
s3 is fully consistent now, but awfully slow on rename and rename doesn't reliably fail if the destination exists. You need to use a custom s3 committer (s3a committers, emr optimized committer) for spark/mapreduce. Consult the hadoop and EMR docs for details.
google gcs is consistent, but it doesn't do atomic renames, so the v1 commit protocol, which relies on atomic dir rename, isn't safe. v2 isn't safe anywhere.
The next version of apache hadoop adds a new "intermediate manifest committer" for gcs and abfs performance and correctness.
finally, iceberg is fast and safe everywhere.

Can spark ignore the a task failure due to an account data issue and continue the job process for other accounts?

I want spark to ignore some failed tasks due to data issues. Also, I want spark not to stop the whole job due to some insert failures.
if you are using databricks, you can handle bad records and files as explained in this article.
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
From the documentation:
Databricks provides a unified interface for handling bad records and
files without interrupting Spark jobs. You can obtain the exception
records/files and reasons from the exception logs by setting the data
source option badRecordsPath. badRecordsPath specifies a path to store
exception files for recording the information about bad records for
CSV and JSON sources and bad files for all the file-based built-in
sources (for example, Parquet).
You can also use some data cleansing library like Pandas,Optimus, sparkling.data,spark vanilla, dora etc. That will give you an insight into the bad data and let you fix your data before running analysis on it.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

What happens when HDInsight source data from Azure DocumentDB

I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?
as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.

Resources