Databricks spark job writing/updating _SUCCESS file twice on job completion - apache-spark

I'm using S3 event based triggers to trigger lambda functions. It triggers a lambda function every time an _SUCCESS file is written at a specific location in S3. Data is being written at the source location using Databricks spark jobs. It has been observed that once the job writes data to source location, the lambda function is trigger twice, consistently.
This behavior is only observed when _SUCCESS is written by Databricks job. I tried to write the file from CLI, it triggers the lambda function just once.
It would be helpful to know the reason behind such a behavior from Databricks jobs.

Related

Trigger workflow job with Databricks Autoloader

I have requirement to monitor S3 bucket for files (zip) to be placed. As soon as a file is placed in S3 bucket, the pipeline should start processing the file. Currently I have Workflow Job with multiple tasks the performs processing. In Job parameter, I have configured S3 bucket file path and able to trigger pipeline. But I need to automate the monitoring through Autoloader. I have setup Databricks autoloader in another notebook and managed to get the list of files that are arriving S3 path by querying checkpoint.
checkpoint_query = "SELECT * FROM cloud_files_state('%s') ORDER BY create_time DESC LIMIT 1" % (checkpoint_path)
But I want to integrate this notebook with my job but no clue how to integrate it with pipeline job. Some pointers to help will be much appreciable.
You need to create a workflow job and add the pipeline as the upstream task and your notebook as the downstream. Currently there is no way to run custom notebooks within a dlt pipeline.
Check this for how to create a workflow: https://docs.databricks.com/workflows/jobs/jobs.html#job-create

Is s3-dist-cp faster for transferring data between S3 and EMR than using Spark and S3 directly?

I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?

How to Spark batch job global commit on ADLS Gen2?

I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark will perform global commit once the job is committed, but what it really does it commits on each task, meaning once task completes writing it moves from temp to target storage.
So if the batch fails we have partial data, and on retry we are getting data duplications. Our scale is really huge so rolling back (deleting data) is not an option for us, the search will takes a lot of time.
Is there any "built-in" solution, something we can use out of the box?
Right now we are considering writing to some temp destination and move files only after the whole job completed, but we would like to find some more elegant solution (if exists).
This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions.
Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

I am running Spark using Google Cloud dataproc cluster. Dataset writing to GCS stucks with pending 1 task which never ends

I am running Spark using Google Cloud dataproc cluster. While writing Dataset to GCS bucket (Google cloud storage), it struck at last partition, which never ends.
It shows 799/800 tasks are completed. But the pending 1 task never ends.
This occurs mainly due to Data Skew.
Also if you are trying out joins, then check if the columns being used for the join do not have Null values inside of them. This may be causing a Cross Join to happen for the Null Values

Resources