Databricks create new job maintaining history - databricks

Everytime I created a new job, instead of overwriting the existing one it created a new one with the same name.
First thing that came to my mind was to delete all the jobs with a certain name and create the job afterwards. The problem with this is that I cant keep the deleted job history (logs etc) associated with the new job. Is there a way to overwrite the job mantaining all the info associated with the previous job using databricks cli?
databricks jobs list --output json | jq '.jobs[] | select(.settings.name == "test_job") | .job_id' | xargs -n 1 databricks jobs delete --job-id
databricks jobs create --json-file ./jobs_jsons/test_job.json

You need to use databricks jobs resets for that - it will overwrite jobs definition, without creating a new job (see docs):
databricks jobs reset --job-id 246 --json-file reset-job.json

Related

Databricks - wait for delta table change and run job

I am data analyst, so I have access only to Databricks (creating notebooks and jobs).
I have a delta table which is updated (merging new data) every day. Pipeline is created by administrators. When this table is after update I'd like to trigger notebook or job.
Is that even possible with my access? I tried to create changeDataFeed and then what?
I don't get how to move from changeDataFeed to actually trigger something else.
My code snippet (based on different questions):
df = spark.readStream \
.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", "latest") \
.table(tableName) \
.filter("_change_type != 'update_preimage'")
After i query df, streamReader is reading data, but I want to run another notebook or job.
Secondly, I don't want to run this readStream all time, because data is merged once a day (about 7-9 AM).
How to do it?
You can use the tasks inside the same job to execute other notebook after your Delta table change.
This is the sample Notebook Code for the Delta table named as Delta_Notebook .
def isdeltaupdated():
#Your code to check delta table updated or not
#if updated then we can proceed to run next notebook so return true
#If not updated then create Value error in this notebook by which the other notebook cannot run
return False #for sample
if (isdeltaupdated()):
print("ok")
else:
raise ValueError("Not updated")
#Raising error so that it won't run the next task (notebook)
First Create a job for this and open that job and go to Tasks of it.
You can see a task is created for the Delta_notebook.
Click on the + icon and create another task for the next notebook. Give the notebook and the dependency 1st task like below.
You can see tasks like this.
Run this Job and you can see that the second task is not executing when we gave Value error in the First one (Delta table not updated).
Second task will be executed if first one won't give any error (Delta table updated).
First Task:
Second task:
You can schedule this Job every day one or two times at a particular time and when Delta table is updated it will execute another notebook.

Auto Loader with Merge Into for multiple tables

I am trying to implement the Auto Loder using the Merge Into on multiple tables using the code below as stated in the documentation:
def upsert_data(df, epoch_id):
deltaTable = DeltaTable.forPath(spark, target_location)\
deltaTable.alias("t").merge(df.alias("s"),\
"t.xx = s.xx and t.xx1 = s.xx1") \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll() \
.execute()
for i in range(len(list_of_files()[0])):
schema =list_of_files()[2][i]
raw_data = list_of_files()[1][i]
checkpoint= list_of_files()[3][i]
target_location = list_of_files()[4][i]
dfSource =list_of_files(raw_data)
dfMergedSchema = dfSource.where("1=0")
dfMergedSchema.createOrReplaceGlobalTempView("test1")
dfMergedSchema.write.option("mergeSchema","true").mode("append").format("delta")\
.save(target_location)
stream = spark.readStream\
.format("cloudFiles")\
.option("cloudFiles.format", "parquet")\
.option("header", "true")\
.schema(schema)\
.load(raw_data)
stream.writeStream.format("delta")\
.outputMode("append")\
.foreachBatch(upsert_data)\
.option("dataChange", "false")\
.trigger(once=True)\
.option("checkpointLocation", checkpoint)\
.start()
My scenario:
We have a Landing Zone where Parquet files are appended into multiple folders for example as shown below:
Landing Zone ---|
|-----folder 0 ---|----parquet1
| |----parquet2
| |----parquet3
|
|
|-----folder 1 ---|----parquet1
| |----parquet2
| |----parquet3
Then I am needing Auto Loader to create the tables as shown below with the checkpoints:
Staging Zone ---|
|-----folder 0 ---|----checkpoint
| |----table
|
|
|
|-----folder 1 ---|----checkpoint
| |----table
|
I am noticing that without the foreachBatch option in the Writestream, but with the Trigger once, the code works as expected for inserts for multiple tables as in above. The code also works when we have both foreachBatch and Trigger options on individual tables without the for loop. However, when I try to enable both options (foreachBatch and the Trigger Once) for multiple tables as in the for loops, Auto Loader is merging all the table contents into one table. You get a checkpoint, but no table contents for folder 0 in Staging Zone, and in folder 1, you get a checkpoint, but delta files that make up the table contents for both folder 0 and 1 in the table folder of folder 1. It's merging both tables into one.
I also get the ConcurrentAppendException.
I read about the ConcurrentAppendException in the documentation, and what I am finding is that you either use partitioning or have a disjointed condition in the upsert_data function passed into the foreachBatch option of the WriteStream. I tried both and none works.
How can one isolate the streams for the different folders in this scenario for the Staging Zone, while using foreachBatch and the Trigger Once in this for loop? There is something I am definitely missing with the foreachBatch option here because without it, Auto Loader is able to isolate the streams to folder 0 and folder 1, but with it, it's not.
Spoke with a Databricks Solution Architect today, and he mentioned that I needed to use a ThreadPoolExecutor, which is something outside the Auto Loader or Databricks itself, but native to Python. That will be in a helper function, which specifies the number of streams to handle the tables in parallel with Auto Loader. So, one can use a single instance of Auto Loader notebook for multiple tables, which meets my use case. Thanks!

Running Python code (Boto3) remotely (AWS)

I have code that moves items from one s3 bucket to another. I am running it locally on my computer. However, it will take a long time to finish running as there are many items in the bucket.
import boto3
#Get resource
s3 = boto3.resource('s3')
#Get reference to buckets
src = s3.Bucket('src')
dst = s3.Bucket('dst')
#Iterate through the items in the source bucket
for item in src.objects.all():
#Creates a copy of the item?
copy_source = {
'Bucket' : 'src',
'Key' : item.key
}
#Places the copy of the item in the destination bucket
dst.copy(copy_source,'Images/'+item.key)
Is there any way I can run this code remotely such that I would not have to monitor it? I have tried AWS lambda but it has a maximum run time of 15 minutes. Is there something like that I could use but for a longer time.
You could use a Data Pipeline.
A data pipeline spawns an EC2 instance where you can run your job.
You can schedule the pipeline to run run at least every 15 minutes. (But not less)
There is also the option to create a pipeline that you van run on demand.
It also offers a console where you can view the jobs and their outcome and have the opportunity to rerun failed jobs.
For this kind of activity you should probably use this:
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
Another option is to just start an EC2 instance run your job and then stop it.

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests.
I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.
When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.
Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?
You can use VACUUM command to do the clean up. I haven't used it yet.
If you are using spark, you can use overwriteSchema option to reload the data.
If you can provide the more details on how you are using it, it would be better
The perfect steps are as follows:
When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :
DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
DELETE FROM TABLE deletes data from table but transaction log still resides.
So, Step 1 - DROP TABLE schema.Tablename
STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 3 - % fs ls make sure there is no data and also no transaction log at that location
Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename
Make sure that you are not creating an external table. There are two types of tables:
1) Managed Tables
2) External Tables (Location for dataset is specified)
When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.
But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.
After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:
VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]
This will cleanup all the uncommitted files from table's folder.
I hope this helps you.
import os
path = "<Your Azure Databricks Delta Lake Folder Path>"
for delta_table in os.listdir(path):
dbutils.fs.rm("<Your Azure Databricks Delta Lake Folder Path>" + delta_table)
How to find your <Your Azure Databricks Delta Lake Folder Path>:
Step 1: Go to Databricks.
Step 2: Click Data - Create Table - DBFS. Then, you will find your delta tables.

Zeppelin - run paragraphs in order

I have spark 2.1 standalone cluster installed on 2 hosts.
There are two notebook's Zeppelin(0.7.1):
first one: preparing data, make aggregation and save output to files by:
data.write.option("header", "false").csv(file)
second one: notebook with shell paragraphs merge all part* files from spark output to one file
I would like to ask about 2 cases:
How to configure Spark to write output to one file
After notebook 1 is completed how to add relations to run all paragraphs in notebook2 eg:
NOTEBOOK 1:
data.write.option("header", "false").csv(file)
"run notebook2"
NOTEBOOK2:
shell code
Have you tried adding a paragraph at the end of note1 that executes note2 through Zeppelin API? You can optionally add a loop that checks whether all paragraphs finished execution, also through API.

Resources