I am data analyst, so I have access only to Databricks (creating notebooks and jobs).
I have a delta table which is updated (merging new data) every day. Pipeline is created by administrators. When this table is after update I'd like to trigger notebook or job.
Is that even possible with my access? I tried to create changeDataFeed and then what?
I don't get how to move from changeDataFeed to actually trigger something else.
My code snippet (based on different questions):
df = spark.readStream \
.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", "latest") \
.table(tableName) \
.filter("_change_type != 'update_preimage'")
After i query df, streamReader is reading data, but I want to run another notebook or job.
Secondly, I don't want to run this readStream all time, because data is merged once a day (about 7-9 AM).
How to do it?
You can use the tasks inside the same job to execute other notebook after your Delta table change.
This is the sample Notebook Code for the Delta table named as Delta_Notebook .
def isdeltaupdated():
#Your code to check delta table updated or not
#if updated then we can proceed to run next notebook so return true
#If not updated then create Value error in this notebook by which the other notebook cannot run
return False #for sample
if (isdeltaupdated()):
print("ok")
else:
raise ValueError("Not updated")
#Raising error so that it won't run the next task (notebook)
First Create a job for this and open that job and go to Tasks of it.
You can see a task is created for the Delta_notebook.
Click on the + icon and create another task for the next notebook. Give the notebook and the dependency 1st task like below.
You can see tasks like this.
Run this Job and you can see that the second task is not executing when we gave Value error in the First one (Delta table not updated).
Second task will be executed if first one won't give any error (Delta table updated).
First Task:
Second task:
You can schedule this Job every day one or two times at a particular time and when Delta table is updated it will execute another notebook.
Related
Im ingesting a data with the api calls and would like to widgets to parametirze. In azure I have the following set up:
I have the list of attribute_codes, reading them with lookup activtiy and passing these parameter inside the databricks notebook code. Code inside the databricks:
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Removing the folder in Data Lake
dbutils.fs.rm(f'/mnt/bronze/attribute_code/{day}',True)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
My problem is that, since its in the ForEach loop, eveytime new parameter is passed, it deletes the entire folder with previosly, loaded data. Now someone can come and say to remove line where I drop and create the spacific daily folder but pipeline should run multiple times a day and I need to drop previously loaded data on that day and load new one.
My goal is to iterte over the entire list of the attribute_code and load them all in one folder with the name "data_{count}.json
Instead of using dbutils.fs.rm in your notebook, you can use delete activity before for each activity to get desired results.
Using dbutils.fs.rm, the folder is being deleted each time the notebook is triggered inside for each loop deleting previously created files as well.
So, using a delete activity only before for each loop to delete the folder (deletes only if it exists), you can load data as per requirement.
For path, I have used the following dynamic content:
attribute/#{formatDateTime(utcNow(),'yyyy-MM-dd')}
And using the following code in my databricks notebook:
#I used similar code
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
Lets say I have the following output from my look up activity:
When I run the pipeline, it would run successfully. Only the latest look up data would be loaded.
I am writing "delta" format file in AWS s3.
Due to some corrupt data I need to delete data , I am using enterprise databricks which can access AWS S3 path, which has delete permission.
While I am trying to delete using below script
val p="s3a://bucket/path1/table_name"
import io.delta.tables.*;
import org.apache.spark.sql.functions;
DeltaTable deltaTable = DeltaTable.forPath(spark, p);
deltaTable.delete("date > '2023-01-01'");
But it is not deleting data in s3 path which is "date > '2023-01-01'".
I waited for 1 hour but still I see data , I have run above script multiple times.
So what is wrong here ? how to fix it ?
If you want delete the data physically from s3 you can use dbutils.fs.rm("path")
If you want tp just delete the data run spark.sql("delete from table_name where cond") or use magic command %sql and run delete command.
Even you can try vacuum command, but the default retention period is 7 days, if you want to delete the data which is less than 7 days then set this configuration SET spark.databricks.delta.retentionDurationCheck.enabled = false; and the execute vacuum command
The DELETE operation only deletes the data from the delta table, it just dereferences it from the latest version. To delete the data physically from the storage you have to run a VACUUM command:
Check: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.
I have created a python notebook in Databricks, I have python logic and need to execute a %sql commandlet.
Say I wanted to execute that commandlet2 based on a python variable
cmd1
EXECUTE_SQL= True
cmd2
if condition :
%sql .....
As mentioned, you can use following Python code (or Scala) to make behavior similar to the %sql cell:
if condition:
display(spark.sql("your-query"))
One advantage of this approach is that you can embed variables into the query text.
Another alternate which I used is
Extracted the sql to a different notebook,
In my case i don't want any results back
Also I am cleaning up the delta tables and deleting contents.
/clean-deltatable-notebook (an sql notebook)
delete from <database>.<table>
used the dbutils.run.notebook() from the python notebook.
cmd2
if condition :
result = dbutils.run.notebook('<path-of-(clean-deltatable-noetbook)',timeout_seconds = 30)
print(result)
Link on dbutils.notebook.run() from Databricks
I have two Spark streams set up in a notebook to run in parallel like so.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df1 = spark \
.readStream.format("delta") \
.table("test_db.table1") \
.select('foo', 'bar')
writer_df1 = df1.writeStream.option("checkpoint_location", checkpoint_location_1) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df2 = spark \
.readStream.format("delta") \
.table("test_db.table2") \
.select('foo', 'bar')
writer_df2 = merchant_df.writeStream.option("checkpoint_location", checkpoint_location_2) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
These dataframes then get processed row by row, with each row being sent to an API. If the API call reports an error, I then convert the row into JSON and append this row to a common failures table in databricks.
columns = ['table_name', 'record', 'time_of_failure', 'error_or_status_code']
vals = [(table_name, json.dumps(row.asDict()), datetime.now(), str(error_or_http_code))]
error_df = spark.createDataFrame(vals, columns)
error_df.select('table_name','record','time_of_failure', 'error_or_status_code').write.format('delta').mode('Append').saveAsTable("failures_db.failures_db)"
When attempting to add the row to this table, the saveAsTable() call here throws the following exception.
py4j.protocol.Py4JJavaError: An error occurred while calling o3578.saveAsTable.
: java.lang.IllegalStateException: Cannot find the REPL id in Spark local properties. Spark-submit and R doesn't support transactional writes from different clusters. If you are using R, please switch to Scala or Python. If you are using spark-submit , please convert it to Databricks JAR job. Or you can disable multi-cluster writes by setting 'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If this is disabled, writes to a single table must originate from a single cluster. Please check https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq for more details.
If I comment out one of the streams and re-run the notebook, any errors from the API calls get inserted into the table with no issues. I feel like there's some configuration I need to add but am not sure of where to go from here.
Not sure if this is the best solution, but I believe the problem comes from each stream writing to the table at the same time. I split this table into separate tables for each stream and it worked after that.