External Table in Databricks is showing only future date data - apache-spark

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.

df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Related

Azure Delta Load won't recognize Epoch timestamp (ms) as Watermark Column Name

I am trying to perform a delta load (incremental load) with Azure Data Factory from SQL Server to Blob Storage. My tables have an updateStamp column that is Epoch Time in milliseconds, numeric(19,0) data type. When I look to select the Watermark column name in the configuration section of the Copy Data tool in ADF, it is not one of the options and it does not let me manually enter the column name. It looks like it only wants a datetime data type or key integer data type. I have tried with the Metadata-driven copy task and the Delta copy from Database template with no luck. Is there a work around or way of converting the max and using that (instead of adding another column to hundreds of millions of rows). Any help or guidance is appreciated.
I'm expecting to be able to use a data type that indicates a point in time as the watermark for an incremental load, even though that data type is not datetime.
I have tried to repro this in my environment using Delta copy from a database template in adf for the watermark column with epoch timestamp type. Below are the steps.
Input table and watermark table are taken as in below image. (Initial Watermark value for watermark table is given as 1657238400000, to copy all records in first run)
Stored Procedure for updating new watermark value is written in SQL server as in below script.
Create PROCEDURE update_watermark #LastModifyDate numeric(19,0)
AS
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = #LastModifyDate
END
In ADF, delta copy from database template is taken and linked service for source, sink and control table are given. Then, Use template is selected.
Configuration in LookupLastWaterMark, LookupCurrentWaterMark, DeltaCopyfromDB activities are not changed
In UpdateWaterMark activity, Stored procedure name is selected, and parameter is imported. type of the parameter LastModifyDate is given as Int64.
Debug is clicked for the pipeline run and pipeline parameters for Source, sink and control table are given.
Once file is copied, watermark value is updated with the latest value.
Sink File:
New line item is added to the source (4th record in the below image is added newly).
Pipeline is rerun to check if delta lines are copied.
Delta records are copied to sink when epoch timestamp is given as watermark column.
Reference: MS doc on Delta copy from a database template.

Azure Data flow - presql script, dynamic content

I want to run a presql script in the Data flow in SINK. I want to delete exiting records for particular year.
This particular year will be coming from source excel file.
Say I have a file for 2021 data and loaded that data into DB. when I rerun the pipeline for the same excel file I want to delete 2021 related records in DB and insert fresh. This table may contain multiple years data. So everytime a new file arrives for a particular year, I want to delete that respective records and reload the new data.
I can read the year value from source file column. And I can keep it as derived column. How Can I write a presql script to delete?
delete from where year = <sourcefile.year>
how can i do this in data flow?
Pls help!
You can use a temporary table in sink1 and delete the records in the sink2.
Please follow the demonstration below:
This is my SQL table with a sample data.
Create two SQL sinks for it from same source using a new branch.
In the first sink, provide a dataset with edit table name option for the temporary table.
In sink check on Recreate table.
In sink 2, give your SQL table with the below SQL script.
DELETE FROM exceltable WHERE year in (select year from dbo.temp1)
drop table dbo.temp1;
In the settings of Data flow give the Write order of sinks.
Temporary table sink should execute first.
This is my result in the SQL table after deleting records.

How do I update a large SQL table using pyspark?

I am building an etl using pyspark in databricks.
I have a source SQL table with roughly 10 million rows of data which I want to load into a SQL staging table.
I have two basic requirements:-
When a row is added to the source table, it must to inserted into the staging table.
When a row is updated to the source table, it must to updated into the staging table.
Source data
Thankfully the source table has two timestamp columns for created and updated time. I can query the new and updated data using these two columns and put it into a dataframe called source_df.
Target data
I load all the keys (IDs) from the staging table into a dataframe called target_df.
Working out changes
I join the two dataframe together based on the key to work out which rows already exist (which form updates), and which rows don't exist (which form inserts). This gives me two new dataframes, inserts_df and updates_df.
Inserting new rows
This is easy because I can just use inserts_df.write to directly write into the staging table. Job done!
Updating existing rows
This is what I can't figure out as there is little in the way of existing examples. I am lead to believe that you can't do updates using pyspark. I could use the "overwrite" mode to replace the SQL table, but it doesn't make a lot of sense to replace 10 million rows when I only want to update half a dozen.
How can I effectively get the rows from updates_df into SQL without overwriting the whole table?

Insert or Update a delta table from a dataframe in Pyspark

I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.

Altering CSV Rows in Azure Data Factory

I've tried to use the 'Alter Rows' function within a Data Flow in Azure Data Factory to remove rows that match a condition from a CSV dataset.
The Data Preview shows that the rows matched will be deleted, however in the next step 'sink' it seems to ignore that and writes the original rows to the CSV file output.
Is it not possible to use alter rows on a CSV dataset and if not, is there a work around?
Firstly,use 'union' to migrate your csv files as source.
Then,use 'filter' to filter your data with date time stamps at source.

Resources