Approach to handling update date update columns in ADF pipelines - azure

New to Adf and trying to build a solution to handle the below simple scenario:
Source and sink are both Azure sql db
New rows based on unique id are inserted with a null [update date] column
Any changes to existing rows have the [update date] column updated to current date
Would the copy data activity be enough to cater for the above?
Thanks

You can use Upsert in the Copy activity. Upsert means update if exists and insert if not exists based on a column.
Follow below repro for your reference:
My source table in SQL:
Target table:
Select upsert in sink:
select Upsert, then you can give your Unique Id column in the Key columns.
The Upsert action will be done based on this column.
Target table after execution:
You can see that the existing rows are updated, and new rows are inserted.

Related

Incremental load in Azure Data Factory

I am replicating my data from Azure SQl DB TO Azure SQL DB. I have some tables with date columns and some tables with just the ID columns which are assigning primary key. While performing incremental load in ADF, I can select date as watermark column for the tables which have date column and id as watermark column for the tables which has id column, But the issue is my id has guid values, So can I i take that as my watermark column ? and if yes while copy activity process it gives me following error in ADF
Please see the image for above reference
How can I overcome this issue. Help is appreciated
Thank you
Gp
I have tried dynamic mapping https://martinschoombee.com/2022/03/22/dynamic-column-mapping-in-azure-data-factory/ from here but it does not work it still gives me same error.
Regarding your question about watermak:
A watermark is a column that has the last updated time stamp or an incrementing key
So GUID column would not be a good fit.
Try to find a date column, or an integer identity which is ever incrementing, to use as watermark.
Since your source is SQL server, you can also use change data capture.
Links:
Incremental loading in ADF
Change data capture
Regards,
Chen
The watermark logic takes advantange of the fact that all the new records which are inserted after the last watermark saved should only be considered for copying from source A to B , basically we are using ">=" operator to our advantage here .
In case of guid you cannot use that logic as guid cann surely be unique but not ">=" or "=<" will not work.

How to add a timestamp column to an existing table in Athena?

According to Athena docs, I can not add the date column to an existing table, so I am trying to use the workaround they propose with the timestamp datatype.
But when I run the ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP) query, I still get the following error :
Parquet does not support date. See HIVE-6384
Is there any option to add date or timestamp columns to an existing table?
Thanks
UPD: I found out that I can still add timestamp columns with glue UI interface/API
UPD 2: The issue occurs only with one specific table, but it works for others.
You can use the following query to add a timestamp column to an existing table:
ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP);
This should work for both Parquet and ORC tables.

Pivoting based on Row Number in Azure Data Factory - Mapping Data Flow

I am new to Azure and am trying to see if the below result is achievable with data factory / mapping data flow without Databricks.
I have my csv file with this sample data :
I have following data in my table :
My expected data/ result:
Which transformations would be helpful to achieve this?
Thanks.
Now, you have the RowNumber column, you can use pivot activity to do row-column pivoting.
I used your sample data to made a test as follows:
My Projection tab is like this:
My DataPreview is like this:
In the Pivot1 activity, we select Table_Name and Row_Number columns to group by. If you don't want Table_Name column, you can delete it here.
At Pivote key tab, we select Col_Name column.
At Pivoted columns, we must select a agrregate function to aggregate the Value column, here I use max().
The result shows:
Please correct me if I understand you wrong in the answer.
update:
The data source like this:
The result shows as you saied, ADF sorts the column alphabetically.It seems no way to customize sorting:
But when we done the sink activity, it will auto mapping into your sql result table.

Cassandra Altering the table

I have a table in Cassandra say employee(id, email, role, name, password) with only id as my primary key.
I want to ...
1. Add another column (manager_id) in with a default value in it
I know that I can add a column in the table but there is no way i can provide a default value to that column through CQL. I can also not update the value for manager_id later since I need to know the id (Partition key and the values are randomly generated unique values which i don't know) to update the row. Is there any way I can achieve this?
2. Rename this table to all_employee.
I also know that its not allowed to rename a table in cassandra. So I am trying to copy the data of table(employee) to csv and copy from csv to new table (all_employee) and deleting the old table(employee). I am doing this through an automated script with cql queries in it and script works fine but will fail if it gets executed again(Which i can not restrict) since the table employee will not be there once its deleted. Essentially I am looking for "If exists" clause in COPY query which is not supported in cql. Is there any other way I can achieve the outcome?
Please note that the amount of data in the table is very small so performance in not an issue.
For #1
I dont think cassandra support default column . You need to do that from your appliaction. Write some default value every time you insert a row.
For #2
you can check if the table exists before trying to copy from it.
SELECT your_table_name FROM system_schema.tables WHERE keyspace_name='your_keyspace_name';

subsonic timestamp issue on insert record

My problem is about the timestamp column in one of my database tables. This field is required because sqlserver automatically generates it. Since this column is required subsonic included it in the insert query..
im using subsonic 3.0
info.Save();
error message:
Cannot insert an explicit value into a timestamp column. Use INSERT with a column list to exclude the timestamp column, or insert a DEFAULT into the timestamp column.
Have a look at the following answer for a workaround for this:
Subsonic: How to exclude a table column so that its not included in the generated SQL query

Resources