Cache Lookup Properties in Azure Data Factory - azure

I have a requirement where in I have a source file containing the Table Name(s) in Mapping Data Flow. Based on the Table Name in the file - there needs to be a dynamic query where column metadata, along with some other properties is retrieved from the data dictionary tables and inserted into a different sink table. The table name from the file would be used as a where condition filter.
Since there can be multiple tables listed in the input file (lets assume its a csv with only one column containing the table names), if we decide to use a cache sink for the file :
Is it possible to use the results of that cached sink in the Source transformation query in the same mapping data flow - as a lookup (from where the column metadata is being retrieved) and if Yes, how
What would be the best way to restrict data from the metadata table query based on this table name
Though of alternatively achieving this with a pipeline using For Each passing the table name as parameter to data flow, but in this case if there are 100 tables in the file, there would be 100 iterations and 100 times the cluster would need to be spun up. Please advise if this is wronf or there are better ways to achieve this

You would need to use option 3. Loop through the table names and pass each in as a parameter to the data flow to set the table name in the dataset.
ADF handles the cluster creation and teardown. All you have to worry about is whether you want to execute each sequentially or in parallel and how many. There are concurrency limits in ADF, so you should consider a batch count of 20 if you run in parallel.

Related

How to delete records from a sql database using azure data factory

I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it

How to rename column names from lookup in ADF?

I have metadata in my Azure SQL db /csv file as below which has old column name and datatypes and new column names.
I want to rename and change the data type of oldfieldname based on those metadata in ADF.
The idea is to store the metadata file in cache and use this in lookup but I am not able to do it in data flow expression builder. Any idea which transform or how I should do it?
I have reproduced the above and able to change the column names and datatypes like below.
This is the sample csv file I have taken from blob storage which has meta data of table.
In your case, take care of new Data types because if we don't give correct types, it will generate error because of the data inside table.
Create dataset and give this to lookup and don't check first row option.
This is my sample SQL table:
Give the lookup output array to ForEach.
Inside ForEach use script activity to execute the script for changing column name and Datatype.
Script:
EXEC SP_RENAME 'mytable2.#{item().OldName}', '#{item().NewName}', 'COLUMN';
ALTER TABLE mytable2
ALTER COLUMN #{item().NewName} #{item().Newtype};
Execute this and below is my SQL table with changes.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

Ingesting a CSV file thru Polybase without knowing the sequence of columns

I am trying to ingest a few CSV files from Azure Data Lake into Azure Synapse using Polybase.
There is a fixed set of columns in each CSV file and the column names are given on the first line. However, the columns can come in different ordering sequence.
In Polybase, I need to declare external table which I need to know the exact sequence of columns during design time and hence I cannot create the external table. Are there other ways to ingest the CSV file?
I don't believe you can do this directly with Polybase because as you noted the CREATE EXTERNAL TABLE statement requires the column declarations. At runtime, the CSV data is then mapped to those column names.
You could accomplish this easily with Azure Data Factory and Data Flow (which uses Polybase under the covers to move the data to Synapse) by allowing the Data Flow to generate the table. This works because the table is generated after the data has been read rather than before as with EXTERNAL.
For the sink Data Set, create it with parameterized table name [and optionally schema]:
In the Sink activity, specify "Recreate table":
Pass the desired table name to the sink Data Set from the Pipeline:
Be aware that all string-based columns will be defined as VARCHAR(MAX).

How to perform Incremental Load with date or key column using Azure data factory

I wanted to achieve an incremental load from oracle to Azure SQL data warehouse using azure data factory. The Issue that I am facing is I don't have any date column or any key column to perform Incremental load Is there any other way to achieve this.
You will either have to:
A. Identify a field in each table you want to use to determine if the row has changed
B. Implement some kind of change capture feature on the source data
Those are really the only the only two ways to limit the amount of data you pull from the source.
It wouldn't be very efficient, but if you are just trying not to update rows that haven't changed in your destination, you can hash your source values and hash the values in the destination, and only insert/update rows where the hashes don't match. Here's an example of how this works in T-SQL.
There is a section of the Data Factory documentation dedicated to incrementally loading data. Please check it out if you haven't.

Resources