Adding Extraction DateTime in Azure Data Factory - azure

I want to write a generic DataFactory in V2 with below scenario.
Source ---> Extracted (Salesforce or some other way), which don't have
extraction timestamp. ---> I want to write it to Blob with extraction
Time Stamp.
I want it to be generic, so I don't want to give column mapping anywhere.
Is there any way to use expression or system variable in Custom activity to append a column in output dataset? I like to have a very simple solution to make implementation realistic.

To do that you should change the query to add the column you need, with the query property in the copy activity of the pipeline. https://learn.microsoft.com/en-us/azure/data-factory/connector-salesforce#copy-activity-properties
I dont know much about Salesforce, but in SQL Server you can do the following:
SELECT *, CURRENT_TIMESTAMP as AddedTimeStamp from [schema].[table]
This will give you every field on your table and will add a column named AddedTimeStamp with the CURRENT_TIMESTAMP value in every row of the result.
Hope this helped!

Related

Incremental load in Azure Data Factory

I am replicating my data from Azure SQl DB TO Azure SQL DB. I have some tables with date columns and some tables with just the ID columns which are assigning primary key. While performing incremental load in ADF, I can select date as watermark column for the tables which have date column and id as watermark column for the tables which has id column, But the issue is my id has guid values, So can I i take that as my watermark column ? and if yes while copy activity process it gives me following error in ADF
Please see the image for above reference
How can I overcome this issue. Help is appreciated
Thank you
Gp
I have tried dynamic mapping https://martinschoombee.com/2022/03/22/dynamic-column-mapping-in-azure-data-factory/ from here but it does not work it still gives me same error.
Regarding your question about watermak:
A watermark is a column that has the last updated time stamp or an incrementing key
So GUID column would not be a good fit.
Try to find a date column, or an integer identity which is ever incrementing, to use as watermark.
Since your source is SQL server, you can also use change data capture.
Links:
Incremental loading in ADF
Change data capture
Regards,
Chen
The watermark logic takes advantange of the fact that all the new records which are inserted after the last watermark saved should only be considered for copying from source A to B , basically we are using ">=" operator to our advantage here .
In case of guid you cannot use that logic as guid cann surely be unique but not ">=" or "=<" will not work.

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

Not able to change datatype of Additional Column in Copy Activity - Azure Data Factory

I am facing very simple problem of not able to change the datatype of additional column in copy activity in ADF pipeline from String to Datetime
I am trying to change source datatype for additional column in mapping using JSON but still it doesn't work with polybase cmd
When I run my pipeline it gives same error
Is it not possible to change datatype of additional column, by default it takes string only
Dynamic columns return string.
Try to put the value [Ex. utcnow()] in the dynamic content of query and cast it to the required target datatype.
Otherwise you can use data-flow-derived-column :
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
Since your source is a query, you can choose to bring current date in source SQL query itself in the desired format rather than adding it in the additional column.
Thanks
Try to use formatDateTime as shown below and define the desired Date format:
Here since format given is ‘yyyy-dd-MM’, the result will look as below:
Note: The output here will be of string format only as in Copy activity we could not cast data type as of the date.
We could either create current date in the Source sql query or use above way so that the data would load into the sink in expected format.

Azure Data Factory - Exists transformation in Data Flow with generic dataset

I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!

Copy data every 1 minute from DataLake by DataFactory

I have a Data Lake storage with the following folder structure:
{YEAR}
- {MONTH}
- {DAY}
- {HOUR}
- {sometext}_{YEAR}_{MONTH}_{DAY}_{HOUR}_{Minute}_{someuuid}.json
example
Could you please help me to configure Data Factory Copy data action?
I need to run Trigger every 1 minute - to copy data from Data Lake by previous minute to Cosmos DB
I've tried this
where the first expresion is
#formatDateTime(utcnow(),'yyyy/MM/dd/HH')
and the second one
#{formatDateTime(utcnow(),'yyyy')}_#{formatDateTime(utcnow(),'MM')}_#{formatDateTime(utcnow(),'dd')}_#{formatDateTime(utcnow(),'HH')}_#{formatDateTime(addMinutes(utcnow(), -1),'mm')}*.json
But it can skip some data, especially when Hour changes.
I'm a new in Data Factory and don't know what is the more efficient way how to do that. Please help
The Pipeline Expression Language has a number of Date functions built in. You can use the addMinutes function to add 1 minute.
To avoid clock skew, I would capture the utcnow() value and store it without any formatting:
In another variable, add a minute to the captured value rather than executing utcnow() again:
Once you have those variables, just use them to format the date string(s).
Result:
NOTE: use concat with the formatDateString to get the wildcard value you want:
Result:

Resources