ADF get property "status": "Succeeded" and IF for validation - azure

I have a pipeline that pull out data from external and sink into SQL Server table as staging. Process for getting raw data has already succeeded by using 4 'Copy data'. Because of so many columns (250 columns), so I split them.
What the next requirement validate 4 those 'Copy data' by getting succeeded status. The output of 'Copy data' look like this
{
"dataRead": 4772214,
"dataWritten": 106918,
"sourcePeakConnections": 1,
"sinkPeakConnections": 1,
"rowsRead": 1366,
"rowsCopied": 1366,
"copyDuration": 8,
"throughput": 582.546,
"errors": [],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Southeast Asia)",
"usedDataIntegrationUnits": 4,
"billingReference": {
"activityType": "DataMovement",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "DIUHours"
}
]
},
"usedParallelCopies": 1,
"executionDetails": [
{
"source": {
"type": "RestService"
},
"sink": {
"type": "AzureSqlDatabase",
"region": "Southeast Asia"
},
"status": "Succeeded",
"start": "2022-04-13T07:16:48.5905628Z",
"duration": 8,
"usedDataIntegrationUnits": 4,
"usedParallelCopies": 1,
"profile": {
"queue": {
"status": "Completed",
"duration": 4
},
"transfer": {
"status": "Completed",
"duration": 4,
"details": {
"readingFromSource": {
"type": "RestService",
"workingDuration": 1,
"timeToFirstByte": 1
},
"writingToSink": {
"type": "AzureSqlDatabase",
"workingDuration": 0
}
}
}
},
"detailedDurations": {
"queuingDuration": 4,
"timeToFirstByte": 1,
"transferDuration": 3
}
}
],
"dataConsistencyVerification": {
"VerificationResult": "NotVerified"
},
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
Now, I want to get "status": "Succeeded" (JSON output) for validating in the 'IF Condition'. So, I set Value from variable in the dynamic content #activity('copy_data_Kobo_MBS').output
but when it run, I got error
The variable 'copy_Kobo_MBS' of type 'Boolean' cannot be initialized
or updated with value of type 'Object'. The variable 'copy_Kobo_MBS'
only supports values of types 'Boolean'.
And the question is how to get "status": "Succeeded" (JSON output) as 'Variable' value ? So 'IF condition' can examine the 'Variable' value.

You can use the below expression to pull the run status from the copy data activity. As your variable is of Boolean type, you need to evaluate it using the #equals() function which returns true or false.
#equals(activity('Copy data1').output.executionDetails[0].status,'Succeeded')
As per knowledge, you don’t have to extract the status from copy data activity as you are connecting your copy activity to set variable activity upon success.
That means your set variable activity runs only when your copy data activity ran successfully.
Also, note that
If the copy data activity (or any other activity) fails, then the activities which are added upon the success output of the previous activity will not be running.
And if you are connecting more than 1 activity output to a single activity, it only runs when all the connected activities run.
You can add activities upon failure or upon completion to process further.
Example:
In the below snip, the Set Variable activity is not run as copy data is not successful. And Wait2 activity is not run as all the input activities are not run successfully.

Related

Get Stored Procedure Output DataSet from Azure Data Factory

I have a stored procedure which accepts a single input parameter and returns a data set. I want to invoke this stored procedure from my ADF Pipeline and with the stored proc data, I want to call another proc with whose result I want to do further processing.
I tried with the Stored Procedure Activity but it's output doesn't contain the actual data set:
{
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Australia East)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Also, I tried with LookUp Activitybut it's result only contains the first row of the resultant data set:
{
"firstRow": {
"CountryID": 1411,
"CountryName": "Maldives",
"PresidentName": "XXXX"
},
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Australia East)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "DIUHours"
}
]
},
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
My main intention behind using ADF is to reduce the huge amount of time taken otherwise by an existing API (.Net Core) for the same steps. What else can be done? Should I consider any other Azure Service(s)?

Avoid duplicates during parallel job run in Azure Synapse

Need your suggestions in developing code in Azure Synapse.
We have a requirement where our jobs will run in parallel at same time and insert data to the same table.
During this insert there are changes that duplicate entries will be inserted to the same table.
For Example: If Job A and Job B run at same time both with same values then "not exists" or "not in" will fail to work. In this case I will get duplicates from both the job. Primary key or Unique constraint allows duplicates in Azure synapse. Is there any best way to lock tables during data insert. Like if Job A is running then JOB B should not insert the data to same table. Please pour your suggestions as I am new to this. Note: We use stored Procedure to load the data through ADF V2
Thanks,
Nandini
Duplicates must be handled within jobs before inserting data into Azure Synapse. If the duplicates exists between two jobs, then do it after completion of both jobs. It depends really how you are loading data. You can easily manage by creating a temp table instead of directly loading data to final table. Please make sure the structure of temp table should be same as final table (Distribution, Partition, constraints, nullability of the columns) You can use SQL BCP/INSERT TO/CTAS/CTAS with partition switching with stage table to final table.
If you can share specific scenario, it will be helpful to give suggestions relevant to your use case.
I just got the same case and I solved it with Pipeline Runs - Query By Factory
Use a Until activity before the DataFlow activity that writes the values in the table with this expression #equals(activity('pingPL').output.value[0].runId, pipeline().RunId) as follow:
Into the Until activities put a web activity and a wait time:
a. Web activity body - follow docs:
{
"lastUpdatedAfter": "#{addminutes(utcnow(), -30)}",
"lastUpdatedBefore": "#{utcnow()}",
"filters": [
{
"operand": "PipelineName",
"operator": "Equals",
"values": [
"pipeline_name_where_writeInSynapse_is_located"
]
},
{
"operand": "Status",
"operator": "Equals",
"values": [
"InProgress"
]
}
]
}
b. Wait activity 30 sec or whatever make sense
What is happening is, if you trigger several times the same pipeline in parallel the web activity is going to filter each PL status InProgress. It will look like this:
{
"value": [
{
"id": "...",
"runId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"debugRunId": null,
"runGroupId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "80efce4dbda74636878bc99472978ccf",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:24:01.0210945Z",
"runEnd": "2021-10-13T17:25:06.9692394Z",
"durationInMs": 65948,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:25:06.9704432Z",
"annotations": [],
"runDimension": {},
"isLatest": true
},
{
"id": "...",
"runId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"debugRunId": null,
"runGroupId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "08205e0eda0b41f6b5a90a8dda06a7f6",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:28:58.219611Z",
"runEnd": null,
"durationInMs": null,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:29:00.9860175Z",
"annotations": [],
"runDimension": {},
"isLatest": true
}
],
"ADFWebActivityResponseHeaders": {
"Pragma": "no-cache",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"X-Content-Type-Options": "nosniff",
"x-ms-ratelimit-remaining-subscription-reads": "11999",
"x-ms-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-correlation-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-routing-request-id": "WESTUS2:20211013T172902Z:188508ef-8897-4c21-8c37-ccdd4adc6d81",
"Cache-Control": "no-cache",
"Date": "Wed, 13 Oct 2021 17:29:02 GMT",
"Server": "Microsoft-IIS/10.0",
"X-Powered-By": "ASP.NET",
"Content-Length": "1492",
"Content-Type": "application/json; charset=utf-8",
"Expires": "-1"
},
"effectiveIntegrationRuntime": "NCAP-Simple-DataMovement (West US 2)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Then the Until expression will evaluate if the first value[0] has runId == pipeline_runid to stop the until activity and run the dataflow that writes in Synapse. Once the PL ends the status will be Succeeded and the Web activity in another job will get the next value[0] with the status InProgress and continue with the next write. This creates a dependency to the parallel jobs to wait until the dataflow validates and writes in table if need it.

Azure Data Factory copy pipeline: get a file with date of yesterday that arrived today

I am getting a txt file (on todays date) with the date of yesterday in it and I want dynamically get this filename in my data factory pipeline.
The file is placed automatically on a file system and I want to copy this file to the blob store In my example below I am simulating this by copying from blob to blob.
For example:
filename_2018-02-11.txt arrived today (2018-03-12) with the date of yesterday(2018-02-11). How can I pick this file up on today's date?
Yesterday's slice did run but there was not a file yet.
Here is my example:
{"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "CopyPipeline-fromBlobToBlob",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"enableSkipIncompatibleRow": true
},
"inputs": [
{
"name": "InputDataset-1"
}
],
"outputs": [
{
"name": "OutputDataset-1"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "05:00:00"
},
"name": "activity_00"
}
],
"start": "2018-03-07T00:00:00Z",
"end": "2020-03-08T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
You can use EndOfInterval instead of StartOfInterval in the policy. That will use the end of the day instead of the start of the day to do the execution. You may also want to set the appropriate offset if the file is not available at midnight.
In ADF v2 you can use inbuilt variables (#pipeline().TriggerTime):
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
And in a source data set (InputDataset-1) put file path/file name as something like this:
#concat('YOUR BASE PATH IN BLOB/', 'filename_',
addhours(pipeline().TriggerTime, -1, 'yyyy'), '-',
addhours(pipeline().TriggerTime, -1, 'MM'), '-',
addhours(pipeline().TriggerTime, -1, 'dd'), '.txt'))
You can also use #trigger().scheduledTime
To have always the same date when e.g. pipeline will fail.
But remember that it is only available in trigger scope.
In my tests it was only evaluated for Schedule trigger.

Azure ADF sliceIdentifierColumnName is not populating correctly

I've set up a ADF pipeline using a sliceIdentifierColumnName which has worked well as it populated the field with a GUID as expected. Recently however this field stopped being populated, the refresh would work but the sliceIdentifierColumnName field would have a value of null, or occasionally the load would fail as it attempted to populate this field with a value of 1 which causes the slice load to fail.
This change occurred at a point in time, before it worked perfectly, after it repeatedly failed to populate the field correctly. I'm sure no changes were made to the Pipeline which caused this to suddenly fail. Any pointers where I should be looking?
Here an extract of the pipeline source, I'm reading from a table in Amazon Redshift and writing to an Azure SQL table.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from mytable where eventtime >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and eventtime < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\' ' , SliceStart, SliceEnd)"
},
"sink": {
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftSomeName"
}
],
"outputs": [
{
"name": "AzureSQLDatasetSomeName"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 2
},
"name": "Activity-somename2Hour"
}
],
Also, here is the error output text
Copy activity encountered a user error at Sink:.database.windows.net side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'ColumnForADFuseOnly' contains an invalid value '1'.,Source=Microsoft.DataTransfer.Common,''Type=System.ArgumentException,Message=Type of value has a mismatch with column typeCouldn't store <1> in ColumnForADFuseOnly Column.
Expected type is Byte[].,Source=System.Data,''Type=System.ArgumentException,Message=Type of value has a mismatch with column type,Source=System.Data,'.
Here is part of the source dataset, it's a table with all datatypes as Strings.
{
"name": "AmazonRedshiftsomename_2hourly",
"properties": {
"structure": [
{
"name": "eventid",
"type": "String"
},
{
"name": "visitorid",
"type": "String"
},
{
"name": "eventtime",
"type": "Datetime"
}
}
Finally, the target table is identical to the source table, mapping each column name to its counterpart in Azure, with the exception of the additional column in Azure named
[ColumnForADFuseOnly] binary NULL,
It is this column which is now either being populated with NULLs or 1.
thanks,
You need to define [ColumnForADFuseOnly] as binary(32), binary with no length modifier is defaulting to a length of 1 and thus truncating your sliceIdentifier...
When n is not specified in a data definition or variable declaration statement, the default length is 1. When n is not specified with the CAST function, the default length is 30. See here

Call stored procedure using ADF

I am loading SQL server table using ADF and after insertion is over, I have to do little manipulation using below approach
Trigger (After insert) - Failed, SQL server not able to detect inserted record that I push using ADF.. **Seems to be a bug**.
Stored procedure using user defined table type - Getting error
Error Number '156'. Error message from database execution : Incorrect
syntax near the keyword 'select'. Must declare the table variable
"#a".
I have created below pipeline
{
"name": "CopyPipeline-xxx",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "sp_xxx",
"storedProcedureParameters": {
"stringProductData": {
"value": "str1"
}
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "col1:col1,col2:col2"
}
},
"inputs": [
{
"name": "InputDataset-3jg"
}
],
"outputs": [
{
"name": "OutputDataset-3jg"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 8
},
"name": "Activity-0-xxx_csv->[dbo]_[xxx_staging]"
}
],
"start": "2017-01-09T21:48:53.348Z",
"end": "2099-12-30T18:30:00Z",
"isPaused": false,
"hubName": "hub",
"pipelineMode": "Scheduled"
}
}
and using below stored procedure
create procedure [dbo].[sp_xxx] #xxx1 [dbo].[ut_xxx] READONLY, #str1 varchar(100) AS
MERGE xxx_dummy AS a
USING #xxx1 AS b
ON (a.col1 = b.col1)
WHEN NOT MATCHED
THEN INSERT(col1, col2)
VALUES(b.col1, b.col2)
WHEN MATCHED
THEN UPDATE SET a.col2 = b.col2;
Please help me to resolve the issue.
I can reproduce your first error. Inserting to a SQL Server table with Azure Data Factory (ADF) appears to use a bulk insert method (similar to BULK INSERT, bcp, SSIS etc) and by default these methods do not fire triggers:
insert bulk [dbo].[testADF] ([col1] Int, [col2] Int, [col3] Int, [col4] Int)
with (TABLOCK, CHECK_CONSTRAINTS)
With bcp, BULK INSERT there is a flag to change to say 'fire triggers' but it appears there is no way to change this setting for ADF. As a workaround, move the logic from your trigger into the stored proc.
If you believe this flag is important, consider creating a feedback item.

Resources