Avoid duplicates during parallel job run in Azure Synapse

Avoid duplicates during parallel job run in Azure Synapse - azure

Need your suggestions in developing code in Azure Synapse.
We have a requirement where our jobs will run in parallel at same time and insert data to the same table.
During this insert there are changes that duplicate entries will be inserted to the same table.
For Example: If Job A and Job B run at same time both with same values then "not exists" or "not in" will fail to work. In this case I will get duplicates from both the job. Primary key or Unique constraint allows duplicates in Azure synapse. Is there any best way to lock tables during data insert. Like if Job A is running then JOB B should not insert the data to same table. Please pour your suggestions as I am new to this. Note: We use stored Procedure to load the data through ADF V2
Thanks,
Nandini

Duplicates must be handled within jobs before inserting data into Azure Synapse. If the duplicates exists between two jobs, then do it after completion of both jobs. It depends really how you are loading data. You can easily manage by creating a temp table instead of directly loading data to final table. Please make sure the structure of temp table should be same as final table (Distribution, Partition, constraints, nullability of the columns) You can use SQL BCP/INSERT TO/CTAS/CTAS with partition switching with stage table to final table.
If you can share specific scenario, it will be helpful to give suggestions relevant to your use case.

I just got the same case and I solved it with Pipeline Runs - Query By Factory
Use a Until activity before the DataFlow activity that writes the values in the table with this expression #equals(activity('pingPL').output.value[0].runId, pipeline().RunId) as follow:
Into the Until activities put a web activity and a wait time:
a. Web activity body - follow docs:
{
"lastUpdatedAfter": "#{addminutes(utcnow(), -30)}",
"lastUpdatedBefore": "#{utcnow()}",
"filters": [
{
"operand": "PipelineName",
"operator": "Equals",
"values": [
"pipeline_name_where_writeInSynapse_is_located"
]
},
{
"operand": "Status",
"operator": "Equals",
"values": [
"InProgress"
]
}
]
}
b. Wait activity 30 sec or whatever make sense
What is happening is, if you trigger several times the same pipeline in parallel the web activity is going to filter each PL status InProgress. It will look like this:
{
"value": [
{
"id": "...",
"runId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"debugRunId": null,
"runGroupId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "80efce4dbda74636878bc99472978ccf",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:24:01.0210945Z",
"runEnd": "2021-10-13T17:25:06.9692394Z",
"durationInMs": 65948,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:25:06.9704432Z",
"annotations": [],
"runDimension": {},
"isLatest": true
},
{
"id": "...",
"runId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"debugRunId": null,
"runGroupId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "08205e0eda0b41f6b5a90a8dda06a7f6",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:28:58.219611Z",
"runEnd": null,
"durationInMs": null,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:29:00.9860175Z",
"annotations": [],
"runDimension": {},
"isLatest": true
}
],
"ADFWebActivityResponseHeaders": {
"Pragma": "no-cache",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"X-Content-Type-Options": "nosniff",
"x-ms-ratelimit-remaining-subscription-reads": "11999",
"x-ms-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-correlation-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-routing-request-id": "WESTUS2:20211013T172902Z:188508ef-8897-4c21-8c37-ccdd4adc6d81",
"Cache-Control": "no-cache",
"Date": "Wed, 13 Oct 2021 17:29:02 GMT",
"Server": "Microsoft-IIS/10.0",
"X-Powered-By": "ASP.NET",
"Content-Length": "1492",
"Content-Type": "application/json; charset=utf-8",
"Expires": "-1"
},
"effectiveIntegrationRuntime": "NCAP-Simple-DataMovement (West US 2)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Then the Until expression will evaluate if the first value[0] has runId == pipeline_runid to stop the until activity and run the dataflow that writes in Synapse. Once the PL ends the status will be Succeeded and the Web activity in another job will get the next value[0] with the status InProgress and continue with the next write. This creates a dependency to the parallel jobs to wait until the dataflow validates and writes in table if need it.

Related

How do I retrieve multiple callbacks on translation progress

I have created a webhook that is using the even extraction.updated that should trigger when a job is in progress. I want to retrieve multiple calls on the progress of the translation so that I can show it in my progress bar. Unfortunately I only retrieve a callback when the job translation is finished. When I create the job I set the misc.workflow parameter and same goes for the hook. Am I missing some parameters when creating a webhook or posting a job?
I was following this tutorial: https://forge.autodesk.com/en/docs/webhooks/v1/tutorials/create-a-hook-model-derivative/
The job payload takes the input which is my urn, output which is the filetype(svf2) and views(2d,3d), and misc which is the workflow(testworkflowname)
Callback result:
{{
"version": "1.0",
"resourceUrn": "<my-resourceUrn>",
"hook": {
"hookId": "<my-hookId>",
"tenant": "testworkflowname",
"callbackUrl": "<my-callbackUrl>",
"createdBy": "<my-createdBy>",
"event": "extraction.updated",
"createdDate": "<my-createdDate>",
"lastUpdatedDate": "<my-lastUpdatedDate>",
"system": "derivative",
"creatorType": "Application",
"status": "active",
"scope": {
"workflow": "testworkflowname"
},
"hookAttribute": {
"progress": "test"
},
"autoReactivateHook": false,
"urn": "<my-urn>"
},
"payload": {
"TimeStamp": <my-timestamp>,
"Env": "production",
"URN": "<my-urn>",
"EventType": "UPDATED",
"Payload": {
"status": "success",
"bubble": {
"guid":"<my-guid>",
"owner": "<my-owner>",
"hasThumbnail": "true",
"startedAt": "my-startedAt>",
"type": "design",
"urn":"<my-urn>",
"success": "100%",
"progress": "complete",
"region": "US",
"status": "success",
"children": []
},
"scope": "<my-scope>",
"registerKey": []
},
"WorkflowAttributes": null
}
}}

You've got your webhooks setup correctly. I'm afraid this is a limitation on the Model Derivative service side. The service can translate over 60 different file formats today, and as you can imagine, different formats must be converted using different libraries. And while some of the converters support progress reporting, others may not, so being able to get notified of translation progress really depends on the file format you're processing.

kafka-connect : Error in sink cassandra connector

I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"

Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.

Azure ADF sliceIdentifierColumnName is not populating correctly

I've set up a ADF pipeline using a sliceIdentifierColumnName which has worked well as it populated the field with a GUID as expected. Recently however this field stopped being populated, the refresh would work but the sliceIdentifierColumnName field would have a value of null, or occasionally the load would fail as it attempted to populate this field with a value of 1 which causes the slice load to fail.
This change occurred at a point in time, before it worked perfectly, after it repeatedly failed to populate the field correctly. I'm sure no changes were made to the Pipeline which caused this to suddenly fail. Any pointers where I should be looking?
Here an extract of the pipeline source, I'm reading from a table in Amazon Redshift and writing to an Azure SQL table.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from mytable where eventtime >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and eventtime < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\' ' , SliceStart, SliceEnd)"
},
"sink": {
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftSomeName"
}
],
"outputs": [
{
"name": "AzureSQLDatasetSomeName"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 2
},
"name": "Activity-somename2Hour"
}
],
Also, here is the error output text
Copy activity encountered a user error at Sink:.database.windows.net side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'ColumnForADFuseOnly' contains an invalid value '1'.,Source=Microsoft.DataTransfer.Common,''Type=System.ArgumentException,Message=Type of value has a mismatch with column typeCouldn't store <1> in ColumnForADFuseOnly Column.
Expected type is Byte[].,Source=System.Data,''Type=System.ArgumentException,Message=Type of value has a mismatch with column type,Source=System.Data,'.
Here is part of the source dataset, it's a table with all datatypes as Strings.
{
"name": "AmazonRedshiftsomename_2hourly",
"properties": {
"structure": [
{
"name": "eventid",
"type": "String"
},
{
"name": "visitorid",
"type": "String"
},
{
"name": "eventtime",
"type": "Datetime"
}
}
Finally, the target table is identical to the source table, mapping each column name to its counterpart in Azure, with the exception of the additional column in Azure named
[ColumnForADFuseOnly] binary NULL,
It is this column which is now either being populated with NULLs or 1.
thanks,

You need to define [ColumnForADFuseOnly] as binary(32), binary with no length modifier is defaulting to a length of 1 and thus truncating your sliceIdentifier...
When n is not specified in a data definition or variable declaration statement, the default length is 1. When n is not specified with the CAST function, the default length is 30. See here

How to create a periodic copy of data from OData to Azure DocumentDB

I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks.  Thanks for that.
What isn't working for me though:  The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move.  I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB

Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.

You could use a Logic App that runs on a Timer Trigger.

Call stored procedure using ADF

I am loading SQL server table using ADF and after insertion is over, I have to do little manipulation using below approach
Trigger (After insert) - Failed, SQL server not able to detect inserted record that I push using ADF.. **Seems to be a bug**.
Stored procedure using user defined table type - Getting error
Error Number '156'. Error message from database execution : Incorrect
syntax near the keyword 'select'. Must declare the table variable
"#a".
I have created below pipeline
{
"name": "CopyPipeline-xxx",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "sp_xxx",
"storedProcedureParameters": {
"stringProductData": {
"value": "str1"
}
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "col1:col1,col2:col2"
}
},
"inputs": [
{
"name": "InputDataset-3jg"
}
],
"outputs": [
{
"name": "OutputDataset-3jg"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 8
},
"name": "Activity-0-xxx_csv->[dbo]_[xxx_staging]"
}
],
"start": "2017-01-09T21:48:53.348Z",
"end": "2099-12-30T18:30:00Z",
"isPaused": false,
"hubName": "hub",
"pipelineMode": "Scheduled"
}
}
and using below stored procedure
create procedure [dbo].[sp_xxx] #xxx1 [dbo].[ut_xxx] READONLY, #str1 varchar(100) AS
MERGE xxx_dummy AS a
USING #xxx1 AS b
ON (a.col1 = b.col1)
WHEN NOT MATCHED
THEN INSERT(col1, col2)
VALUES(b.col1, b.col2)
WHEN MATCHED
THEN UPDATE SET a.col2 = b.col2;
Please help me to resolve the issue.

I can reproduce your first error. Inserting to a SQL Server table with Azure Data Factory (ADF) appears to use a bulk insert method (similar to BULK INSERT, bcp, SSIS etc) and by default these methods do not fire triggers:
insert bulk [dbo].[testADF] ([col1] Int, [col2] Int, [col3] Int, [col4] Int)
with (TABLOCK, CHECK_CONSTRAINTS)
With bcp, BULK INSERT there is a flag to change to say 'fire triggers' but it appears there is no way to change this setting for ADF. As a workaround, move the logic from your trigger into the stored proc.
If you believe this flag is important, consider creating a feedback item.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Avoid duplicates during parallel job run in Azure Synapse - azure

Related

How do I retrieve multiple callbacks on translation progress

kafka-connect : Error in sink cassandra connector

Azure ADF sliceIdentifierColumnName is not populating correctly

How to create a periodic copy of data from OData to Azure DocumentDB

Call stored procedure using ADF

Categories

Resources