Azure Data Factory - Order of actions inside the Copy Activity

Azure Data Factory - Order of actions inside the Copy Activity - azure

I have a tricky question about the "Copy Activity" in ADF. Assume the following scenario:
Source: an external API or an non-Azure database using hosted integration runtime.
Sink: an Azure SQL Server database.
The "pre-copy Script" field has a command to delete some data from the sink table (why deleting is out of scope of the discussion).
When the pipeline runs, the connection with the source fails (due to a time-out, network issue, authentication, etc.)
The question is: will the pre-copy script run in this case? Or the script only runs after ADF successfully connected the source data store? I couldn't find any reference about it.
I can just try to simulate it and see what happens, but I'm hopping someone can save my time. :)
Thanks in advance!

Per my experience about Data Factory, the pre-copy script won't run.
As I understand, we can consider it as a workflow, connect to source--> get data from source-->connect to sink-->run the pre-copy script-->write data to sink. No mater which step failed, data factory will stop run.

Related

How to setup an ADF pipeline that isolates every pipeline run and create its own computer resources?

I have a simple pipeline in ADF that is triggered by a Logic App every time someone submits a file as response in a Microsoft forms. The pipeline creates a cluster based in a Docker and then uses a Databricks notebook to run some calculations that can take several minutes. 
The problem is that every time the pipeline is running and someone submits a new response to the forms, it triggers another pipeline run that, for some reason, will make the previous runs to fail.
The last pipeline will always work fine, but earlier runs will show this error:
 > Operation on target "notebook" failed: Cluster 0202-171614-fxvtfurn does not exist 
However, checking the parameters of the last pipeline it uses a different cluster id, 0202-171917-e616dsng for example.
 It seems that for some reason, the computers resources for the first run are relocated in order to be used for the new pipeline run. However, the IDs of the cluster are different.
I have set up the concurrency up to 5 in the pipeline general settings tab, but still getting the same error. 
Concurrency setup screenshot
Also, in the first connector that looks up for the docker image files I have the concurrency set up to 15, but this won’t fix the issue 
look up concurrency screenshot
To me, it seems a very simple and common task when it comes to automation and data workflows, but I cannot figure it out.
I really appreciate any help and suggestions, thanks in advance

The best way would be use an existing pool rather than recreating the pool everytime

Syncing incremental data with ADF

In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?

This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.

How to pass Session parameters to Oracle in azure datafactory copy activity.. In Oracle Linkedservice

I'm copying data from an Oracle instance in aws, self hosted integration runtime service running on a VM in source network.
Issue is, while copying data from Oracle database using copy data activity in Azure, how do I pass session parameters like - NLS_DATE_FORMAT, NLS_TIMESTAMP_FORMAT to oracle session to make timestamp strings in certain format.
Copy activity sink is csv. Files written in csv format with timestamp precision till nanoseconds isn't parseable by spark's csv-reader.
Hence It seemed best idea to bring only seconds to azure from oracle by settings NLS_TIMESTAMP_FORMAT parameter to YYYY-MM-DD HH24:MI:SS
Please suggest how to do it?
My another question on this topic here -Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results
providing it under connection properties parameters availed in no help. See attached screenshot.

You need to set the parameters (NLS_DATE_FORMAT etc) as System properties in the VM that is hosting the Self Hosted Ingration Runtime.

PDI slow loading into azure databases

I have an Azure VM with Pentaho Data Integration installed, i'm trying to build some ETL which loads a dimensional model from the staging area, but when i start a transformation, the load speed of PDI into any azure database is painfully slow.
It is possible to have PDI working on cloud with Azure Databases? There is some configuration step needed to achieve a reasonable loading speed?
PS:
VM and databases are in the same region
There is a firewall rule to allow port acess
Reading speed is working just fine
PDI 8.1, using table output step

I've been experiencing same speed problem but I will tell you my workarounds with this.
First of all: Download and install latest jdbc driver that lets you gain connection with azure sql database, in documentation the link is here but the way I do is keep it synced from here in GitHub any of this will let you use the latest driver in PDI.
Second workaround: for large files what I've found most powerful is using BCP Utility integrated with PowerShell or Linux Batch. Doesn't care if it files are local or in azure blob storage but you might need credentials for this.
Last but not least: Use Azure Data Factory V2 to move and load files (if you are like me I try to keep it in PDI until I have to load it, the http get step will let you trigger ADF pipeline).
Good luck and let me know if you get it.

Using ODBC Driver in Azure to connect to external database

I am working in a business in New Zealand. We currently use a remote server (Plexus) to store a large amount of data (some tables > 2 billion rows). We have started down the SharePoint route, and I have created a number of databases and apps in SharePoint that use this data. Currently, I have to run a program in New Zealand that downloads the data to our local server and then pushes up that data into an Azure database, which the web apps connect to. I would like to remove this middle step for many reasons but the biggest reason is that the web connection between NZ and the US tends to result in a lot of time outs and long pulls due to having to pull large data sets across the Pacific. The remote database we are using is Plexus.
Ideally, I would like to have my C# code sitting in Azure and have this connect to the remote server directly. This way I could simply send the SQL request to Plex and have this data go directly into the Azure databases. The major advantage would be that this would mean it would all be based in the US which would make things a lot faster.
The major hurdle is that we need to install an ODBC Driver given to us by the remote server into Azure so it recognises the calls as genuine. Our systems adminstrator has said he has looked into it and it seems this can't be done?
I was hoping someone on the StackOverFlow community has encountered a similar issue and resolved it?
Note: Please dont think I am asking whether Azure has an ODBC connection because I know it does. I am not asking if I can connect TO Azure, I am asking if I can connect Azure to another external data source.

In a Worker Role/Cloud service in azure you can install the ODBC driver in a startup task using powershells ODBC commandlets.
More info here: Powershell Add-OdbcDsn and here: Powershell startup task in cloud services

One option is to create a virtual machine in the same Azure data center as your database and install your ODBC driver and your C# app.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string