Copy data from self-hosted Integration runtime to azure data lake?

Copy data from self-hosted Integration runtime to azure data lake? - azure

I'm trying to copy data, using the copy activity in a synapse-pipeline, from a self hosted integration runtime rest api call to a azure data lake gen2. Using preview I can see the data from the rest api call but when I try to do the copy activity it is queued endlessly. Any idea why this happens? The Source is working with a self hosted integration Runtime and the Sink with azure integration runtime. Could this be the problem? Otherwise both connections are tested and working...
Edit: When trying the the web call, it tells me it's processing for a long time but I know I can connect to the rest api source since when using the preview feature in the copy activity it shows me the response....
Running the diagnostic tool, I receive the following error:
It seems to be a problem with the certificate. Any ideas?

Integration runtime
If you use a Self-hosted Integration Runtime (IR) and copy activity waits long in the queue until the IR has available resource to execute, suggest scaling out/up your IR.
If you use an Azure Integration Runtime that is in a not optimal region resulting in slow read/write, suggest configuring to use an IR in another region.
You can also try performance tuning tips
how do i scale up the IR?
Scale Considerations

Related

Azure monitoring tool

I am trying to find the reason why azure data factory pipeline is running for long time. It may be due to n/w, database, copy actiovity or any other reason.
Is there a tool or plugin which can show all the metrics in one dashborad.
Datadog will work for this
Azure monitor is not showing all the required metrics in one dashboard.
So need a tool for this

Self-Hosted Integration Runtime Copy Activity Timeout

I’m trying to implement a pipeline in ADF where I copy data from a Function App to an on-prem SQL Server. I have installed the Self-Hosted Integration Runtime to access the on-prem database and set my copy activity to use the self-hosted IR.
First I was a getting a firewall error, so I added a rule to allow the node where IR is installed to call the function app but now I am getting a timeout error.
Any ideas why the timeout?

Please check the General parameters of the copy activity. Try to Increase Timeout of your copy activity. By default it is 7 days.
Also, try to increase the retry count in the copy activity. The default is zero (no retry). Increasing the count and retry interval should allow it to attempt to regain connection.
Please refer this Microsoft Documentation: Troubleshoot copy activity on Self-hosted IR

Is it possible to Monitor Azure Integration Runtime?

I am running few Data Pipelines in Azure Data Factory and its using Azure Integration Runtime for the compute.
I am trying to Monitor the CPU/Memory Usage Pipelines Consume and Utilise Azure IR.
I have checked in the Azure Monitor but the CPU / Memory Metrics are for Self Hosted Integration Runtime I think.
Also, with the Diagnostic Setting Enabled, I tried to verify the details in the Logs too but these details are not available.
Can anyone help to know more options?

If you are referring to the Azure AutoResolveIntegrationRuntime, then no there is not, and this is why (from https://www.cathrinewilhelmsen.net/integration-runtimes-azure-data-factory/)
Microsoft has massive elastic pools across the various locations/regions they offer Azure, and at runtime ADF determines what pool/hardware it will use to perform the Pipeline activities. So there is really no way (and no need) to monitor the Azure Autoresolve IR. But if you are interested in monitoring Self-Hosted IR's then there are many ways to do it.
One simple and straight forward way to do it is by creating Azure Dashboards in the Metrics portion of Azure Monitor. As you can see from the screenshot below it provides good visual representation of usage/resources over time.
As you can see I'm visualizing the integration Runtime itself (CPU/Memory) as well as the Azure VM that is hosting the Integration Runtime. On top of this you can go into the Metrics dashboard to set up alerts if certain conditions are met (eg AVG CPU % usage is over 75% for the last 15 minutes). These alerts can send you a text message, or email... and even do things as complicated as triggering a LogicApp or WebHook for automated scaling up/out, advanced notifications, etc.
This in my opinion is the best way to monitor but another option could be to call the Azure Data Factory REST API to get monitor data for the Integration Runtimes
POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName}/monitoringData?api-version=2018-06-01
But this method would require you to incrementally pull in data, store it, parse it, and then visualize it or act upon it when that is already very well built in for you. Sometimes it's fun to recreate wheels though.

Yes It is possible to Monitor Azure Integration Runtime.
"Pipeline Runs" in Monitoring has the option to check the CPU Utilization specific to pipeline, Integration Runtime and more specific filters. You can find here, how its done.

Azure Data Factory use two Integration Runtimes for failover

I have an Azure Data Factory V2 with an Integration Runtime installed on the our internal cloud server and connected to our Java web platform API. This passes data one way into ADF on a scheduled trigger via a request to the IR API.
The Java web platform also has a DR solution at another site, which is a mirror build of the same servers and platforms. If I was to install another IR on this DR platform and link to ADF as a secondary IR. Is there a way for ADF to detect if the primary is down and auto failover to the secondary IR?
Thanks

For you question "Is there a way for ADF to detect if the primary is down and auto failover to the secondary IR?", the answer is no, Data Factory doesn't have the failover feature. The shared integration runtime nodes don't affect each other.
For another question in the comment, the IR can't be stop/pause automatically, we must set it manually on the machine:

Azure Data Factory(ADF) vs Azure Functions: How to choose?

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?

Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork

Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.

Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.

Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.

Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string