Right way to access private data from Azure Data Factory

Right way to access private data from Azure Data Factory - azure

I am trying to understand what is the right architecture to use to access data from servers hosted on a private network (still running on Azure but not publicly accessible) and the Azure Data Factory service.
On some documentation Microsoft mentions the Integration Runtime as the solution:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
While on other documentation it refers to a Data Gateway:
https://learn.microsoft.com/en-us/azure/analysis-services/analysis-services-gateway
Both articles seem fairly recent. The two applications have different recommended requirements (one mentions 8 CPU cores! Which is an overkill for my requirements to ship a few hundred megabytes per night)
Given that the data sources are running on Azure, just not publicly accessible, is there a way to connect Azure Data Factory directly?

The Self Hosted Integration Runtime in ADF should meet your requirement, this link gives a complete example to access data under Azure VNet or private network.

Related

Monitoring Azure Data Factory access

Kind of a simple question, but puzzling...
Is there a stat in Azure services to monitor how many times data factory is / was accessed ?
So, as an example if an automated system is set up to make persistent API calls to ADF with the malicious intent exhaust it is there a way to monitor for that and gather some kind of stats?

The monitoring built into the Azure Data Factory PaaS itself only monitors legitimate, authenticated usage. You can see this on the https://adf.azure.com/en/monitoring/pipelineruns?factory=%2Fsubscriptions%... dashboard.
Notice how the root domain is adf.azure.com - this is the same for all tenants using data factory around the world. Your specific subscription / instance are mere query parameters in the URL. Microsoft Azure is fully managing the actual hosting of this PaaS, which means they are entirely responsible for subverting any DDOS or similar bad-actor attempts on this service. It's not something you have to worry about, and therefore not something you have much visibility into.
If you ever needed or wanted to check in on how microsoft is doing with this, head on over to https://status.azure.com/status and search for the "Azure Data Factory" row:
This is really one of the biggest selling points of using a fully-hosted cloud PaaS such as Data Factory. You are no longer responsible for the hardware, or even range of ip addresses that back this service. No more than you have to worry about someone DDOS'ing outlook.office.com which probably services your entire organisation's email. I could happen, but if it did, it affects all of Microsoft's customers around the world, not just you personally, so there should be no expectation that you personally are doing anything special to mitigate against it.
Note that more generically if you want to monitor network traffic within your NSGs, iterfaces, VNETs etc in general on Azure, the thing to use is the Application Insights' Network Monitoring at https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/networkInsights
This is more generically applicable to all provisioned resources and services on Azure though, not something specific to Azure Data Factory.

Trying to find out Azure latency between on premises client and azure cloud application

I am trying to accomplish one task which is below.
What I am doing it.
All my users are on Premises.
Application is hosted on Azure VM IaaS.
Question =>
Azure cloud application talk with Internet and download huge packages and share with client which is on- Primes. So I am trying to understand the Risk and latency matrix between on-Prime users and Azure cloud application.
If any one has done some sort of thing and encounter latency issues and what will be possible fixes for that?
Note=> I can't Migrate user to Azure cloud as of now.

To encounter latency issues, please try the following:
To reduce the latency between on premises client and azure cloud application make use of Azure HPC cache.
Azure HPC Cache reduces latency for applications where data may be tethered to existing infrastructure because of dataset sizes and operational scale.
Azure HPC caches active data automatically that is present in both on-premises and in Azure.
You can make use of Accelerated networking where communication will be done more fast.
Try eliminating network congestion.
Try reducing number of network nodes needed to traverse from one stage to another.
Make use of Azure ExpressRoute and Azure Analysis Services to reduce Network latency.
Azure ExpressRoute creates a private connection between on-premises sources and the Azure.
Azure Analysis Services avoids the need for an on-premises data gateway and generally eliminates network latency.
For more in detail, please refer below links:
https://azure.microsoft.com/en-us/blog/azure-hpc-cache-reducing-latency-between-azure-and-on-premises-storage/
https://blogit.create.pt/gustavobrito/2017/11/27/latency-test-between-azure-and-on-premises-intro/
https://viniciusdeschamps.com.br/3-ways-to-reduce-network-latency-in-azure/#how-can-I-measure-network-latency

Linked Service between two or more datafactory

is possible to configure a linked service between 2 or more datafactory?
I red documentation but i didn't found it
Thanks

Per my experience, we can't do that and never heard such configuration.
Just as I know, we only could share the Integration runtime between 2 or more Data Factory.
But we still need to create the linked service to connect to the same on-premise data source through shared shared self-hosted integration runtime
In one word, it's impossible to configure a linked service between 2 or more Data Factory.

Parameterised datasets in Azure Data Factory

I'm wondering if anyone has any experience in calling datasets dynamically in Azure Data Factory. The situation we have is that we dynamically sweep all tables in from IaaS (on-premise SQL Server installations on an Azure VM) application systems to a data lake. We want to have one pipeline that can pass server name, database name, user name and password to the pipeline's activities. The pipelines will then sweep whatever source they've been told to read from the parameters. The source systems are currently within a separate subscription and domain within our Enterprise Agreement.
We have looked into using the AutoResolveIntegrationRuntime on a generic SQL Server dataset but, as it is Azure and the runtimes on the VMs are self-hosted, it can't resolve and we get 'cannot connect' errors. So,
i) I don't know if this problem goes away if they are in the same subscription and domain?
That leaves whether anyone can assist with:
ii) A way of getting a dynamic runtime to resolve which SQL Server runtime it should use (we have one per VM for resilience purposes, but they can all see each other's instances). We don't want to parameterise a linked service on a particular VM as it places reliance for other VMs on that single VM.
iii) Ability to parameterise a dataset to call a runtime (doesn't look possible in the UI).
iv) Ability to parameterise the source and sink connections with pipeline activities to call a dataset parameter.

Servers, database, tableNames are possible to be dynamic by using parameters. The key problem here is that all the reference in ADF can’t be parameterized, like linked services reference in dataset, integrationRuntime reference in linked service. If you don’t have too many selfhosted integrationRuntime, maybe you can try setup different pipelines for different network?

Achieving MasterData deduplication on Azure

I am looking at achieving Master Data deduplication based on match percentages in AzureDB...was looking at something equivalent to Master Data Services/ DQS (Data Quality Services) in SQL Server2012
https://channel9.msdn.com/posts/SQL11UPD05-REC-06
Broadly looking for controls on match rules (exact, close match etc), handle dependencies and audit trail(undo capability etc)
I reckon this must be available in Azure cloud, if this is made available in SQL Server. Could you pls point me to how I get this done on AzureDB
Please note- I am NOT looking for data Sources like MelissaDAta, D&B that are listed on the Azure marketplace

Master Data Services is not just a database process: it also centrally involves a website component, which still (as of 2021) requires some Windows server running IIS.
This can be an Azure Virtual Machine (link to documentation) but there is no serverless offering for this at this time.
The database itself can be hosted on an Azure SQL Managed Instance (link to documentation) but not on a standalone Azure SQL DB, as far as I can tell. This is presumably because some of the essential components of MDS sit outside the database, much like other services like SSIS are more than just a database.
Data Quality Services is a similar story: it uses three databases (link to documentation) and seemingly some components outside the databases, so wouldn't be possible to deploy in standalone Azure SQL DBs. It may be possible to run on a Managed Instance, I couldn't find a clear answer to that. And again, there is no fully-serverless offering at this time.
Of course, all of this can easily be run via IaaS (Infrastructure as a Service) using an Azure virtual machine running SQL Server.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string