Parameterised datasets in Azure Data Factory - azure

I'm wondering if anyone has any experience in calling datasets dynamically in Azure Data Factory. The situation we have is that we dynamically sweep all tables in from IaaS (on-premise SQL Server installations on an Azure VM) application systems to a data lake. We want to have one pipeline that can pass server name, database name, user name and password to the pipeline's activities. The pipelines will then sweep whatever source they've been told to read from the parameters. The source systems are currently within a separate subscription and domain within our Enterprise Agreement.
We have looked into using the AutoResolveIntegrationRuntime on a generic SQL Server dataset but, as it is Azure and the runtimes on the VMs are self-hosted, it can't resolve and we get 'cannot connect' errors. So,
i) I don't know if this problem goes away if they are in the same subscription and domain?
That leaves whether anyone can assist with:
ii) A way of getting a dynamic runtime to resolve which SQL Server runtime it should use (we have one per VM for resilience purposes, but they can all see each other's instances). We don't want to parameterise a linked service on a particular VM as it places reliance for other VMs on that single VM.
iii) Ability to parameterise a dataset to call a runtime (doesn't look possible in the UI).
iv) Ability to parameterise the source and sink connections with pipeline activities to call a dataset parameter.

Servers, database, tableNames are possible to be dynamic by using parameters. The key problem here is that all the reference in ADF can’t be parameterized, like linked services reference in dataset, integrationRuntime reference in linked service. If you don’t have too many selfhosted integrationRuntime, maybe you can try setup different pipelines for different network?

Related

Monitoring Azure Data Factory access

Kind of a simple question, but puzzling...
Is there a stat in Azure services to monitor how many times data factory is / was accessed ?
So, as an example if an automated system is set up to make persistent API calls to ADF with the malicious intent exhaust it is there a way to monitor for that and gather some kind of stats?
The monitoring built into the Azure Data Factory PaaS itself only monitors legitimate, authenticated usage. You can see this on the https://adf.azure.com/en/monitoring/pipelineruns?factory=%2Fsubscriptions%... dashboard.
Notice how the root domain is adf.azure.com - this is the same for all tenants using data factory around the world. Your specific subscription / instance are mere query parameters in the URL. Microsoft Azure is fully managing the actual hosting of this PaaS, which means they are entirely responsible for subverting any DDOS or similar bad-actor attempts on this service. It's not something you have to worry about, and therefore not something you have much visibility into.
If you ever needed or wanted to check in on how microsoft is doing with this, head on over to https://status.azure.com/status and search for the "Azure Data Factory" row:
This is really one of the biggest selling points of using a fully-hosted cloud PaaS such as Data Factory. You are no longer responsible for the hardware, or even range of ip addresses that back this service. No more than you have to worry about someone DDOS'ing outlook.office.com which probably services your entire organisation's email. I could happen, but if it did, it affects all of Microsoft's customers around the world, not just you personally, so there should be no expectation that you personally are doing anything special to mitigate against it.
Note that more generically if you want to monitor network traffic within your NSGs, iterfaces, VNETs etc in general on Azure, the thing to use is the Application Insights' Network Monitoring at https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/networkInsights
This is more generically applicable to all provisioned resources and services on Azure though, not something specific to Azure Data Factory.

What is the good way of working when moving from on-premise to Azure PaaS model?

We plan to migrate to Azure cloud and would like to know what is the recommened way of working, especially when the teams are used to working with dedicated UAT/DEV servers.
We have about 10 UAT physical database servers, each used for a different purpose. Some servers are used for integration testing, few for user acceptance testing, for chain testing etc.
We plan to use services such as Azure Data Factory (SSIS IR), Azure SQL Managed Instance, Virtual machines for custom applications
Creating the same number of database servers on cloud would be too expensive. How can we best deal with such scenario and how it is handled within your teams? Any pointers are appreciated.
As the use case for the migration is lift and shift and since you have multiple cross database queries,SQL Managed instance would be the best fit.
As SQL MI is very costly, so having different MI instances for different environments would prove to be very expensive (that too having <10 databases in 1 instance)
Would suggest have 2 SQL MI Instances :
1 for lower environment like Dev, test,UAT and 1 for Prod environment.
Have database names as xyz_env and create variables within DACPAC and parameterize it avoid any manual code changes for each database deployments.
Though you can also leverage Azure SQL database and the concept of elastic queries(for cross database queries) which would be some additional efforts but worth leveraging complete benefits over PaaS as there wont be any Vnet and databases can be directly accessible via other Azure resources and also would be very cost effective.

Azure Storage Account for Tables

So first of all I'd like to say I'm no DBA nor coder, I'm just a regular IT person that works as support for network and infrastructure, however, I like to get familiar with technologies in general and understand the basics of it, let's say how they work, implemented with no additional specific details.
I've been reading about Azure Storage Accounts in regards to tables. As IT, I had to implement simple file shares via SMB 3.0 in order to have them mapped on our network, I've come across other options such as blobs, tables and queues. I've read about them however I'm trying to get the main functionality of tables for a coder.
Correct me if I am wrong, when you code an app with a database, you can put the database on same/different server, and that can be on premise or on the cloud and you kind of link both together.
And as far as Im concerned and what I was able to find out investigating on the web, these tables are NoSQL and no constraints, you create the tables and data through Visual Studio thanks to an API, then that information is reflect on your storage.
How is this is useful when using it for the app you're developing?
I've been reading about Azure Storage Accounts in regards to tables. As IT, I had to implement simple file shares via SMB 3.0 in order to have them mapped on our network, I've come across other options such as blobs, tables and queues. I've read about them however I'm trying to get the main functionality of tables for a coder.
And as far as Im concerned and what I was able to find out investigating on the web, these tables are NoSQL and no constraints, you create the tables and data through Visual Studio thanks to an API, then that information is reflect on your storage.
Azure Storage Accounts is a "box" to keep your Blobs, Tables, Queues, Files organised from the management point of view and for the access control. Each storage type is good for it's specific tasks.
If the world would have just one super storage which will solve all our possible cases for storing, querying and managing the data then there would not be such variety of different databases, storage types etc. available.
If you need to share the files as a "network folder" - try Azure Files.
If your coders need a database storage, then the first question would be what are the requirements to the database do they have? What is the purpose of that database would be, etc. Azure, particularly, has a lot of different database solutions, and again, each of them good for some specific task, and can be not a good choice for other tasks.
As to Azure Tables, from the official docs:
Azure Table storage is a service that stores structured NoSQL data in the cloud, providing a key/attribute store with a schemaless design.
So, if your coders do need to store such data, then yes, that would be one of the possible choices.
Correct me if I am wrong, when you code an app with a database, you can put the database on same/different server, and that can be on premise or on the cloud and you kind of link both together.
Correct. But also you can have your own server with the database which you need to manage yourself, or you can choose some cloud service which will provide the database for you but will keep the underlying server and other maintenance activity managed for you, so you no need to worry/spend your time on that.
How is this is useful when using it for the app you're developing?
It is important to understand what your requirements are for data storage in order to pick a proper one. This question perhaps should be addressed not to you, but to your coders, who are building the app and can consolidate their requirements to the database store. Usually, they will tell you exactly what they need, and you may give them some ideas or advice of the alternatives, if any (That may be a similar solution with extra functionality or the way how the data is stored or processed, or have more built in integrations that may be important for you, or a decision whether keep own installation or use cloud managed service)
For your further possible question about When should I use a NoSQL database instead of a relational database? Is it okay to use both on the same site? see this thread
Update based on further questions:
If I develop an application with a database whose tables are on Azure, can I call let's say functions or data from it to my main application that is hosted on premise? What's the benefit of doing that versus hosting the tables on premise other than it's largely scalable and highly available?
Perhaps you need to better understand the relationship between App (Application) and DB (Database). The Database is a standalone system, which store the data, reply to the incoming queries (receive request, process it, return the result). In overall to the DB is not important who is requesting the data. It is a "passive" system. (There are some cases when DB can trigger further processes in data processing pipelines, but that is beyond this scope).
The App in opposite is an active system in App<->DB relationship. (Also leave behind more advanced designs where App is not just a 1 system). App receive requests, process them (may do external requests to other "services" if that is necessary), give a response (with or without data) to the requester. In App<->DB relationship the external requests is what happening. At some point App need some data from the DB, so App make a request to the DB, obtain the response and continue its own logic.
Where App server and DB server are placed is not that important (for simplicity). The important part is whether DB server is accessable for the requests. DB can be on-prem with public static IP address, it can be in cloud on your own server which has public static IP address (sometimes that is archived in different ways but we skip that for simplicity), that can be a Database as a Service cloud solution, where you do not need to have a server and configure the database, but have a url endpoint which you need to use to query the DB.
I appreciate the answer, and I pretty much agree with what you're saying.
But my questions goes beyond what the requirements are for the developers.
I'll modify the question. If I develop an application with a database whose tables are on Azure, can I call let's say functions or data from it to my main application that is hosted on premise? What's the benefit of doing that versus hosting the tables on premise other than it's largely scalable and highly available?
Azure Storage Tables are the "Notepad" of NoSQL Databases. If you want quick and easy key/value pairs, tables is the way to go. If you are looking for the "Word" of NoSQL in Azure then Cosmos DB is where it's at. Cosmos DB offers global distrobution, better features and better SLA (see comparison). Tables are cheaper too.
Azure also supports MySQL, PostGreSQL, MariaDB and MSSQL as PaaS offerings if you wish to use a traditional database.

Use parameters in place of linked service name in Azure Data Factory V2

My question is slightly similar to this question however adequately different to merit a new thread.
I have a requirement to extract data from a number of different on-premises SQL Server instances over the internet. I am using Azure Data-Factory 2 and the Integration Runtime to access data from these servers.
The problem is that i will have many pipelines to manage and update. I want to have a single Data Factory process which uses parameters for linked service names.
Is it possible to have 1 pipeline which uses a parameter to reference a linked service name which is updated before re-executing the pipeline?
I am struggling to find a useful article on how this can be achieved.
Reference name can’t be parametied. But making linked Service support parameters is in the plan as the post you mentioned said.

Achieving MasterData deduplication on Azure

I am looking at achieving Master Data deduplication based on match percentages in AzureDB...was looking at something equivalent to Master Data Services/ DQS (Data Quality Services) in SQL Server2012
https://channel9.msdn.com/posts/SQL11UPD05-REC-06
Broadly looking for controls on match rules (exact, close match etc), handle dependencies and audit trail(undo capability etc)
I reckon this must be available in Azure cloud, if this is made available in SQL Server. Could you pls point me to how I get this done on AzureDB
Please note- I am NOT looking for data Sources like MelissaDAta, D&B that are listed on the Azure marketplace
Master Data Services is not just a database process: it also centrally involves a website component, which still (as of 2021) requires some Windows server running IIS.
This can be an Azure Virtual Machine (link to documentation) but there is no serverless offering for this at this time.
The database itself can be hosted on an Azure SQL Managed Instance (link to documentation) but not on a standalone Azure SQL DB, as far as I can tell. This is presumably because some of the essential components of MDS sit outside the database, much like other services like SSIS are more than just a database.
Data Quality Services is a similar story: it uses three databases (link to documentation) and seemingly some components outside the databases, so wouldn't be possible to deploy in standalone Azure SQL DBs. It may be possible to run on a Managed Instance, I couldn't find a clear answer to that. And again, there is no fully-serverless offering at this time.
Of course, all of this can easily be run via IaaS (Infrastructure as a Service) using an Azure virtual machine running SQL Server.

Resources