Azure data factories vs factory - azure

I'm building an Azure data lake using data factory at the moment, and am after some advice on having multiple data factories vs just one.
I have one data factory at the moment, that is sourcing data from one EBS instance, for one specific company under an enterprise. In the future though there might be other EBS instances, and other companies (with other applications as sources) to incorporate into the factory - and I'm thinking the diagram might get a bit messy.
I've searched around, and I found this site, that recommends to keep everything in a single data factory to reuse linked services. I guess that is a good thing, however as I have scripted the build for one data factory, it would be pretty easy to build the linked services again to point at the same data lake for instance.
https://www.purplefrogsystems.com/paul/2017/08/chaining-azure-data-factory-activities-and-datasets/
Pros for having only one instance of data factory:
have to only create the data sets, linked services once
Can see overall lineage in one diagram
Cons
Could get messy over time
Could get quite big to even find the pipeline you are after
Has anyone got some large deployments of Azure Data Factories out there, that bring in potentially thousands of data sources, mix them together and transform? Would be interested in hearing your thoughts.

My suggestion is to have only one, as it makes it easier to configure multiple integration runtimes (gateways). If you decide to have more than one data factory, take into consideration that a pc can only have 1 integration runtime installed, and that the integration runtime can only be registered to only 1 data factory instance.
I think the cons you are listing are both fixed by having a naming rules. Its not messy to find a pipeline you want if you name them like: Pipeline_[Database name][db schema][table name] for example.
I have a project with thousands of datasets and pipelines, and its not harder to handle than smaller projects.
Hope this helped!

I'd initially agree with an integration runtime being tied to a single data factory being a restriction, however I suspect it is no longer or soon to be no longer a restriction.
In the March 13th update to AzureRm.DataFactories, there is a comment stating "Enable integration runtime to be shared across data factory".
I think it will depend on the complexity of the data factory and if there are inter-dependencies between the various sources and destinations.
The UI particularly (even more so in V2) makes managing a large data factory easy.
However if you choose an ARM deployment technique the data factory JSON can soon become unwieldy in even a modestly complex data factory. And in that sense I'd recommend splitting them.
You can of course mitigate maintainability issues as people have mentioned, by breaking your ARM templates into nested deployments, ARM parameterisation or data factory V2 parameterisation, using the SDK direct with separate files. Or even just use the UI (now with git support :-) )
Perhaps more importantly particularly as you mention separate companies being sourced from; it perhaps sounds like the data isn't related and if it isn't - should it be isolated to avoid any coding errors? Or perhaps even to have segregated roles and responsibilities for the data factories.
On the other hand if the data is interrelated, having it in one data factory makes things far easier for allowing data factory to manage data dependencies and re-running failed slices in one go.

After the March release, you can link integration runtimes among different factories.
The other thing to do is to create different folders for the various pipelines and datasets

My suggestion is to create one DataFactory service per each project. If you need to transfer data from two source to one destination and for each transformation you need several Pipelines and Linked Services and other stuffs, I suggest to create two separated ADF services for each Source. In this case I will see each source as an integration project separated.
You will have two separated CI/CD for each project also.
In your source controller also you need to have two separated repositories.

If you are using ADF v1 then it will get messy. At a client of ours we have over 1000 pipelines in one Data Factory. If you are starting fresh, I would recommend looking at v2 because it allows you to parameterize things and should make your scripts more reusable.

Related

How can i Migrate from Azure DevOps services to Azure DevOps Server

I am having my project collection running in Azure DevOps Online(Services). And I would like to migrate that to Azure DevOps On-Prem Server.
Help me out here with the incompatibility issue i will be facing and how to overcome that.
Options to Migrate from Azure DevOps online(services) to Azure DevOps Server(On-Prem).
Is there any services available in azure to successfully acheive the above migration with out any data loss?
Should I must use third party tool to do the above migration with out any data loss?
Help me out here with the Downtime required for the 100 GB of Project collection with multiple repository.
Project Collection size - 100 GB
One of the previous answers (since deleted?) has captured most of the critical points, and that no tool can migrate 100% of data with zero data loss (Actually, 100% migration with no loss is not feasible as inherently some of the automatically generated and configuration values, like work item ids etc., will inherently be different between two instances). Therefore, the only way to get zero data loss migration is to lift and shift the complete project collection image from Azure DevOps Services to Azure DevOps Server, which is not supported by the official Azure DevOps migration tool. Given that, the only way left to migrate data is using Azure DevOps APIs.
So, the best approach is to understand what data cannot be migrated by the migration tools that you are evaluating, and then decide what works best for you. Also, it will not be a black and white selection when it comes to choosing a migration solution. First, you need to define the must-haves you expect from migration and then evaluate the different migrators available in the market. Here are a few common selection criteria:
Data Loss:
Understand what data can be and cannot be migrated by the migration solution. Ideally, the tool should be able to migrate work items (along with history, attachments, mentions, and inline images) and test management, including Test Results, Source code, Dashboards, Areas and Iterations. For Builds and pipelines, you can use the native Export-Import feature, as they require manual changes to tweak the connection.
Zero Downtime:
Downtime adds operational costs and impacts development operations as teams cannot use Azure DevOps tools. Understand thoroughly that there is no scenario in which downtime will be required for any type of data.
Ease of Use:
Some tools are a collection of unsupported scripts (Naked Agility) which require very high degree of sophistication to use. These can be extremely expensive (even though the scripts are open source), error prone and hinder operations.
Project Consolidation or Customized Templates:
Analyze if you want to consolidate multiple projects into one project while migrating or if the templates need to be customized. If that is the need, evaluate if the migration tool can support such configuration with ease and has a UI to do so. Manually configuring mappings for each project can be tedious and highly error prone.
Migration Time:
Many migration tools migrate projects one by one, hence consuming a lot of effort and time to migrate the data spread across multiple projects. Understand how many projects can be parallelly migrated to have speedy migration.
Reverse Synchronization:
Do you want to keep the data in sync between Services and Server for some time post-migration? Will data be integrated bidirectionally or unidirectionally? Answer these questions and then evaluate the migration solution if it will meet the requirements.
Commercial Support:
Migration can be tricky and time-consuming, as, over time, different teams have created all the odd stuff in there. Better to have a team of experts do the migration for you while you focus on defining requirements and validating the completeness of migration.
I hope this helps. Full disclosure: I work for OpsHub, where we are experts at data migration and using OpsHub Azure DevOps migrator have migrated multiple organizations to and from Azure DevOps Services and Server over the last decade. Contact us if you need more help.

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?
This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH
One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Azure Table Storage design question: Is it a good idea to use 1 table to store multiple types?

I'm just wondering if anyone who has experience on Azure Table Storage could comment on if it is a good idea to use 1 table to store multiple types?
The reason I want to do this is so I can do transactions. However, I also want to get a sense in terms of development, would this approach be easy or messy to handle? So far, I'm using Azure Storage Explorer to assist development and viewing multiple types in one table has been messy.
To give an example, say I'm designing a community site of blogs, if I store all blog posts, categories, comments in one table, what problems would I encounter? On ther other hand, if I don't then how do I ensure some consistency on category and post for example (assume 1 post can have one 1 category)?
Or are there any other different approaches people take to get around this problem using table storage?
Thank you.
If your goal is to have perfect consistency, then using a single table is a good way to go about it. However, I think that you are probably going to be making things more difficult for yourself and get very little reward. The reason I say this is that table storage is extremely reliable. Transactions are great and all if you are dealing with very very important data, but in most cases, such as a blog, I think you would be better off just 1) either allowing for some very small percentage of inconsistent data and 2) handling failures in a more manual way.
The biggest issue you will have with storing multiple types in the same table is serialization. Most of the current table storage SDKs and utilities were designed to handle a single type. That being said, you can certainly handle multiple schemas either manually (i.e. deserializing your object to a master object that contains all possible properties) or interacting directly with the REST services (i.e. not going through the Azure SDK). If you used the REST services directly, you would have to handle serialization yourself and thus you could more efficiently handle the multiple types, but the trade off is that you are doing everything manually that is normally handled by the Azure SDK.
There really is no right or wrong way to do this. Both situations will work, it is just a matter of what is most practical. I personally tend to put a single schema per table unless there is a very good reason to do otherwise. I think you will find table storage to be reliable enough without the use of transactions.
You may want to check out the Windows Azure Toolkit. We have designed that toolkit to simplify some of the more common azure tasks.

What design decisions can I make today, that would make a migration to Azure and Azure Tables easier later?

I'm rebuilding an application from the ground up. At some point in the future...not sure if it's near or far yet, I'd like to move it to Azure. What decisions can I make today, that will make that migration easier.
I'm going to be dealing with large amounts of data, and like the idea of Azure Tables...are there some specific persistance choices I can make now that will mimick Azure Tables so that when the time comes the pain of migration will be lessened?
A good place to start is the Windows Azure Guidance
If you want to use Azure Tables eventually, you could design your database where all tables are a primary key, plus a field with XML data.
I would advise to plan along the lines of almost-infinitely scalable solutions (see Pat Helland's paper on Life beyond distributed transactions) and the CQRS approach in general. This way you'll be able to avoid common pitfalls of the distributed apps generally and Azure table storage peculiarities.
This really helps us to work with Azure and Cloud Computing at Lokad (data-sets are quite large plus various levels of scalability are needed).

Resources