Azure data factory integration dataset vs inline dataset - azure

I want to know the difference between integration data set and inline data set in ADF.
I know when multiple people in the team and pipelines look for same data set, we can go for integration data set.
That is sharable across branches.
What are pros and cons of these? Which DS will cost more?
Individual branches are created based on main branch and people already working on respective branches.
Now if i need an integration dataset, how do i proceed? Will creating a DS in main branch resolve it?
should i publish it? Will it be visible/accessible to other branches? Will it sync from main to individual branches?

Related

Azure data factory - Dashboard Log Query - Filter failed pipelines who successfully rerun

I've been tasked with reducing monitor overhead of a data lake (~80TiB) with multiple ADF pipelines running (~2k daily). Currently we are logging Failed pipeline runs by doing a query on ADFPipelineRun. I do not own these pipelines, nor do I know the inner workings of existing and future pipes, I cannot make assumptions on how to filter these by custom logic in my queries. Currently the team is experiencing fatigue with these, most failed piperuns are solved during their reruns.
How can I filter these failures so they dont show up when a rerun succeeds?
The logs exposes a few id's that initially looks interesting, like Id, PipelineRunId, CorrelationId, RunId, but none of these will link a failed pipe to a successful one.
The logs does however show an interesting column, UserProperties, that apparently can be dynamically populated during the pipeline run. There may be a solution to be found here, however it would require time and friction for all existing factories to be reconfigured.
Are there any obvious solutions I have overlooked here? Preferably Azure native solutions. I can see that reruns and failures are linked inside ADF Studio, but I cannot see a way to query it externally.
After a discussion with the owner of the ADF pipes we realized the naming convention of the pipelines would allow me to filter out the noisy failing pipes that would later succeed. It's not a universal solution but it will work for us as the naming convention is enforced across the business unit I am supporting

Creating a Pipeline/Mechanism with a Approval gateway with If Else Logic in Azure to Detect Delta Data Quality

Imagine we have a Data source, which can be a blob storage or table.
When new data comes into the data source, the main objective is to create a mechanism so that we can first check the data quality of the new data using certain statistical tests, then if it passes these tests, we should be able to combine the new data with the previous data source. The Data source must be versioned.
Also if the new data fails the statistical tests, then we should have a mechanism where it alerts a developer, then if the developer decides to override and then we should be able to combine the new data with the previous data source.
This specific part must be triggered manually, the starting point where we check the new delta. After doing so we need to trigger an Azure DevOps Pipeline.
What tools can we use for this? Are there any reference guides we can follow for this? I need to implement this in Azure.
Key Concerns:
Dataset: Being able to version.
Way to detect delta and store it in a separate place before the tests.
Way to allow a developer to have an override.
Performing statistical tests.
Assuming that the steps in your entire workflow can be broken down into discreet steps, are relatively idempotent or can be checkpointed at each step, and are not long running, then yes, you could explore using Durable Functions, an advanced orchestration framework for Azure Functions.
Suggestions to match your goals:
Dataset: Being able to version - You should version this explicitly in your dataset during generation. If that's not feasible, you may derive the version based on a hash of a combination of metadata for the dataset.
Way to detect delta and store it in a separate place before the tests - depends on what delta means to your dataset. You can make the code check the previous hash from an entry in table storage, and compare with the current hash.
Way to allow a developer to have an override - Yes, see Human interaction in Durable Functions.
Performing statistical tests - If there are multiple tests to run for each pass, then consider using the fan-out/fan-in pattern.

How can i Migrate from Azure DevOps services to Azure DevOps Server

I am having my project collection running in Azure DevOps Online(Services). And I would like to migrate that to Azure DevOps On-Prem Server.
Help me out here with the incompatibility issue i will be facing and how to overcome that.
Options to Migrate from Azure DevOps online(services) to Azure DevOps Server(On-Prem).
Is there any services available in azure to successfully acheive the above migration with out any data loss?
Should I must use third party tool to do the above migration with out any data loss?
Help me out here with the Downtime required for the 100 GB of Project collection with multiple repository.
Project Collection size - 100 GB
One of the previous answers (since deleted?) has captured most of the critical points, and that no tool can migrate 100% of data with zero data loss (Actually, 100% migration with no loss is not feasible as inherently some of the automatically generated and configuration values, like work item ids etc., will inherently be different between two instances). Therefore, the only way to get zero data loss migration is to lift and shift the complete project collection image from Azure DevOps Services to Azure DevOps Server, which is not supported by the official Azure DevOps migration tool. Given that, the only way left to migrate data is using Azure DevOps APIs.
So, the best approach is to understand what data cannot be migrated by the migration tools that you are evaluating, and then decide what works best for you. Also, it will not be a black and white selection when it comes to choosing a migration solution. First, you need to define the must-haves you expect from migration and then evaluate the different migrators available in the market. Here are a few common selection criteria:
Data Loss:
Understand what data can be and cannot be migrated by the migration solution. Ideally, the tool should be able to migrate work items (along with history, attachments, mentions, and inline images) and test management, including Test Results, Source code, Dashboards, Areas and Iterations. For Builds and pipelines, you can use the native Export-Import feature, as they require manual changes to tweak the connection.
Zero Downtime:
Downtime adds operational costs and impacts development operations as teams cannot use Azure DevOps tools. Understand thoroughly that there is no scenario in which downtime will be required for any type of data.
Ease of Use:
Some tools are a collection of unsupported scripts (Naked Agility) which require very high degree of sophistication to use. These can be extremely expensive (even though the scripts are open source), error prone and hinder operations.
Project Consolidation or Customized Templates:
Analyze if you want to consolidate multiple projects into one project while migrating or if the templates need to be customized. If that is the need, evaluate if the migration tool can support such configuration with ease and has a UI to do so. Manually configuring mappings for each project can be tedious and highly error prone.
Migration Time:
Many migration tools migrate projects one by one, hence consuming a lot of effort and time to migrate the data spread across multiple projects. Understand how many projects can be parallelly migrated to have speedy migration.
Reverse Synchronization:
Do you want to keep the data in sync between Services and Server for some time post-migration? Will data be integrated bidirectionally or unidirectionally? Answer these questions and then evaluate the migration solution if it will meet the requirements.
Commercial Support:
Migration can be tricky and time-consuming, as, over time, different teams have created all the odd stuff in there. Better to have a team of experts do the migration for you while you focus on defining requirements and validating the completeness of migration.
I hope this helps. Full disclosure: I work for OpsHub, where we are experts at data migration and using OpsHub Azure DevOps migrator have migrated multiple organizations to and from Azure DevOps Services and Server over the last decade. Contact us if you need more help.

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!
One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.
Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

Azure data factories vs factory

I'm building an Azure data lake using data factory at the moment, and am after some advice on having multiple data factories vs just one.
I have one data factory at the moment, that is sourcing data from one EBS instance, for one specific company under an enterprise. In the future though there might be other EBS instances, and other companies (with other applications as sources) to incorporate into the factory - and I'm thinking the diagram might get a bit messy.
I've searched around, and I found this site, that recommends to keep everything in a single data factory to reuse linked services. I guess that is a good thing, however as I have scripted the build for one data factory, it would be pretty easy to build the linked services again to point at the same data lake for instance.
https://www.purplefrogsystems.com/paul/2017/08/chaining-azure-data-factory-activities-and-datasets/
Pros for having only one instance of data factory:
have to only create the data sets, linked services once
Can see overall lineage in one diagram
Cons
Could get messy over time
Could get quite big to even find the pipeline you are after
Has anyone got some large deployments of Azure Data Factories out there, that bring in potentially thousands of data sources, mix them together and transform? Would be interested in hearing your thoughts.
My suggestion is to have only one, as it makes it easier to configure multiple integration runtimes (gateways). If you decide to have more than one data factory, take into consideration that a pc can only have 1 integration runtime installed, and that the integration runtime can only be registered to only 1 data factory instance.
I think the cons you are listing are both fixed by having a naming rules. Its not messy to find a pipeline you want if you name them like: Pipeline_[Database name][db schema][table name] for example.
I have a project with thousands of datasets and pipelines, and its not harder to handle than smaller projects.
Hope this helped!
I'd initially agree with an integration runtime being tied to a single data factory being a restriction, however I suspect it is no longer or soon to be no longer a restriction.
In the March 13th update to AzureRm.DataFactories, there is a comment stating "Enable integration runtime to be shared across data factory".
I think it will depend on the complexity of the data factory and if there are inter-dependencies between the various sources and destinations.
The UI particularly (even more so in V2) makes managing a large data factory easy.
However if you choose an ARM deployment technique the data factory JSON can soon become unwieldy in even a modestly complex data factory. And in that sense I'd recommend splitting them.
You can of course mitigate maintainability issues as people have mentioned, by breaking your ARM templates into nested deployments, ARM parameterisation or data factory V2 parameterisation, using the SDK direct with separate files. Or even just use the UI (now with git support :-) )
Perhaps more importantly particularly as you mention separate companies being sourced from; it perhaps sounds like the data isn't related and if it isn't - should it be isolated to avoid any coding errors? Or perhaps even to have segregated roles and responsibilities for the data factories.
On the other hand if the data is interrelated, having it in one data factory makes things far easier for allowing data factory to manage data dependencies and re-running failed slices in one go.
After the March release, you can link integration runtimes among different factories.
The other thing to do is to create different folders for the various pipelines and datasets
My suggestion is to create one DataFactory service per each project. If you need to transfer data from two source to one destination and for each transformation you need several Pipelines and Linked Services and other stuffs, I suggest to create two separated ADF services for each Source. In this case I will see each source as an integration project separated.
You will have two separated CI/CD for each project also.
In your source controller also you need to have two separated repositories.
If you are using ADF v1 then it will get messy. At a client of ours we have over 1000 pipelines in one Data Factory. If you are starting fresh, I would recommend looking at v2 because it allows you to parameterize things and should make your scripts more reusable.

Resources