Creating a Pipeline/Mechanism with a Approval gateway with If Else Logic in Azure to Detect Delta Data Quality

Creating a Pipeline/Mechanism with a Approval gateway with If Else Logic in Azure to Detect Delta Data Quality - azure

Imagine we have a Data source, which can be a blob storage or table.
When new data comes into the data source, the main objective is to create a mechanism so that we can first check the data quality of the new data using certain statistical tests, then if it passes these tests, we should be able to combine the new data with the previous data source. The Data source must be versioned.
Also if the new data fails the statistical tests, then we should have a mechanism where it alerts a developer, then if the developer decides to override and then we should be able to combine the new data with the previous data source.
This specific part must be triggered manually, the starting point where we check the new delta. After doing so we need to trigger an Azure DevOps Pipeline.
What tools can we use for this? Are there any reference guides we can follow for this? I need to implement this in Azure.
Key Concerns:
Dataset: Being able to version.
Way to detect delta and store it in a separate place before the tests.
Way to allow a developer to have an override.
Performing statistical tests.

Assuming that the steps in your entire workflow can be broken down into discreet steps, are relatively idempotent or can be checkpointed at each step, and are not long running, then yes, you could explore using Durable Functions, an advanced orchestration framework for Azure Functions.
Suggestions to match your goals:
Dataset: Being able to version - You should version this explicitly in your dataset during generation. If that's not feasible, you may derive the version based on a hash of a combination of metadata for the dataset.
Way to detect delta and store it in a separate place before the tests - depends on what delta means to your dataset. You can make the code check the previous hash from an entry in table storage, and compare with the current hash.
Way to allow a developer to have an override - Yes, see Human interaction in Durable Functions.
Performing statistical tests - If there are multiple tests to run for each pass, then consider using the fan-out/fan-in pattern.

Related

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!

One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.

Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?

This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH

One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks

I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.

As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Azure data factories vs factory

I'm building an Azure data lake using data factory at the moment, and am after some advice on having multiple data factories vs just one.
I have one data factory at the moment, that is sourcing data from one EBS instance, for one specific company under an enterprise. In the future though there might be other EBS instances, and other companies (with other applications as sources) to incorporate into the factory - and I'm thinking the diagram might get a bit messy.
I've searched around, and I found this site, that recommends to keep everything in a single data factory to reuse linked services. I guess that is a good thing, however as I have scripted the build for one data factory, it would be pretty easy to build the linked services again to point at the same data lake for instance.
https://www.purplefrogsystems.com/paul/2017/08/chaining-azure-data-factory-activities-and-datasets/
Pros for having only one instance of data factory:
have to only create the data sets, linked services once
Can see overall lineage in one diagram
Cons
Could get messy over time
Could get quite big to even find the pipeline you are after
Has anyone got some large deployments of Azure Data Factories out there, that bring in potentially thousands of data sources, mix them together and transform? Would be interested in hearing your thoughts.

My suggestion is to have only one, as it makes it easier to configure multiple integration runtimes (gateways). If you decide to have more than one data factory, take into consideration that a pc can only have 1 integration runtime installed, and that the integration runtime can only be registered to only 1 data factory instance.
I think the cons you are listing are both fixed by having a naming rules. Its not messy to find a pipeline you want if you name them like: Pipeline_[Database name][db schema][table name] for example.
I have a project with thousands of datasets and pipelines, and its not harder to handle than smaller projects.
Hope this helped!

I'd initially agree with an integration runtime being tied to a single data factory being a restriction, however I suspect it is no longer or soon to be no longer a restriction.
In the March 13th update to AzureRm.DataFactories, there is a comment stating "Enable integration runtime to be shared across data factory".
I think it will depend on the complexity of the data factory and if there are inter-dependencies between the various sources and destinations.
The UI particularly (even more so in V2) makes managing a large data factory easy.
However if you choose an ARM deployment technique the data factory JSON can soon become unwieldy in even a modestly complex data factory. And in that sense I'd recommend splitting them.
You can of course mitigate maintainability issues as people have mentioned, by breaking your ARM templates into nested deployments, ARM parameterisation or data factory V2 parameterisation, using the SDK direct with separate files. Or even just use the UI (now with git support :-) )
Perhaps more importantly particularly as you mention separate companies being sourced from; it perhaps sounds like the data isn't related and if it isn't - should it be isolated to avoid any coding errors? Or perhaps even to have segregated roles and responsibilities for the data factories.
On the other hand if the data is interrelated, having it in one data factory makes things far easier for allowing data factory to manage data dependencies and re-running failed slices in one go.

After the March release, you can link integration runtimes among different factories.
The other thing to do is to create different folders for the various pipelines and datasets

My suggestion is to create one DataFactory service per each project. If you need to transfer data from two source to one destination and for each transformation you need several Pipelines and Linked Services and other stuffs, I suggest to create two separated ADF services for each Source. In this case I will see each source as an integration project separated.
You will have two separated CI/CD for each project also.
In your source controller also you need to have two separated repositories.

If you are using ADF v1 then it will get messy. At a client of ours we have over 1000 pipelines in one Data Factory. If you are starting fresh, I would recommend looking at v2 because it allows you to parameterize things and should make your scripts more reusable.

How to desing the component which take textfile,csv, or any other custom file and load it in AZURE DB tables

I am looking for the proper kind of architecture to get the data loading done from text files, csv etc... (format is not decided yet) and load it into Azure SQL Database table(s).
It should not be done manually by the user, I want this workflow to happen automatically, provide input for the components, and I need use a service so that it will be a fail safe process.
Please provide approach or architecture which does the similar job.

There are a number of ways to do this an it really depends on your goals and how much you want to invest.
Since you mention CSV by default you could script a SQL bcp (bulk copy).
https://msdn.microsoft.com/en-us/library/ms162802.aspx
You could script it so it isn't manual, but you mentioned it should be fail safe. Any group insert of row into an existing table could fail for a number of reasons (referential integrity checks, hardware issues, locks, etc).
Most integration architectures would bulk copy the data to a clean staging location (empty or new table in the same or other database) using either BCP or SqlBulkCopy (https://msdn.microsoft.com/en-us/library/7ek5da1a(v=vs.110).aspx). Then they would loop through those rows inserting them to the final destination table. That way you can build in either retry or report failures on a row by row basis, and handle accordingly.
Since this is in Azure, there are a number of ways to do the final loop through rows process, like Azure Batch, or Azure Jobs, or even Azure Logic Apps (this is basically a BizTalk like workflow/integration support engine), or custom worker role.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string