How do I scale Azure Data Factory Dataflow? - azure

I was able to setup a SCD Type 2 process quite easily using the ADF UI for one table BUT I don't see an easy way to scale to the 1000s of datasources we've. I don't see any Java APIs that will allow me to write ADF Pipelines/Dataflow and configure & trigger them dynamically. No UI to allow which tables to choose from a particular database etc. I looked at Azure Datalake Gen 2, Azure Databricks etc. I don't see any tool in Azure that will allow us to replace the UI driven Data Lake ingestion process we've built in house. Am I missing something?
On a side note, we've an old Data lake application that ingests data from thousands of datasources such as Databases, log files, web applications etc and stores data on HDFS (a typical architecture) using technologies as Java, Spark, Kafka etc. We're evaluating Azure Active Data Factory to replace it.

There is a generic SCD (Type 1, but you can retrofit to Type 2) example built into ADF. Go to New > Pipeline from template > Transform with Data flows > Generic SCD Type 1.
This pattern is outlined here: https://techcommunity.microsoft.com/t5/azure-data-factory/create-generic-scd-pattern-in-adf-mapping-data-flows/ba-p/918519.
You can also iterate over schemaless table datasets for Foreach inside a pipeline, calling the same data flow on every iteration.
Lastly, if you still wish to stamp-out data flows programmatically, the .NET and PowerShell SDKs are listed in the references section of the online Azure docs.

You could leverage the REST API from Java to build out pipelines using code.
https://learn.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-rest-api

Related

Azure Architecture Implementation Ideas

We've designed a Data Architecture for our client on Azure wherein, We ingest the sources into a Raw Layer consisting of a Azure SQL Database. This Azure SQL Database acts as a Source Mirror and Has Near Real time sync enabled.
We also have an ODS Layer which is populated from the Previously mentioned Azure SQL Database (Source Mirror) as per the given Data Model. This Layer should ideally take anywhere between 30mins and 1 Hour to Load.
May I Know How Do I Handle the Concurrent Writes and Reads from the Raw Layer (Azure SQL Database, Source Mirror) ? It Syncs every 5 mins with the Sources but also read every 30mins - 1 Hour for ODS Layer.
I've to Use Azure Data Factory to Implement my Data Loads
Yes, Azure Data Factory platform is best fit for such scenarios. Its a cloud-based ETL and data integration tool that lets you build data-driven processes for managing data transportation and data transformation at scale. You can use Azure Data Factory to design and plan data-driven processes (also known as pipelines) that may consume data from a variety of sources. With data flows or computing services like Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database, you can design sophisticated ETL processes that convert data graphically.
When using control flow, you can use a GetMetadata activity to get a
list of files in a storage account, then pass that list to a for each
activity with the Sequential flag set to false to process all files
concurrently (in parallel) up to the maximum batch size according to
the activities defined in the for each loop.
Here, is the Microsoft Official Document for the Azure Datafactory Connectors Overview | Docs

Azure data factory data flow task cannot take on prem as source

Hey,
I am having trouble in creating a Data Flows task which uses an on-prem source. Is this not possible in the Preview version?
I have created a self hosted IR to connect ADF to my laptop, and that is what I want to use. In the pic below I am trying to create a dataset off self hosted IR. It works great in Copy task but for Data Flows it is greyed out.
For this question, I asked Azure support for help and they replied me with the answer:
Answer:
This means on-premise SQL server is not supported as dataset in data flow in current stage.
Screen shot:
Update:
Data flow now only support Azure IR so it doesn’t support on-premise dataset.
Refer to Integration runtime types.
Hope this helps.
If your goal is to use visual data transformations in ADF using Mapping Data Flows with on-prem data, then build a pipeline with a Copy Activity first. Use the Self-Hosted Integration Runtime with the Copy Activity to stage your data in Blob Store. Then add a subsequent Execute Data Flow activity to transform that data.
I made video on how to do this:
https://www.youtube.com/watch?v=IN-4v0e7UIs
Reproduce your issue on my side ,however nothing about this feature is claimed on the official document. As you can see everywhere about data flow cliams that:
You could submit any voice here:
Also found a feedback for data flow in ADF for your reference.If you need push the progress of it,you could vote up it. Also,i would suggest you referring to the comments in the link:
For access to the 80+ ADF connectors, use Copy Activity to stage data
for transformation.
Data Flows will access data in your lake (Blob, ADB, ADW, ADLS) for
transformation.
Think of Copy Activity as your data heavy-lifting activity and Data
Flow as your data transformation engine.

How to import your example files into my Azure Data Factory

I believe many of you have ADF experiences, and may be have seen Mark Kromer's example of azure data flows (https://github.com/kromerm/adfdataflowdocs)
I am a beginner using ADF and azure Data Flows especially. I am very curious of these examples, and I really want to import all your examples files (json) to my newly created data factory. It must be an easier way than creating all activities, connections, datasets and others manually. The templates are of course good but I want to test out your example code in my azure portal and in my data factory.
And just a second question: I am a SSIS man used control flows with master packages executing other packages. When I am now building a data warehouse in ADF with many dimension tables and fact tables, is it best practice to have separate data flows or should I build general data flows that either have many parallel upserts to different dimensions? I think I need some guiding here
Thank you
Regards Geir
I've been publishing those example data flows as ADF pipeline templates.
In your browser, go to ADF and then click Factory Resources > Pipeline from Template. You should see a Data Flow category. Many of the examples are in there.
In ADF, you can also have master / child packages. You'll use Execute Pipeline in ADF instead of execute package.
ADF supports re-use and generalization of data flow patterns, so that you could process multiple dimension tables in a single data flow. You'll have to weigh the value of doing that vs. supporting it long-term as well as the danger of creating a very complex data flow.

Which Azure products are needed for a staging database?

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.
Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.
Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources