How to import your example files into my Azure Data Factory - azure

I believe many of you have ADF experiences, and may be have seen Mark Kromer's example of azure data flows (https://github.com/kromerm/adfdataflowdocs)
I am a beginner using ADF and azure Data Flows especially. I am very curious of these examples, and I really want to import all your examples files (json) to my newly created data factory. It must be an easier way than creating all activities, connections, datasets and others manually. The templates are of course good but I want to test out your example code in my azure portal and in my data factory.
And just a second question: I am a SSIS man used control flows with master packages executing other packages. When I am now building a data warehouse in ADF with many dimension tables and fact tables, is it best practice to have separate data flows or should I build general data flows that either have many parallel upserts to different dimensions? I think I need some guiding here
Thank you
Regards Geir

I've been publishing those example data flows as ADF pipeline templates.
In your browser, go to ADF and then click Factory Resources > Pipeline from Template. You should see a Data Flow category. Many of the examples are in there.
In ADF, you can also have master / child packages. You'll use Execute Pipeline in ADF instead of execute package.
ADF supports re-use and generalization of data flow patterns, so that you could process multiple dimension tables in a single data flow. You'll have to weigh the value of doing that vs. supporting it long-term as well as the danger of creating a very complex data flow.

Related

Relationships in Azure synapse (DWH)

I'm currently working in Azure synapse DWH and I have some theoretical questions:
How I can create relationships between tables (Dim's and Fact's) and what implications I would have If I want to create those relationships.
I read that To create a primary key, I would need to set a nonclustered table, but what that means?
Azure Synapse Analytics (ASA) has three engines:
serverless SQL pools (was SQL on-demand)
dedicated SQL pools (the next step on from Azure SQL Data Warehouse)
Apache Spark pools
None of these currently support database relationships, as at today. I suspect you mean dedicated SQL pools and just to confirm it does not support the FOREIGN KEY syntax. Relationships is more of an OLTP concept and not common in big data platforms, which ASA is.
Therefore your options are to enforce these relationships downstream or on import to your warehouse. A common method is to identify unknown values and substitute them with a -1 / Unknown value on import. This will ensure there are no NULLs in your key columns.
Additionally, enforce your relationships downstream eg in an Azure Analysis Services tabular model or Power BI model.
If you really need relationships then depending on your data volumes you might consider Azure SQL Database which supports data volumes up to 4TB alongside columnstore indexes which give great compression.
having a similar issue:
I cannot find an automated solution thus far;
I'm importing 'entities' from D365 to datalake; and it does NOT come with the relationships.
it will also NOT suggest the "Related Tables"
Introduce; ETL of 'entities' using T-SQL and Spark.
Governance of:: py.spark, notebooks, Schema, linting T-SQL. orchestration of activities and pipelines, workflows. Etc...
OR
For small datasets and projects:
Reverse look-up each table needed.
In Azure Synapse create a new DataFlow; and download the .PBIX ;
Do your ETL: Create Primary fact and dimension tables; (by whatever means), such as Using PowerPivot Unique/distinct DAX expression on a Customer.Table).
Once complete; if you like; import the newly ETL primary tables to the datalake.
Repeat step 2.
Create the relationships with PowerBI. (Ideally if ETL is done correctly PBI will auto find the relationships)
RE-Publish the .PBIX with the relationships as a “DataFlow”.
a. You must create relationships for every Dataflow; dataflows cannot be combined.
Measures and Dataflows will consume resources and require performance analysis if they grow.
at some point 'dataverse' may allow D365 data making this easier.
depending on your 'cost/spend' cloning all of D365 still doesn't solve your relationship needs.
Two solutions I'm aware of thus far:
Import the serverless DBO's to PowerBI; Model and Create the Dataset there. you can do massive ETL, including Foreign Key creation, and Filtering of NULL values to create primary keys for Dimensions. Aggregate data and create Fact tables, etc...
Its far easier then using the Synapse GUI. Drawbacks are PBI licensing related.
Create a "Lake Database" (map as you go, great for 5 or less entities.tables.) ETL is low-code. But I'm skeptical that after 40 hours of training; I should have just learned how to scrip this in Workbook/Spark.
Do BOTH; use PowerBI to develop your model and test it. Then go back to synapse and deploy the working model as a pipeline or lake database.
Points of Clarity from the top posted solution:
Do not trust the auto-relationship of PowerBI; stay away from pre-made REFID relationships in PBI unless you know for sure this is what you want. (step 6: original poster; if ETL is correct its a 1:M)
Publishing with .PBIX has its limitations with sharing and other issues the OP mentioned. Lake Database might be the workaround if you have Tabelau, Python, or Qlik as your solution.
DataVerse is coming; and PBI Analytics as well as predictive analysis with HD Insights will be embedded into D365. You will also be able to create drag and drop dashboards. As of 08-05-2022 this is already working in its infancy; even thought they want you to go modular; with hybrid serverless setup you can STILL Pull the aggregate measures from D365 into synapse and Reverse engineer them.

How do I scale Azure Data Factory Dataflow?

I was able to setup a SCD Type 2 process quite easily using the ADF UI for one table BUT I don't see an easy way to scale to the 1000s of datasources we've. I don't see any Java APIs that will allow me to write ADF Pipelines/Dataflow and configure & trigger them dynamically. No UI to allow which tables to choose from a particular database etc. I looked at Azure Datalake Gen 2, Azure Databricks etc. I don't see any tool in Azure that will allow us to replace the UI driven Data Lake ingestion process we've built in house. Am I missing something?
On a side note, we've an old Data lake application that ingests data from thousands of datasources such as Databases, log files, web applications etc and stores data on HDFS (a typical architecture) using technologies as Java, Spark, Kafka etc. We're evaluating Azure Active Data Factory to replace it.
There is a generic SCD (Type 1, but you can retrofit to Type 2) example built into ADF. Go to New > Pipeline from template > Transform with Data flows > Generic SCD Type 1.
This pattern is outlined here: https://techcommunity.microsoft.com/t5/azure-data-factory/create-generic-scd-pattern-in-adf-mapping-data-flows/ba-p/918519.
You can also iterate over schemaless table datasets for Foreach inside a pipeline, calling the same data flow on every iteration.
Lastly, if you still wish to stamp-out data flows programmatically, the .NET and PowerShell SDKs are listed in the references section of the online Azure docs.
You could leverage the REST API from Java to build out pipelines using code.
https://learn.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-rest-api

Azure data factory data flow task cannot take on prem as source

Hey,
I am having trouble in creating a Data Flows task which uses an on-prem source. Is this not possible in the Preview version?
I have created a self hosted IR to connect ADF to my laptop, and that is what I want to use. In the pic below I am trying to create a dataset off self hosted IR. It works great in Copy task but for Data Flows it is greyed out.
For this question, I asked Azure support for help and they replied me with the answer:
Answer:
This means on-premise SQL server is not supported as dataset in data flow in current stage.
Screen shot:
Update:
Data flow now only support Azure IR so it doesn’t support on-premise dataset.
Refer to Integration runtime types.
Hope this helps.
If your goal is to use visual data transformations in ADF using Mapping Data Flows with on-prem data, then build a pipeline with a Copy Activity first. Use the Self-Hosted Integration Runtime with the Copy Activity to stage your data in Blob Store. Then add a subsequent Execute Data Flow activity to transform that data.
I made video on how to do this:
https://www.youtube.com/watch?v=IN-4v0e7UIs
Reproduce your issue on my side ,however nothing about this feature is claimed on the official document. As you can see everywhere about data flow cliams that:
You could submit any voice here:
Also found a feedback for data flow in ADF for your reference.If you need push the progress of it,you could vote up it. Also,i would suggest you referring to the comments in the link:
For access to the 80+ ADF connectors, use Copy Activity to stage data
for transformation.
Data Flows will access data in your lake (Blob, ADB, ADW, ADLS) for
transformation.
Think of Copy Activity as your data heavy-lifting activity and Data
Flow as your data transformation engine.

How to Migrate Azure DataFactory(v2) Job to another DataFactory(2)

How to move all existing job to another Azure Datafactory.
I am trying to move existing job from one data Factory to another but not able to find the solution any suggestions, please.
As far as I know there is no easy import/export facility.
I recommend connecting your Data Factories to source control (GIT). You can then copy and paste the JSON definitions between the two repo's using a text editor.
For propagating pipelines between environments, you can look in to the documentation for CI/CD in Azure Data Factory.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources