Is it possible to iterate multiple Azure data factory pipeline run ids and activity runs metadata onto Azure dashboard using fewer variables - azure-adf

Is it possible to iterate multiple Azure data factory pipeline run ids and activity runs metadata onto the Azure dashboard using fewer variables.
for example, variable 'bytesread' iterate using some loop to capture all of 'bytesread' values from each of the activity runs.

Related

Iterating through the parameters in csv file and run pipeline with each parameter

Hi i have a scenario where i have an csv file in azure datalake storage. while running azure pipeline, the parameters from the excel has to be picked up one by one in an iterative manner. Based on each parameter the databricks notebook should be run.
Is there any solution for this - how to iterate through the values in csv file?
If you are in Azure, you should consider Azure Data Factory (ADF) or Azure Synapse Analytics which has pipelines. Both are good for moving data from place to place and data orchestration. For example you could have an ADF pipeline with a Lookup activity which reads your .csv, then calls a For Each activity with a parameterised Databricks notebook inside:
Interestingly, the For Each activity runs in parallel so could deal with multiple lines at once, depending on your Databricks cluster size etc.
You could try and do all this within a single Databricks notebook which I'm sure is possible, but I would say that is a more code heavy approach and you still have questions around the scheduling, doing tasks in parallel, orchestration etc

azure data factory pipeline and activities Monitor

I have about 120 pipeline with almost 400 activities all together and I would like to log them in our datalake storage system so we can report on the performance using powerBI. I came across How to get output parameter from Executed Pipeline in ADF? but it seems to me to work with a single pipeline, but I am wondering if I could get the whole pipeline in my ADF in one single call and the activities also.
Thnaks
Assuming the source in these pipelines varies which makes it difficult to apply the logic for monitoring.
One way is to store the logs individually for each pipeline by running some queries with pipeline parameters. Refer Option 2 in this tutorial.
Although, the best feasible and appropriate way to monitor ADF pipelines and activities is to use the Azure Data Factory Analytics.
This solution provides you a summary of overall health of your Data Factory, with options to drill into details and to troubleshoot unexpected behavior patterns. With rich, out of the box views you can get insights into key processing including:
At a glance summary of data factory pipeline, activity and trigger
runs
Ability to drill into data factory activity runs by type
Summary of data factory top pipeline, activity errors
Go to Azure Marketplace, choose Analytics filter, and search for Azure Data Factory Analytics (Preview)
Select Create and then create or select the Log Analytics Workspace.
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration
You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts, and take other actions.

How to use the same pipeline in different environments with varying number of customers inside Azure Data Factory?

I have a copy data pipeline in the Azure Data Factory. I need to deploy the same Data Factory instance in multiple environments like DEV, QA, PROD using Release Pipeline.
The pipeline transfer data from Customer Storage Account (Blob Container) to Centralized Data Lake. So, we can say - its a Many to One flow. (Many customers > One Data Lake)
Now, suppose I am in DEV environment & I have 1 demo customer there. I have defined an ADF pipeline for Copy Data. But in prod environment, the number of customers will grow. So, I don't want to create multiple copies of the same pipeline in production Data Factory.
I am looking out for a solution so that I can keep one copy pipeline in Data Factory and deploy/promote the same Data Factory from one environment to the other environment. And this should work even if the number of customers is varying from one to another.
I am also doing CI/CD in Azure Data Factory using Git integration with Azure Repos.
You will have to create additional linked services and datasets which do not exist in a non-production environment to ensure any new "customer" storage account is mapped to the pipeline instance.
With CI/CD routines, you can deliver this in an incremental manner i.e. parameterize you release pipeline with variable groups and update the data factory instance with newer pipelines with new datasets/linked services.

Is it possible to download a million files in parallel from Rest API endpoint using Azure Data Factory into Blob?

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
I added the REST API and Blob as linked services.
The API endpoint takes in a query string param named : fileName
I am passing the whole URL with the query string
The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
Is it possible to achieve the million parallel without data factory just using Batch ?
Hard to help with you main problem - you need to provide more examples of your code
In relation to your other queries:
You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.
Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.
In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.

Create a generic data factory with multiple linked services

Use Case: To create a generic data factory which can read data from different azure blob containers which has flat files into Azure SQL. I have created a data pipeline which uses stored procedures to populate the Azure SQL tables.
Issue: The trouble that I have is that I want to execute this data factory from my code and change the database and blob container on the fly and execute the same data factory with this new parameters. The Table names will remain the same on the Azure SQL side and the File name will also remain same in the blob storage. The change will the the Container or the folder name inside the Container which will be know before hand.
Please help me out or point me in the direction as to what could help me achieve this and if this can be at all be achieved or not.
You would need to use the parameterized datasets and linked services. Define parameters on your data factory pipeline (which you want to pass from your code e.g. container name or the folder name, connection string for SQL azure and connection string for blob storage). Once this is defined - you would need to pass these values downstream all the way till the linked service
i.e. something like this
Pipeline Parameters > Dataset Parameters > Linked Service Parameters

Resources