Azure Data Factory(ADF) vs Azure Functions: How to choose? - azure

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?

Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork

Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.

Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.

Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.

Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

Related

How to invoke Job/Task in Azure Databricks from Azure Function

I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.

File transfer in Azure Integration Services

I have a requirement where I need to transfer file (20-150 MB) between two systems. For this requirement , is it better to use Durable function instead of Azure data factory (ADF). As per my understanding , ADF execution will costlier as compared to durable functions. Note : durable function trigger is eventGrid trigger. Any suggestion will be helpful. File transfer will be simple pass through, no transformation is involved.
Also, for my requirement even simple azure function will work right instead of durable function? There is no need of function orchestration as file is not processed in batch. Since, file will be processed based on event trigger.
As of my experience, I would like to recommend using Azure functions over ADF is a good idea because of the following reasons:
Azure Data Factory is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higher latency due to the large overhead of its job framework
Durable function is just related to the maximum execution time of a single call. For "out of the box" functions, that timeout is 10min, for durable functions this limitation gets removed. In this case, where you simply need to copy the data, there might be timeout issue and therefore you can consider the Durable function. Otherwise, simple function should also work fine. Moreover, Durable functions and normal functions share the same billing pattern.
For more details: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp

How to Trigger ADF Pipeline from Synapse Pipelines

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

What is the best way to execute long-running and high memory usage tasks on Azure?

I need to find the best way to preform long running tasks on Azure.
Scenario:
User picks the dataset and filters on the web app (Azure App Service)
Based on the requirements we create SQL query
Query is executed against one or more Azure SQL databases
Data is being exported in .csv format and uploaded to Azure Blob Storage
The main issues are that execution of some SQL queries can last for 2+ hours and resultset can have
100M+ rows.
I believe that Azure Functions (and subsequently Durable Functions) are not a option because of the timeout and memory usage.
I believe that Azure Functions (and subsequently Durable Functions)
are not a option because of the timeout and memory usage.
Timeout limit is only in consumption plan. If you want to get rid of timeout limit and want more memory, you can use premium plan or app service plan. And because azure function can scale out horizontally, you can split the result to multiple and then input the azure function.
And also, you can use web jobs.

Pulling data asynchronously from third-party web service on Windows Azure Platform

I want to pull large amount of data, frequently from different third party API web services and store it in a staging area (this is what I want to decide right now) from where it will be then moved one by one as required into my application's database.
I wanted to know that can I use Azure platform to achieve the above? How good is it to use Azure platform for this task?
What if the data to be pulled is of large amount and the frequency of the pull is high i.e. may be half-hourly or hourly for 2,000 different users?
I assume that if at all this is possible, then the bandwidth, data storage and server capability etc. will not be a thing to worry for me but for ©Microsoft. And obviously, I should be able to access the data back whenever I need it.
If I would have to implement it on Windows Servers, then I know that I would use a windows service to do this. But I don't know how it can be done for Windows Azure Platform if at all it is possible?
As Rinat stated, you can use Lokad's solution. If you choose to do it yourself, you can run a timed task in your worker role - maybe spawn a thread that sleeps, waking every 30 minutes to perform its task. It can then reach out to the Web Services in question (or maybe one thread per Web Service?) and fetch data. You can store it temporarily in Azure Table Storage, which is a fraction of the cost of SQL Azure (0.15 per GB), and then easily read it out of Table Storage on-demand and transfer to SQL Azure.
Assuming you host your services, storage and SQL Azure are in the same data center (by setting the affinity appropriately), you'd only pay for bandwidth when pulling data from the web service. There'd be no bandwidth charges to retrieve from Table Storage or insert into SQL Azure.
In Windows Azure that's usually Worker Role used to host the cloud processing. In order to accomplish your tasks you'll either need to implement this messaging/scheduling infrastructure yourself or use something like Lokad.Cloud or Lokad.CQRS open source projects for Azure.
We use Lokad.Cloud for distributed BI processing of hundreds of thousands of series and Lokad.CQRS allows to reliably retrieve and synchronize millions of products on schedule.
There are samples, docs and community in both projects to get you started.

Resources