Azure Data Factory Pricing - Activity Count - azure

I'm thinking of using Data Factory in order to copy data from a Blob Storage container to an SQL table but I'm not quite sure I understand how the pricing works, specifically how the activities are counted.
So if I have a pipeline with 3 activities that copies the data from a CSV with 1000 lines will the total activity count be 3*1 or 3*1000? In other words, will I be charged based on the number o files it processes or the total number of lines it copies?

That's 3 activity runs. Activity runs are measured by the thousand, at $1 per. Since these are Copy activities, they consume Data Integration Units (DIU) at $.25 per hour. Pipeline execution time is billed at $.005 per hour. If you add all this up for 1 pipeline with 3 Copy activities that runs for 1 hour, your total bill is like 27 cents.
We run THOUSANDS of pipelines a month, all with many activities including quite a few Copy activities. Our Data Factory billing is still so low that it looks like a rounding error in our total Azure spend.
The exception to this is Data Flow. Data Flow is a Spark wrapper, so you have to pay for Cluster time, which can get expensive quickly if you aren't careful.

Actually, you have to pay for 2 important metrics: Orchestration and Execution. Please refer to more details from this document.
1.Orchestration, $1 per 1,000 runs. You have 3 activities, then it should be $ 3/1000.
2.Execution, it depends on the DIU you configured,which means the performance of your transmission.

Related

Azure Data Factory ForEach activity pricing

I'm having a hard time understanding how ADF will be charged in the following scenario
1 pipeline, 2 activities with 1 being a ForEach which will loop 1000+ times. Activity inside the ForEach is a stored procedure.
So will this be 2 activity runs or more than 1000?
You can see the result in ADF monitor - but it will be 1002 activities at $1/1000 (or whatever the current rate is).
It's much cheaper (if you mind the dollar) if you can pass the list into your proc; the lookup.output field is just json with your table list in it, which you could parse in the proc.
https://learn.microsoft.com/en-us/azure/data-factory/pricing-concepts

Azure cost analysis for a particular subscription using Python SDK

So I'm trying to automate fetching the current cost and cost forecast (Like it is shown under cost analysis for a particular subscription) for a particular subscription using python SDK but I haven't been able to find a single API that does this yet.
I've tried using UsageAggregate and Rate card but I haven't really figured out a way to find the cost for the current month to date. If there is an API that I'm missing or if I need to calculate monthly costs myself, I'd appreciate any code snippets or help.
If you already have the usage and the ratecard data, then you must combine them.
Take the meterId of the usage data and get the related ratecard data.
The ratecard data contains the MeterRates and the IncludedQuantity which you must take.
There are probably multiple meter rates and the included quantity because there are probably different costs per usage (e.g. first 10 calls for free, 3 GB for free, ...).
The consumption starts/is reseted at the 14th of the month. That's the reason why you have to read the data from the whole billing period (begins with 14th of each month), because that's the only way how you get the correct consumption.
So, if you are using e.g. Azure Functions and you have a usage of 100.000 units per day and you want the costs from 20th - 30th, then the calculation works as follows:
read data from 14th - 30th. These are 17 days and therefore it used 1.700.000 units. The first 400.000 are for free = IncludedQuantity (so in this sample the first 4 days).
From the 400.001 unit on, you have to take the meter rate (0,0000134928 €) and calculate the costs. 1.300.000 * 0,0000134928 = ~17,54€.
Fortunately, the azure functions have only one rate. If the rate changes e.g. after 5.000.000 units, then you also have to take this into account. If you have the whole costs, then you can filter on your date which is 20.-30. and you will get the result.
Its calculation implemented in C# and published it as a NuGet package here. It also contains a sample console which you could use to export the data.
I know I am bit late to the party, but after struggling with the same problem, I managed to create the code for getting the cost of a resource group using
azure.mgmt.costmanagement
Link to cost management API
Code sample is in my answer here

How do I identify the source of Cosmos DB RU consumption?

Context
I have an Azure Function which executes daily, it is change tracking a pricing feed with the results stored in a Cosmos DB. Each time it runs it compares the latest price from the feed, against the most recent values in the DB collection and writes a new item if there is a difference. It is scheduled to run at 23:55 each day. The overall setup is tiny, with 10 items in the DB, and changes seen usually once a week. My consumption is 3.84 RUs for the daily execution when the price hasn't changed.
Question
In addition to the expected activity at 23:55 each day, there is additional activity appearing almost every 3 hrs 45 mins. This unexpected activity consumes 4 RUs each time (around 24 RUs per day). How can I identify the source of the additional activity?
Other info
I noticed that the backups were scheduled to 4 hours, so I changed that to daily. That didn't help.
Since noticing the additional usage I have added diagnostic settings to save all logs into a Log Analytics workspace. I can see that the RUs are Read operations, Type = AzureDiagnostics, requestResourceType = Collection, Source IP = 51.11.40.180, 1 of the 4 is a read against the __Cosmos/colls/__Query collection. This suggests to me that it's the diagnostics causing the cost. I disabled the diagnostic settings to see if that reduced my RU consumption, but it does not.
Is it just a case that diagnostics are run by Microsoft and that is simply part of having a cosmos db?
And that usually it's a smaller proportion of the overall cost, therefore not an issue?

TTL configuration and billing in Azure Integration Runtime

Doing some tests, I could see that having an Azure Integration Runtime (AIR) allowed us to reduce considerably the amount of time required to finish a pipeline.
To fully understand the use of this configuration and its billing as well, I've got these questions. Let's assume I've got two independent pipelines, all of their Data Flow activities use the same AIR with a TTL = 10 minutes.
The first pipeline takes 7 minutes to finish. The billing will be (if I understand well):
billing: time to acquire cluster + job execution time + TTL (7 + 10)
Five minutes later, I trigger the second pipeline. It will take only 3 minutes to finish (I understand it will also use the same pool as the first one). After it concludes, the TTL is setting up to 10 minutes again or is equal to 2 minutes
10 - 5 -3 (original TTL - start time second pipe - runtime second pipe), in this case, what will happen if I trigger a third pipeline that could take more than 2 minutes?
What about the billing, how is it going to be calculated?
Regards.
Look at the ADF pipeline monitoring view and find all of your data flow activity executions.
Add up that total data flow activity execution time.
Now add the TTL value for that Azure IR you were using to that total.
That is the total time you will be billed.

Azure Durable Functions - fanout-fanin scalability

We are a skills based development company that creates competitions . The players of this competition can upload photos and rank each other photos to earn points . One of the key requirements of this is to update the competition leader board regularly to keep the players interested. We are looking for a fan-out and fan in architecture to implement the leader board. A typical work flow is attached
From our analysis Durable functions seems to an best option.
However we have the following constraints
Each competition has about 500 players
A player will be ranking up to 500 photos
I have been trying to read through documentation. However could not find documentation on the scalability of this approach using Durable functions. Any comments or insights is highly appreciated
You can find the performance targets for Durable Functions here: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets
Parallel activity execution (fan-out) 100 activities per second, per instance
Parallel response processing (fan-in) 150 responses per second, per instance
If you run on an Azure Functions Consumption plan, the scale controller there will scale up to more instances as more messages appear in the work item queue.
This is the queue used to start activities (which you would use to calculate a single player's score.
You can also improve fan-in performance by doing what the say in the docs:
Unlike fan-out, fan-in operations are limited to a single VM. If your application uses the fan-out, fan-in pattern and you are concerned about fan-in performance, consider sub-dividing the activity function fan-out across multiple sub-orchestrations.
So you'd have:
Main orchestrator
Batch 0 sub-orchestrator
Activity for user 0 in batch 0
Activity for user 1 in batch 0
...
Batch 1 sub-orchestrator
Activity for user 0 in batch 1
Activity for user 1 in batch 1
...
...
The reason this kind of sub-orchestrator batching makes it faster is because your orchestrator history table gets more and more rows as the activities complete.
It has to load these every time there is a result.
So by limiting the ceiling for those rows you get maximal performance.
TL;DR: I think the fan-out will scale well, but you may want to do sub-orchestrator batching to improve fan-in performance.

Resources