Does the per activity cost count each data factory individually - azure

We are setting up a data factory to help with our global failover scenario. The Pipeline copies data from our SQL server located on premise into Azure Table Storage.
We are using Data Factory V2 and have set up the CI/CD pipeline as described in the ADF documentation.
Therefore, our dev and test instances only copy data from the Sql to one region, but our production needs to copy data to multiple regions. My thought to simplify things would be to have one Factory per region that will only copy data to that region (so that production and dev can share the exact same pipelines).
However, this will mean that we will have multiple pipelines and all of them will have a rather low usage. There are only 3 activities that run once a day, so we will only have 90 activities per month. Looking at the data factory pricing, you are charged for every 1,000 activities.
My question is, since each of these factories will have less than 1,000 activities, will we be charged the minimum of $1.50 for each factory or will the pricing just charge us once since all of them together will still be less than 1,000 activities?

Great question! The pricing is calculated per Data Factory instance and not per pipeline. You can have as many pipelines in a single Data Factory instance. You will be charged based on the number of activity runs within a Data Factory instance.
In your case, since you are planning on having multiple Data Factory instances, you will be billed multiple times. Eg- If you have 3 data factories(may or may not be across different regions) and each ADF has 90 activity runs a month, you will be charged 3x$1.5 = $4.5.
For an accurate estimation of pricing, please refer :
https://azure.microsoft.com/en-in/pricing/calculator/
Hope this helps!

Related

Azure data factory copy activity waiting for so long at Time to first byte

I was trying to load data from on-prem to azure using ADF copy activity. By giving below query. My table is source is very large.
select acid,
mbid,
actid,
actdttm,
crettm,
rslvid,
hsid,
cdcflag,
cdcts
from df_lake.acity
where cdcts>'2022-06-06'
ADF activity taking much time to load only few records from source. Below is the screen short of activity details.
I can see it is taking most of the time at time to first byte. After in seconds it is loading data.
Kindly Suggest how can I make this faster.
As per offical doc,
Check if there any throttling error on the source or if your data store is under high utilization. If so, either reduce your workloads on the data store or increase the throttling limit or available resources.
Please follow this reference for performance boosting and connecting the Self-hosted Integration on-premises data store to Azure Data Factory by Ahmad Yaseen.
if you are already connected to Self-hosted Integration, if so try to follow this article for Troubleshoot copy activity on Self-hosted IR

Data factory cost reducing

How to reduce the costs of azure data factory (we have pipelines for data movement between tables, and triggers, datasets, alerts) ?
Thanks in advance,
How to reduce the costs of azure data factory (we have pipelines for data movement between tables, and triggers, datasets, alerts) ?
Before reducing costs, we need to monitor what are the costs to avoid unnecessary costs.
We could:
First, at the beginning of the ETL project, you conduct proof of
concept and use a combination of per-pipeline consumption and pricing
calculator to estimate costs.
After you have deployed your pipelines to production, you use the
cost management features to set budgets and monitor costs. You can
also review the forecasted costs and identify spending trends.
In addition, you can view per-pipeline consumption and per-activity
consumption information to understand which pipelines and which
activities are costliest and identify candidates for cost reduction.
Please refer the document Plan and manage costs for Azure Data Factory for some more details.
In addition, we could save costs from running ADF pipelines using Triggers.
Check this thread for the details.
If you are using many copy activities to move data then the under settings tab you can change the default Data Integration Unit to be 2. The default is 4. This can make a big difference if you have many copy activities.
This will reduce the performance of the copy activity however.
.

Azure Data Factory Pipeline Cost

I'm using azure data factory for some migration project. while doing it I came a cross with some clarification. I just want to know if I keep a pipeline in ADF without using it will there be a cost for that? I need to run some pipelines according to a schedule like weekly or monthly. Please help
Yes: $0.80 / month / inactive pipeline. According to Data Pipeline Pricing, under the heading Inactive pipelines,
A pipeline is considered inactive if it has no associated trigger or any runs within the month. An inactive pipeline is charged at $0.80 per month.
I just want to know if I keep a pipeline in ADF without using it will
there be a cost for that?
Quick answer is NO. Based on ADF pricing document,the billing consists of Data pipelines and SQL Server Integration Services.
Your account only need to pay when you execute your activities(in the pipelines). Or it is related to migration of SQL Server DB.

Azure Data Factory(ADF) vs Azure Functions: How to choose?

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

DTU utilization is not going above 56%

I'm doing a data load on azure sql server using azure data factory v2. I started the data load & the DB was set to Standard Pricing Tier with 800 DTUs. It was slow, so I increased the DTUs to 1600. (My pipeline is still running since 7 hrs).
I decided to change the pricing tier. I changed the pricing tier to Premium, DTUs set to 1000. (I didnt make any additional changes).
The pipeline failed as it lost connection. I rerun the pipeline.
Now, when I monitor the pipeline, it is working fine. When I monitor the database. The DTU usage on average is not going above 56%.
I am dealing with tremendous data. How can I speed up the process?
I expect the DTUs must max out. But the average utilization is around 56%.
Please follow this document Copy activity performance and scalability guide.
This tutorial gives us the Performance tuning steps.
One of ways is increase the Azure SQL Database tier with more DTUs. You have increased the Azure SQL Database tier with more 1000 DTUs, but the average utilization is around 56%. I think You don't need so higher price tier.
You need to think about other ways to improve the performance. Such as set more Data Integration Units(DIU).
A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.
Hope this helps.
The standard answer from Microsoft seems to be that you need to tune the target database or scale up to a higher tier. This suggests that Azure Data Factory is not a limiting factor in the copy performance.
However we've done some testing on a single table, single copy activity, ~15 GB of data. The table did not contain varchar(max), high precision, just simple and plain data.
Conclusion: it does barely matter what kind of tier you choose (not too low ofcourse), roughly above S7 / 800 DTU, 8 vcores, the performance of the copy activity is ~10 MB/s and does not go up. The load on the target database is 50%-75%.
Our assumption is that since we could keep throwing higher database tiers against this problem, but did not see any improvement in the copy activity performance, this is Azure Data Factory related.
Our solution is, since we are loading a lot of separate tables, to scale out instead of scale up via a for each loop and a batch count set to at least 4.
The approach to increase the DIU is only applicable in some cases:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units
Setting of DIUs larger than four currently applies only when you copy
multiple files from Azure Storage, Azure Data Lake Storage, Amazon S3,
Google Cloud Storage, cloud FTP, or cloud SFTP to any other cloud data
stores.
In our case we are copying data from relational databases.

Resources