How to reduce the costs of azure data factory (we have pipelines for data movement between tables, and triggers, datasets, alerts) ?
Thanks in advance,
How to reduce the costs of azure data factory (we have pipelines for data movement between tables, and triggers, datasets, alerts) ?
Before reducing costs, we need to monitor what are the costs to avoid unnecessary costs.
We could:
First, at the beginning of the ETL project, you conduct proof of
concept and use a combination of per-pipeline consumption and pricing
calculator to estimate costs.
After you have deployed your pipelines to production, you use the
cost management features to set budgets and monitor costs. You can
also review the forecasted costs and identify spending trends.
In addition, you can view per-pipeline consumption and per-activity
consumption information to understand which pipelines and which
activities are costliest and identify candidates for cost reduction.
Please refer the document Plan and manage costs for Azure Data Factory for some more details.
In addition, we could save costs from running ADF pipelines using Triggers.
Check this thread for the details.
If you are using many copy activities to move data then the under settings tab you can change the default Data Integration Unit to be 2. The default is 4. This can make a big difference if you have many copy activities.
This will reduce the performance of the copy activity however.
.
Related
I am working on modernizing a reporting solution where the data sources are on prem on the customers' sql servers (2014) and the reports are displayed as Power BI reports on the customer's Power BI Service portal. Today I use SSIS to build a data warehouse, as well as an on premise data gateway to ensure the transport of data up to an Azure analysis services which in turn are used by the Power BI reports.
I have been wondering if I could use Azure Synapse to connect to customer data and in a most cost effective way transport data to Azure and link them to the Power BI workspace as a shared dataset. There are many possibilities, but it is important that the customer experiences that the reports are fast and stable, and if possible can cope with near real time.
I experience SSIS being cumbersome and expensive in azure. Are there mechanisms that make it cheap and fast to get data in azure? Do I need a data warehouse (Azure SQL database) or is it better to use data lake as storage for data? Needs to do incremental load too. And what if I need to do some transformations? Should I use Power BI dataflow or do I need to create Azure Data flows to achieve this?
Does anyone have good experience to use synapse (also with DevOps in mind) and get a good DEV, TEST and Prod environment for this? Or is using Synapse a cost driver and a simpler implementation will do? Give me your opinions and if you have links to good articles, please do so. Looking forward to hear from you
regards Geir
The honest answer is it depends on a lot of different things and I don't know that I can give you a solid answer. What I can do is try to focus down which services might be the best option.
It is worth noting that a Power BI dataset is essentially an Analysis Services database behind the scenes, so unless you are using a feature that is specifically only available in AAS and using a live connection, you may be able to eliminate that step. Refresh options are one of the things that are more limited in Power BI though, so the separate AAS DB might be necessary for your scenario.
There is a good chance that Power BI dataflows will work just fine for you if you can eliminate the AAS instance, and they have the added advantage of have incremental refresh as a core feature. In that case, Power BI will store the data in a data lake for you.
Synapse is an option, but probably not the best one for your scenario unless your dataset is large, SQL pools can get quite expensive, especially if you aren't making use of any of the compute options to do transformations.
Data Factory (also available as Synapse pipelines) without the SSIS integration is generally the least expansive option for moving large amounts of data. It allows you to use data flows to do some transformations and has things like incremental load. Outputting to a Data Lake is probably fine and the most cost effective, though in some scenarios something like an Azure SQL instance could be required if you specifically need some of those features.
If they want true real time, it can be done, but none of those tools really are built for it. In most cases the 48 refreshes per day (aka every 30 minutes) available on a Premium capacity are close enough to real time once you dig into the underlying purpose of a given report.
For true real time reporting, you would look at push and/or streaming datasets in Power BI and feed them with something like a Logic App or possibly Stream Analytics. There are a lot of limitations with push datasets though- more than likely you would want to set up a regular Power BI report and dataset and then add the real time dataset as a separate entity in addition to that.
As far as devops goes, pretty much any Azure service can be integrated with a pipeline. In addition to any code, any service or service settings can be deployed via an ARM template or CLI script.
Power BI has improved in the past couple years to have much better support for devops and dev/test/prod environments. Current best practices can be found in the Power BI documentation: https://learn.microsoft.com/en-us/power-bi/create-reports/deployment-pipelines-best-practices
I am looking for help to read the data from 100+ subscriptions at one go. I have a requirement to create a dashboard which can read the data from all the subscriptions at one go and show us the trend in graphical format.
For ex - CPU utilization of VMs.
Can we read the CPU utilization from all the VMs across all the subscriptions and put the highest one in the graph, so that it would be easy for the monitoring team to monitor the platform. It is always easier to see the data in the dashboard rather than going through 1000 emails on daily basis.
You can use a metrics step to query many resources at once, if you need true metrics.
but there isn't generally a "top x metric value across N subs" functionality of metrics (yet), you are querying a specific metric for specific resources. you could use workbooks to find the resources in a sub, and use those resources in a metrics step.
Most of these things are limited though, i know metrics is limited to 200 resources a a time.
If you were using something like log analytics, and had all the VMs emitting metrics to the same workspace, then you could do this as a single log analytics query, though.
Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)
I'm doing a data load on azure sql server using azure data factory v2. I started the data load & the DB was set to Standard Pricing Tier with 800 DTUs. It was slow, so I increased the DTUs to 1600. (My pipeline is still running since 7 hrs).
I decided to change the pricing tier. I changed the pricing tier to Premium, DTUs set to 1000. (I didnt make any additional changes).
The pipeline failed as it lost connection. I rerun the pipeline.
Now, when I monitor the pipeline, it is working fine. When I monitor the database. The DTU usage on average is not going above 56%.
I am dealing with tremendous data. How can I speed up the process?
I expect the DTUs must max out. But the average utilization is around 56%.
Please follow this document Copy activity performance and scalability guide.
This tutorial gives us the Performance tuning steps.
One of ways is increase the Azure SQL Database tier with more DTUs. You have increased the Azure SQL Database tier with more 1000 DTUs, but the average utilization is around 56%. I think You don't need so higher price tier.
You need to think about other ways to improve the performance. Such as set more Data Integration Units(DIU).
A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.
Hope this helps.
The standard answer from Microsoft seems to be that you need to tune the target database or scale up to a higher tier. This suggests that Azure Data Factory is not a limiting factor in the copy performance.
However we've done some testing on a single table, single copy activity, ~15 GB of data. The table did not contain varchar(max), high precision, just simple and plain data.
Conclusion: it does barely matter what kind of tier you choose (not too low ofcourse), roughly above S7 / 800 DTU, 8 vcores, the performance of the copy activity is ~10 MB/s and does not go up. The load on the target database is 50%-75%.
Our assumption is that since we could keep throwing higher database tiers against this problem, but did not see any improvement in the copy activity performance, this is Azure Data Factory related.
Our solution is, since we are loading a lot of separate tables, to scale out instead of scale up via a for each loop and a batch count set to at least 4.
The approach to increase the DIU is only applicable in some cases:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units
Setting of DIUs larger than four currently applies only when you copy
multiple files from Azure Storage, Azure Data Lake Storage, Amazon S3,
Google Cloud Storage, cloud FTP, or cloud SFTP to any other cloud data
stores.
In our case we are copying data from relational databases.
We are setting up a data factory to help with our global failover scenario. The Pipeline copies data from our SQL server located on premise into Azure Table Storage.
We are using Data Factory V2 and have set up the CI/CD pipeline as described in the ADF documentation.
Therefore, our dev and test instances only copy data from the Sql to one region, but our production needs to copy data to multiple regions. My thought to simplify things would be to have one Factory per region that will only copy data to that region (so that production and dev can share the exact same pipelines).
However, this will mean that we will have multiple pipelines and all of them will have a rather low usage. There are only 3 activities that run once a day, so we will only have 90 activities per month. Looking at the data factory pricing, you are charged for every 1,000 activities.
My question is, since each of these factories will have less than 1,000 activities, will we be charged the minimum of $1.50 for each factory or will the pricing just charge us once since all of them together will still be less than 1,000 activities?
Great question! The pricing is calculated per Data Factory instance and not per pipeline. You can have as many pipelines in a single Data Factory instance. You will be charged based on the number of activity runs within a Data Factory instance.
In your case, since you are planning on having multiple Data Factory instances, you will be billed multiple times. Eg- If you have 3 data factories(may or may not be across different regions) and each ADF has 90 activity runs a month, you will be charged 3x$1.5 = $4.5.
For an accurate estimation of pricing, please refer :
https://azure.microsoft.com/en-in/pricing/calculator/
Hope this helps!