How pipeline execution time had been calculated in the official guide? - azure

I'm trying to understand how the price estimation works for Azure Data Factory from the official guide, section "Estimating Price - Use Azure Data Factory to migrate data from Amazon S3 to Azure Storage
I managed to understand everything except the 292 hours that are required to complete the migration.
Could you please explain to me how did they get that number?

Firstly, feel free to submit a feedback here with the MS docs team to clarify with an official response on same.
Meanwhile, I see, as they mention "In total, it takes 292 hours to complete the migration" it would include listing from source, reading from source, writing to sink, other activities, other than the data movement itself.
If we consider approximately, for data volume of 2 PB and aggregate throughput of 2 GBps would give
2PB = 2,097,152 GB BINARY and
Aggregate throughput =
2BGps --> 2,097,152/2 = 1,048,576 secs --> 1,048,576 secs / 3600 =
291.271 hours
Again, these are hypothetical. Further you can refer Plan to manage costs for Azure Data Factory and Understanding Data Factory pricing through examples.

Related

Azure cost analysis for a particular subscription using Python SDK

So I'm trying to automate fetching the current cost and cost forecast (Like it is shown under cost analysis for a particular subscription) for a particular subscription using python SDK but I haven't been able to find a single API that does this yet.
I've tried using UsageAggregate and Rate card but I haven't really figured out a way to find the cost for the current month to date. If there is an API that I'm missing or if I need to calculate monthly costs myself, I'd appreciate any code snippets or help.
If you already have the usage and the ratecard data, then you must combine them.
Take the meterId of the usage data and get the related ratecard data.
The ratecard data contains the MeterRates and the IncludedQuantity which you must take.
There are probably multiple meter rates and the included quantity because there are probably different costs per usage (e.g. first 10 calls for free, 3 GB for free, ...).
The consumption starts/is reseted at the 14th of the month. That's the reason why you have to read the data from the whole billing period (begins with 14th of each month), because that's the only way how you get the correct consumption.
So, if you are using e.g. Azure Functions and you have a usage of 100.000 units per day and you want the costs from 20th - 30th, then the calculation works as follows:
read data from 14th - 30th. These are 17 days and therefore it used 1.700.000 units. The first 400.000 are for free = IncludedQuantity (so in this sample the first 4 days).
From the 400.001 unit on, you have to take the meter rate (0,0000134928 €) and calculate the costs. 1.300.000 * 0,0000134928 = ~17,54€.
Fortunately, the azure functions have only one rate. If the rate changes e.g. after 5.000.000 units, then you also have to take this into account. If you have the whole costs, then you can filter on your date which is 20.-30. and you will get the result.
Its calculation implemented in C# and published it as a NuGet package here. It also contains a sample console which you could use to export the data.
I know I am bit late to the party, but after struggling with the same problem, I managed to create the code for getting the cost of a resource group using
azure.mgmt.costmanagement
Link to cost management API
Code sample is in my answer here

How do I identify the source of Cosmos DB RU consumption?

Context
I have an Azure Function which executes daily, it is change tracking a pricing feed with the results stored in a Cosmos DB. Each time it runs it compares the latest price from the feed, against the most recent values in the DB collection and writes a new item if there is a difference. It is scheduled to run at 23:55 each day. The overall setup is tiny, with 10 items in the DB, and changes seen usually once a week. My consumption is 3.84 RUs for the daily execution when the price hasn't changed.
Question
In addition to the expected activity at 23:55 each day, there is additional activity appearing almost every 3 hrs 45 mins. This unexpected activity consumes 4 RUs each time (around 24 RUs per day). How can I identify the source of the additional activity?
Other info
I noticed that the backups were scheduled to 4 hours, so I changed that to daily. That didn't help.
Since noticing the additional usage I have added diagnostic settings to save all logs into a Log Analytics workspace. I can see that the RUs are Read operations, Type = AzureDiagnostics, requestResourceType = Collection, Source IP = 51.11.40.180, 1 of the 4 is a read against the __Cosmos/colls/__Query collection. This suggests to me that it's the diagnostics causing the cost. I disabled the diagnostic settings to see if that reduced my RU consumption, but it does not.
Is it just a case that diagnostics are run by Microsoft and that is simply part of having a cosmos db?
And that usually it's a smaller proportion of the overall cost, therefore not an issue?

Azure Data Factory Pricing - Activity Count

I'm thinking of using Data Factory in order to copy data from a Blob Storage container to an SQL table but I'm not quite sure I understand how the pricing works, specifically how the activities are counted.
So if I have a pipeline with 3 activities that copies the data from a CSV with 1000 lines will the total activity count be 3*1 or 3*1000? In other words, will I be charged based on the number o files it processes or the total number of lines it copies?
That's 3 activity runs. Activity runs are measured by the thousand, at $1 per. Since these are Copy activities, they consume Data Integration Units (DIU) at $.25 per hour. Pipeline execution time is billed at $.005 per hour. If you add all this up for 1 pipeline with 3 Copy activities that runs for 1 hour, your total bill is like 27 cents.
We run THOUSANDS of pipelines a month, all with many activities including quite a few Copy activities. Our Data Factory billing is still so low that it looks like a rounding error in our total Azure spend.
The exception to this is Data Flow. Data Flow is a Spark wrapper, so you have to pay for Cluster time, which can get expensive quickly if you aren't careful.
Actually, you have to pay for 2 important metrics: Orchestration and Execution. Please refer to more details from this document.
1.Orchestration, $1 per 1,000 runs. You have 3 activities, then it should be $ 3/1000.
2.Execution, it depends on the DIU you configured,which means the performance of your transmission.

Understanding Azure SQL Performance

The facts:
1 Azure SQL S0 instance
a few tables one of them containing ~ 8.6 Million Rows and 1 PK
Running a Count-query on this table takes nearly 30 minutes (!) to complete.
Upscaling the instance from S0 to S1 reduces the query time to 13 minutes:
Looking into Azure Portal (new version) the resource-usage-monitor shows the following:
Questions:
Does anyone else consider even 13 minutes as rediculos for a simple COUNT()?
Does the second screenshot meen that during the 100%-period my instance isn't responding to other requests?
Why are my metrics limited to 100% in both S0 and S1? (see look under "Which Service Tier is Right for My Database?" stating " These values can be above 100% (a big improvement over the values in the preview that were limited to a maximum of 100).") I'd expect the S0 to bee like on 150% or so if the quoted statement is true.
I'm interested in experiences regarding usage of databases with more than 1.000 records or so from other people. I don't see how a S*-scaled Azure SQL for 22 - 55 € per month could help me in upscaling-strategies at the moment.
Azure SQL Database editions provide increasing levels of DTUs from Basic -> Standard -> Premium levels (CPU,IO,Memory and other resources - see https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx). Once your query reaches its limits of DTU (100%) in any of these resource dimensions, it will continue to receive these resources at that level (but not more) and that may increase the latency in completing the request. It looks like in your scenario above, the query is hitting its DTU limit (10 DTUs for S0 and 20 for S1). You can see the individual resource usage percentages (CPU, Data IO or Log IO) by adding these metrics to the same graph, or by querying the DMV sys.dm_db_resource_stats.
Here is a blog that provides more information on appropriately sizing your database performance levels. http://azure.microsoft.com/blog/2014/09/11/azure-sql-database-introduces-new-near-real-time-performance-metrics/
To your specific questions
1) As you have 8.6 million rows, database needs to scan the index entries to get the count back. So, it may be hitting the IO limit for the edition here.
2) If you have multiple concurrent queries running against your DB, they will be scheduled appropriately to not starve one request or the other. But latencies may increase further for all queries since you will be hitting the available resource limits.
3) For older Web/Business editions, you may be able to see the metric values going beyond 100% (they are normalized to the limits of an S2 level), as they don't have any specific limits and run in a resource-shared environment with other customer loads. For the new editions, metrics will never exceed 100%, because system guarantees you resources upto 100% of that edition's limits, but no more. This provides predictable, guaranteed amount of resources for your DB unlike Web/Business editions, where you may get very little or lot more resources at different times depending on other competing customer DB workloads running on the same machine.
Hope this helps.
-- Srini

Comparing the new SQL Azure tiers to old ones [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Now that Microsoft made the new SQL Azure service tiers available (Basic, Standard, Premium) we are trying to figure out how they map to the existing ones (Web and Business).
Essentially, there are six performance levels in the new tier breakdown: Basic, S1, S2, P1, P2 and P3 (details here: http://msdn.microsoft.com/library/dn741336.aspx)
Does anyone know how the old database tiers map to those six levels? For instance, is Business equivalent of an S1? S2?
We need to be able to answer this question in order to figure out what service tiers/levels to migrate our existing databases to.
We just finished a performance comparison.
I can't publish our SQL queries, but we used 3 different test cases that match our normal activity. In each test case, we performed several queries with table joins and aggregate calculations (SUM, AVG, etc) for a few thousand rows. Our test database is modest - about 5GB in size with a few million rows.
A few notes: For each, we tested the my local machine which is a 5 year old iMac running Windows/SQL Server in a virtual machine ("Local"), SQL Azure Business ("Business"), SQL Azure Premium P1, SQL Azure Standard S2, and SQL Azure Standard S1. The basic tier seemed so slow that we didn't test it. All of these tests were done with no other activity on the system. The queries did not return data so network performance was hopefully not a factor.
Here were our results:
Test One
Local: 1 second
Business: 2 seconds
P1: 2 seconds
S2: 4 seconds
S1: 14 seconds
Test Two
Local: 2 seconds
Business: 5 seconds
P1: 5 seconds
S2: 10 seconds
S1: 30 seconds
Test Three
Local: 5 seconds
Business: 12 seconds
P1: 13 seconds
S2: 25 seconds
S1: 77 seconds
Conclusions:
After working with the different tiers for a few days, our team concluded a few things:
P1 appears to perform at the same level as SQL Azure Business. (P1 is 10x the price)
Basic and S1 are way too slow for anything but a starter database.
The Business tier is a shared service so performance depends on what other users are on your server. Our database shows a max of 4.01% CPU, 0.77% Data IO ,0.14% Log IO and we're experiencing major performance problems and timeouts. Microsoft Support confirmed that we are "just on a really busy server."
The Business tier delivers inconsistent service across servers and regions. In our case, we moved to a different server in a different region and our service is back to normal. (we view that as a temporary solution)
S1, S2, P1 tiers seem to provide the same performance across regions. We tested West and North Central.
Considering the results above, we're generally worried about the future of SQL Azure. The business tier has been great for us for a few years, but it's scheduled to go out of service in 12 months. The new tiers seem over priced compared to the Business tier.
I'm sure there are 100 ways this could be more scientific, but I'm hoping those stats help others getting ready to evaluate.
UPDATE:
Microsoft Support sent us a very helpful query to assess your database usage.
SELECT
avg(avg_cpu_percent) AS 'Average CPU Percentage Used',
max(avg_cpu_percent) AS 'Maximum CPU Percentage Used',
avg(avg_physical_data_read_percent) AS 'Average Physical IOPS Percentage',
max(avg_physical_data_read_percent) AS 'Maximum Physical IOPS Percentage',
avg(avg_log_write_percent) AS 'Average Log Write Percentage',
max(avg_log_write_percent) AS 'Maximum Log Write Percentage',
--avg(avg_memory_percent) AS 'Average Memory Used Percentage',
--max(avg_memory_percent) AS 'Maximum Memory Used Percentage',
avg(active_worker_count) AS 'Average # of Workers',
max(active_worker_count) AS 'Maximum # of Workers'
FROM sys.resource_stats
WHERE database_name = 'YOUR_DATABASE_NAME' AND
start_time > DATEADD(day, -7, GETDATE())
The most useful part is that the percentages represent % of an S2 instance. According to Microsoft Support, if you're at 100%, you're using 100% of an S2, 200% would be equivalent to a P1 instance.
We're having very good luck with P1 instances now, although the price difference has been a shocker.
I am the author of the Azure SQL Database Performance Testing blog posts mentioned above.
Making IOPS to DTU comparisons is quite difficult for Azure SQL Database, which is why I focussed on row counts and throughput rates (in MB per second) in my tests.
I would be cautious about using the Transaction Rates quoted by Microsoft - their benchmark databases are rather small e.g. for Standard tier, which has a capacity of 250 GB, their benchmark databases for S1 and S2 are only 2 GB and 7 GB respectively. At these sizes I suggest SQL Server is caching much/most of the database and as such their benchmark is avoid the worst of the read throttling that is likely to impact real world databases.
I have added a new post regarding the new Service Tiers hitting General Availability and making some estimates of the changes in performance around S0 and S1 at GA.
http://cbailiss.wordpress.com/2014/09/16/performance-in-new-azure-sql-database-performance-tiers/
There is not really any kind of mapping between the old and new offerings near as I can tell. The old offerings the only thing that was really different between the "web" and "business" offering was the size the database was limited to.
However, on the new offerings each tier has performance metrics associated with them. So in order to decide what offering you need to move your existing databases to you need to figure out what type of performance needs your application has.
It appears in terms of size Web and Business fall between Basic and S1. Here's a link that has a chart with the new and old tiers compared. It seems a little apple to oranges honestly so there isn't a direct mapping. Here's also a link specifically addressed to people currently on the Web and Business Tiers.
Comparison of Tiers
Web and Business Edition Sunset FAQ

Resources