Link: https://azure.microsoft.com/en-in/pricing/details/storage/data-lake/
Under transaction pricing there is Write operations (every 4MB, per 10000).
What is 10000? And what does every 4MB per 10000 mean?
Transactions are incurred any time you read and write data operations to the service.
Each transaction can consist of maximum size of 4MB data.
So let us assume that you are writing a 8MB data to service - then there will be counted as 2 transactions. Similarly, if one read operation gets 10MB of data. It will be considered as 3 transactions (4+4+2)
So, let us assume that you are writing only data of 256 KB - it will still be considered as a single transaction. (Anything up to 4MB will be considered as 1 transaction.)
Coming back to your question :
As per the above logic, Write operations for 10000 transactions with 4MB as Max data size for each transaction would be Rs. 4.296 for the hot and Rs.8.592 for cold.
I might be misunderstanding this explanation but it seems wrong:
If it's saying that the charge is for 10,000 transactions with up to 4MB per transaction (i.e. 40GB across 10k transactions) then this is wrong.
The charges are per transaction up to a max of 4MB per transaction e.g. a single read/write of 10MB data incurs 3 transaction charges.
The ADLS Storage team at Microsoft maintain an FAQ that explains this better although it's still not entirely clear: (https://azure.github.io/Storage/docs/analytics/azure-storage-data-lake-gen2-billing-faq/).
The FAQ seems to suggest that reading/writing data from/to a file is based on the "per 4MB approach" whilst metadata operations (Copy File, Rename, Set Properties, etc) are charged on a per 10k operations basis.
So effectively data read/writes are charged per transaction up to 4MB (single transactions >4MB are charged as multiple 4MB transactions), whilst metadata operations are charged on a per 10,000 operations basis.
How you're ever supposed to work out how much this is going to cost upfront is beyond me...
Related
I have a use case where I need to send very large number of messages to Azure Service Bus queue. From this https://github.com/Huachao/azure-content/blob/master/articles/service-bus/service-bus-azure-and-service-bus-queues-compared-contrasted.md I got to know that the Azure Service Bus queue supports 2000 put operations per second (1kb message).
But my writes will be more than 2000 per second.
From the Microsoft's docs
https://learn.microsoft.com/en-us/azure/service-bus-messaging/enable-partitions#:~:text=Service%20Bus%20partitions%20enable%20queues,message%20broker%20or%20messaging%20store.
I have seen that we can create a partitioned queue which will create 16 partitions and the size of queue will be increased by 16 times. But I am not able to find will this have any impact on the put operations? will the put operations also be increased by 16 times resulting in 32000 writes per second?
You are looking at a very outdated document, current limits for standard SKU is 1000 credits per second (per namespace), take a look at this doc for more info on how credits works.
Regarding your question, what partitioned entities do is dividing your entitiy into multiple logical components in order to achieve higher resiliency, when you send a message to a partitioned entitiy there will be an internal load balancing mechanism that distribute messages across all partitions, this is not counted as additional operations, hence if you send 1000 messages per second that is equivalent to 1000 credits.
In Microsoft's documents regarding CosmosDB it has been said that Stored Procedure and UDF Programming are good when you have a batch save or submit, but it hasn't said anything regarding batch size/record count.
Batching – Developers can group operations like inserts and submit them in bulk. The network traffic latency cost and the store overhead to create separate transactions are reduced significantly.
Is there any limits? What is the best practice?
For example lets say I have a million record that I'd like to save and each record is 2-4KB. I think it is not a good idea to call the SP with 3 GB of data. :)
Should I go for 1000 rows in 1 call (~3MB) or is it still too big/small?
*P.S: Since it has been promised to complete a write action in less than 15 Milliseconds, I would assume that 1000 records should take less than 15 seconds and 5000 records less than 75 seconds which both are still valid duration.
I will say, you should experiment to come up with the correct batch size.
However, remember sprocs can run only for 5 seconds. See https://learn.microsoft.com/en-us/azure/cosmos-db/programming#bounded-execution for how to handle this from code.
Hope this help.
There a few things you need to consider while doing Batching.
When you use a stored procedure to do Batch upsert, it can only work on a single partition.
If each of your record is 4 KB, then a write operation would consume around 4 times 6 RUs/sec = 24 RUs/sec for a single write.
A single physical partition can only have a maximum of 10K RUs, which means you could at best you could insert 416 documents/sec.
This is assuming there is no additional cost of indexing and there are no other writes happening to the same physical partitions.
Batching definitely saves on the network hops you make.
But you should consider the below when you are using batching:
Executing a stored procedure would consume some extra RUs that will be consumed from the RUs that are allocated to your partition.
If a stored procedure throws an un-handled error , then the whole transaction will be rolled back. Which means the RUs are used up with out adding any data.
So you need to do good exception handling and if there are failures after executing half of the batch, re-try only for the rest.
The code of the Stored procedure does not necessarily run as good as the document db internal code.
Also there is bounded execution limit of 5 secs before the transaction is killed.
We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.
I'm looking into moving to the new partitioned collections for DocumentDB and have a few questions that the documentation and pricing calculator seems to be a little unclear on.
PRICING:
In the below scenario my partitioned collection will be charged $30.02/mo at 1GB of data with a constant hourly RU use of 500:
So does this mean if my users only hit the data for an average of 500 RU's for about 12 hours per day which means that HALF the time my collection goes UNUSED, but is still RUNNING and AVAILABLE (not shut down) the price goes down to $15.13/mo as the calculator indicates here:
Or will I be billed the full $30.01/mo since my collection was up and running?
I get confused when I go to the portal and see an estimate for $606/mo with no details behind it when I attempt to spin up the lowest options on a partition collection:
Is the portal just indicating the MAXIMUM that I COULD be billed that month if I use all my allotted 10,100 RU's a second every second of the hour for 744 consecutive hours?
If billing is based on hourly use and the average RU's used goes down to 100 on some of the hours used in the second scenario does the cost go down even further? Does Azure billing for partitioned collections fluctuates based on hourly usage and not total up time like the existing S1/S2/S3 tiers?
If so then how does the system determine what is billed for that hour? If most of the hour the RU's used are 100/sec but for a few seconds that hour it spikes to 1,000 does it average out by the seconds across that entire hour and only charge me for something like 200-300 RU's for that hour or will I be billed for the highest RU's used that hour?
PERFORMANCE:
Will I see a performance hit by moving to this scenario since my data will be on separate partitions and require partition id/key to access? If so what can I expect, or will it be so minimal that it would be undetected by my users?
RETRIES & FAULT HANDLING:
I'm assuming the TransientFaultHandling Nuget package I use in my current scenario will still work on the new scenario, but may not be used as much since my RU capacity is much larger, or do I need to rethink how I handle requests that go over the RU cap?
So they way that pricing works for an Azure documentDB is that you pay to reserve a certain amount of data storage size (in GB's) and/or throughput (in Request units (RU)). These charges are charged per hour that the reserve is in place (does not require usage). Additionally, just having a Document Account active is deemed to be an active S1 subscription, until a documentDB gets created then the pricing of your db takes over. There are two options available:
Option 1 (Original Pricing)
You can a choose between S1, S2 or S3. Each offering the same 10GB of storage but varying in throughput 250RU/1000RU/2500RU.
Option 2 (User-defined performance)
This is the new pricing structure which better decouples size and throughout. This option additionally provides for partitioning. Note that with user defined performance you are charge per GB of data storage used (Pay as you go storage).
With user-defined performance levels, storage is metered based on
consumption, but with pre-defined performance levels, 10 GB of storage
is reserved at the time of collection creation.
Single Partition Collection
The minimum is set at 400RU and 1GB of data storage.
The maximum is set at 10,000RU and 250GB of data storage.
Partitioned Collections
The minimum is set at 10,000RU and 1GB of data storage.
The maximum is set at 250,000RU and 250GB of data storage (EDIT can request greater).
So at a minimum you will be paying the cost per hour related to the option you selected. The only way to not pay for an hour would be to delete the db and the account, unfortunately.
Cost of Varying RU
In terms of varying your RU within the time frame of 1 hours, you are charged for that hour at the cost of the peak reserve RU you requested. So if you were at 400RU and you up it to 1000RU for 1sec you will be charge at the 1000RU rate for that hour. Even if for the other 59minutes 59secounds you set it back to 400RU.
Will I see a performance hit by moving to this scenario since my data will be on separate partitions and require partition id/key to access?
One the topic of perfromance hit there's a few things to think about, but in general no.
If you have a sane partition key with enough values you should not see a performance penalty. This means that you need to partition data so that you have the partition key available when querying and you need to keep the data you want from a query in the same partition by using the same partiton key.
If you do queries without partitionkey, you will see a sever penalty, as the query is parsed and executed per partition.
One thing to keep in mind when selecting a partition key is the limits for each partition, which are 10GB and 10K RU. This means that you want an even distributions over the partitions in order to avoid a "hot" partition which means that even if you scale to more than enough RU in total, you may recieve 429 for a specific partition.
Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?
You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.