recently I was using DynamoDB to build my service. I use the provisioned mode for my DynamoDB table.
In order to test how DynamoDB will react, I set both Read Capacity Unit and Write Capacity Unit to only 1. In addition, I insert 20 items which account for about 27KB in my table. I use Scan method with ReturnConsumedCapacity parameter. I use Postman to test it, the result shows that it consumes 2.5 capacity units!
Why does DynamoDB not reject my request? I only assign 1 to both RU & WU! Doesn't it mean that it should only be able to read as much as 4KB of data in one second?
This is the screenshot of Postman result
Reference -
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.ProvisionedThroughput.Manual
One read request unit represents one strongly consistent read request, or two eventually consistent read requests, for an item up to 4 KB in size. Transactional read requests require 2 read request units to perform one read for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. The total number of read request units required depends on the item size, and whether you want an eventually consistent or strongly consistent read. For example, if your item size is 8 KB, you require 2 read request units to sustain one strongly consistent read, 1 read request unit if you choose eventually consistent reads, or 4 read request units for a transactional read request.
Related
Link: https://azure.microsoft.com/en-in/pricing/details/storage/data-lake/
Under transaction pricing there is Write operations (every 4MB, per 10000).
What is 10000? And what does every 4MB per 10000 mean?
Transactions are incurred any time you read and write data operations to the service.
Each transaction can consist of maximum size of 4MB data.
So let us assume that you are writing a 8MB data to service - then there will be counted as 2 transactions. Similarly, if one read operation gets 10MB of data. It will be considered as 3 transactions (4+4+2)
So, let us assume that you are writing only data of 256 KB - it will still be considered as a single transaction. (Anything up to 4MB will be considered as 1 transaction.)
Coming back to your question :
As per the above logic, Write operations for 10000 transactions with 4MB as Max data size for each transaction would be Rs. 4.296 for the hot and Rs.8.592 for cold.
I might be misunderstanding this explanation but it seems wrong:
If it's saying that the charge is for 10,000 transactions with up to 4MB per transaction (i.e. 40GB across 10k transactions) then this is wrong.
The charges are per transaction up to a max of 4MB per transaction e.g. a single read/write of 10MB data incurs 3 transaction charges.
The ADLS Storage team at Microsoft maintain an FAQ that explains this better although it's still not entirely clear: (https://azure.github.io/Storage/docs/analytics/azure-storage-data-lake-gen2-billing-faq/).
The FAQ seems to suggest that reading/writing data from/to a file is based on the "per 4MB approach" whilst metadata operations (Copy File, Rename, Set Properties, etc) are charged on a per 10k operations basis.
So effectively data read/writes are charged per transaction up to 4MB (single transactions >4MB are charged as multiple 4MB transactions), whilst metadata operations are charged on a per 10,000 operations basis.
How you're ever supposed to work out how much this is going to cost upfront is beyond me...
I am trying to understand the Cassandra concurrent read and writes. I come across the property called
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
So as per the definition, Correct me If am wrong, 4 threads can access the database concurrently. So let's say I am trying to run the following query,
SELECT max(column1) from 'testtable' WHERE duration = 'month';
I am just trying to execute this query, What will be the use of concurrent read in executing this query?
Thats how many active reads can run at a single time per host. This is viewable if you type nodetool tpstats under the read stage. If the active is at pegged at the number of concurrent readers and you have a pending queue it may be worth trying to increase this. Its pretty normal for people to have this at ~128 when using decent sized heaps and SSDs. This is very hardware dependent so defaults are conservative.
Keep in mind that the activity on this thread is very fast, usually measured in sub ms but assuming they take 1ms even with only 4, given little's law you have a maximum of 4000 (local) reads per second per node max (1000/1 * 4), with RF=3 and quorum consistency that means your doing a minimum of 2 reads per request so can divide in 2 to think of a theoretical (real life is ickier) max throughput.
The aggregation functions (ie max) are processed on the coordinator, after fetching the data of the replicas (each doing a local read and sending response) and are not directly impacted by the concurrent reads since handled in the native transport and request response stages.
From cassandra 2.2 onward, the standard aggregate functions min, max, avg, sum, count are built-in. So, I don't think concurrent_reads will have any effect on your query.
In Microsoft's documents regarding CosmosDB it has been said that Stored Procedure and UDF Programming are good when you have a batch save or submit, but it hasn't said anything regarding batch size/record count.
Batching – Developers can group operations like inserts and submit them in bulk. The network traffic latency cost and the store overhead to create separate transactions are reduced significantly.
Is there any limits? What is the best practice?
For example lets say I have a million record that I'd like to save and each record is 2-4KB. I think it is not a good idea to call the SP with 3 GB of data. :)
Should I go for 1000 rows in 1 call (~3MB) or is it still too big/small?
*P.S: Since it has been promised to complete a write action in less than 15 Milliseconds, I would assume that 1000 records should take less than 15 seconds and 5000 records less than 75 seconds which both are still valid duration.
I will say, you should experiment to come up with the correct batch size.
However, remember sprocs can run only for 5 seconds. See https://learn.microsoft.com/en-us/azure/cosmos-db/programming#bounded-execution for how to handle this from code.
Hope this help.
There a few things you need to consider while doing Batching.
When you use a stored procedure to do Batch upsert, it can only work on a single partition.
If each of your record is 4 KB, then a write operation would consume around 4 times 6 RUs/sec = 24 RUs/sec for a single write.
A single physical partition can only have a maximum of 10K RUs, which means you could at best you could insert 416 documents/sec.
This is assuming there is no additional cost of indexing and there are no other writes happening to the same physical partitions.
Batching definitely saves on the network hops you make.
But you should consider the below when you are using batching:
Executing a stored procedure would consume some extra RUs that will be consumed from the RUs that are allocated to your partition.
If a stored procedure throws an un-handled error , then the whole transaction will be rolled back. Which means the RUs are used up with out adding any data.
So you need to do good exception handling and if there are failures after executing half of the batch, re-try only for the rest.
The code of the Stored procedure does not necessarily run as good as the document db internal code.
Also there is bounded execution limit of 5 secs before the transaction is killed.
Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/
How many transactions are fired for retrieving 1200 entities in Azure Storage Tables, keeping continuation Tokens in mind.
I have read that " Windows Azure Tables returns up to a maximum of 1000 entities in a single request and returns a continuation token when more results(remaining 200 entities) are available." See at (http://blog.smarx.com/posts/windows-azure-tables-expect-continuation-tokens-seriously).
Because Azure charges on the basis of the no. of transactions we perform over the cloud;
I just want to know: How many transactions will be executed for a single request that returns say 1200 entities(rows) with a continuation token after 1000th entity(row) result?
How many transactions will be executed for a single request that
returns say 1200 entities(rows) with a continuation token after 1000th
entity(row) result?
It actually depends. As the documentation states that Windows Azure Table returns up to a maximum of 1000 entities in a single request. What that means is that in your case, the minimum number of transactions would be 2 however the maximum number of transactions could be 1200. It all depends on how your data is partitioned and the load on your storage account. The more partitions you have, chances are that you'll receive lesser data per request thus more transaction. Again request execution time (server side) would also need be taken into consideration because if the execution takes more than the allocated time, the service will return the partial data.
Based on the documentation here: http://msdn.microsoft.com/en-us/library/windowsazure/dd179421.aspx, you can expect a continuation token if one or more conditions are true:
If the number of entities to be returned exceeds 1000.
If the server timeout interval is exceeded.
If the query crosses the partition boundary.