Best batching size choice

Best batching size choice - azure

In Microsoft's documents regarding CosmosDB it has been said that Stored Procedure and UDF Programming are good when you have a batch save or submit, but it hasn't said anything regarding batch size/record count.
Batching – Developers can group operations like inserts and submit them in bulk. The network traffic latency cost and the store overhead to create separate transactions are reduced significantly.
Is there any limits? What is the best practice?
For example lets say I have a million record that I'd like to save and each record is 2-4KB. I think it is not a good idea to call the SP with 3 GB of data. :)
Should I go for 1000 rows in 1 call (~3MB) or is it still too big/small?
*P.S: Since it has been promised to complete a write action in less than 15 Milliseconds, I would assume that 1000 records should take less than 15 seconds and 5000 records less than 75 seconds which both are still valid duration.

I will say, you should experiment to come up with the correct batch size.
However, remember sprocs can run only for 5 seconds. See https://learn.microsoft.com/en-us/azure/cosmos-db/programming#bounded-execution for how to handle this from code.
Hope this help.

There a few things you need to consider while doing Batching.
When you use a stored procedure to do Batch upsert, it can only work on a single partition.
If each of your record is 4 KB, then a write operation would consume around 4 times 6 RUs/sec = 24 RUs/sec for a single write.
A single physical partition can only have a maximum of 10K RUs, which means you could at best you could insert 416 documents/sec.
This is assuming there is no additional cost of indexing and there are no other writes happening to the same physical partitions.
Batching definitely saves on the network hops you make.
But you should consider the below when you are using batching:
Executing a stored procedure would consume some extra RUs that will be consumed from the RUs that are allocated to your partition.
If a stored procedure throws an un-handled error , then the whole transaction will be rolled back. Which means the RUs are used up with out adding any data.
So you need to do good exception handling and if there are failures after executing half of the batch, re-try only for the rest.
The code of the Stored procedure does not necessarily run as good as the document db internal code.
Also there is bounded execution limit of 5 secs before the transaction is killed.

Related

What does write operations (every 4MB, per 10000) mean?

Link: https://azure.microsoft.com/en-in/pricing/details/storage/data-lake/
Under transaction pricing there is Write operations (every 4MB, per 10000).
What is 10000? And what does every 4MB per 10000 mean?

Transactions are incurred any time you read and write data operations to the service.
Each transaction can consist of maximum size of 4MB data.
So let us assume that you are writing a 8MB data to service - then there will be counted as 2 transactions. Similarly, if one read operation gets 10MB of data. It will be considered as 3 transactions (4+4+2)
So, let us assume that you are writing only data of 256 KB - it will still be considered as a single transaction. (Anything up to 4MB will be considered as 1 transaction.)
Coming back to your question :
As per the above logic, Write operations for 10000 transactions with 4MB as Max data size for each transaction would be Rs. 4.296 for the hot and Rs.8.592 for cold.

I might be misunderstanding this explanation but it seems wrong:
If it's saying that the charge is for 10,000 transactions with up to 4MB per transaction (i.e. 40GB across 10k transactions) then this is wrong.
The charges are per transaction up to a max of 4MB per transaction e.g. a single read/write of 10MB data incurs 3 transaction charges.
The ADLS Storage team at Microsoft maintain an FAQ that explains this better although it's still not entirely clear: (https://azure.github.io/Storage/docs/analytics/azure-storage-data-lake-gen2-billing-faq/).
The FAQ seems to suggest that reading/writing data from/to a file is based on the "per 4MB approach" whilst metadata operations (Copy File, Rename, Set Properties, etc) are charged on a per 10k operations basis.
So effectively data read/writes are charged per transaction up to 4MB (single transactions >4MB are charged as multiple 4MB transactions), whilst metadata operations are charged on a per 10,000 operations basis.
How you're ever supposed to work out how much this is going to cost upfront is beyond me...

How to optimize RUs for Replace of a document in Azure Cosmos DB?

I'm a beginner to Azure. I'm using log monitors to view the logs for a Cosmos DB resource. I could see one log with Replace operation which is consuming a lot of average RUs.
Generally operation names should be CREATE/DELETE/UPDATE/READ . But why REPLACE operation has come in place over here - I could not understand it. And why the REPLACE operation is consuming lot of RUs?
What can I try next?

Updates in Cosmos are full replacement operations rather than in-place updates, as such these consume more RU/s than inserts. Also, the larger the document, the more throughput required for the update.
Strategies to optimize throughput consumption on update operations typically center around splitting documents into two with properties that don't change going into one document that is typically larger and another document with those properties that change frequently going into an other that is smaller. This will allow for the updates to be made on a smaller document which reduces RU/s consumed to do the operation.
All that said, 12 RU/s is not an inordinate amount of RU/s for a replace operation. I don't think you will get much, if any throughput reduction doing this. But you can certainly try.

Cassandra concurrent read and write

I am trying to understand the Cassandra concurrent read and writes. I come across the property called
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
So as per the definition, Correct me If am wrong, 4 threads can access the database concurrently. So let's say I am trying to run the following query,
SELECT max(column1) from 'testtable' WHERE duration = 'month';
I am just trying to execute this query, What will be the use of concurrent read in executing this query?

Thats how many active reads can run at a single time per host. This is viewable if you type nodetool tpstats under the read stage. If the active is at pegged at the number of concurrent readers and you have a pending queue it may be worth trying to increase this. Its pretty normal for people to have this at ~128 when using decent sized heaps and SSDs. This is very hardware dependent so defaults are conservative.
Keep in mind that the activity on this thread is very fast, usually measured in sub ms but assuming they take 1ms even with only 4, given little's law you have a maximum of 4000 (local) reads per second per node max (1000/1 * 4), with RF=3 and quorum consistency that means your doing a minimum of 2 reads per request so can divide in 2 to think of a theoretical (real life is ickier) max throughput.
The aggregation functions (ie max) are processed on the coordinator, after fetching the data of the replicas (each doing a local read and sending response) and are not directly impacted by the concurrent reads since handled in the native transport and request response stages.

From cassandra 2.2 onward, the standard aggregate functions min, max, avg, sum, count are built-in. So, I don't think concurrent_reads will have any effect on your query.

How fast is Azure Search Indexer and how I can index faster?

Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?

It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.

I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/

Improving performance for Azure Table Storage bulk loads

I'm trying to bulk load about 25 million rows from an Azure SQL table into three different tables in Azure Table Storage. I'm currently managing to process about 50-100 rows / second, which means that at current speeds, it'll take me about 70-140 hours to finish the load. That's a long time, and it seems like it ought to be possible to speed that up.
Here's what I'm doing:
Kick off 10 separate tasks
For each task, read the next 10,000 unprocessed records from the SQL DB
For each of the three destination ATS tables, group the 10,000 records by that table's partition key
In parallel (up to 10 simultaneously), for each partition key, segment the partition into (max) 100-row segments
In parallel (up to 10 simultaneously), for each segment, create a new TableBatchOperation.
For each row from the chunk, execute a batch.InsertOrReplace() statement (because some of the data has already been loaded, and I don't know which)
Execute the batch asynchronously
Rinse and repeat (with lots of flow control, error checking, etc.)
Some notes:
I've tried this several different ways, with lots of different parameters for the various numbers up above, and I'm still not getting it down to less than 10-20 ms / event.
It doesn't seem to be CPU bound, as the VM doing the load is averaging about 10-20% CPU.
It doesn't seem to be SQL-bound, as the SQL select statement is the fastest part of the operation by at least two orders of magnitude.
It's presumably not network-bound, as the VM executing the batch is in the same data center (US West).
I'm getting reasonable partition density, i.e., each 10K set of records is getting broken up into a couple hundred partitions for each table.
With perfect partition density, I'd have up to 3000 tasks running simultaneously (10 master tasks * 3 tables * 10 partitions * 10 segments). But they're executing asynchronously, and they're nearly all I/O bound (by ATS), so I don't think we're hitting any threading limits on the VM executing the process.
The only other obvious idea I can come up with is one that I tried earlier on, namely, to do an order by partition key in the SQL select statements, so that we can get perfect partition density for the batch inserts. For various reasons that has proven to be difficult, as the table's indexes aren't quite setup for that. And while I would expect some speed up on the ATS side using that approach, given that I'm already grouping the 10K records by their partition keys, I wouldn't expect to get that much additional performance improvement.
Any other suggestions for speeding this up? Or is this about as fast as anybody else has been able to get?

Still open to other suggestions, but I found this page here quite helpful:
http://blogs.msmvps.com/nunogodinho/2013/11/20/windows-azure-storage-performance-best-practices/
Specifically, these:
ServicePointManager.Expect100Continue = false;
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.DefaultConnectionLimit = 100;
With those, I was able to drop the average processing time from ~10-20 ms / event down to ~2 ms.
Much better.
But as I said, still open to other suggestions. I've read about other folks getting upwards of 20,000 operations per second on ATS, and I'm still stuck around 500.

What about your partition keys? If they are incremental numbers, then Azure will optimize them into one storage node. So you should use completly different partition keys "A1", "B2" etc. instead of "1", "2" etc.
In this situation all of your partitions will be handled by different storage nodes, and performance will be multitplied.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string