How expensive is it to "index everything" on CosmosDB? (default behaviour) - azure

By default, all the data in Azure Cosmos DB is indexed i.e. every property inside document/item has by default consistent/automatic indexing.
However, the cost of storing this large amount of data is not clearly visible to the end users.
How would you calculate/track the cost you are bearing for storing or using the index related data?
As the billing is only related to RU/s and data storage, it is not clear how the indexing strategy affects billing.
Also i wonder if the RU/s necessary for intensive writes maybe increased because of the indexes.
If so, indexes in CosmosDB should be excluded, and only necessary properties have to be indexed, thus reducing the overall performance.

As the billing is only related to RU/s and data storage, it is not
clear how the indexing strategy affects billing.
Indexing strategy will affect billing because if you index everything, you will consume more storage and that in turn will increase your bill. When an item is written to Cosmos DB, a part of your RU/s will be spent on indexing that item thus you will end up consuming more RU/s which will increase your bill.
You may find these links helpful in optimizing the costs as far as indexing is concerned:
Optimize cost with indexing: https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-storage#optimize-cost-with-indexing.
Optimize by changing indexing policy: https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-throughput#optimize-by-changing-indexing-policy.
Also i wonder if the RU/s necessary for intensive writes maybe
increased because of the indexes.
That is correct.
If so, indexes in CosmosDB should be excluded, and only necessary
properties have to be indexed, thus reducing the overall performance.
For bulk writes, it is recommended that you turn off indexing completely before doing the bulk writes and enable it once the write operations are completed. You can also request for Lazy Indexing as described here.
UPDATE - Adding comment from #MarkBrown (part of CosmosDB team):
I don't recommend using lazy indexing as there is no way to know when
it is caught up and the data can be queried consistently. Also, a
well-tuned indexing policy can be fine in bulk scenarios. Generally
best reserved for massive one-time data loads. Not recommended for
regular, batch type data ingestion.

Related

CosmosDB: Can I reduce the consumption of RUs when I split the document (in case of updating)?

I have a question regarding CosmosDB and how to deal with the costs and RUs. Lets say I have JSON document which is around 100KB big. Now when I want to update a property and use an upsert the cosmos will do a replace which is essentially a delete and create which will result in relatively high consumption of RU right?
My first question is, can I reduce the amount of RUs for updating when I split the file into smaller parts? Let's say 10x 10KB documents. Because then a smaller document has to be replaced and it needs less CPU etc.
So this would be the case for upserts. But now there is a game changer for ComsosDB called partial update.
How would it be in this case? Would smaller files lead to a decrease in RU consumption? Because in the background cosmos has to parse the file and insert the new property. Bigger file more to parse more consumption of RU?
My last question is: Will the split into more files lead to higher consumption of RUs because I have to do 10 requests instead of one?
I'm going to preface my answer with the comment that anything performance related is something that users need to test themselves because the benefits or trade-offs can vary widely. There is however some general guidance around this.
In scenarios where you have very large documents with frequent updates on a small number of properties, it is often better to shred that document with one document that has the frequently updated properties and another that has the static properties. Smaller documents consume less RU/s to update and also reduce the load on the client and network payload.
Partial updates provides zero benefit over Update or Upsert for RU/s regardless of whether you shred the document or not. The service still needs to patch and merge the entire document. It will only reduce CPU consumption and network payload due to the smaller amount of data.

How to optimize RUs for Replace of a document in Azure Cosmos DB?

I'm a beginner to Azure. I'm using log monitors to view the logs for a Cosmos DB resource. I could see one log with Replace operation which is consuming a lot of average RUs.
Generally operation names should be CREATE/DELETE/UPDATE/READ . But why REPLACE operation has come in place over here - I could not understand it. And why the REPLACE operation is consuming lot of RUs?
What can I try next?
Updates in Cosmos are full replacement operations rather than in-place updates, as such these consume more RU/s than inserts. Also, the larger the document, the more throughput required for the update.
Strategies to optimize throughput consumption on update operations typically center around splitting documents into two with properties that don't change going into one document that is typically larger and another document with those properties that change frequently going into an other that is smaller. This will allow for the updates to be made on a smaller document which reduces RU/s consumed to do the operation.
All that said, 12 RU/s is not an inordinate amount of RU/s for a replace operation. I don't think you will get much, if any throughput reduction doing this. But you can certainly try.

QLDB - Indexed Storage vs Journal Storage

As I am new to QLDB. I have some doubts
Why Indexed Storage consumes more size than Journal Storage?
During Large Data Insertion there is an increase in Journal Storage Size even after Insertion is Stopped for a certain Period.Can you explain me why?
What are the tables that are present in Index Storage along with current and History ?
Q1: the Journal is conceptually similar to tape. It's really good (performance and cost) for sequential reads and writes, but poor for random access. This is why we use it to record transactions. Indexed storage is designed for random access, but this is simply more expensive to implement. We bill for them separately to keep the cost of QLDB as low as we can make it.
Q2: the Journal size will not (ever) increase unless you are committing transactions. You may be noticing this in CloudWatch Metrics, which may be delayed by a few minutes (or longer, depending on how you are aggregating the metrics).
Q3: that's it, just your data. QLDB also offers some virtual tables (such as information_schema.user_tables), but these are not "in indexed storage" - i.e. you're not paying for them.

Writing to Azure Cosmos , Throughput RU

We are planning in writing 10000 JSON Documents to Azure Cosmos DB (MongoDB), Does the Throughput Units matter, if so, can we increase for the batch load and set it back to low number
Yes you can do that. The lowest the RUs can be is 400. Scale up before you're about to do your insert and then turn it down again. As always, that part can be automated if you know when the documents are going to be inserted.
Check out the DocumentClient documentation and more specifically ReplaceOfferAsync.
You can scale the RU/sec allocation up or down at any time. You'll want to look at your insertion cost (RU cost is returned in a header) for a typical document, to get an idea of how many documents you might be able to write, per second, before getting throttled.
Also keep in mind: if you scale your RU out beyond what an underlying physical partition can provide, Cosmos DB will scale out your collection to have additional physical partitions. This means you might not be able to scale your RU back down to the bare minimum later (though you will be able to scale down).

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

Resources