CosmosDB: Optimizing RUs with a time triggered FunctionApp? - azure

We have a FunctionApp which inserts every 6 minutes around 8k documents in a CosmosDb. Currently we set Cosmos to autoscale, but since our RUs are very predictable I have the feeling we could save some money because it's quite expensive.
I found out it's possible to set the througput to manually and according to this article I could decrease/increase the RUs with a timer. But now I'm wondering if its a good idea because we have small time interval and even if I time the FunctionApp correctly (error prone?) there are maybe 3 minutes where I can decrease the throughput. Another thing is that manual throughput costs 50% RUs less.
What do you think, is it worth implementing a time-triggered FunctionApp which increase/decrease the throuhput or its not a good idea in terms of error prone etc? Do you have any experience with it?

The timer with manual throughput will likely save you money because throughput is billed as the highest amount of RU/s per hour. Since your workload needs to scale up every 6 minutes your cost is the highest RU/s during that hour. Given that autoscale is 50% more expensive, you'd save by manually scaling up and down.
However, if you were able to stream this data to Cosmos rather than batch it you would save even more. Throughput is measured per second. The more you are able to amortize your usage over a greater period of time, the less throughput you need at any given point in time. So if you were able to use say a message queue and do load-leveling in front of Cosmos and stream the changes in, you will have better throughput utilization overall and thus lower total cost. Of course you'd need to evaluate the cost for utilizing a message queue to do this but in general, streaming is the more cost effective than batching.

Related

Writing to Azure Cosmos , Throughput RU

We are planning in writing 10000 JSON Documents to Azure Cosmos DB (MongoDB), Does the Throughput Units matter, if so, can we increase for the batch load and set it back to low number
Yes you can do that. The lowest the RUs can be is 400. Scale up before you're about to do your insert and then turn it down again. As always, that part can be automated if you know when the documents are going to be inserted.
Check out the DocumentClient documentation and more specifically ReplaceOfferAsync.
You can scale the RU/sec allocation up or down at any time. You'll want to look at your insertion cost (RU cost is returned in a header) for a typical document, to get an idea of how many documents you might be able to write, per second, before getting throttled.
Also keep in mind: if you scale your RU out beyond what an underlying physical partition can provide, Cosmos DB will scale out your collection to have additional physical partitions. This means you might not be able to scale your RU back down to the bare minimum later (though you will be able to scale down).

Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda

In the process of migrating an existing Node.js (Hapi.js) + RethinkDB from an OVH VPS (smallest vps) to AWS Lambda (node) + DynamoDB, I've recently come across a very huge performance issue.
The usage is rather simple, people use an online tool, and "stuff" gets saved in the DB, passing through a node.js server/lambda. That "stuff" takes some spaces, around 3kb non-gzipped (a complex object with lots of keys and children, hence why using a NOSQL solution makes sense)
There is no issue with the saving itself (for now...), not so many people use the tool and there isn't much simultaneous writing to do, which makes sense to use a Lambda instead of a 24/7 running VPS.
The real issue is when I want to download those results.
Using Node+RethinkDB takes about 3sec to scan the whole table and generate a CSV file to download
AWS Lambda + DynamoDB timeout after 30sec, even if I paginate the results to download only 1000 items, it still takes 20 sec (no timeout this time, just very slow) -> There are 2200 items on that table, and we could deduce that we'd need around 45sec to download the whole table, if AWS Lambda wouldn't timeout after 30sec
So, the operation takes around 3s with RethinkDB, and would theoretically take 45sec with DynamoDB, for the same amount of fetched data.
Let's look at those data now. There are 2200 items in the table, for a total of 5MB, here are the DynamoDB stats:
Provisioned read capacity units 29 (Auto Scaling Enabled)
Provisioned write capacity units 25 (Auto Scaling Enabled)
Last decrease time October 24, 2018 at 4:34:34 AM UTC+2
UTC: October 24, 2018 at 2:34:34 AM UTC
Local: October 24, 2018 at 4:34:34 AM UTC+2
Region (Ireland): October 24, 2018 at 2:34:34 AM UTC
Last increase time October 24, 2018 at 12:22:07 PM UTC+2
UTC: October 24, 2018 at 10:22:07 AM UTC
Local: October 24, 2018 at 12:22:07 PM UTC+2
Region (Ireland): October 24, 2018 at 10:22:07 AM UTC
Storage size (in bytes) 5.05 MB
Item count 2,195
There is 5 provisioned read/write capacity units, with an autoscaling max to 300. But the autoscaling doesn't seem to scale as I'd expect, went from 5 to 29, could use 300 which would be enough to download 5MB in 30 sec, but doesn't use them (I'm just getting started with autoscaling so I guess it's misconfigured?)
Here we can see the effect of the autoscaling, which does increase the amount of read capacity units, but it does so too late and the timeout has happened already. I've tried to download the data several times in a row and didn't really see much improvements, even with 29 units.
The Lambda itself is configured with 128MB RAM, increasing to 1024MB has no effect (as I'd expect, it confirms the issue comes from DynamoDB scan duration)
So, all this makes me wonder why DynamoDB can't do in 30sec what does RethinkDB in 3sec, it's not related to any kind of indexing since the operation is a "scan", therefore must go through all items in the DB in any order.
I wonder how am I supposed to fetch that HUGE dataset (5MB!) with DynamoDB to generate a CSV.
And I really wonder if DynamoDB is the right tool for the job, I really wasn't expecting so low performances compared to what I've been using by the past (mongo, rethink, postgre, etc.)
I guess it all comes down to proper configuration (and there probably are many things to improve there), but even so, why is it such a pain to download a bunch of data? 5MB is not a big deal but there if feels like it requires a lot of efforts and attention, while it's just a common operation to export a single table (stats, dump for backup, etc.)
Edit: Since I created this question, I read https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b which explains in-depth the issue I've met. Basically, autoscaling is slow to trigger, which explains why it doesn't scale right with my use case. This article is a must-read if you want to understand how DynamoDB auto-scaling works.
DynamoDB is not designed for that kind of usage. Its not like a traditional DB that you can just query as you wish and it especially does not do well with large datasets at a time such as the one you are requesting.
For this type of scenario, I actually use DyanamoDB streams to create a projection into an S3 bucket and then do large exports in that way. It will probably even be faster than the RethinkDB export you reference.
In short, DynamoDb is best as a transactional key-value store for known queries.
I have come across exactly the same problem in my application (i.e. DynamoDB autoscaling does not kick in fast enough for an on-demand high intensity job).
I was pretty committed to DynamoDB by the time I can across the problem, so I worked around it. Here is what I did.
When I'm about to start a high-intensity job, I programatically increase the RCU and WCU on my DynamoDB table. In your case you could probably have one lambda to increase the throughput, then have that lambda kick off another one to do the high-intensity job. Note that increasing provision can take a few seconds, hence splitting this into a separate lambda is probably a good idea.
I will paste my personal notes on the problem I faced below. Apologies but I can't be bothered to format them into stackoverflow markup.
We want enough throughput provisioned all the time so that users have a fast experience, and even more importantly, don't get any failed operations. However, we only want to provision enough throughput to serve our needs, as it costs us money.
For the most part we can use Autoscaling on our tables, which should adapt our provisioned throughput to the amount actually being consumed (i.e. more users = more throughput automatically provisioned). This fails in two key aspects for us:
Autoscaling only increases throughput about 10 minutes after the throughput provision threshold is breached. When it does start scaling up, it is not very aggressive in doing so. There is a great blog on this here https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b.
When there is literally zero consumption of throughput, DynamoDB does not decrease throughput. AWS Dynamo not auto-scaling back down
The place we really need to manage throughput is on the Invoice table WCUs. RCUs are a lot cheaper than WCUs, so reads are less of a worry to provision. For most tables, provisioning a few RCU and WCU should be plenty. However, when we do an extract from the source, our write capacity on the Invoices table is high for a 30 minute period.
Lets imagine we just relied on Autoscaling. When a user kicked off an extract, we would have 5 minutes of burst capacity, which may or may not be enough throughput. Autoscaling would kick in after around 10 minutes (at best), but it would do so ponderously - not scaling up as fast we needed. Our provision would not be high enough, we would get throttled, and we would fail to get the data we wanted. If several processes were running concurrently, this problem would be even worse - we just couldn't handle multiple extracts at the same time.
Fortunately we know when we are about to beast the Invoices table, so we can programatically increase throughput on the Invoices table. Increasing throughput programatically seems to take effect very quickly. Probably within seconds. I noticed in testing that the Metrics view in DynamoDB is pretty useless. Its really slow to update and I think sometimes it just showed the wrong information. You can use AWS CLI to describe the table, and see what the throughput is provisioned at in real-time:
aws dynamodb describe-table --table-name DEV_Invoices
In theory we could just increase throughput when an extract started, and then reduce it again when we were finished. However, whilst you can increase throughput provision as often as you like, you can only decrease throughput provision 4 times in a day, although you can then decrease throughput once every hour (i.e. 27 times in 24 hours). https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#default-limits-throughput. This approach is not going to work, as our decrease in provision might well fail.
Even if Autoscaling is in play, it still has to abide by the provisioning decrease rules. So if we've decreased 4 times, Autoscaling will have to wait an hour before decreasing again - and thats for both read and write values
Increasing throughput provision programatically is a good idea, we can do it fast (much faster than Autoscaling), so it works for our infrequent high workloads. We can't decrease throughput programtically after an extract (see above) but there are a couple of other options.
Autoscaling for throughput decrease
Note that even when Autoscaling is set, we can programtically change it to anything we like (e.g. higher than the maximum Autoscaling level).
We can just rely on Autoscaling to bring the capacity back down an hour or two after the extract has finished, that's not going to cost us too much.
There is another problem though. If our consumed capacity drops right down to zero after an extract, which is likely, no consumption data is sent to CloudWatch and Autoscaling doesn't do anything to reduce provisioned capacity, leaving us stuck on a high capacity.
There are two fudge options to fix this though. Firstly we can set the minimum and maximum throughput provision to be same the same value. So for example setting the minimum and maximum provisioned RCUs within Autoscaling to 20 will ensure that the provisioned capacity returns to 20, even if there is zero consumed capacity. Im not sure why but this works (i've tested it, and it does), AWS acknowledge the workaround here:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
The other option is to create a Lambda function to attempt to execute a (failed) read and delete operation on the table every minute. Failed operations still consume capacity which is why this works. This job ensures data is sent to CloudWatch regularly, even when our 'real' consumption is zero, and therefore Autoscaling will reduce capacity correctly.
Note that read and write data is sent separately to CloudWatch. So if we want WCUs to decrease when real consumed WCUs are zero, we need to use a write operation (i.e. a delete). Similarly we need a read operation to make sure RCUs are updated. Note that failed Reads (if the item does not exist) and failed Deletes (if the item does not exist) but still consume throughput.
Lambda for throughput decrease
In the previous solution we used a Lambda function to continously 'poll' the table, thus creating the CloudWatch data which enables the DynamoDB Autoscaling to function. As an alternative we could just have a lambda which runs regularly and scales down the throughput when required. When you 'describe' a DynamoDB table, you get the current Provisioned Throughput as well as the last increase datetime and last decrease datetime. So the lambda could say: if the provisioned WCUs are over a threshold and the last time we had a throughput increase was more than half an hour ago (i.e. we are not in the middle of an extract), lets decrease the throughput right down.
Given that this is more code than the Autoscaling option, im not inclined to do this one.

Where is the 'break even' point for scaling a collection up / down

I've been thinking about scaling my collection up before initiating a large write or bulk import.
However I am struck that I don't know how many RU's it costs to perform the scaling operation. It's possible that it could cost more to scale up, execute, and back down than it would be to just leave it at a constant level.
Naturally, there are concerns around how long between writes, how long the process takes, etc, but I can't really approach the question without knowing the cost of scaling. I'm curious if anyone has a policy or rule-of-thumb they use to control this.
The cost of the scaling operation itself is the same as the cost of updating any other Resource in CosmosDB.
What you need to know is that from the Database to the Document everything inherits from a single type. The Resource.
What you are talking about is updating an Offer which is the Resource that holds the collection's offer data, such as the throughput. Updating the offer costs the same as updating any other document of that size (which would be something around the 5-10 RUs).
Keep in mind however that CosmosDB charges you on an hourly basis based on the maximum provisioned throughput of the collection for that hour. This means that even if you upscale and instantly downscale the throughput, you will still be charged for one hour worth of that maximum throughput.

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

DocumentDB User Defined Performance Pricing

I'm looking into moving to the new partitioned collections for DocumentDB and have a few questions that the documentation and pricing calculator seems to be a little unclear on.
PRICING:
In the below scenario my partitioned collection will be charged $30.02/mo at 1GB of data with a constant hourly RU use of 500:
So does this mean if my users only hit the data for an average of 500 RU's for about 12 hours per day which means that HALF the time my collection goes UNUSED, but is still RUNNING and AVAILABLE (not shut down) the price goes down to $15.13/mo as the calculator indicates here:
Or will I be billed the full $30.01/mo since my collection was up and running?
I get confused when I go to the portal and see an estimate for $606/mo with no details behind it when I attempt to spin up the lowest options on a partition collection:
Is the portal just indicating the MAXIMUM that I COULD be billed that month if I use all my allotted 10,100 RU's a second every second of the hour for 744 consecutive hours?
If billing is based on hourly use and the average RU's used goes down to 100 on some of the hours used in the second scenario does the cost go down even further? Does Azure billing for partitioned collections fluctuates based on hourly usage and not total up time like the existing S1/S2/S3 tiers?
If so then how does the system determine what is billed for that hour? If most of the hour the RU's used are 100/sec but for a few seconds that hour it spikes to 1,000 does it average out by the seconds across that entire hour and only charge me for something like 200-300 RU's for that hour or will I be billed for the highest RU's used that hour?
PERFORMANCE:
Will I see a performance hit by moving to this scenario since my data will be on separate partitions and require partition id/key to access? If so what can I expect, or will it be so minimal that it would be undetected by my users?
RETRIES & FAULT HANDLING:
I'm assuming the TransientFaultHandling Nuget package I use in my current scenario will still work on the new scenario, but may not be used as much since my RU capacity is much larger, or do I need to rethink how I handle requests that go over the RU cap?
So they way that pricing works for an Azure documentDB is that you pay to reserve a certain amount of data storage size (in GB's) and/or throughput (in Request units (RU)). These charges are charged per hour that the reserve is in place (does not require usage). Additionally, just having a Document Account active is deemed to be an active S1 subscription, until a documentDB gets created then the pricing of your db takes over. There are two options available:
Option 1 (Original Pricing)
You can a choose between S1, S2 or S3. Each offering the same 10GB of storage but varying in throughput 250RU/1000RU/2500RU.
Option 2 (User-defined performance)
This is the new pricing structure which better decouples size and throughout. This option additionally provides for partitioning. Note that with user defined performance you are charge per GB of data storage used (Pay as you go storage).
With user-defined performance levels, storage is metered based on
consumption, but with pre-defined performance levels, 10 GB of storage
is reserved at the time of collection creation.
Single Partition Collection
The minimum is set at 400RU and 1GB of data storage.
The maximum is set at 10,000RU and 250GB of data storage.
Partitioned Collections
The minimum is set at 10,000RU and 1GB of data storage.
The maximum is set at 250,000RU and 250GB of data storage (EDIT can request greater).
So at a minimum you will be paying the cost per hour related to the option you selected. The only way to not pay for an hour would be to delete the db and the account, unfortunately.
Cost of Varying RU
In terms of varying your RU within the time frame of 1 hours, you are charged for that hour at the cost of the peak reserve RU you requested. So if you were at 400RU and you up it to 1000RU for 1sec you will be charge at the 1000RU rate for that hour. Even if for the other 59minutes 59secounds you set it back to 400RU.
Will I see a performance hit by moving to this scenario since my data will be on separate partitions and require partition id/key to access?
One the topic of perfromance hit there's a few things to think about, but in general no.
If you have a sane partition key with enough values you should not see a performance penalty. This means that you need to partition data so that you have the partition key available when querying and you need to keep the data you want from a query in the same partition by using the same partiton key.
If you do queries without partitionkey, you will see a sever penalty, as the query is parsed and executed per partition.
One thing to keep in mind when selecting a partition key is the limits for each partition, which are 10GB and 10K RU. This means that you want an even distributions over the partitions in order to avoid a "hot" partition which means that even if you scale to more than enough RU in total, you may recieve 429 for a specific partition.

Resources