Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda - node.js

In the process of migrating an existing Node.js (Hapi.js) + RethinkDB from an OVH VPS (smallest vps) to AWS Lambda (node) + DynamoDB, I've recently come across a very huge performance issue.
The usage is rather simple, people use an online tool, and "stuff" gets saved in the DB, passing through a node.js server/lambda. That "stuff" takes some spaces, around 3kb non-gzipped (a complex object with lots of keys and children, hence why using a NOSQL solution makes sense)
There is no issue with the saving itself (for now...), not so many people use the tool and there isn't much simultaneous writing to do, which makes sense to use a Lambda instead of a 24/7 running VPS.
The real issue is when I want to download those results.
Using Node+RethinkDB takes about 3sec to scan the whole table and generate a CSV file to download
AWS Lambda + DynamoDB timeout after 30sec, even if I paginate the results to download only 1000 items, it still takes 20 sec (no timeout this time, just very slow) -> There are 2200 items on that table, and we could deduce that we'd need around 45sec to download the whole table, if AWS Lambda wouldn't timeout after 30sec
So, the operation takes around 3s with RethinkDB, and would theoretically take 45sec with DynamoDB, for the same amount of fetched data.
Let's look at those data now. There are 2200 items in the table, for a total of 5MB, here are the DynamoDB stats:
Provisioned read capacity units 29 (Auto Scaling Enabled)
Provisioned write capacity units 25 (Auto Scaling Enabled)
Last decrease time October 24, 2018 at 4:34:34 AM UTC+2
UTC: October 24, 2018 at 2:34:34 AM UTC
Local: October 24, 2018 at 4:34:34 AM UTC+2
Region (Ireland): October 24, 2018 at 2:34:34 AM UTC
Last increase time October 24, 2018 at 12:22:07 PM UTC+2
UTC: October 24, 2018 at 10:22:07 AM UTC
Local: October 24, 2018 at 12:22:07 PM UTC+2
Region (Ireland): October 24, 2018 at 10:22:07 AM UTC
Storage size (in bytes) 5.05 MB
Item count 2,195
There is 5 provisioned read/write capacity units, with an autoscaling max to 300. But the autoscaling doesn't seem to scale as I'd expect, went from 5 to 29, could use 300 which would be enough to download 5MB in 30 sec, but doesn't use them (I'm just getting started with autoscaling so I guess it's misconfigured?)
Here we can see the effect of the autoscaling, which does increase the amount of read capacity units, but it does so too late and the timeout has happened already. I've tried to download the data several times in a row and didn't really see much improvements, even with 29 units.
The Lambda itself is configured with 128MB RAM, increasing to 1024MB has no effect (as I'd expect, it confirms the issue comes from DynamoDB scan duration)
So, all this makes me wonder why DynamoDB can't do in 30sec what does RethinkDB in 3sec, it's not related to any kind of indexing since the operation is a "scan", therefore must go through all items in the DB in any order.
I wonder how am I supposed to fetch that HUGE dataset (5MB!) with DynamoDB to generate a CSV.
And I really wonder if DynamoDB is the right tool for the job, I really wasn't expecting so low performances compared to what I've been using by the past (mongo, rethink, postgre, etc.)
I guess it all comes down to proper configuration (and there probably are many things to improve there), but even so, why is it such a pain to download a bunch of data? 5MB is not a big deal but there if feels like it requires a lot of efforts and attention, while it's just a common operation to export a single table (stats, dump for backup, etc.)
Edit: Since I created this question, I read https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b which explains in-depth the issue I've met. Basically, autoscaling is slow to trigger, which explains why it doesn't scale right with my use case. This article is a must-read if you want to understand how DynamoDB auto-scaling works.

DynamoDB is not designed for that kind of usage. Its not like a traditional DB that you can just query as you wish and it especially does not do well with large datasets at a time such as the one you are requesting.
For this type of scenario, I actually use DyanamoDB streams to create a projection into an S3 bucket and then do large exports in that way. It will probably even be faster than the RethinkDB export you reference.
In short, DynamoDb is best as a transactional key-value store for known queries.

I have come across exactly the same problem in my application (i.e. DynamoDB autoscaling does not kick in fast enough for an on-demand high intensity job).
I was pretty committed to DynamoDB by the time I can across the problem, so I worked around it. Here is what I did.
When I'm about to start a high-intensity job, I programatically increase the RCU and WCU on my DynamoDB table. In your case you could probably have one lambda to increase the throughput, then have that lambda kick off another one to do the high-intensity job. Note that increasing provision can take a few seconds, hence splitting this into a separate lambda is probably a good idea.
I will paste my personal notes on the problem I faced below. Apologies but I can't be bothered to format them into stackoverflow markup.
We want enough throughput provisioned all the time so that users have a fast experience, and even more importantly, don't get any failed operations. However, we only want to provision enough throughput to serve our needs, as it costs us money.
For the most part we can use Autoscaling on our tables, which should adapt our provisioned throughput to the amount actually being consumed (i.e. more users = more throughput automatically provisioned). This fails in two key aspects for us:
Autoscaling only increases throughput about 10 minutes after the throughput provision threshold is breached. When it does start scaling up, it is not very aggressive in doing so. There is a great blog on this here https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b.
When there is literally zero consumption of throughput, DynamoDB does not decrease throughput. AWS Dynamo not auto-scaling back down
The place we really need to manage throughput is on the Invoice table WCUs. RCUs are a lot cheaper than WCUs, so reads are less of a worry to provision. For most tables, provisioning a few RCU and WCU should be plenty. However, when we do an extract from the source, our write capacity on the Invoices table is high for a 30 minute period.
Lets imagine we just relied on Autoscaling. When a user kicked off an extract, we would have 5 minutes of burst capacity, which may or may not be enough throughput. Autoscaling would kick in after around 10 minutes (at best), but it would do so ponderously - not scaling up as fast we needed. Our provision would not be high enough, we would get throttled, and we would fail to get the data we wanted. If several processes were running concurrently, this problem would be even worse - we just couldn't handle multiple extracts at the same time.
Fortunately we know when we are about to beast the Invoices table, so we can programatically increase throughput on the Invoices table. Increasing throughput programatically seems to take effect very quickly. Probably within seconds. I noticed in testing that the Metrics view in DynamoDB is pretty useless. Its really slow to update and I think sometimes it just showed the wrong information. You can use AWS CLI to describe the table, and see what the throughput is provisioned at in real-time:
aws dynamodb describe-table --table-name DEV_Invoices
In theory we could just increase throughput when an extract started, and then reduce it again when we were finished. However, whilst you can increase throughput provision as often as you like, you can only decrease throughput provision 4 times in a day, although you can then decrease throughput once every hour (i.e. 27 times in 24 hours). https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#default-limits-throughput. This approach is not going to work, as our decrease in provision might well fail.
Even if Autoscaling is in play, it still has to abide by the provisioning decrease rules. So if we've decreased 4 times, Autoscaling will have to wait an hour before decreasing again - and thats for both read and write values
Increasing throughput provision programatically is a good idea, we can do it fast (much faster than Autoscaling), so it works for our infrequent high workloads. We can't decrease throughput programtically after an extract (see above) but there are a couple of other options.
Autoscaling for throughput decrease
Note that even when Autoscaling is set, we can programtically change it to anything we like (e.g. higher than the maximum Autoscaling level).
We can just rely on Autoscaling to bring the capacity back down an hour or two after the extract has finished, that's not going to cost us too much.
There is another problem though. If our consumed capacity drops right down to zero after an extract, which is likely, no consumption data is sent to CloudWatch and Autoscaling doesn't do anything to reduce provisioned capacity, leaving us stuck on a high capacity.
There are two fudge options to fix this though. Firstly we can set the minimum and maximum throughput provision to be same the same value. So for example setting the minimum and maximum provisioned RCUs within Autoscaling to 20 will ensure that the provisioned capacity returns to 20, even if there is zero consumed capacity. Im not sure why but this works (i've tested it, and it does), AWS acknowledge the workaround here:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
The other option is to create a Lambda function to attempt to execute a (failed) read and delete operation on the table every minute. Failed operations still consume capacity which is why this works. This job ensures data is sent to CloudWatch regularly, even when our 'real' consumption is zero, and therefore Autoscaling will reduce capacity correctly.
Note that read and write data is sent separately to CloudWatch. So if we want WCUs to decrease when real consumed WCUs are zero, we need to use a write operation (i.e. a delete). Similarly we need a read operation to make sure RCUs are updated. Note that failed Reads (if the item does not exist) and failed Deletes (if the item does not exist) but still consume throughput.
Lambda for throughput decrease
In the previous solution we used a Lambda function to continously 'poll' the table, thus creating the CloudWatch data which enables the DynamoDB Autoscaling to function. As an alternative we could just have a lambda which runs regularly and scales down the throughput when required. When you 'describe' a DynamoDB table, you get the current Provisioned Throughput as well as the last increase datetime and last decrease datetime. So the lambda could say: if the provisioned WCUs are over a threshold and the last time we had a throughput increase was more than half an hour ago (i.e. we are not in the middle of an extract), lets decrease the throughput right down.
Given that this is more code than the Autoscaling option, im not inclined to do this one.

Related

CosmosDB: Optimizing RUs with a time triggered FunctionApp?

We have a FunctionApp which inserts every 6 minutes around 8k documents in a CosmosDb. Currently we set Cosmos to autoscale, but since our RUs are very predictable I have the feeling we could save some money because it's quite expensive.
I found out it's possible to set the througput to manually and according to this article I could decrease/increase the RUs with a timer. But now I'm wondering if its a good idea because we have small time interval and even if I time the FunctionApp correctly (error prone?) there are maybe 3 minutes where I can decrease the throughput. Another thing is that manual throughput costs 50% RUs less.
What do you think, is it worth implementing a time-triggered FunctionApp which increase/decrease the throuhput or its not a good idea in terms of error prone etc? Do you have any experience with it?
The timer with manual throughput will likely save you money because throughput is billed as the highest amount of RU/s per hour. Since your workload needs to scale up every 6 minutes your cost is the highest RU/s during that hour. Given that autoscale is 50% more expensive, you'd save by manually scaling up and down.
However, if you were able to stream this data to Cosmos rather than batch it you would save even more. Throughput is measured per second. The more you are able to amortize your usage over a greater period of time, the less throughput you need at any given point in time. So if you were able to use say a message queue and do load-leveling in front of Cosmos and stream the changes in, you will have better throughput utilization overall and thus lower total cost. Of course you'd need to evaluate the cost for utilizing a message queue to do this but in general, streaming is the more cost effective than batching.

DocumentDB is using way more IOPS than it should

My team and I recently switched a relatively large MongoDB deployment (0.5T originally) onto AWS DocumentDB. Strangely, DocumentDB is taking up way more resources than I think it should. On our old cluster we never went over 1500 IOPS, but now we're well over 50k in AWS, which is terrifically expensive. Also, the storage size has ballooned to 2T - almost as if the data were no longer compressed at all (WiredTiger would have been compressing things in the old cluster).
What could be happening here? As far as I know there were no code changes. Are there any tools I could use to figure out what the IOPS are being used for? Are there other unique sources of I/O usage that I might not be thinking of? A >10x jump seems totally crazy to me.
If you mean the VolumeWriteIOPS as reported by the cluster of your documentDB then the reported value is actually not the write operations per second.
In AWS the VolumeWriteIOPS are reported as a total for a 5 minute interval, not an average. So to get the actual billed volume write IOPs (per second) you would have to divide VolumeWriteIOPS by 300.
You can scroll down this table to find the explanation for the VolumeWriteIOPs. https://docs.aws.amazon.com/documentdb/latest/developerguide/cloud_watch.html#w150aac29c23b9c11
After discovering the above documentation I thought you might have had the same problem. My original question:
AWS DocumentDB: What is the difference between instance writeIOPS and cluster volumeWriteIOPS and why is the volumeWriteIOPS 100x the writeIOPS?

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

How fast is Azure Search Indexer and how I can index faster?

Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/

Simple select count(id) uses 100% of Azure SQL DTUs

This started off as this question but now seems more appropriately asked specifically since I realised it is a DTU related question.
Basically, running:
select count(id) from mytable
EDIT: Adding a where clause does not seem to help.
Is taking between 8 and 30 minutes to run (whereas the same query on a local copy of SQL Server takes about 4 seconds).
Below is a screen shot of the MONITOR tab in the Azure portal when I run this query. Note I did this after not touching the Database for about a week and Azure reporting I had only used 1% of my DTUs.
A couple of extra things:
In this particular test, the query took 08:27s to run.
While it was running, the above chart actually showed the DTU line at 100% for a period.
The database is configured Standard Service Tier with S1 performance level.
The database is about 3.3GB and this is the largest table (the count is returning approx 2,000,000).
I appreciate it might just be my limited understanding but if somebody could clarify if this is really the expected behaviour (i.e. a simple count taking so long to run and maxing out my DTUs) it would be much appreciated.
From the query stats in your previous question we can see:
300ms CPU time
8000 physical reads
8:30 is about 500sec. We certainly are not CPU bound. 300ms CPU over 500sec is almost no utilization. We get 16 physical reads per second. That is far below what any physical disk can deliver. Also, the table is not fully cached as evidenced by the presence of physical IO.
I'd say you are throttled. S1 corresponds to
934 transactions per minute
for some definition of transaction. Thats about 15 trans/sec. Maybe you are hitting a limit of one physical IO per transaction?! 15 and 16 are suspiciously similar numbers.
Test this theory by upgrading the instance to a higher scale factor. You might find that SQL Azure Database cannot deliver the performance you want at an acceptable price.
You also should find that repeatedly scanning half of the table results in a fast query because the allotted buffer pool seems to fit most of the table (just not all of it).
I had the same issue. Updating the statistics with fullscan on the table solved it:
update statistics mytable with fullscan
select count
should perform clustered index scan if one is available and its up to date. Azure SQL should update statistics automatically, but does not rebuild indexes automatically if they are completely out of date.
if there's a lot of INSERT/UPDATE/DELETE traffic on that table I suggest manually rebuilding the indexes every once in a while.
http://blogs.msdn.com/b/dilkushp/archive/2013/07/28/fragmentation-in-sql-azure.aspx
and SO post for more info
SQL Azure and Indexes

Resources