How to do capacity control in this case? - node.js

my app reads data from DynamoDB, which has a pre-configured read capacity, which limits the read throughput. I'd like to control my query to not reach the limit, here is how I'm doing this now:
const READ_CAPACITY = 80
async function query(params) {
const consumed = await getConsumedReadCapacity()
if (consumed > READ_CAPACITY) {
await sleep((consumed-READ_CAPACITY)*1000/READ_CAPACITY)
}
const result = await dynamoDB.query(params).promise()
await addConsumedReadCapacity(result.foo.bar.CapacityUnits)
return result.Items
}
async function getConsumedReadCapacity() {
return redis.get(`read-capacity:${Math.floor(Date.now() / 1000)}`)
}
async function addConsumedReadCapacity(n) {
return redis.incrby(`read-capacity:${Math.floor(Date.now() / 1000)}`, n)
}
as you can see, a query will first check current consumed read capacity, if it does'nt exceed READ_CAPACITY, do the query, and add up the consumed read capacity.
the problem is the code is running on several servers, so there are race conditions, where the consumed > READ_CAPACITY check passed, and before it executes dynamoDB.query, dynamodb readed capacity limit by queries from other processes on other servers. How can I improve this?

Some things to try instead of avoiding hitting capacity limits...
Try, then back-off
From DyanmoDB error handling:
ProvisionedThroughputExceededException: The AWS SDKs for DynamoDB automatically retry requests that receive this exception. Your request is eventually successful, unless your retry queue is too large to finish. Reduce the frequency of requests, using Error Retries and Exponential Backoff.
Burst
From Best Practices for Tables:
DynamoDB provides some flexibility in the per-partition throughput provisioning. When you are not fully utilizing a partition's throughput, DynamoDB retains a portion of your unused capacity for later bursts of throughput usage. DynamoDB currently retains up to five minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed very quickly—even faster than the per-second provisioned throughput capacity that you've defined for your table.
DynamoDB Auto Scaling
From Managing Throughput Capacity Automatically with DynamoDB Auto Scaling:
DynamoDB auto scaling uses the AWS Application Auto Scaling service to dynamically adjust provisioned throughput capacity on your behalf, in response to actual traffic patterns. This enables a table or a global secondary index to increase its provisioned read and write capacity to handle sudden increases in traffic, without throttling. When the workload decreases, Application Auto Scaling decreases the throughput so that you don't pay for unused provisioned capacity.
Cache in SQS
Some AWS customers have implemented a system where, if throughput is exceeded, they store the data in an Amazon SQS queue. They then have a process that retrieves the data from the queue and inserts into the table later, when there is less demand on throughput. This allows the DynamoDB table to be provisioned based on average throughput rather than peak throughput.

Related

Cosmos DB metrics report 100x more requests than expected

I'm comparing the service side metrics of my app with the metrics emitted by Cosmos DB and I can see a 100x difference in request counts.
Is my container misconfigured? Am I querying the wrong way? Is Cosmos performing multiple requests internally for each query I'm running against it?
The metric I'm looking at in Cosmos is TotalRequests/Count/5min.
The container has indexes on all attributes + a few composite indexes.
The query I'm running is:
SELECT *
FROM x
WHERE x.partitionKey = 0
and x.index1 = 1
and x.index2 = 2
The container is suffering from a VERY hot partition.
Each request consumes about 5 RUs.
The consistency level is BOUNDED_STALENESS.
I tried changing the consistency level to EVENTUAL which brought the consumed RUs down, but I'm still seeing a huge amount of requests that aren't accounted for.
The Total Requests metric includes every request between the SDK and the service. The SDK makes frequent calls to the service when an SDK instance is first created, then makes regular calls for metadata and other information. If you want to see just requests made by user, apply a filter for OperationType and select the operations you want to monitor.
It's not clear why you were using Bounded Staleness. Reads using Strong and Bounded Staleness consume twice the RU/s because they read from 2 replicas rather than 1 replica for the other weaker consistency models. In addition to differences in cost, there are of course differences in whether you may read stale data or not. They also play a big role in your RTO and RPO in multi-region scenarios.
A hot partition does not have impact on throughput consumption. 5 RU/s for a query is actually very good.

Azure Event Hub throughput

In the service high-level description Microsoft mentions that I can stream millions of events per second and it is highly scalable
Event Hubs is a fully managed, real-time data ingestion service that’s simple, trusted, and scalable. Stream millions of events per second from any source
https://azure.microsoft.com/en-us/services/event-hubs/
But when I go to the official documentation the maximum throughput units (TUs) limit is 20, which translates into 1000 event per TU * 20 TUs = 20,000 events:
Event Hubs traffic is controlled by throughput units. A single throughput unit allows 1 MB per second or 1000 events per second of ingress and twice that amount of egress. Standard Event Hubs can be configured with 1-20 throughput units, and you can purchase more with a quota increase support request.
https://azure.microsoft.com/en-us/services/event-hubs/
How does 20TUs translate into streaming millions of events?
You can increase 20-TUs by raising a support request.
But if you need to go very high you can also use Dedicated Clusters for Event Hubs.
Two important notes from the docs
A Dedicated cluster guarantees capacity at full scale, and can ingress up to gigabytes of streaming data with fully durable storage and sub-second latency to accommodate any burst in traffic.
At high ingress volumes (>100 TUs), a cluster costs significantly less per hour than purchasing a comparable quantity of throughput units in the Standard offering.
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-dedicated-overview
The throughput capacity of Event Hubs is controlled by throughput units. Throughput units are pre-purchased units of capacity. A single throughput unit lets you: Ingress: Up to 1 MB per second or 1000 events per second (whichever comes first). Egress: Up to 2 MB per second or 4096 events per second.

Request Timeout in Azure Cosmos DB in sdk v3

I am inserting the data to azure cosmos db. In some time it throws an error (Request Timeout : 408). I have increased the Request Timeout to 10 mins.
Also, i have iterate each item from api and calling CreateItemAsync() method instead of bulk executor.
Data To Insert = 430 K Items
Microsoft.Azure.Cosmos SDK used = v3
Container Throughput = 400
Can anyone help me to fix this issue.
Just increase your throughput. But it's going to cost you a lot of money if you leave it increased. 400 RU/s isn't going to cut it unless you batch your operation to the point where it's going to take a long time to insert 400k items.
If this is a one-time deal, increase your RU/s to 2000+, then start slowly inserting items. I would say, depending on the size of your documents, maybe do 50 at a time, then wait 250 milliseconds, then do 50 more until you are done. You will have to play with this though.
Once you are done, move your RU/s back down to 400.
Cosmos DB can be ridiculously expensive, so be careful.
ETA:
This is from some documentation:
Increase throughput: The duration of your data migration depends on the amount of throughput you set up for an individual collection or a set of collections. Be sure to increase the throughput for larger data migrations. After you've completed the migration, decrease the throughput to save costs. For more information about increasing throughput in the Azure portal, see performance levels and pricing tiers in Azure Cosmos DB.
The documentation page for 408 timeouts lists a number of possible causes to investigate.
Aside from addressing the root cause with the SDK client app or increasing throughput, you might also consider leveraging Azure Data Factory to ingest the data as in this example. This assumes your data load is an initialization process and your data can be made available as a blob file.

DynamoDB queries take 2000ms for no obvious reason

I am using DynamoDB to query some data. At a bottom you can see number of miliseconds for certain percentage of requests.
Most of the time DynamoDB works fine with about 100ms response time. (there are only queries on Primary Key or Indexes). About 0.4% requests take more than 800ms (which is the limit the service need to provide response) even at "calm" times. Its not perfect, but good enough.
However certain load (which is not even big) triggers DynamoDB behaviour that causes about 5% of requests to have 2000ms response times. There are several queries in one request, but if one of these queries take more than 800ms, whole request is terminated by caller.
We are using Node.js and this is how we use the AWS library:
This is how we initiate the library
const dynamodb = new AWS.DynamoDB.DocumentClient({
region: config.server.region,
});
/// ...
dynamodb.query(params).promise()
Example of query
{
"TableName":"prod-UserDeviceTable",
"KeyConditionExpression":"#userId = :userId",
"ExpressionAttributeNames":{
"#userId":"userId"
},
"ExpressionAttributeValues":{
":userId":"e3hs8etjse3t13se8h7eh4"
},
"ConsistentRead":false
}
Few more notes:
CPU of service is currently autoscaled to be below 10%, higher load or cpu usage does not make the service work worse, we even tried to reach 50% with same response times
There is no "grow" of response time. Its usually less than 100ms and then out of nowhere its 2000 for 1-5%. As you can see in the graph
The queries do not use ConsistentRead
We are using Global Tables in 4 regions. This happens in all regions (the service is deployed independently in all 4 regions as well)
DynamoDB metrics for tables used do not show any spikes in response times
Service is deployed in ECS Fargate, but it was deployed before in ECS EC2 with same results
All tables in all regions are on-demand
The 2000ms delay happen anywhere in service during request lifetime when querying dynamodb (=happens for different queries, even for different tables)
When I repeat the same request that took more than 2000ms again, it takes only 100ms or less

What should be done when the provisioned throughput is exceeded?

I'm using AWS SDK for Javascript (Node.js) to read data from a DynamoDB table. The auto scaling feature does a great job during most of the time and the consumed Read Capacity Units (RCU) are really low most part of the day. However, there's a programmed job that is executed around midnight which consumes about 10x the provisioned RCU and since the auto scaling takes some time to adjust the capacity, there are a lot of throttled read requests. Furthermore, I suspect my requests are not being completed (though I can't find any exceptions in my error log).
In order to handle this situation, I've considered increasing the provisioned RCU using the AWS API (updateTable) but calculating the number of RCU my application needs may not be straightforward.
So my second guess was to retry failed requests and simply wait for auto scale increase the provisioned RCU. As pointed out by AWS docs and some Stack Overflow answers (particularlly about ProvisionedThroughputExceededException):
The AWS SDKs for Amazon DynamoDB automatically retry requests that receive this exception. So, your request is eventually successful, unless the request is too large or your retry queue is too large to finish.
I've read similar questions (this one, this one and this one) but I'm still confused: is this exception raised if the request is too large or the retry queue is too large to finish (therefore after the automatic retries) or actually before the retries?
Most important: is that the exception I should be expecting in my context? (so I can catch it and retry until auto scale increases the RCU?)
Yes.
Every time your application sends a request that exceeds your capacity you get ProvisionedThroughputExceededException message from Dynamo. However your SDK handles this for you and retries. The default Dynamo retry time starts at 50ms, the default number of retries is 10, and backoff is exponential by default.
This means you get retries at:
50ms
100ms
200ms
400ms
800ms
1.6s
3.2s
6.4s
12.8s
25.6s
If after the 10th retry your request has still not succeeded, the SDK passes the ProvisionedThroughputExceededException back to your application and you can handle it how you like.
You could handle it by increasing throughput provision but another option would be to change the default retry times when you create the Dynamo connection. For example
new AWS.DynamoDB({maxRetries: 13, retryDelayOptions: {base: 200}});
This would mean you retry 13 times, with an initial delay of 200ms. This would give your request a total of 819.2s to complete rather than 25.6s.

Resources