Dynamo Db Accelerator should reduce the response time to microseconds - node.js

i am using aws lamda function and dynamodb and api gateway for my application
and making a load test using Apache bench and i had run the test successfully
for 1000 request and 100 concurrency
and here is the result
Test #1
Concurrency Level: 100
Time taken for tests: 0.920 seconds
Complete requests: 1000
Failed requests: 0
Requests per second: 1086.60 [#/sec] (mean)
Time per request: 92.030 [ms] (mean)
Time per request: 0.920 [ms] (mean, across all concurrent requests)
After that i added DAX (dynamodb accelerator) to reduce the response time to microseconds as expected
but i got same results
Test #2
Concurrency Level: 100
Time taken for tests: 0.853 seconds
Complete requests: 1000
Failed requests: 0
Requests per second: 1172.12 [#/sec] (mean)
Time per request: 85.315 [ms] (mean)
Time per request: 0.853 [ms] (mean, across all concurrent requests)

Lambda and API Gateway have significant overhead themselves, which likely accounts for most of that 85ms. The only part that DAX can speed up is reading from DynamoDB.
For example, say a regular read (GetItem) from DynamoDB takes 2.5ms, and a cached read from DAX takes 500µs, and your Lambda does 5 sequential GetItems. In that case DynamoDB would take 12.5ms, and DAX would require 2.5ms, saving 10ms of time in the Lambda - but you still have to pay the cost of API Gateway and Lambda, which could easily be 50+ ms.
(I recommend reading up on Amdahl's Law if you're not familiar with it to understand the limitations of performance optimization.)
It may still make sense to use DAX for your use case because it may let you reduce your DynamoDB provisioned throughput or on demand requests, but when using Lambda the latency improvement will only be noticeable if each Lambda makes many requests.

Related

Node+Express+MongoDB Native Client Performance issue

I am testing the performance of Node.js (ExpressJS/Fastify), Python (Flask) and Java (Spring Boot with webflux) with MongoDB. I hosted all these sample applications on the same server one after another so all services have the same environment. I used two different tools Load-test and Apache Benchmark cli for measuring the performance.
All the code for the Node sample is present in this repository:
benchmark-nodejs-mongodb
I have executed multiple tests with various combinations of the number of requests and concurrent requests with both the tools
Apache Benchmark Total 1K requests and 100 concurrent
ab -k -n 1000 -c 100 http://{{server}}:7102/api/case1/1000
Load-Test Total 100 requests and 10 concurrent
loadtest http://{{server}}:7102/api/case1/1000 -n 100 -c 10
The results are also attached to the Github repository and are shocking for NodeJS as compared to other technologies, either the requests are breaking in between the test or the completion of the test is taking too much time.
Server Configuration: Not dedicated but
CPU: Core i7 8th Gen 12 Core
RAM: 32GB
Storage: 2TB HDD
Network Bandwidth: 30Mbps
Mongo Server Different nodes on different networks connected through the Internet
Please help me in understanding this issue in detail. I do understand how the Event loop works in nodejs but this problem is not identifiable.
Reproduced
Setup:
Mongodb Atlas M30
AWS c4xlarge in the same region
Results:
No failures
Document Path: /api/case1/1000
Document Length: 37 bytes
Concurrency Level: 100
Time taken for tests: 33.915 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 1000
Total transferred: 265000 bytes
HTML transferred: 37000 bytes
Requests per second: 29.49 [#/sec] (mean)
Time per request: 3391.491 [ms] (mean)
Time per request: 33.915 [ms] (mean, across all concurrent requests)
Transfer rate: 7.63 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 3.1 0 12
Processing: 194 3299 1263.1 3019 8976
Waiting: 190 3299 1263.1 3019 8976
Total: 195 3300 1264.0 3019 8976
Length failures on havier load:
Document Path: /api/case1/5000
Document Length: 37 bytes
Concurrency Level: 100
Time taken for tests: 176.851 seconds
Complete requests: 1000
Failed requests: 22
(Connect: 0, Receive: 0, Length: 22, Exceptions: 0)
Keep-Alive requests: 978
Total transferred: 259170 bytes
HTML transferred: 36186 bytes
Requests per second: 5.65 [#/sec] (mean)
Time per request: 17685.149 [ms] (mean)
Time per request: 176.851 [ms] (mean, across all concurrent requests)
Transfer rate: 1.43 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.9 0 4
Processing: 654 17081 5544.0 16660 37911
Waiting: 650 17323 5290.9 16925 37911
Total: 654 17081 5544.1 16660 37911
I copied results of your tests from the github repo for completeness:
Python
Java Spring Webflux
Node Native Mongo
So, there are 3 problems.
Upload bandwidth
ab -k -n 1000 -c 100 http://{{server}}:7102/api/case1/1000 uploads circa 700 MB of bson data over the wire.
30Mb/s = less than 4MB/s which requires at least 100 seconds only to transfer data at top speed. If you test it from home, consumer grade ISP do not always give you the max speed, especially to upload.
It's usually less a problem for servers, especially if application is hosted close to the database. I put some stats for the app and mongo servers hosted on aws in the same zone in the question itself.
Failed requests
All I could notice are "Length" failures - the number of bytes factually received does not match.
It happens only to the last batch (100 requests) because some race conditions in nodejs cluster module - the master closes connections to the worker threads before worker's http.response.end() writes data to the socket. On TCP level it looks like this:
After 46 seconds of struggles there is no HTTP 200 OK, only FIN, ACK.
This is very easy to fix by using nginx reverse proxy + number of nodejs workers started manually instead of built-in cluster module, or let k8s do resource management.
In short - don't use nodejs cluster module for network-intensive tasks.
Timeout
It's ab timeout. When network is a limiting factor and you increase the payload x5 - increase default timeout (30 sec) at least x4:
ab -s 120 -k -n 1000 -c 100 http://{{server}}:7102/api/case1/5000
I am sure you did this for other tests, since you report 99 sec/request for java and 81 sec/request for python.
Conclusion
There are nothing shockingly bad with nodejs. Some bugs in the cluster, but it's a very niche usecase to start from, and it's trivial to work it around.
The flamechart:
Most of the CPU time is used to serialise/deserialise bson and send data to the stream, with some 10% spent on the most CPU intensive bson serialiseInto,
If you are using only single server, then you can cache the database operations on the app side and get rid of database latency altogether and only commit to it with an interval or when cache expires.
If there are multiple servers, you may get help from a scalable cache, maybe Redis. Redis alao has client caching and you can still apply your own cache on Redis to boost the performance further.
A plain LRU cache written in NodeJs can do at least 3-5 million lookups per second and even more if key access is based on integers(so it can be sharded like an n-way associative lru cache).
If you group multiple clients into single cache request, then getting help from C++ app can reach hundreds of millions to billions of lookups per second depending on data type.
You can also try sharding the db on extra disk drives like ramdisk if db data is temporary.
Event loop can be offloaded a task queue for database operations and another queue for incoming requests. This way event loop can harness i/o overlapping more, instead of making a client wait for own db operation.

sub-second latency causing delay in spark application

I have a spark batch job that runs every minute and processes ~200k records per batch. The usual processing delay of the app is ~30 seconds. In the app, for each request, we make a write request to DynamoDB. At times, the server-side DDB write latency is ~5 ms instead of 3.5 ms (~30% increase w.r.t to usual latency 3.5ms). This is causing the overall delay of the app to bump by 6 times (~3 minutes).
How does sub-second latency of DDB call impact the overall latency of the app by 6 times?
PS: I have verified the root cause through overlapping the cloud-watch graphs of DDB put latency and the spark app processing delay.
Thanks,
Vinod.
Just a ballpark estimate:
If the average is 3.5 ms latency and about half of your 200k records are processed in 5ms instead of 3.5ms, this would leave us with:
200.000 * 0.5 * (5 - 3.5) = 150.000 (ms)
of total delay, which is 150 seconds or 2.5 minutes. I don't know how well the process is parallelized, but this seems to be within the expected delay.

Calculating limit in Cosmos DB [duplicate]

I have a cosmosGB gremlin API set up with 400 RU/s. If I have to run a query that needs 800 RUs, does it mean that this query takes 2 sec to execute? If i increase the throughput to 1600 RU/s, does this query execute in half a second? I am not seeing any significant changes in query performance by playing around with the RUs.
As I explained in a different, but somewhat related answer here, Request Units are allocated on a per-second basis. In the event a given query will cost more than the number of Request Units available in that one-second window:
The query will be executed
You will now be in "debt" by the overage in Request Units
You will be throttled until your "debt" is paid off
Let's say you had 400 RU/sec, and you executed a query that cost 800 RU. It would complete, but then you'd be in debt for around 2 seconds (400 RU per second, times two seconds). At this point, you wouldn't be throttled anymore.
The speed in which a query executes does not depend on the number of RU allocated. Whether you had 1,000 RU/second OR 100,000 RU/second, a query would run in the same amount of time (aside from any throttle time preventing the query from running initially). So, aside from throttling, your 800 RU query would run consistently, regardless of RU count.
A single query is charged a given amount of request units, so it's not quite accurate to say "query needs 800 RU/s". A 1KB doc read is 1 RU, and writing is more expensive starting around 10 RU each. Generally you should avoid any requests that would individually be more than say 50, and that is probably high. In my experience, I try to keep the individual charge for each operation as low as possible, usually under 20-30 for large list queries.
The upshot is that 400/s is more than enough to at least complete 1 query. It's when you have multiple attempts that combine for overage in the timespan that Cosmos tells you to wait some time before being allowed to succeed again. This is dynamic and based on a more or less black box formula. It's not necessarily a simple division of allowance by charge, and no individual request would be faster or slower based on the limit.
You can see if you're getting throttled by inspecting the response, or monitor by checking the Azure dashboard metrics.

Loadtesting - High concurrency = High latency while using Lambda?

I created a lamdba function that fetches a record from DynamoDB.
Now I am trying to get some numbers on the performance of the architecture (Which will be having DAX enabled in a later iteration).
For the test I am using the loadtest package. Below the details of 2 of my tests
Test #1
AWS Lambda Configuration
Timeout: 30 sec
Memory: 1024 MB
Reserve concurrency: 900
Test Inputs
Max Requests:1000
Concurrency : 100
Test Result
totalRequests:1000
totalTimeSeconds:15.028303200999997
meanLatencyMs:1385.2
maxLatencyMs:6536
minLatencyMs:197
Test #2
AWS Lambda Configuration
Timeout: 30 sec
Memory: 1024 MB
Reserve concurrency: 900
Test Inputs
Max Requests: 1000
Concurrency : 1000
Test Result
totalRequests:1000
totalTimeSeconds:19.298303200999997
meanLatencyMs:8648.2
maxLatencyMs:18749
minLatencyMs:832
Questions
Why does the mean latency raise so much when change the concurrency level from 100 to 1000 when I have configured the reserve concurrency of the lambda function to run 900 parallel instances?
Am I missing any AWS configuration that could improve the numbers ?
Test 1 has 10x as many requests as concurrent executions, which helps to amortize the cost of any cold starts. On the other hand, Test 2 results are worse because Test 2 is entirely cold starts.
Right now, your tests are not necessarily a fair comparison (depending on what you’re try to measure). You could try repeating Test 2 with the number of requests being 10x the concurrency to see if you still get similar results to Test 1.
Have you checked whether Lambda was not throttled?
There is a default account concurrency for lambda around <=1000 (which u use in load testing)
Is there any http errors for API Gateway or in Lambda?
AWS :
"AWS Lambda will keep the unreserved concurrency pool at a minimum of 100 concurrent executions, so that functions that do not have specific limits set can still process requests. So, in practice, if your total account limit is 1000, you are limited to allocating 900 to individual functions."
Check :
https://itnext.io/the-everything-guide-to-lambda-throttling-reserved-concurrency-and-execution-limits-d64f144129e5

why does nodejs orm performs worse than just running a simple query?

Just doing some performance testing using orm2 and seems to be 4 times slower than just querying directly with sql. Any thoughts?
https://github.com/gmaggiotti/rule-restApi/tree/orm-poc
Benchmark using ORM2
Document Path: /rules/
Document Length: 6355 bytes
Concurrency Level: 100
Time taken for tests: 5.745 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 6484000 bytes
HTML transferred: 6355000 bytes
Requests per second: 174.06 [#/sec] (mean)
Time per request: 574.526 [ms] (mean)
Time per request: 5.745 [ms] (mean, across all concurrent requests)
Transfer rate: 1102.13 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 2
Processing: 118 552 83.1 555 857
Waiting: 116 552 83.1 555 857
Total: 119 552 83.0 555 857
Benchmark using just sql
Document Path: /rules/
Document Length: 6355 bytes
Concurrency Level: 100
Time taken for tests: 1.630 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 6484000 bytes
HTML transferred: 6355000 bytes
Requests per second: 613.38 [#/sec] (mean)
Time per request: 163.032 [ms] (mean)
Time per request: 1.630 [ms] (mean, across all concurrent requests)
Transfer rate: 3883.92 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 2
Processing: 98 158 49.2 137 361
Waiting: 98 158 49.2 137 361
Total: 98 158 49.4 137 362
Not sure this is worthy of an answer, but it's too long for a comment.
This is (in my experience) true in every language/platform, for every ORM. In general, you don't use ORMs for query performance, you use them for code maintenance optimization and developer speed.
Why is this the case? Well, as a rule, ORMs have to translate what you say in language X into SQL, and in doing so they won't often come up with the most optimized query. They will typically do the query generation on the fly, and so the actual "building" of the string of (ideally parameterized) SQL takes some small amount of time, as can reflection on the structure of the native code objects to figure out what the right column names, etc.
Many ORMs are also not completely deterministic in terms of how they do this, either, which means that the underlying DB has a harder time caching the query plan than they might otherwise have. Also I couldn't find your actual benchmark tests in the link you provided; it's possible that you're not actually measuring apples to apples.
So I can't answer specifically for the particular module you're using without spending more time on it than I care to, but in general I would discourage this line of questioning for the reasons stated above. The workflow I've often used is to do all my development using the ORM and worry about optimizing queries, etc, once I can do some production time profiling, and at that point I would replace the worst offenders with direct SQL or possibly stored procedures or views (depending on the DB engine) to improve performance where it actually matters.

Resources