Abnormally high memory consumption MongoDb - node.js

Comrades. There is a problem. Short description: there is an application on NodeJS (cluster is used, sits on 2 protsah). There is MongoDB (replicated). The total value of all bases is ~ 5 gig. For the day of work, the monga exceeds the limit of 8GB and hangs. There are not many requests to the application, about 5000 per hour ... The question is how to limit the mongo and do it so that it does not fall.

Related

Node+Express+MongoDB Native Client Performance issue

I am testing the performance of Node.js (ExpressJS/Fastify), Python (Flask) and Java (Spring Boot with webflux) with MongoDB. I hosted all these sample applications on the same server one after another so all services have the same environment. I used two different tools Load-test and Apache Benchmark cli for measuring the performance.
All the code for the Node sample is present in this repository:
benchmark-nodejs-mongodb
I have executed multiple tests with various combinations of the number of requests and concurrent requests with both the tools
Apache Benchmark Total 1K requests and 100 concurrent
ab -k -n 1000 -c 100 http://{{server}}:7102/api/case1/1000
Load-Test Total 100 requests and 10 concurrent
loadtest http://{{server}}:7102/api/case1/1000 -n 100 -c 10
The results are also attached to the Github repository and are shocking for NodeJS as compared to other technologies, either the requests are breaking in between the test or the completion of the test is taking too much time.
Server Configuration: Not dedicated but
CPU: Core i7 8th Gen 12 Core
RAM: 32GB
Storage: 2TB HDD
Network Bandwidth: 30Mbps
Mongo Server Different nodes on different networks connected through the Internet
Please help me in understanding this issue in detail. I do understand how the Event loop works in nodejs but this problem is not identifiable.
Reproduced
Setup:
Mongodb Atlas M30
AWS c4xlarge in the same region
Results:
No failures
Document Path: /api/case1/1000
Document Length: 37 bytes
Concurrency Level: 100
Time taken for tests: 33.915 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 1000
Total transferred: 265000 bytes
HTML transferred: 37000 bytes
Requests per second: 29.49 [#/sec] (mean)
Time per request: 3391.491 [ms] (mean)
Time per request: 33.915 [ms] (mean, across all concurrent requests)
Transfer rate: 7.63 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 3.1 0 12
Processing: 194 3299 1263.1 3019 8976
Waiting: 190 3299 1263.1 3019 8976
Total: 195 3300 1264.0 3019 8976
Length failures on havier load:
Document Path: /api/case1/5000
Document Length: 37 bytes
Concurrency Level: 100
Time taken for tests: 176.851 seconds
Complete requests: 1000
Failed requests: 22
(Connect: 0, Receive: 0, Length: 22, Exceptions: 0)
Keep-Alive requests: 978
Total transferred: 259170 bytes
HTML transferred: 36186 bytes
Requests per second: 5.65 [#/sec] (mean)
Time per request: 17685.149 [ms] (mean)
Time per request: 176.851 [ms] (mean, across all concurrent requests)
Transfer rate: 1.43 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.9 0 4
Processing: 654 17081 5544.0 16660 37911
Waiting: 650 17323 5290.9 16925 37911
Total: 654 17081 5544.1 16660 37911
I copied results of your tests from the github repo for completeness:
Python
Java Spring Webflux
Node Native Mongo
So, there are 3 problems.
Upload bandwidth
ab -k -n 1000 -c 100 http://{{server}}:7102/api/case1/1000 uploads circa 700 MB of bson data over the wire.
30Mb/s = less than 4MB/s which requires at least 100 seconds only to transfer data at top speed. If you test it from home, consumer grade ISP do not always give you the max speed, especially to upload.
It's usually less a problem for servers, especially if application is hosted close to the database. I put some stats for the app and mongo servers hosted on aws in the same zone in the question itself.
Failed requests
All I could notice are "Length" failures - the number of bytes factually received does not match.
It happens only to the last batch (100 requests) because some race conditions in nodejs cluster module - the master closes connections to the worker threads before worker's http.response.end() writes data to the socket. On TCP level it looks like this:
After 46 seconds of struggles there is no HTTP 200 OK, only FIN, ACK.
This is very easy to fix by using nginx reverse proxy + number of nodejs workers started manually instead of built-in cluster module, or let k8s do resource management.
In short - don't use nodejs cluster module for network-intensive tasks.
Timeout
It's ab timeout. When network is a limiting factor and you increase the payload x5 - increase default timeout (30 sec) at least x4:
ab -s 120 -k -n 1000 -c 100 http://{{server}}:7102/api/case1/5000
I am sure you did this for other tests, since you report 99 sec/request for java and 81 sec/request for python.
Conclusion
There are nothing shockingly bad with nodejs. Some bugs in the cluster, but it's a very niche usecase to start from, and it's trivial to work it around.
The flamechart:
Most of the CPU time is used to serialise/deserialise bson and send data to the stream, with some 10% spent on the most CPU intensive bson serialiseInto,
If you are using only single server, then you can cache the database operations on the app side and get rid of database latency altogether and only commit to it with an interval or when cache expires.
If there are multiple servers, you may get help from a scalable cache, maybe Redis. Redis alao has client caching and you can still apply your own cache on Redis to boost the performance further.
A plain LRU cache written in NodeJs can do at least 3-5 million lookups per second and even more if key access is based on integers(so it can be sharded like an n-way associative lru cache).
If you group multiple clients into single cache request, then getting help from C++ app can reach hundreds of millions to billions of lookups per second depending on data type.
You can also try sharding the db on extra disk drives like ramdisk if db data is temporary.
Event loop can be offloaded a task queue for database operations and another queue for incoming requests. This way event loop can harness i/o overlapping more, instead of making a client wait for own db operation.

sub-second latency causing delay in spark application

I have a spark batch job that runs every minute and processes ~200k records per batch. The usual processing delay of the app is ~30 seconds. In the app, for each request, we make a write request to DynamoDB. At times, the server-side DDB write latency is ~5 ms instead of 3.5 ms (~30% increase w.r.t to usual latency 3.5ms). This is causing the overall delay of the app to bump by 6 times (~3 minutes).
How does sub-second latency of DDB call impact the overall latency of the app by 6 times?
PS: I have verified the root cause through overlapping the cloud-watch graphs of DDB put latency and the spark app processing delay.
Thanks,
Vinod.
Just a ballpark estimate:
If the average is 3.5 ms latency and about half of your 200k records are processed in 5ms instead of 3.5ms, this would leave us with:
200.000 * 0.5 * (5 - 3.5) = 150.000 (ms)
of total delay, which is 150 seconds or 2.5 minutes. I don't know how well the process is parallelized, but this seems to be within the expected delay.

azure what is Time Aggregation Max or AVG

I am using SQL Azure SQL Server for my App. My app was in was working perfectly till recently and the MAX dtu usage has been 100% but the AVG DTU usage ois around 50%.
Which value should i monitor to scale the services, MAX or AVG?
I found on the net after lots of searching:
CPU max/min and average within that 1 minute. As 1 minute (60 seconds) is the finest granularity, if you chose for example max, if the CPU has touched 100% even for 1 second, it will be shown 100% for that entire minute. Perhaps the best is to use the Average. In this case the average CPU utilization from 60 seconds will be shown under that 1 minute metric.
which sorta helped me with what it all meant, but thanks to bradbury9 too for your input.

Memory management scenario with MongoDB & Node.JS

I'm implementing a medium scale marketing e-commerce affiliation site, which has following estimates,
Total Size of Data: 5 - 10 GB
Indexes on Data: 1 GB approx (which I wanted to be in memory)
Disk Size (fast I/O): 20-25 GB
Memory: 2 GB
App development: node.js
Working set estimation of Query: Average 1-2 KB, Maximum 20-30 KB of text base article
I'm trying to understand whether MongoDB would be right choice for database or not. Index is going to be fairly downsize of Memory but I have noticed that after querying that MongoDB, it has occupied the memory (size of result set) for caching query. In 8 hours I'm expecting that all queries' depth would cover almost 95% of data, in that scenario how will MongoDB manage limited memory scenario also app instance of node.js running on same server.
Would a MongoDB a right choice for this scenario or I should go for other JSON based no-SQL Databases.

Solr Indexing Time

Solr 1.4 is doing great with respect to Indexing on a dedicated physical server (Windows Server 2008). For Indexing around 1 million full text documents (around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB RAM.
However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at the first time. Note that there is no Network delays and no RAM issues. Now when I increased the RAM to 8GB and increased the heap size, the indexing time increased to 2 hrs. That was really strange. Note that except for SQL Server there is no other process running. There are no network delays. However I have not checked for File I/O. Can that be a bottleneck? Does Solr has any issues running in "Virtualization" Environment?
I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets deteriorated when RAM is increased when Solr is running on a VM but that is with respect to query times and not indexing times.
I am bit confused as to why it took longer on a VM when I repeated the same test second time with increased heap size and RAM.
I/O on a VM will always be slower than on dedicated hardware. This is because the disk is virtualized and I/O operations must pass through an extra abstraction layer. Indexing requires intensive I/O operations, so it's not surprising that it runs more slowly on a VM. I don't know why adding RAM causes a slowdown though.

Resources