How to use a lot of memory in node.js? - node.js

I am working on a system designed to keep processes behaving properly and I need to test its memory usage kill switch. I am yet to find a way to quickly and efficiently use a lot of system memory quickly with node.js. I have tried using arrays but it seems as if the garbage collector compresses them. Basically, I need to deliberately cause a fast memory leak.
EDIT: I need to reach around 4 gigabytes of used system memory by one process.

If garbage collector is compressing your array, maybe you are pushing always the same value inside it.
you should try shomething like
const a = []
for (let i = 0; i < 9000000000; i++)
a.push(i)
You also should run your process with a flag to increase memory limit. Note that with a limit of 9B, your array should be bigger than 4GB and your process will exit with non zero status code.
node --max_old_space_size=4096 yourscript.js

Related

Node.js never Garbage Collects Buffers once they fall out of scope

I have researched a lot before posting this. This is a collection of all the things I have discovered about garbage collecting and at the end I'm asking for a better solution than the one I found.
Summary
I am hosting a Node.js app on Heroku, and when a particular endpoint of my server is hit, which uses a lot of buffers for image manipulation (using sharp, but this is a buffer issue, not a sharp one), it takes a very few requests for the buffers to occupy all the external and rss memory (used process.memoryUsage() for diagnostics) because even tho such variables have felt out of scope, or set to null, the OS never garbage collects them. The outcome is that external and rss memory grow exponentially and after a few requests my 512 dyno quota will be reached and my dyno will crash.
Now, I have made a minimal reproducible example, which shows that simply declaring a new buffer within a function, and calling that function 10 times, results in the buffers to never be garbage collected even when the functions have finished executing.
I'm writing to find a better way to make sure Node garbage collects the unreferenced buffers and to understand why it doesn't do so by default. The only solution I have found now is to call global.gc().
NOTE
In the minimal reproducible example I simply use a buffer, no external libraries, and it is enough to recreate the issue i am having with sharp because it's just an issue that node.js buffers have.
Also note, what increases is the external memory and rss. The arraybuffer memory, or heapused, or heaptotal are not affected. I have not found a way yet to trigger garbage collector for when a certain threshold of external memory is used.
Finally, my heroku server has been running, with no incoming requests, for up to 8 hours now, and the garbage collector hasn't yet cleared out the external memory and the RSS. so it is not a matter of waiting. same holds true for the minimal reproducible example, even with timers, the garbage collector doesn't do its job.
Minimal reproducible example - garbage collection is not triggered
This is the snippet of code that logs out the memory used after each function call, where the external memory and RSS memory keep building up without being freed:
async function getPrintFile() {
let buffer = Buffer.alloc(1000000);
return
}
async function test() {
for (let i = 0; i < 10; i++) {
console.log(process.memoryUsage())
await getPrintFile()
}
}
test()
console.log(process.memoryUsage())
Below I will share the endless list of things that I have tried in order to make sure those buffers get garbage collected, but without succeeding. First I'll share the only working solution that is not optimal.
Minimal reproducible example - garbage collection is triggered through code
To make this work, I have to call global.gb() in two parts of the code. For some weird reason, that I hope someone could explain to me, if I call global.gb() only at the end of the function that creates the buffer, or only just after calling that function, it won't garbage collect. However if I call global.gb() from both places, it will.
This is the only solution that has worked for me so far, but obviously it is not ideal as global.gb() is blocking.
async function getPrintFile() {
let buffer = Buffer.alloc(1000000);
global.gc()
return
}
async function test() {
for (let i = 0; i < 10; i++) {
console.log(process.memoryUsage())
await getPrintFile()
global.gc()
}
}
test()
console.log(process.memoryUsage())
What I have tried
I tried setting the buffers to null, logically if they are not referenced anymore, they should be garbage collected, but the garbage collector is very lazy apparently.
tried "delete buffer", or finding a way to resize or reallocate buffer memory, but it doesn't exist apparently in Node.js
tried buffer.fill(0), but that simply fills all the spaces with zeros, it doesn't resize it
installing a memory allocator like jemalloc on my heroku server, following this guide: Jemalloc Heroku Buildpack but it was pointless
running my script with: node --max-old-space-size=4 index.js however again, pointless, it didn't work even with the space size set at 4 MB.
I thought maybe it was because the functions were asynchronous, or I was using a loop, nope, I wrote 5 different version of that snippet, each and everyone of them had the same issue of the external memory growing like crazy.
Questions
By any remote chance, is there something super easy I'm missing, like a keyword, a function to use, that would easily sort this out? Or does anyone have anything that has worked for them so far? A library, a snippet, anything?
Why the hell do I have to call global.gb() TWO TIMES from within and outside the function for the garbace collector to work, and why is once not enough??
Why is it that the garbage collector for Buffers in Node.js is such dog s**t?
How is this not an issue? Literally, every single buffer ever declared on a running application NEVER gets garbage collected, and there is no way to find out if you are using buffers on your laptop because of the big memory, but as soon as you upload a snippet online, then by the time you realise it's probably too late. What's up with that?
I hope someone can give me a hand, as running process.gb() twice for each n type of request is not very efficient, and I'm not sure what repercussions it might have on my code.

Running garbage collection manually in node

I am using node and am considering manually running garbage collection in node. Is there any drawbacks on this? The reason I am doing this is that it looks like node is not running garbage collection frequently enough. Does anyone know how often V8 does its garbage collection routine in node?
Thanks!
I actually had the same problem running node on heroku with 1GB instances.
When running the node server on production traffic, the memory would grow constantly until it exceeded the memory limit, which caused it to run slowly.
This is probably caused by the app generating a lot of garbage, it mostly serves JSON API responses. But it wasn't a memory leak, just uncollected garbage.
It seems that node doesn't prioritize doing enough garbage collections on old object space for my app, so memory would constantly grow.
Running global.gc() manually (enabled with node --expose_gc) would reduce memory usage by 50MB every time and would pause the app for about 400ms.
What I ended up doing is running gc manually on a randomized schedule (so that heroku instances wouldn't do GC all at once). This decreased the memory usage and stopped the memory quota exceeded errors.
A simplified version would be something like this:
function scheduleGc() {
if (!global.gc) {
console.log('Garbage collection is not exposed');
return;
}
// schedule next gc within a random interval (e.g. 15-45 minutes)
// tweak this based on your app's memory usage
var nextMinutes = Math.random() * 30 + 15;
setTimeout(function(){
global.gc();
console.log('Manual gc', process.memoryUsage());
scheduleGc();
}, nextMinutes * 60 * 1000);
}
// call this in the startup script of your app (once per process)
scheduleGc();
You need to run your app with garbage collection exposed:
node --expose_gc app.js
I know this may be a bit of a tardy reply to help to OP, but i thought I would collaborate my recent experiences with Node JS memory allocation and garbage collection.
We are currently working on a node JS server running on a raspberry pi 3. Every so often it would crash due to running out of memory. I initially thought this was a memory leak, and after a week and a half of searching through my code and coming up with nothing, I thought the problem could have been exacerbated by the fact that Node JS allocates more memory than available on the Rpi3 for its processes before it does the GC.
I have been running new instances of my server with the following commands:
'node server.js --max-executable-size=96 --max-old-space-size=128 --max-semi-space-size=2'
This effectively limits the total amount of space that node is allowed to take up on the local machine and forces garbage collections to be done more frequently. Thus far, we are seeing a constant usage of memory and it confirms to me that my code was not leaking initially, but rather node was allocating more memory than possible.
EDIT: This link here outlines in more specific terms the issue I was dealing with.
-nodejs decrease v8 garbage collector memory usage
-https://github.com/nodejs/node/issues/2738
V8 Run garbage collection when he thinks it's useful. There is no fixed delay for that. You can read this article to learn about garbage collection V8: https://strongloop.com/strongblog/node-js-performance-garbage-collection/
Anyway, it's a bad idea to run manually the garbage collector in your project because it blocks completely the node process. So during the garbage collection, your program won't handle any requests.

Memory leak in a node.js crawler application

For over a month I'm struggling with a very annoying memory leak issue and I have no clue how to solve it.
I'm writing a general purpose web crawler based on: http, async, cheerio and nano. From the very beginning I've been struggling with memory leak which was very difficult to isolate.
I know it's possible to do a heapdump and analyse it with Google Chrome but I can't understand the output. It's usually a bunch of meaningless strings and objects leading to some anonymous functions telling me exactly nothing (it might be lack of experience on my side).
Eventually I came to a conclusion that the library I had been using at the time (jQuery) had issues and I replaced it with Cheerio. I had an impression that Cheerio solved the problem but now I'm sure it only made it less dramatic.
You can find my code at: https://github.com/lukaszkujawa/node-web-crawler. I understand it might be lots of code to analyse but perhaps I'm doing something stupid which can be obvious strait away. I'm suspecting the main agent class which does HTTP requests https://github.com/lukaszkujawa/node-web-crawler/blob/master/webcrawler/agent.js from multiple "threads" (with async.queue).
If you would like to run the code it requires CouchDB and after npm install do:
$ node crawler.js -c conf.example.json
I know that Node doesn't go crazy with garbage collection but after 10min of heavy crawling used memory can go easily over 1GB.
(tested with v0.10.21 and v0.10.22)
For what it's worth, Node's memory usage will grow and grow even if your actual used memory isn't very large. This is for optimization on behalf of the V8 engine. To see your real memory usage (to determine if there is actually a memory leak) consider dropping this code (or something like it) into your application:
setInterval(function () {
if (typeof gc === 'function') {
gc();
}
applog.debug('Memory Usage', process.memoryUsage());
}, 60000);
Run node --expose-gc yourApp.js. Every minute there will be a log line indicating real memory usage immediately after a forced garbage collection. I've found that watching the output of this over time is a good way to determine if there is a leak.
If you do find a leak, the best way I've found to debug it is to eliminate large sections of your code at a time. If the leak goes away, put it back and eliminate a smaller section of it. Use this method to narrow it down to where the problem is occurring. Closures are a common source, but also check for anywhere else references may not be cleaned up. Many network applications will attach handlers for sockets that aren't immediately destroyed.

Need thoughts on profiling of multi-threading in C on Linux

My application scenario is like this: I want to evaluate the performance gain one can achieve on a quad-core machine for processing the same amount of data. I have following two configurations:
i) 1-Process: A program without any threading and processes data from 1M .. 1G, while system was assumed to run only single core of its 4-cores.
ii) 4-threads-Process: A program with 4-threads (all threads performing same operation) but processing 25% of the input data.
In my program for creating 4-threads, I used pthread's default options (i.e., without any specific pthread_attr_t). I believe the performance gain of 4-thread configuration comparing to 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).
I profiled the time spent in creation of threads just like this below:
timer_start(&threadCreationTimer);
pthread_create( &thread0, NULL, fun0, NULL );
pthread_create( &thread1, NULL, fun1, NULL );
pthread_create( &thread2, NULL, fun2, NULL );
pthread_create( &thread3, NULL, fun3, NULL );
threadCreationTime = timer_stop(&threadCreationTimer);
pthread_join(&thread0, NULL);
pthread_join(&thread1, NULL);
pthread_join(&thread2, NULL);
pthread_join(&thread3, NULL);
Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on. Hence, structure of the code of my functions run by threads is like this:
timer_start(&threadTimer[i]);
while(!dataFinished[i])
{
threadTime[i] += timer_stop(&threadTimer[i]);
data_source();
timer_start(&threadTimer[i]);
process();
}
threadTime[i] += timer_stop(&threadTimer[i]);
Variable dataFinished[i] is marked true by process when the it received and process all needed data. Process() knows when to do that :-)
In the main function, I am calculating the time taken by 4-threaded configuration as below:
execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime.
And performance gain is calculated by simply
gain = execTime1process / execTime4Thread * 100
Issue:
On small data size around 1M to 4M, the performance gain is generally good (between 350% to 400%). However, the trend of performance gain is exponentially decreasing with increase in the input size. It keeps decreasing until some data size of upto 50M or so, and then become stable around 200%. Once it reached that point, it remains almost stable for even 1GB of data.
My question is can anybody suggest the main reasoning of this behaviour (i.e., performance drop at the start and but remaining stable later)?
And suggestions how to fix that?
For your information, I also investigated the behaviour of threadCreationTime and threadTime for each thread to see what's happening. For 1M of data the values of these variables are small and but with increase in the data size both these two variables increase exponentially (but threadCreationTime should remain almost same regardless of data size and threadTime should increase at a rate corresponding to data being processing). After keep on increasing until 50M or so threadCreationTime becomes stable and threadTime (just like performance drop becomes stable) and threadCreationTime keep increasing at a constant rate corresponding to increase in data to be processed (which is considered understandable).
Do you think increasing the stack size of each thread, process priority stuff or custom values of other parameters type of scheduler (using pthread_attr_init) can help?
PS: The results are obtained while running the programs under Linux's fail safe mode with root (i.e., minimal OS is running without GUI and networking stuff).
Since increase in the size of the input data may also increase in the
memory requirement of each thread, then so loading all data in advance
is definitely not a workable option. Therefore, in order to ensure not
to increase the memory requirement of each thread, each thread reads
data in small chunks, process it and reads next chunk process it and
so on.
Just this, alone, can cause a drastic speed decrease.
If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.
As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.
Within the threading,
threading context switches due to many threads can cause sub-optimal speedup
as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns
But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?
Some algorithm in your worker threads is likely to be suboptimal/expensive.
Are your thread starting on creation ? If it is the case, then the following will happen :
while your parent thread is creating thread, the thread already created will start to run. When you hit timerStop (ThreadCreation timer), the four have already run
for a certain time. So threadCreationTime overlaps threadTime[i]
As it is now, you don't know what you are measuring. This won't solve your problem, because obviously you have a problem since threadTime does not augment linearly, but at least you won't add overlapping times.
To have more info you can use the perf tool if it is available on your distro.
for example :
perf stat -e cache-misses <your_prog>
and see what happens with a two thread version, a three thread version etc...

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Resources