Node.js + Redis memory leak. Am I doing something wrong? - node.js

var redis = require("redis"),
client = redis.createClient();
for(var i =0 ; i < 1000000; i++){
client.publish('channel_1', 'hello!');
}
After the code is executed, the Node process consumes 1.2GB of memory and stays there; GC does not reduce allocated memory. If I simulate 2 million messages or 4x500000, node crashes with memory error.
Node: 0.8.*, tried 4.1.1 later but nothing changed
Redis: 2.8 , works well (1MB allocated memory).
My server will be publishing more than 1 million messages per hour. So this is absolutely not acceptable (process crashing every hour).
updated test
var redis = require("redis"),
client = redis.createClient();
var count = 0;
var x;
function loop(){
count++;
console.log(count);
if(count > 2000){
console.log('cleared');
clearInterval(x);
}
for(var i =0 ; i < 100000; i++){
client.set('channel_' + i, 'hello!');
}
}
x = setInterval(loop, 3000);
This allocate ~ 50Mb, with peak at 200Mb, and now GC drop memory back to 50Mb

If you take a look at the node_redis client source, you'll see that every send operation returns a boolean that indicates whether the command queue has passed the high water mark (by default 1000). If you were to log this return value (alternatively, enable redis.debug_mode), there is a good possibility that you'll see false a lot- an indication that you're sending too more requests than Redis can handle all at once.
If this turns out not to be the case, then the command queue is indeed being cleared regularly which means GC is most likely the issue.
Either way, try jfriend00's suggestion. Sending 1M+ async messages with no delay (so basically all at once) is not a good test. The queue needs time to clear and GC needs time to do its thing.
Sources:
Backpressure and Unbounded Concurrency & Node-redis client return values

Related

Optimizing file parse and SNS publish of large record set

I have an 85mb data file with 110k text records in it. I need to parse each of these records, and publish an SNS message to a topic for each record. I am doing this successfully, but the Lambda function requires a lot of time to run, as well as a large amount of memory. Consider the following:
const parse = async (key) => {
//get the 85mb file from S3. this takes 3 seconds
//I could probably do this via a stream to cut down on memory...
let file = await getFile( key );
//parse the data by new line
const rows = file.split("\n");
//free some memory now
//this free'd up ~300mb of memory in my tests
file = null;
//
for( let i = 0; i < rows.length; i++ ) {
//... parse the row and build a small JS object from it
//publish to SNS. assume publishMsg returns a promise after a successful SNS push
requests.push( publishMsg(data) );
}
//wait for all to finish
await Promise.all(requests);
return 1;
};
The Lambda function will timeout with this code at 90 seconds (the current limit I have set). I could raise this limit, as well as the memory (currently at 1024mb) and likely solve my issue. But, none of the SNS publish calls take place when the function hits the timeout. Why?
Lets say 10k rows process before the function hits the timeout. Since I am submitting the publish async, shouldn't several of these complete regardless of the timeout? It seems they only run if the entire function completes.
I have run a test where I cut the data down to 15k rows, and it runs without any issue, in roughly 15 seconds.
So the question, why are the async calls not firing prior to the function timeout, and any input on how I can optimize this without moving away from Lambda?
Lambda Config: nodeJS 10.x, 1024 mb, 90 second timeout

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.
For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Why nodejs memory is not consumed for specific loop count

I was trying to find memory leak in my code. I found out when n is, 1 < n < 257 it is showing 0KB consume, but as I put 257 it consumed memory 304KB then increase proportionally with n.
function somefunction()
{
var n = 256;
var x ={};
for(var i=0; i<n; i++){
x['some'+i] = {"abc" : ("abc#yxy.com"+i)};
}
}
// Memory Leak
var init = process.memoryUsage();
somefunction();
var end = process.memoryUsage();
console.log("memory consumed 2nd Call : "+((end.rss-init.rss)/1024)+" KB");
It's probably not leak. You cannot always expect gc to purge everything so soon.
See Why does nodejs have incremental memory usage?
If you want to force garbage collection, see
How to request the Garbage Collector in node.js to run?

Is my understanding of libuv threadpool in node.js correct?

I wrote the following node.js program (node version 6.2.0 on Ubuntu 14.04) to understand more about libuv threadpool in node.js. In the program, I am reading two text files of size 10KB. After the files are read successfully, I am doing some computing intensive task(in the callback).
var log4js = require('log4js');// For logging output with timestamp
var logger = log4js.getLogger();
var fs=require('fs');
fs.readFile('testFile0.txt',function(err,data){//read testFile0.txt
logger.debug('data read of testFile0.txt');
for(var i=0; i<10000; i++)//Computing intensive task. Looping for 10^10 times
{
for(var j=0; j<10000; j++)
{
for(var k=0; k<100; k++)
{
}
}
}
});
fs.readFile('testFile1.txt',function(err,data){//read testFile1.txt
logger.debug('data read of testFile1.txt');
for(var i=0; i<10000; i++)//Computing intensive task. Looping for 10^10 times
{
for(var j=0; j<10000; j++)
{
for(var k=0; k<100; k++)
{
}
}
}
});
As per my understanding of libuv threadpool, the two files should be read immediately and the the time difference between printing of statements "data read of testFile0.txt", "data read of testFile1.txt" should be very less (in milliseconds or a second at most) since the default thread pool size is 4 and only two async requests (file read operation) are there. But, the time difference between printing of statements "data read of testFile0.txt" and "data read of testFile0.txt" is quite large (10 seconds). Can someone explain why the time difference is so large?? Does the computing intensive task being done in the callback contribute to the large time difference??
libuv has a threadpool of size 4 (by default) so that part is correct. Now, let's see how that is actually used.
When some operation is queued in the threadpool, it's run on one of the threads, and then the result is posted to the "main" thread, that being the thread where the loop is running. Results are processed in FIFO style.
In your case, reading the files happens in parallel, but processing the results will be serialized. This means that while bytes are read from the disk in parallel, the callbacks will always run one after another.
You see the delay, because the second callback can only run after the first one is finished, but that takes ~10s, hence the delay.
One way to make this trully parallel would be to do the computation in the threadpool itself, though you'd need an addon for that which uses uv_queue_work or use the child_process mdoule.

possible socket.io memory leak

Using socket.io 0.9.17 with a redis store, over time the memory usage grows from ~150mb at startup to beyond 1.0gb.
I took 2 heap snapshots using node-heapdump. One after app start and another a day later. And compared the results and it looks like the biggest offender is string objects.
Below are the screenshots of the comparison.
When I expand the string objects all I see is some trace and a uncaughtException.
The app doesn't crash and there are no exceptions when running the same code on dev environments. These strings are events that are passed to socket.io and distributed to the nodes via the redis store. The relevant code for this is below
var result = {
posts: [postData],
privileges: {
'topics:reply': true
},
'reputation:disabled': parseInt(meta.config['reputation:disabled'], 10) === 1,
'downvote:disabled': parseInt(meta.config['downvote:disabled'], 10) === 1,
};
for(var i=0; i<uids.length; ++i) {
if (parseInt(uids[i], 10) !== socket.uid) {
websockets.in('uid_' + uids[i]).emit('event:new_post', result);
}
}
Upgrading to socket.io 1.x got rid of the memory leak.

Resources