DynamoDB PutItem using all heap memory - NodeJS - node.js

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.

For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Related

Pass large array of objects to RabbitMQ exchange

I receive large array of objects from an external source (about more than 10 000 objects). And then I pass it to exchange in order to notify other microservices about new entries to handle.
this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: myData, // [object1, object2, object3, ...]
pattern: 'myPattern',
})
The problem is that it's bad practice to push such large message to exchange, and I'd like to resolve this issue. I've read articles and stackoverflow posts about that to find code example or information about streaming data but with no success.
The only way I've found out is to divide my large array into chunks and publish each one to exchange using for ... loop. Is it good practice? How to determine what length should each chunk (number of objects) have? Or maybe is there another approach?
It really depends on the Object size.. That's a thing you would have to figure out yourself. Get your 10k objects and calculate an average size out of them (Put them as json into a file and take fileSize/10'000 that's it. Maybe request body size of 50-100kb is a good thing? But that's still up to u ..
Start with number 50 and do tests. Check the time taken, bandwidth and everything what makes sense. Change chunk sizes from between 1-5000 and test test test . At some point, you will get a feeling what number would be good to take! .
Here's some example code of looping through the elements:
// send function for show caseing the idea.
function send(data) {
return this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: data,
pattern: 'myPattern',
})
}
// this sends chunks one by one..
async function sendLargeDataPacket(data, chunkSize) {
// Pure functions do prevent headache
const mutated = [...data]
// send full packages aslong as possible!.
while (mutated.length >= chunkSize) {
// send a packet of chunkSize length
await send(mutated.splice(0, chunkSize))
}
// send the remaining elements if there are any!
if(mutated.length > 0) {
await send(mutated)
}
}
And you would call it like:
// that's your 10k+ items array!.
var myData = [/**...**/]
// let's start with 50, but try out all numbers!.
const chunkSize = 50
sendLargeDataPacket(chunkSize).then(() => console.log('done')).catch(console.error)
This approach send one packet after the other, and may take some time since it is not done in parallel. I do not know your requirements but I can help you writing a parallel approach if you need..

Does redis.pipeline.exec() "reset" the pipeline?

I am trying to bulk insert around 500k records into Redis using a pipeline, and my code looks roughly like
const seedCache = async (
products: Array<Item>,
chunkSize,
) => {
for (const chunk of _.chunk(products, chunkSize)) {
chunk.forEach((item) => {
pipeline.set(item.id, item.data);
});
await pipeline.exec();
console.log(pipeline.length);
}
redis.quit();
};
Essentially, I load chunkSize items into the pipeline, then wait until pipeline.exec() returns, then continue.
I expected that "console.log(pipeline.length)" would be printing "0" every time, since it is only getting run after the pipeline has been flushed to Redis. However, I'm finding that pipeline.length is never getting reset to 0; instead, it just grows and grows until its length is equal to products.length by the end. This is causing my machine to run out of memory for large datasets.
Does anybody know why this is happening? Also, is this even the correct way to bulk insert records into Redis? Since running this script with 5000 products and batch size 100 only inserts 200 into the cache, whereas it does successfully insert an array of 1000 products with the same batch size. The documents are quite large (~5kB), so it needs to be done in batches somehow.

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()
The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

Optimizing file parse and SNS publish of large record set

I have an 85mb data file with 110k text records in it. I need to parse each of these records, and publish an SNS message to a topic for each record. I am doing this successfully, but the Lambda function requires a lot of time to run, as well as a large amount of memory. Consider the following:
const parse = async (key) => {
//get the 85mb file from S3. this takes 3 seconds
//I could probably do this via a stream to cut down on memory...
let file = await getFile( key );
//parse the data by new line
const rows = file.split("\n");
//free some memory now
//this free'd up ~300mb of memory in my tests
file = null;
//
for( let i = 0; i < rows.length; i++ ) {
//... parse the row and build a small JS object from it
//publish to SNS. assume publishMsg returns a promise after a successful SNS push
requests.push( publishMsg(data) );
}
//wait for all to finish
await Promise.all(requests);
return 1;
};
The Lambda function will timeout with this code at 90 seconds (the current limit I have set). I could raise this limit, as well as the memory (currently at 1024mb) and likely solve my issue. But, none of the SNS publish calls take place when the function hits the timeout. Why?
Lets say 10k rows process before the function hits the timeout. Since I am submitting the publish async, shouldn't several of these complete regardless of the timeout? It seems they only run if the entire function completes.
I have run a test where I cut the data down to 15k rows, and it runs without any issue, in roughly 15 seconds.
So the question, why are the async calls not firing prior to the function timeout, and any input on how I can optimize this without moving away from Lambda?
Lambda Config: nodeJS 10.x, 1024 mb, 90 second timeout

Nodejs: How can I optimize writing many files?

I'm working in a Node environment on Windows. My code is receiving 30 Buffer objects (~500-900kb each) each second, and I need to save this data to the file system as quickly as possible, without engaging in any work that blocks the receipt of the following Buffer (i.e. the goal is to save the data from every buffer, for ~30-45 minutes). For what it's worth, the data is sequential depth frames from a Kinect sensor.
My question is: What is the most performant way to write files in Node?
Here's pseudocode:
let num = 0
async function writeFile(filename, data) {
fs.writeFileSync(filename, data)
}
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
let filename = 'file-' + num++
// Do anything with data here to optimize write?
writeFile(filename, data)
}
fs.writeFileSync seems much faster than fs.writeFile, which is why I'm using that above. But are there any other ways to operate on the data or write to file that could speed up each save?
First off, you never want to use fs.writefileSync() in handling real-time requests because that blocks the entire node.js event loop until the file write is done.
OK, based on writing each block of data to a different file, then you want to allow multiple disk writes to be in process at the same time, but not unlimited disk writes. So, it's still appropriate to use a queue, but this time the queue doesn't just have one write in process at a time, it has some number of writes in process at the same time:
const EventEmitter = require('events');
class Queue extends EventEmitter {
constructor(basePath, baseIndex, concurrent = 5) {
this.q = [];
this.paused = false;
this.inFlightCntr = 0;
this.fileCntr = baseIndex;
this.maxConcurrent = concurrent;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
while (!paused && this.q.length && this.inFlightCntr < this.maxConcurrent) {
this.inFlightCntr++;
let buf = this.q.shift();
try {
fs.writeFile(basePath + this.fileCntr++, buf, err => {
this.inFlightCntr--;
if (err) {
this.err(err);
} else {
// write more data
this.write();
}
});
} catch(e) {
this.err(e);
}
}
}
err(e) {
this.pause();
this.emit('error', e)
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
let q = new Queue("file-", 0, 5);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(e) {
// go some sort of write error here
console.log(e);
});
Things to consider:
Experiment with the concurrent value you pass to the Queue constructor. Start with a value of 5. Then see if raising that value any higher gives you better or worse performance. The node.js file I/O subsystem uses a thread pool to implement asynchronous disk writes so there is a max number of concurrent writes that will allow so cranking the concurrent number up really high probably does not make things go faster.
You can experiement with increasing the size of the disk I/O thread pool by setting the UV_THREADPOOL_SIZE environment variable before you start your node.js app.
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD on a fast bus would be best.
If you can spread the writes out across multiple actual physical disks, that will likely also increase write throughput (more disk heads at work).
This is a prior answer based on the initial interpretation of the question (before editing that changed it).
Since it appears you need to do your disk writes in order (all to the same file), then I'd suggest that you either use a write stream and let the stream object serialize and cache the data for you or you can create a queue yourself like this:
const EventEmitter = require('events');
class Queue extends EventEmitter {
// takes an already opened file handle
constructor(fileHandle) {
this.f = fileHandle;
this.q = [];
this.nowWriting = false;
this.paused = false;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
if (!nowWriting && !paused && this.q.length) {
this.nowWriting = true;
let buf = this.q.shift();
fs.write(this.f, buf, (err, bytesWritten) => {
this.nowWriting = false;
if (err) {
this.pause();
this.emit('error', err);
} else {
// write next block
this.write();
}
});
}
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
// pass an already opened file handle
let q = new Queue(fileHandle);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(err) {
// got disk write error here
});
You could use a writeStream instead of this custom Queue class, but the problem with that is that the writeStream may fill up and then you'd have to have a separate buffer as a place to put the data anyway. Using your own custom queue like above takes care of both issues at once.
Other Scalability/Performance Comments
Because you appear to be writing the data serially to the same file, your disk writing won't benefit from clustering or running multiple operations in parallel because they basically have to be serialized.
If your node.js server has other things to do besides just doing these writes, there might be a slight advantage (would have to be verified with testing) to creating a second node.js process and doing all the disk writing in that other process. Your main node.js process would receive the data and then pass it to the child process that would maintain the queue and do the writing.
Another thing you could experiment with is coalescing writes. When you have more than one item in the queue, you could combine them together into a single write. If the writes are already sizable, this probably doesn't make much difference, but if the writes were small this could make a big difference (combining lots of small disk writes into one larger write is usually more efficient).
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD would be best.
I have written a service that does this extensively and the best thing you can do is to pipe the input data directly to the file (if you have an input stream as well).
A simple example where you download a file in such a way:
const http = require('http')
const ostream = fs.createWriteStream('./output')
http.get('http://nodejs.org/dist/index.json', (res) => {
res.pipe(ostream)
})
.on('error', (e) => {
console.error(`Got error: ${e.message}`);
})
So in this example there is no intermediate copying involved of the whole file. As the file is read in chunks from the remote http server it is written to the file on disk. This is much more efficient that downloading a whole file from the server, saving that in memory and then writing it to a file on disk.
Streams are a basis of many operations in Node.js so you should study those as well.
One other thing that you should investigate depending on your scenarios is UV_THREADPOOL_SIZE as I/O operations use libuv thread pool that is by default set to 4 and you might fill that up if you do a lot of writing.

Resources