I am trying to bulk insert around 500k records into Redis using a pipeline, and my code looks roughly like
const seedCache = async (
products: Array<Item>,
chunkSize,
) => {
for (const chunk of _.chunk(products, chunkSize)) {
chunk.forEach((item) => {
pipeline.set(item.id, item.data);
});
await pipeline.exec();
console.log(pipeline.length);
}
redis.quit();
};
Essentially, I load chunkSize items into the pipeline, then wait until pipeline.exec() returns, then continue.
I expected that "console.log(pipeline.length)" would be printing "0" every time, since it is only getting run after the pipeline has been flushed to Redis. However, I'm finding that pipeline.length is never getting reset to 0; instead, it just grows and grows until its length is equal to products.length by the end. This is causing my machine to run out of memory for large datasets.
Does anybody know why this is happening? Also, is this even the correct way to bulk insert records into Redis? Since running this script with 5000 products and batch size 100 only inserts 200 into the cache, whereas it does successfully insert an array of 1000 products with the same batch size. The documents are quite large (~5kB), so it needs to be done in batches somehow.
Related
Here is my function where I am trying to save data extracted from an Excel file. I am using XLSX npm package to extract the data from Excel file.
function myFunction() {
const excelFilePath = "/ExcelFile2.xlsx"
if (fs.existsSync(path.join('uploads', excelFilePath))) {
const workbook = XLSX.readFile(`./uploads${excelFilePath}`)
const [firstSheetName] = workbook.SheetNames;
const worksheet = workbook.Sheets[firstSheetName];
const rows = XLSX.utils.sheet_to_json(worksheet, {
raw: false, // Use raw values (true) or formatted strings (false)
// header: 1, // Generate an array of arrays ("2D Array")
});
// res.send({rows})
const serviceAccount = require('./*******-d75****7a06.json');
admin.initializeApp({
credential: admin.credential.cert(serviceAccount)
});
const db = admin.firestore()
rows.forEach((value) => {
db.collection('users').doc().onSnapshot((snapShot) => {
docRef.set(value).then((respo) => {
console.log("Written")
})
.catch((reason) => {
console.log(reason.note)
})
})
})
console.log(rows.length)
}
Here is an error that I am getting and this process uses up all of my system memory:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
It's pretty normal in Firebase/Firestore land to have errors like this when trying to add too much data at once.
Firebase Functions tend to time out, and even if you configure them to be able to run all the way to 9 minutes, they'll still time out eventually and you end up with partial data and/or errors.
Here's how I do things like this:
Write a function that writes 500 entries at a time (using batch write)
Use an entry identifier (let's call it userId), so the function knows which was the last user recorded to the database. Let's call it lastUserRecorded.
After each iteration (batch write of 500 entries), have your function record the value of lastUserRecorded inside a temporary document in the database.
When the function runs again, it should first read the value of lastUserRecorded in the db, then write a new batch of 500 users, starting AFTER that value. (it would select a new set of 500 users from your excel file, but start after the value of lastUserRecorded.
To avoid running into function timeout issues, I would schedule the function to run every minute (Cloud Scheduler trigger). This way, it's very highly likely that the function will be able to handle the batch of 500 writes, without timing out and recording partial data.
If you do it this way, 100k entries will take around 3 hours and 34 minutes to finish.
I have an 85mb data file with 110k text records in it. I need to parse each of these records, and publish an SNS message to a topic for each record. I am doing this successfully, but the Lambda function requires a lot of time to run, as well as a large amount of memory. Consider the following:
const parse = async (key) => {
//get the 85mb file from S3. this takes 3 seconds
//I could probably do this via a stream to cut down on memory...
let file = await getFile( key );
//parse the data by new line
const rows = file.split("\n");
//free some memory now
//this free'd up ~300mb of memory in my tests
file = null;
//
for( let i = 0; i < rows.length; i++ ) {
//... parse the row and build a small JS object from it
//publish to SNS. assume publishMsg returns a promise after a successful SNS push
requests.push( publishMsg(data) );
}
//wait for all to finish
await Promise.all(requests);
return 1;
};
The Lambda function will timeout with this code at 90 seconds (the current limit I have set). I could raise this limit, as well as the memory (currently at 1024mb) and likely solve my issue. But, none of the SNS publish calls take place when the function hits the timeout. Why?
Lets say 10k rows process before the function hits the timeout. Since I am submitting the publish async, shouldn't several of these complete regardless of the timeout? It seems they only run if the entire function completes.
I have run a test where I cut the data down to 15k rows, and it runs without any issue, in roughly 15 seconds.
So the question, why are the async calls not firing prior to the function timeout, and any input on how I can optimize this without moving away from Lambda?
Lambda Config: nodeJS 10.x, 1024 mb, 90 second timeout
I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.
For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.
I am trying to insert more than 100 records using batch() method.
client.batch(batchQuery, { prepare: true }, function (err, result) {
if (err) {
res.status(404).json({ msg: err });
} else {
res.json([result.rows][0]);
}
});
batchQuery has more than 100 insert queries. It is working if the records are less than 7. If its more than 10, then i am getting "Batch too large"
You shouldn't use batches for bulk inserts into Cassandra (in contrast to RDBMS) - this error that you get mean that you're inserting data into different partitions, and it pushes an additional load on the node that receives query. You need to use batches only if you're doing inserts into the same partition - in this case they will be applied as a single mutation.
Otherwise, sending individual insert queries via async execute, will be much faster. You only need not to send too many requests at the same time (see this answer).
You can read more about good & bad use of batches in the documentation and following answer on SO: 1.
I've been testing the limits of MongoDB to see whether it will work for an upcoming project and I've noticed that upserts are quite slow compared to inserts.
Of course, I'd expect them to be slower, but not (almost) an order of magnitude slower (7400 vs 55000 ops/sec). Here's the (nodejs native driver) bench-marking code that I used:
(async function() {
let db = await require('mongodb').MongoClient.connect('mongodb://localhost:27017/mongo-benchmark-8764824692947');
db.collection('text').createIndex({text:1},{unique:true})
let batch = db.collection('text').initializeOrderedBulkOp();
let totalOpCount = 0;
let batchOpCount = 0;
let start = Date.now();
while(1) {
totalOpCount++;
batchOpCount++;
if(batchOpCount === 1000) { // batch 1000 ops at a time
await batch.execute();
batch = db.collection('text').initializeOrderedBulkOp();
batchOpCount = 0;
let secondsElapsed = (Date.now() - start)/1000;
console.log(`(${Math.round(totalOpCount/secondsElapsed)} ops per sec) (${totalOpCount} total ops)`)
}
///////// INSERT TEST ///////// (~55000 ops/sec)
// batch.insert({text:totalOpCount});
///////// UPSERT TEST ///////// (~7400 ops/sec)
let text = Math.floor(Math.random()*1000000);
batch.find({text}).upsert().updateOne({$setOnInsert:{text}});
if(totalOpCount > 500000) {
console.log("<< finished >>");
await db.dropCollection('text');
db.close();
break;
}
}
})();
You can easily run it yourself by pasting it into index.js, running npm init -y and npm install --save mongodb and then node .
When we upsert a document, the mongo engine has to check whether there's an existing document that matches it. This might have explained some of the slowdown, but doesn't an insert command on a unique index require the same collision checking? Thanks!
Edit: Turns out $setOnInsert is needed else we get duplicate key errors.
I made each batch bigger by changing if(batchOpCount === 1000) to if(batchOpCount === 50000), and got ~90000 ops/sec for insert and ~35000 ops/sec for upsert. I'm not exactly sure why the batch size makes a relative difference. It makes sense that smaller batches would result in fewer ops/sec (communication overhead), but I'm not sure why upserts suffer more than inserts, though this is probably a topic for a separate question.
This (roughly) 3-fold speed difference is definitely closer to the performance difference between the two operations that I'd expect, but it still seems like upserts are a little slower than they should be, given that inserts must also check for collisions (due to the unique index). This is really only a half-answer, so I'm going to leave this question open in the hope that a mongo pro will come along and provide a more complete answer.