How to bulk insert in Mongoose cursor.eachAsync?

How to bulk insert in Mongoose cursor.eachAsync? - node.js

I have tried several solutions to get this working but all failed. I am reading the Mongo DB docs using cursor.eachAsync() and converting some doc fields. I need to move these docs to another collection after conversion.My idea is that after 1000 docs are processed, they should be bulk-inserted into the destination collection. This works good until the last batch of records which are less than 1000. To phrase the same problem differently, if the number of records are <1000 then they are not inserted.
1. First version - bulk insert after async()
Just like any other code, I should have docs < 1000 in bulk object after async() and should be able to insert. But I find bulk.length is 0. (I have removed those statements in the code snippet below).
```js`async function run() {
await mongoose.connect(dbPath, dbOptions);
const cursor = events.streamEvents(query, 10);
let successCounter = 0;
let bulkBatchSize = 1000;
let bulkSizeCounter = 0;
let sourceDocsCount = 80;
var bulk = eventsConvertedModel.collection.initializeOrderedBulkOp();
await cursor.eachAsync((doc) => {
let pPan = new Promise((resolve, reject) => {
getTokenSwap(doc.panTokenIdentifier, doc._id)
.then((swap) => {
resolve(swap);
});
});
let pXml = new Promise((resolve, reject) => {
let xmlObject;
getXmlObject(doc)
.then(getXmlObjectToken)
.then((newXmlString) => {
resolve(newXmlString);
})
.catch(errFromPromise1 => {
});
})
.catch(error => {
reject(error);
});
});
Promise.all([pPan, pXml])
.then(([panSwap, xml]) => {
doc.panTokenIdentifier = panSwap;
doc.eventRecordTokenText = xml;
return doc;
})
.then((newDoc) => {
successCounter++;
bulkSizeCounter++;
bulk.insert(newDoc);
if (bulkSizeCounter % bulkBatchSize == 0) {
bulk.execute()
.then(result => {
bulkSizeCounter = 0;
let msg = "Conversion- bulk insert =" + result.nInserted;
console.log(msg);
bulk = eventsConvertedModel.collection.initializeOrderedBulkOp();
Promise.resolve();
})
.catch(bulkErr => {
logger.error(bulkErr);
});
}
else {
Promise.resolve();
}
})
.catch(err => {
console.log(err);
});
});
console.log("outside-async=" + bulk.length); // always 0
console.log("run()- Docs converted in this run =" + successCounter);
process.exit(0);
}`
2. Second version (track expected number of iterations and after all iterations, change batch size to say 10).
Result - The batch size value changes but it's not reflected in bulk.insert. The records are lost.
3. Same as 2nd but insert one record at a time after bulk inserts are done.
```js
let d = eventsConvertedModel(newDoc);
d.isNew = true;
d._id = mongoose.Types.ObjectId();
d.save().then(saved => {
console.log(saved._id)
Promise.resolve();
}).catch(saveFailed => {
console.log(saveFailed);
Promise.resolve();
});
```
Result - I was getting DocumentNotFound error, so I added d.isNew = true. But for some reason only few records get inserted and many of them get lost.
I have also tried other variations using the number of expected bulk insert iterations. Finally, I changed the code to write to file (one doc at a time) but I am still wondering if there is any way to make write to DB make work.
Dependencies:
Node v8.0.0
Mongoose 5.2.2

Related

insertMany does not work properly while streaming in mongoDB

I'm using Node.js, mongoose.js and mongoDB in M0 Sandbox (General) atlas tier. I want to use streaming for better optimization of resources usages. here's my code:
let counter = 0;
let batchCounter = 0;
const batchSize = 500;
const tagFollowToBeSaved = [];
let startTime;
USER.find({}, { _id: 1 }).cursor()
.on('data', (id) => {
if (counter === 0) {
startTime = process.hrtime();
}
if (counter === batchSize) {
counter = 0;
const [s2] = process.hrtime();
console.log('finding ', batchSize, ' data took: ', s2 - startTime[0]);
TAG_FOLLOW.insertMany(tagFollowToBeSaved, { ordered: false })
.then((ok) => {
batchCounter += 1;
const [s3] = process.hrtime();
console.log(`batch ${batchCounter} took: ${s3 - s2}`);
})
.catch((error) => {
console.log('error in inserting following: ', error);
});
// deleting array
tagFollowToBeSaved.splice(0, tagFollowToBeSaved.length);
}
else {
counter += 1;
tagFollowToBeSaved.push({
tagId: ObjectId('5e81ba5a5c86d7215cdc9c88'),
followedBy: id,
});
}
})
.on('error', (error) => {
console.log('error while streaming: ', error);
})
.on('end', () => {
console.log('END');
});
What I'm doing:
first I read _ids from USER collection of 200k users using streams API with cursor
check whether the number of reads _ids has reached the batch size limit, if so, I'm going to insert these data to TAG_FOLLOW model.
All the console.logs are for analysis purposes and you can ignore them.
However insertMany does not work properly. because I'm monitoring database while inserting data, It seems it inserts about 100 data instead of 500 per batch size without throwing any error, therefore I'm losing 400 data per batch which is so unacceptable!
Although I changed the batchSize value, it does not work anyway and I'm going to lose some data per batch.
So what is the problem? what is the better approach for saving a bunch of large data in an optimized and ensured way?
Thanks in advance.

BatchWrite in AWS dynamo db skipping some items

I am trying to write Items to AWS dynamo db using node SDK. The problem I am facing is that when I write batch items to AWS in parallel using threads, some of the items are not written to database. The number of items are written are random. For instance, If I run my code 3 times, at one time it would be 150, next it would 200 and third time it could be 135. In addition, when I write the items sequentially without threads, even then some of the items are not written.However, in this case the items are less missing. For instance if the total number of items is 300 then the items written are 298. I investigated the problem to see if there any unprocessed items but the batchWrite method returns nothing. It means that all the items are being processed correctly. Please note that I have OnDemand provision for my respective database so I do not expect any throttling issues. So here is my code.
exports.run = async function() {
**This is the function which runs first !!!!!**
const data = await getArrayOfObjects();
console.log("TOTAL PRICE CHANGES")
console.log(data.length)
const batchesOfData = makeBatches(data)
const threads = new Set();
console.log("**********")
console.log(batchesOfData.length)
console.log("**********")
for(let i = 0; i < batchesOfData.length; i++) {
console.log("BATCH!!!!!")
console.log(i)
console.log(batchesOfData[i].length)
// Sequential Approach
const response = await compensationHelper.createItems(batchesOfData[i])
console.log("RESPONSE")
console.log(response)
Parallel approach
// const workerResult = await runService(batchesOfData[i])
// console.log("WORKER RESUULT!!!!")
// console.log(workerResult);
}
}
exports.updateItemsInBatch = async function(data, tableName) {
console.log("WRITING DATA")
console.log(data.length)
const batchItems = {
RequestItems: {},
};
batchItems.RequestItems[tableName] = data;
try {
const result = await documentClient.batchWrite(batchItems).promise();
console.log("UNPROCESSED ITEMS")
console.log(result)
if (result instanceof Error) {
console.log(`[Error]: ${JSON.stringify(Error)}`);
throw new Error(result);
}
return Promise.resolve(true);
} catch (err) {
console.error(`[Error]: ${JSON.stringify(err.message)}`);
return Promise.reject(new Error(err));
}
};
exports.convertToAWSCompatibleFormat = function(data) {
const awsCompatibleData = [];
data.forEach(record => awsCompatibleData.push({ PutRequest: { Item: record } }));
return awsCompatibleData;
};
const createItems = async function(itemList) {
try {
const objectsList = [];
for (let index = 0; index < itemList.length; index++) {
try {
const itemListObj = itemList[index];
const ObjToBeInserted = {
// some data assignments here
};
objectsList.push(ObjToBeInserted);
if (
objectsList.length >= AWS_BATCH_SIZE ||
index === itemList.length - 1
) {
const awsCompatiableFormat = convertToAWSCompatibleFormat(
objectsList
);
await updateItemsInBatch(
awsCompatiableFormat,
process.env.myTableName
);
}
} catch (error) {
console.log(`[Error]: ${JSON.stringify(error)}`);
}
}
return Promise.resolve(true);
} catch (err) {
return Promise.reject(new Error(err));
}
};
const makeBatches = products => {
const productBatches = [];
let countr = -1;
for (let index = 0; index < products.length; index++) {
if (index % AWS_BATCH_SIZE === 0) {
countr++;
productBatches[countr] = [];
if (countr === MAX_BATCHES) {
break;
}
}
try {
productBatches[countr].push(products[index]);
} catch (error) {
continue;
}
}
return productBatches;
};
async function runService(workerData) {
return new Promise((resolve, reject) => {
const worker = new Worker(path.join(__dirname, './worker.js'), { workerData });
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0)
reject(new Error(`Worker stopped with exit code ${code}`));
})
})
}
// My worker file
'use strict';
const { workerData, parentPort } = require('worker_threads')
const creatItems = require('myscripts')
// You can do any heavy stuff here, in a synchronous way
// without blocking the "main thread"
console.log("I AM A NEW THREAD")
createItems(workerData)
// console.log('Going to write tons of content on file '+workerData);
parentPort.postMessage({ fileName: workerData, status: 'Done' })

From boto3 documentation:
If one or more of the following is true, DynamoDB rejects the entire batch write operation:
One or more tables specified in the BatchWriteItem request does not exist.
Primary key attributes specified on an item in the request do not match those in the corresponding table's primary key schema.
You try to perform multiple operations on the same item in the same BatchWriteItem request. For example, you cannot put and delete the same item in the same BatchWriteItem request.
Your request contains at least two items with identical hash and range keys (which essentially is two put operations).
There are more than 25 requests in the batch.
Any individual item in a batch exceeds 400 KB.
The total request size exceeds 16 MB.
To me, it looks some of this is true. At my job, we also had a problem that one batch contained 2 identical primary and secondary keys in the batch so the whole batch was discarded. I know it's not node.js, but we used this to overcome that problem.
It is batch_writer(overwrite_by_pkeys) and it is used to overwrite the last occurance of the same primary and last key in the batch. If only a small portion of your data is duplicate data and you do not need to save it, you can use this. BUT if you need to save all your data, I do not advise you to use this functionality.

I don't see where you are checking the response for UnprocessedItems. Batch operations will often return a list of items it didn't process. As is documented, BatchWriteItem "can write up to 16 MB of data, which can comprise as many as 25 put or delete requests."

I had duplicate keys issue which means that primary and the sort key had duplicate values in the batch, however, in my case this error was not returned from the AWS BatchWrite method if my timestamp was in fraction of seconds 2020-02-09T08:02:36.71, which was a bit surprising. I resolved the issue by making my createdAt(sort key) to be more granular like this => 2020-02-09T08:02:36.7187 Thus making it non-repetitive.

Node Postgres COPY FROM failing silently

I am trying to use PostgreSQL's COPY FROM API to stream potentially-thousands of records into a database as they are dynamically generated in node.js code. To do so, I wrote this generic wrapper function:
function streamRows(client, { table, columns, data }) {
return new Promise((resolve, reject) => {
const sqlStream = client.query(
copyFrom(`COPY ${ table } (${ columns.join(', ') }) FROM STDIN`));
const rowStream = new Readable();
rowStream.pipe(sqlStream)
.on('finish', resolve)
.on('error', reject);
for (const row of data) {
rowStream.push(`${ row.join('\t') }\n`);
}
rowStream.push('\\.\n');
rowStream.push(null);
});
}
The database table I'm writing into looks like this:
CREATE TABLE devices (
id SERIAL PRIMARY KEY,
group_id INTEGER REFERENCES groups(id),
serial_number CHAR(12) NOT NULL,
status INTEGER NOT NULL
);
And I am calling it as follows:
function *genRows(id, devices) {
let count = 0;
for (const serial of devices) {
yield [ id, serial, UNSTARTED ];
count++;
if (count % 10 === 0) log.info(`Streamed ${ count } rows...`);
}
log.info(`Streamed ${ count } rows.`);
}
await streamRows(client, {
table: 'devices',
columns: [ 'group_id', 'serial_number', 'status' ],
data: genRows(id, devices),
});
The log statements in my generator function that's producing the per-row data all run as expected, and the output indicates that it is in fact always running the generator to completion, and streaming all the data rows I want. No errors are ever thrown. But if I wait for it to complete, the table sometimes ends up with 0 rows added to it--i.e., it looks like I sent all that data to Postgres, but none of it was actually inserted. What am I doing wrong?

I do not know exactly what parts of this made the difference and what is purely stylistic, but after playing around with a bunch of different examples from across the web, I managed to cobble together this function which works:
function streamRows(client, { table, columns, data }) {
return new Promise((resolve, reject) => {
const iterator = data[Symbol.iterator]();
const rs = new Readable();
const ws = client.query(copyFrom(`COPY ${ table } (${ columns.join(', ') }) FROM STDIN`));
rs._read = function() {
const { value, done } = iterator.next();
rs.push(done ? null : `${ value.join('\t') }\n`);
};
rs.on('error', reject);
ws.on('error', reject);
ws.on('end', resolve);
rs.pipe(ws);
});
}

Create promises on queries inside a for loop

I'm trying to write a Node.js code that does the below.
Connect to a Salesforce instance.
Get the past 7 days, and loop through them.
Run 2 queries inside them and push the result to an Array.
Display the value in another function.
Here is my JS code.
var jsforce = require("jsforce");
var moment = require('moment');
function connectToEP() {
var main_Obj = {};
var response_Obj = {};
var pastSevenDaysArray = [];
var conn = new jsforce.Connection();
var beforeSevenDays = moment().subtract(7, 'days').format('YYYY-MM-DD');
var today = moment().startOf('day');
var i = 0;
conn.login("myUid", "myPwd").then(() => {
console.log("Connected To Dashboard");
for (var m = moment(beforeSevenDays); m.diff(today, 'days') <= 0; m.add(1, 'days')) {
conn.query("SELECT SUM(Total_ETA_of_all_tasks__c), SUM(Total_ETA__C) from Daily_Update__c where DAY_ONLY(createddate)= " + m.format('YYYY-MM-DD')).then(() => {
console.log("B1");
var z = response_Obj.aggrRes;
response_Obj.aggrRes = res;
pastSevenDaysArray.push({ z: res });
console.log("B1 Exit");
}).then(() => {
conn.query("SELECT count(Id), Task_Type__c FROM Daily_Task__c where DAY_ONLY(createddate) = " + m.format('YYYY-MM-DD') + " group by Task_Type__c").then(() => {
console.log("B2");
var z = response_Obj.aggrRes;
response_Obj.aggrRes = res;
pastSevenDaysArray.push({ z: res });
console.log("B2 Exit");
})
})
}
return Promise.resolve(pastSevenDaysArray);
}).then((data) => {
console.log(typeof data);
updateMessage(JSON.stringify(data));
console.log(typeof data);
});
}
function updateMessage(message) {
console.log("XXXXXXXXXXXX");
console.log(message);
console.log("XXXXXXXXXXXX");
}
function socketNotificationReceived() {
console.log("socket salesforce rec");
connectToEP();
}
socketNotificationReceived();
when I run this code, the output that I get is.
socket salesforce rec
Connected To Dashboard
object
XXXXXXXXXXXX
[]
XXXXXXXXXXXX
object
B1
B1
B1
B1
B1
B1
B1
B1
I'm very new to this js platform, unable to get the promises concepts :(. please let me know on were am I going wrong and how can I fix it.
An explanation of what's going is very helpful in my future projects.
Thanks

The thing I always do when I get confused is to decompose. Build the pieces one by one, and make sure each works. Trying to understand your code, I get something like this...
A function each for logging in, getting a "task sum" from the db and getting a "task count" from the db. (Task sum/count is what I guessed the queries were up to. Rename as you see fit).
var jsforce = require("jsforce");
var moment = require('moment');
function login(conn) {
return conn.login("myUid", "myPwd");
}
function queryTaskSumForDay(conn, m) {
return conn.query("SELECT SUM(Total_ETA_of_all_tasks__c), SUM(Total_ETA__C) from Daily_Update__c where DAY_ONLY(createddate)= " + m.format('YYYY-MM-DD'));
}
function queryTaskCountForDay(conn, m) {
return conn.query("SELECT count(Id), Task_Type__c FROM Daily_Task__c where DAY_ONLY(createddate) = " + m.format('YYYY-MM-DD') + " group by Task_Type__c");
}
With those working, it should be easy to get a sum and a count for a given day. Rather than returning these in an array (containing two objects that each have a "z" property as your code did), I opted for the simpler single object that has a sum and count property. You may need to change this to suit your design. Notice the use of Promise.all() to resolve two promises together...
function sumAndCountForDay(conn, m) {
let sum = queryTaskSumForDay(conn, m);
let count = queryTaskCountForDay(conn, m);
return Promise.all([sum, count]).then(results => {
return { sum: results[0], count: results[1] };
});
}
With that working, it should be easy to get an array of sum-count objects for a period of seven days using your moment logic and the Promise.all() idea...
function sumAndCountForPriorWeek(conn) {
let promises = [];
let beforeSevenDays = moment().subtract(7, 'days').format('YYYY-MM-DD');
let today = moment().startOf('day');
for (let m = moment(beforeSevenDays); m.diff(today, 'days') <= 0; m.add(1, 'days')) {
promises.push(sumAndCountForDay(conn, m));
}
return Promise.all(promises);
}
With that working (notice the pattern here?), your OP function is tiny and nearly fully tested because we tested all of it's parts...
function connectToEP() {
let conn = new jsforce.Connection();
return login(conn).then(() => {
return sumAndCountForPriorWeek(conn)
}).then(result => {
console.log(JSON.stringify(result));
return result;
}).catch(error => {
console.log('error: ' + JSON.stringify(error));
return error;
});
}

I think your general structure should be something like this. The biggest issue is not returning promises when you need to. A "for loop" of promises is a little difficult to step into, but if you can do them in parallel then the easiest thing to do is Promise.all If you need to aggregate the data before you can perform the next query then you need to do multiple Promise.all().then()'s. The reason you get an empty array [] is because your for loop creates the promises but doesn't wait until they finish.
var jsforce = require("jsforce");
var moment = require('moment');
function connectToEP() {
// connectToEP now returns a promise
return conn.login("myUid", "myPwd").then(() => {
console.log("Connected To Dashboard");
let myQueries = [];
for (start ; condition ; incrementer) {
myQueries.push( // Add all these query promises to the parallel queue
conn.query(someQuery)
.then((res) => {
return res;
})
.then((res) => {
return conn.query(someQuery).then((res) => {
return someData;
})
})
)
}
return Promise.all(myQueries); // Waits for all queries to finish...
}).then((allData) => { // allData is an array of all the promise results
return updateMessage(JSON.stringify(allData));
});
}

reading two csv files at a time with promises nodejs

I am new to promises. I want to read two csv files at time using promises. How to read 2 csv files parallel and then proceed to chain. I have gone through this but they have used some library called RSVP. can any one help me how to do this without using any of library functions. I want to read 2 files at a time and should able to access both files in next .then()?
file = require('fs')
// Reading file
let readFile =(filename)=>{
return new Promise(function(resolve, reject){
file.readFile(filename, 'utf8', function(err, data){
if(err){
reject(err)
}else{
resolve(data)
}
});
});
}
// Get the match id for season 2016
getMatchId=(matches)=>{
let allRows = matches.split(/\r?\n|\r/);
let match_id = []
for(let i=1; i<allRows.length-1; i++){
let x = allRows[i].split(',')
if(x[1] === '2016'){
match_id.push(parseInt((x[0])))
}
}
return match_id
}
// Final calculation to get runs per team in 2016
getExtraRunsin2016=(deliveries, match_id)=>{
let eachRow = deliveries.split(/\r?\n|\r/);
result = {}
let total = 0
for(let i=1; i<eachRow.length-1;i++){
let x = eachRow[i].split(',')
let team_name = x[3]
let runs = parseInt(x[16])
let id = parseInt(x[0])
if(match_id.indexOf(id)){
total+=runs
result[team_name] += runs
}else{
result[team_name] = 0
}
}
console.log(result, total)
}
// Promise
readFile('fileOne.csv')
.then((matchesFile)=>{
return getMatchId(matchesFile)
})
.then((data)=>{
readFile('fileTwo.csv')
.then((deliveries)=>{
getExtraRunsin2016(deliveries, data)
})
})
.catch(err=>{
console.log(err)
})

You can Promise.all() to combine things without using any other libraries
"use strict";
Promise.all([
readFile('fileOne.csv'),
readFile('fileTwo.csv'),
]).then((results) => {
let [rawFileOneValues, rawFileTwoValues] = results;
// Your other function to process them
}).catch(err => {
console.log(err)
});

You want to use Promise.all().
// Promise
Promise.all([
readFile('fileOne.csv').then(getMatchId),
readFile('fileTwo.csv'),
])
.then(([data, deliveries]) => getExtraRunsin2016(deliveries, data))
.catch(err => console.log(err));
I also recommend using fs-extra, which would replace your readFile implementation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to bulk insert in Mongoose cursor.eachAsync? - node.js

Related

insertMany does not work properly while streaming in mongoDB

BatchWrite in AWS dynamo db skipping some items

Node Postgres COPY FROM failing silently

Create promises on queries inside a for loop

reading two csv files at a time with promises nodejs

Categories

Resources