Azure DocumentDB does not support UPSERT. Is there a reasonable work around to achieve the same functionality?
Is using a stored procedure which checks if the document exists to determine whether and insert or update should be performed an effective strategy?
What if I need to perform thousands of these in bulk?
Vote for the feature here:
http://feedback.azure.com/forums/263030-documentdb/suggestions/7075256-provide-for-upsert
Update - Here is my attempt at a bulk upsert stored procedure.
function bulkImport(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var count = 0;
if (!docs) throw new Error('Docs parameter is null');
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
}
tryUpsert(docs[count], callback);
function tryUpsert(doc, callback) {
var query = { query: ""select * from root r where r.id = #id"", parameters: [ {name: ""#id"", value: doc.id}]};
var isAccepted = collection.queryDocuments(collectionLink, query, function(err, resources, options) {
if (err) throw err;
if(resources.length > 0) {
// Perform a replace
var isAccepted = collection.replaceDocument(resources[0]._self, doc, callback);
if (!isAccepted) getContext().getResponse().setBody(count);
}
else {
// Perform a create
var isAccepted = collection.createDocument(collectionLink, doc, callback);
if (!isAccepted) getContext().getResponse().setBody(count);
}
});
if (!isAccepted) getContext().getResponse().setBody(count);
}
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryUpsert(docs[count], callback);
}
}
}
Update (2015-10-06): Atomic upsert is now supported by Azure DocumentDB.
Yes, a store procedure would work great for upsert.
There are even code samples available on DocumentDB's Github:
Upsert (Optimized for Insert): https://github.com/aliuy/azure-node-samples/blob/master/documentdb-server-side-js/stored-procedures/upsert.js
Upsert (Optimized for Replace): https://github.com/aliuy/azure-node-samples/blob/master/documentdb-server-side-js/stored-procedures/upsertOptimizedForReplace.js
Bulk Import / Upsert:
https://github.com/Azure/azure-documentdb-hadoop/blob/master/src/BulkImportScript.js
Related
Let's say we have a container with following items
{
location:CA,
bool: false
}
{
location:CA,
bool: false
}
{
location:CA,
bool: false
}
How do we write a stored procedure to query all items and update all item's bool field from false to true? I have not find any reference yet and I know CosmosDB stored procedure only support querying not updating or deleting.
This stored procedure code:
// SAMPLE STORED PROCEDURE
function sample() {
var collection = getContext().getCollection();
// Query all documents in one logic partition
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
'SELECT * FROM root r',
function (err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
var response = getContext().getResponse();
response.setBody('no docs found');
}
else {
feed.forEach(element => {
element.bool = true;
collection.replaceDocument(element._self, element,function (err){
if (err) throw err;
})
})
var response = getContext().getResponse();
response.setBody("replace success");
}
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
Execute stored procedure:
BTW, your partition key should be /location.
A simple stored procedure using readDocument function in CosmosDB/DocumentDB, but it does not work.
function testRead() {
var collection = getContext().getCollection();
var docId = collection.getSelfLink() + 'docs/myDocId';
// Query documents and take 1st item.
var isAccepted = collection.readDocument(docId, {}, function (err, doc, options) {
if (err) throw err;
response.setBody(JSON.stringify(doc));
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
it always get error code 400.
{"code":400,"body":"{\"code\":\"BadRequest\",\"message\":\"Message:
{\\"Errors\\":[\\"Encountered exception while executing Javascript.
Exception = Error: Error creating request message\\r\\nStack
trace: Error: Error creating request message\\n at readDocument
(testRead.js:512:17)\\n at testRead (testRead.js:8:5)\\n at
__docDbMain (testRead.js:18:5)\\n at Global code (testRead.js:1:2)\\"]}\r\nActivityId:
2fb0f7ef-c192-4b56-b8bb-9681c9f8fa6e, Request URI:
/apps/DocDbApp/services/DocDbServer22/partitions/a4cb4962-38c8-11e6-8106-8cdcd42c33be/replicas/1p/,
RequestStats: , SDK:
Microsoft.Azure.Documents.Common/1.22.0.0\"}","activityId":"2fb0f7ef-c192-4b56-b8bb-9681c9f8fa6e","substatus":400}
Anyone can help me ?
Can you try this: var docId = collection.getAltLink() + 'docs/myDocId';
-- self link is not for "name routing".
According to Michael's suggestion, my sample works now, here is the code
function testRead() {
var collection = getContext().getCollection();
var response = getContext().getResponse();
var docId = collection.getAltLink() + '/docs/myDocId';
// Query documents and take 1st item.
var isAccepted = collection.readDocument(docId, {}, function (err, doc, options) {
if (err) throw err;
response.setBody(JSON.stringify(doc));
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
You could modify your code like :
function testRead() {
var collection = getContext().getCollection();
var docId = collection.getAltLink() + 'docs/myDocId';
console.log(collection.getSelfLink() + 'docs/myDocId');
var isAccepted = collection.readDocument(docId, {}, function (err, doc, options) {
if (err) throw err;
response.setBody(JSON.stringify(doc));
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
Or you could use follow sample code to query the document, it also contains all of the fields.
function testRead() {
var collection = getContext().getCollection();
var query = "select * from c where c.id = '1'";
var isAccepted = collection.queryDocuments(collection.getSelfLink(), query,function (err, doc, options) {
if (err) throw err;
var response = getContext().getResponse();
response.setBody(JSON.stringify(doc));
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
I've been following along the javascript stored proc examples shown here
The code below is an attempt at writing a modified version of the update stored proc sample. Here's what I'm trying to do:
Instead of operating on a single document, I'd like to perform the
update on the set of documents returned by a provided query.
(Optional) Return a count of updated documents in the response body.
Here's the code:
function updateSproc(query, update) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
updated: 0,
continuation: false
};
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
if (!update) throw new Error("The update is undefined or null.");
tryQueryAndUpdate();
// Recursively queries for a document by id w/ support for continuation tokens.
// Calls tryUpdate(document) as soon as the query returns a document.
function tryQueryAndUpdate(continuation) {
var requestOptions = {continuation: continuation};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, documents, responseOptions) {
if (err) throw err;
if (documents.length > 0) {
tryUpdate(documents);
}
else if (responseOptions.continuation) {
// Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
tryQueryAndUpdate(responseOptions.continuation);
}
else {
// Else if there are no more documents and no continuation token - we are finished updating documents.
responseBody.continuation = false;
response.setBody(responseBody);
}
});
// If we hit execution bounds - return continuation:true
if (!isAccepted) {
response.setBody(responseBody);
}
}
// Updates the supplied document according to the update object passed in to the sproc.
function tryUpdate(documents) {
if (documents.length > 0) {
var requestOptions = {etag: documents[0]._etag};
// Rename!
rename(documents[0], update);
// Update the document.
var isAccepted = collection.replaceDocument(
documents[0]._self,
documents[0],
requestOptions,
function (err, updatedDocument, responseOptions) {
if (err) throw err;
responseBody.updated++;
documents.shift();
// Try updating the next document in the array.
tryUpdate(documents);
}
);
if (!isAccepted) {
response.setBody(responseBody);
}
}
else {
tryQueryAndUpdate();
}
}
// The $rename operator renames a field.
function rename(document, update) {
var fields, i, existingFieldName, newFieldName;
if (update.$rename) {
fields = Object.keys(update.$rename);
for (i = 0; i < fields.length; i++) {
existingFieldName = fields[i];
newFieldName = update.$rename[fields[i]];
if (existingFieldName == newFieldName) {
throw new Error("Bad $rename parameter: The new field name must differ from the existing field name.")
} else if (document[existingFieldName]) {
// If the field exists, set/overwrite the new field name and unset the existing field name.
document[newFieldName] = document[existingFieldName];
delete document[existingFieldName];
} else {
// Otherwise this is a noop.
}
}
}
}
}
I'm running this sproc via the azure web portal, and these are my input parameters:
SELECT * FROM root r
{$rename: {A: "B"}}
My documents look something like this:
{ id: someId, A: "ChangeThisField" }
After the field rename, I would like them to look like this:
{ id: someId, B: "ChangeThisField" }
I'm trying to debug two issues with this code:
The updated count is wildly inaccurate. I suspect I'm doing something really stupid with the continuation token - part of the problem is that I'm not really sure about what to do with it.
The rename itself is not occurring. console.log() debugging shows that I'm never getting into the if (update.$rename) block in the rename function.
I modified your stored procedure code as below and it works for me.I didn't use object or array as my $rename parameter, I used oldKey and newKey instead. If you do concern the construct of parameters, you could change the rename method back which does not affect other logic. Please refer to my code:
function updateSproc(query, oldKey, newKey) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
updated: 0,
continuation: ""
};
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
if (!oldKey) throw new Error("The oldKey is undefined or null.");
if (!newKey) throw new Error("The newKey is undefined or null.");
tryQueryAndUpdate();
function tryQueryAndUpdate(continuation) {
var requestOptions = {
continuation: continuation,
pageSize: 1
};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, documents, responseOptions) {
if (err) throw err;
if (documents.length > 0) {
tryUpdate(documents);
if(responseOptions.continuation){
tryQueryAndUpdate(responseOptions.continuation);
}else{
response.setBody(responseBody);
}
}
});
if (!isAccepted) {
response.setBody(responseBody);
}
}
function tryUpdate(documents) {
if (documents.length > 0) {
var requestOptions = {etag: documents[0]._etag};
// Rename!
rename(documents[0]);
// Update the document.
var isAccepted = collection.replaceDocument(
documents[0]._self,
documents[0],
requestOptions,
function (err, updatedDocument, responseOptions) {
if (err) throw err;
responseBody.updated++;
documents.shift();
// Try updating the next document in the array.
tryUpdate(documents);
}
);
if (!isAccepted) {
response.setBody(responseBody);
}
}
}
// The $rename operator renames a field.
function rename(document) {
if (oldKey&&newKey) {
if (oldKey == newKey) {
throw new Error("Bad $rename parameter: The new field name must differ from the existing field name.")
} else if (document[oldKey]) {
document[newKey] = document[oldKey];
delete document[oldKey];
}
}
}
}
I only have 3 test documents, so I set the pagesize to 1 to test the usage of continuation.
Test documents:
Output:
Hope it helps you.Any concern,please let me know.
I'm trying to read all records in a sqlite3 table and return them via callback. But it seems that despite using serialize these calls are still ASYNC. Here is my code:
var readRecordsFromMediaTable = function(callback){
var db = new sqlite3.Database(file, sqlite3.OPEN_READWRITE | sqlite3.OPEN_CREATE);
var allRecords = [];
db.serialize(function() {
db.each("SELECT * FROM MediaTable", function(err, row) {
myLib.generateLog(levelDebug, util.inspect(row));
allRecords.push(row);
}
callback(allRecords);
db.close();
});
}
When the callback gets fired the array prints '[]'.
Is there another call that I can make (instead of db.each) that will give me all rows in one shot. I have no need for iterating through each row here.
If there isn't, how do I read all records and only then call the callback with results?
I was able to find answer to this question. Here it is for anyone who is looking:
var sqlite3 = require("sqlite3").verbose();
var readRecordsFromMediaTable = function(callback){
var db = new sqlite3.Database(file, sqlite3.OPEN_READONLY);
db.serialize(function() {
db.all("SELECT * FROM MediaTable", function(err, allRows) {
if(err != null){
console.log(err);
callback(err);
}
console.log(util.inspect(allRows));
callback(allRows);
db.close();
});
});
}
A promise based method
var readRecordsFromMediaTable = function(){
return new Promise(function (resolve, reject) {
var responseObj;
db.all("SELECT * FROM MediaTable", null, function cb(err, rows) {
if (err) {
responseObj = {
'error': err
};
reject(responseObj);
} else {
responseObj = {
statement: this,
rows: rows
};
resolve(responseObj);
}
db.close();
});
});
}
The accepted answer using db.all with a callback is correct since db.each wasn't actually needed. However, if db.each was needed, the solution is provided in the node-sqlite3 API documentation, https://github.com/mapbox/node-sqlite3/wiki/API#databaseeachsql-param--callback-complete:
Database#each(sql, [param, ...], [callback], [complete])
...
After all row callbacks were called, the completion callback will be called if present. The first argument is an error object, and the second argument is the number of retrieved rows
So, where you end the first callback, instead of just } put }, function() {...}. Something like this:
var readRecordsFromMediaTable = function(callback){
var db = new sqlite3.Database(file, sqlite3.OPEN_READWRITE | sqlite3.OPEN_CREATE);
var allRecords = [];
db.serialize(function() {
db.each("SELECT * FROM MediaTable", function(err, row) {
myLib.generateLog(levelDebug, util.inspect(row));
allRecords.push(row);
}, function(err, count) {
callback(allRecords);
db.close();
}
});
}
I know I'm kinda late, but since you're here, please consider this:
Note that it first retrieves all result rows and stores them in memory. For queries that have potentially large result sets, use the Database#each function to retrieve all rows or Database#prepare followed by multiple Statement#get calls to retrieve a previously unknown amount of rows.
As described in the node-sqlite3 docs, you should use .each() if you're after a very large or unknown number or rows, since .all() will store all result set in memory before dumping it.
That being said, take a look at Colin Keenan's answer.
I tackled this differently, since these calls are asynchronous you need to wait until they complete to return their data. I did it with a setInterval(), kind of like throwing pizza dough up into the air and waiting for it to come back down.
var reply = '';
db.all(query, [], function(err, rows){
if(err != null) {
reply = err;
} else {
reply = rows;
}
});
var callbacker = setInterval(function(){
// check that our reply has been modified yet
if( reply !== '' ){
// clear the interval
clearInterval(callbacker);
// do work
}
}, 10); // every ten milliseconds
Old question, but I came across the issue, with a different approach as to solve the problem. The Promise option works, though being a little too verbose to my taste, in the case of a db.all(...) call.
I am using instead the event concept of Node:
var eventHandler = require('events')
In your Sqlite function:
function queryWhatever(eventHandler) {
db.serialize(() => {
db.all('SELECT * FROM myTable', (err, row) => {
// At this point, the query is completed
// You can emit a signal
eventHandler.emit('done', 'The query is completed')
})
})
}
Then, give your callback function to the eventHandler, that "reacts" to the 'done' event:
eventHandler.on('done', () => {
// Do something
})
I have a huge collection of documents in my DB and I'm wondering how can I run through all the documents and update them, each document with a different value.
The answer depends on the driver you're using. All MongoDB drivers I know have cursor.forEach() implemented one way or another.
Here are some examples:
node-mongodb-native
collection.find(query).forEach(function(doc) {
// handle
}, function(err) {
// done or error
});
mongojs
db.collection.find(query).forEach(function(err, doc) {
// handle
});
monk
collection.find(query, { stream: true })
.each(function(doc){
// handle doc
})
.error(function(err){
// handle error
})
.success(function(){
// final callback
});
mongoose
collection.find(query).stream()
.on('data', function(doc){
// handle doc
})
.on('error', function(err){
// handle error
})
.on('end', function(){
// final callback
});
Updating documents inside of .forEach callback
The only problem with updating documents inside of .forEach callback is that you have no idea when all documents are updated.
To solve this problem you should use some asynchronous control flow solution. Here are some options:
async
promises (when.js, bluebird)
Here is an example of using async, using its queue feature:
var q = async.queue(function (doc, callback) {
// code for your update
collection.update({
_id: doc._id
}, {
$set: {hi: 'there'}
}, {
w: 1
}, callback);
}, Infinity);
var cursor = collection.find(query);
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) q.push(doc); // dispatching doc to async.queue
});
q.drain = function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
}
Using the mongodb driver, and modern NodeJS with async/await, a good solution is to use next():
const collection = db.collection('things')
const cursor = collection.find({
bla: 42 // find all things where bla is 42
});
let document;
while ((document = await cursor.next())) {
await collection.findOneAndUpdate({
_id: document._id
}, {
$set: {
blu: 43
}
});
}
This results in only one document at a time being required in memory, as opposed to e.g. the accepted answer, where many documents get sucked into memory, before processing of the documents starts. In cases of "huge collections" (as per the question) this may be important.
If documents are large, this can be improved further by using a projection, so that only those fields of documents that are required are fetched from the database.
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {
"category_code": "biotech"
};
db.collection('companies').find(query).toArray(function(err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(function(doc) {
console.log(doc.name + " is a " + doc.category_code + " company.");
});
db.close();
});
});
Notice that the call .toArray is making the application to fetch the entire dataset.
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {
"category_code": "biotech"
};
var cursor = db.collection('companies').find(query);
function(doc) {
cursor.forEach(
console.log(doc.name + " is a " + doc.category_code + " company.");
},
function(err) {
assert.equal(err, null);
return db.close();
}
);
});
Notice that the cursor returned by the find() is assigned to var cursor. With this approach, instead of fetching all data in memory and consuming data at once, we're streaming the data to our application. find() can create a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The 2nd parameter to cursor.forEach shows what to do when the driver gets exhausted or an error occurs.
In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.
Also, MongoDB returns data in batch format. The image below shows, requests from cursors (from application) to MongoDB
forEach is better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it, if you can in your application.
None of the previous answers mentions batching the updates. That makes them extremely slow 🐌 - tens or hundreds of times slower than a solution using bulkWrite.
Let's say you want to double the value of a field in each document. Here's how to do that fast 💨 and with fixed memory consumption:
// Double the value of the 'foo' field in all documents
let bulkWrites = [];
const bulkDocumentsSize = 100; // how many documents to write at once
let i = 0;
db.collection.find({ ... }).forEach(doc => {
i++;
// Update the document...
doc.foo = doc.foo * 2;
// Add the update to an array of bulk operations to execute later
bulkWrites.push({
replaceOne: {
filter: { _id: doc._id },
replacement: doc,
},
});
// Update the documents and log progress every `bulkDocumentsSize` documents
if (i % bulkDocumentsSize === 0) {
db.collection.bulkWrite(bulkWrites);
bulkWrites = [];
print(`Updated ${i} documents`);
}
});
// Flush the last <100 bulk writes
db.collection.bulkWrite(bulkWrites);
And here is an example of using a Mongoose cursor async with promises:
new Promise(function (resolve, reject) {
collection.find(query).cursor()
.on('data', function(doc) {
// ...
})
.on('error', reject)
.on('end', resolve);
})
.then(function () {
// ...
});
Reference:
Mongoose cursors
Streams and promises
Leonid's answer is great, but I want to reinforce the importance of using async/promises and to give a different solution with a promises example.
The simplest solution to this problem is to loop forEach document and call an update. Usually, you don't need close the db connection after each request, but if you do need to close the connection, be careful. You must just close it if you are sure that all updates have finished executing.
A common mistake here is to call db.close() after all updates are dispatched without knowing if they have completed. If you do that, you'll get errors.
Wrong implementation:
collection.find(query).each(function(err, doc) {
if (err) throw err;
if (doc) {
collection.update(query, update, function(err, updated) {
// handle
});
}
else {
db.close(); // if there is any pending update, it will throw an error there
}
});
However, as db.close() is also an async operation (its signature have a callback option) you may be lucky and this code can finish without errors. It may work only when you need to update just a few docs in a small collection (so, don't try).
Correct solution:
As a solution with async was already proposed by Leonid, below follows a solution using Q promises.
var Q = require('q');
var client = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
client.connect(url, function(err, db) {
if (err) throw err;
var promises = [];
var query = {}; // select all docs
var collection = db.collection('demo');
var cursor = collection.find(query);
// read all docs
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) {
// create a promise to update the doc
var query = doc;
var update = { $set: {hi: 'there'} };
var promise =
Q.npost(collection, 'update', [query, update])
.then(function(updated){
console.log('Updated: ' + updated);
});
promises.push(promise);
} else {
// close the connection after executing all promises
Q.all(promises)
.then(function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
})
.fail(console.error);
}
});
});
The node-mongodb-native now supports a endCallback parameter to cursor.forEach as for one to handle the event AFTER the whole iteration, refer to the official document for details http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach.
Also note that .each is deprecated in the nodejs native driver now.
You can now use (in an async function, of course):
for await (let doc of collection.find(query)) {
await updateDoc(doc);
}
// all done
which nicely serializes all updates.
let's assume that we have the below MongoDB data in place.
Database name: users
Collection name: jobs
===========================
Documents
{ "_id" : ObjectId("1"), "job" : "Security", "name" : "Jack", "age" : 35 }
{ "_id" : ObjectId("2"), "job" : "Development", "name" : "Tito" }
{ "_id" : ObjectId("3"), "job" : "Design", "name" : "Ben", "age" : 45}
{ "_id" : ObjectId("4"), "job" : "Programming", "name" : "John", "age" : 25 }
{ "_id" : ObjectId("5"), "job" : "IT", "name" : "ricko", "age" : 45 }
==========================
This code:
var MongoClient = require('mongodb').MongoClient;
var dbURL = 'mongodb://localhost/users';
MongoClient.connect(dbURL, (err, db) => {
if (err) {
throw err;
} else {
console.log('Connection successful');
var dataBase = db.db();
// loop forEach
dataBase.collection('jobs').find().forEach(function(myDoc){
console.log('There is a job called :'+ myDoc.job +'in Database')})
});
I looked for a solution with good performance and I end up creating a mix of what I found which I think works good:
/**
* This method will read the documents from the cursor in batches and invoke the callback
* for each batch in parallel.
* IT IS VERY RECOMMENDED TO CREATE THE CURSOR TO AN OPTION OF BATCH SIZE THAT WILL MATCH
* THE VALUE OF batchSize. This way the performance benefits are maxed out since
* the mongo instance will send into our process memory the same number of documents
* that we handle in concurrent each time, so no memory space is wasted
* and also the memory usage is limited.
*
* Example of usage:
* const cursor = await collection.aggregate([
{...}, ...],
{
cursor: {batchSize: BATCH_SIZE} // Limiting memory use
});
DbUtil.concurrentCursorBatchProcessing(cursor, BATCH_SIZE, async (doc) => ...)
* #param cursor - A cursor to batch process on.
* We can get this from our collection.js API by either using aggregateCursor/findCursor
* #param batchSize - The batch size, should match the batchSize of the cursor option.
* #param callback - Callback that should be async, will be called in parallel for each batch.
* #return {Promise<void>}
*/
static async concurrentCursorBatchProcessing(cursor, batchSize, callback) {
let doc;
const docsBatch = [];
while ((doc = await cursor.next())) {
docsBatch.push(doc);
if (docsBatch.length >= batchSize) {
await PromiseUtils.concurrentPromiseAll(docsBatch, async (currDoc) => {
return callback(currDoc);
});
// Emptying the batch array
docsBatch.splice(0, docsBatch.length);
}
}
// Checking if there is a last batch remaining since it was small than batchSize
if (docsBatch.length > 0) {
await PromiseUtils.concurrentPromiseAll(docsBatch, async (currDoc) => {
return callback(currDoc);
});
}
}
An example of usage for reading many big documents and updating them:
const cursor = await collection.aggregate([
{
...
}
], {
cursor: {batchSize: BATCH_SIZE}, // Limiting memory use
allowDiskUse: true
});
const bulkUpdates = [];
await DbUtil.concurrentCursorBatchProcessing(cursor, BATCH_SIZE, async (doc: any) => {
const update: any = {
updateOne: {
filter: {
...
},
update: {
...
}
}
};
bulkUpdates.push(update);
// Updating if we read too many docs to clear space in memory
await this.bulkWriteIfNeeded(bulkUpdates, collection);
});
// Making sure we updated everything
await this.bulkWriteIfNeeded(bulkUpdates, collection, true);
...
private async bulkWriteParametersIfNeeded(
bulkUpdates: any[], collection: any,
forceUpdate = false, flushBatchSize) {
if (bulkUpdates.length >= flushBatchSize || forceUpdate) {
// concurrentPromiseChunked is a method that loops over an array in a concurrent way using lodash.chunk and Promise.map
await PromiseUtils.concurrentPromiseChunked(bulkUpsertParameters, (upsertChunk: any) => {
return techniquesParametersCollection.bulkWrite(upsertChunk);
});
// Emptying the array
bulkUpsertParameters.splice(0, bulkUpsertParameters.length);
}
}