improve mongo query performance - node.js

I'm using a node based CMS system called Keystone, which uses MongoDB for a data store, giving fairly liberal control over data and access. I have a very complex model called Family, which has about 250 fields, a bunch of relationships, and a dozen or so methods. I have a form on my site which allows the user to enter in the required information to create a new Family record, however the processing time is running long (12s on localhost and over 30s on my Heroku instance). The issue I'm running into is that Heroku emits an application error for any processes that run over 30s, which means I need to optimize my query. All processing happens very quickly except one function. Below is the offending function:
const Family = keystone.list( 'Family' );
exports.getNextRegistrationNumber = ( req, res, done ) => {
console.time('get registration number');
const locals = res.locals;
Family.model.find()
.select( 'registrationNumber' )
.exec()
.then( families => {
// get an array of registration numbers
const registrationNumbers = families.map( family => family.get( 'registrationNumber' ) );
// get the largest registration number
locals.newRegistrationNumber = Math.max( ...registrationNumbers ) + 1;
console.timeEnd('get registration number');
done();
}, err => {
console.timeEnd('get registration number');
console.log( 'error setting registration number' );
console.log( err );
done();
});
};
the processing in my .then() happens in milliseconds, however, the Family.model.find() takes way too long to execute. Any advice on how to speed things up would be greatly appreciated. There are about 40,000 Family records the query is trying to dig through, and there is already an index on the registrationNumber field.

It makes sense that the then() executes quickly but the find() takes a while; finding the largest value in a set of records is a relatively quick database operation while getting the set could potentially be very time-consuming depending on a number of factors.
If you are simply reading the data and presenting it to the user via REST or some sort of visual interface, you can make use of lean() which will return plain javascript objects. By default, you are returning a mongoose.Document which in your case is unnecessary as there does not appear to be any data manipulation after your read query; you are just getting the data.
More importantly, it appears that all you need is one record: the record with the largest registrationNumber. You should always use findOne() when you are looking for one record in any set of records to maximize performance.
See previous answer detailing using findOne in a node.js implementation, or see mongoDB documentation for general information about this collection method.

Related

Approach for changing fields in documents that are related to other documents

I am building an API and came across an issue that I have a few ideas of how to solve, but I was wondering what is the most optimal one. The issue is the following:
I have a Product model which has, for the sake of simplicity one field called totalValue.
I have another model called InventoryItems, which, whenever is updated, the totalValue of Product must also be updated.
For example, if the current totalValue of a product is say $1000, when someone purchases 10 screws at a cost of $1 each, a new InventoryItem record will be created:
InventoryItem: {
costPerItem: 1,
quantity: 10,
relatedToProduct: "ProductXYZ"
}
At the same time of creation of that item, totalValue of the respective ProductXYZ must be updated to now $1100.
The question is what is the most efficient and user-friendly way to do this?
Two ways come to my mind (and keep in mind that the code bellow is kinda pseudo, I have intentionally omitted parts of it, that are irrelevant for the problem at hand):
When the new InventoryItem is created, it also queries the database for the product and updates it, so both things happen in the same function that creates the inventory item:
function async createInventoryItem(req, res) {
const item = { ...req.body };
const newInventoryItem = await new InventoryItem({...item}).save();
const foundProduct = await Product.find({ name: item.relatedtoProduct }).exec();
foundProduct.totalValue = foundProduct.totalValue + item.costPerItem * item.quantity;
foundProduct.save();
res.json({ newInventoryItem, newTotalOfProduct: foundProduct.totalValue });
}
That would work, my problem with that is that I will no longer have "a single source of truth" as that approach will make it hard to update the code, as updating a given Product will be scattered all over the project.
The second approach that comes to my mind is that, when I receive the request to create the item, I do create the item, and then I make an internal request to the other endpoint that handles product updates, something like:
function async createInventoryItem(req, res) {
const item = { ...req.body };
const newInventoryItem = await new InventoryItem({...item}).save();
const totalCostOfNewInventoryItem = item.costPerItem * item.quantity;
// THIS is the part that I don't know how to do
const putResponse = putrequest("/api/product/update", {
product: item.relatedtoProduct,
addToTotalValue: totalCostOfNewInventoryItem
});
res.json({ newInventoryItem, newTotalOfProduct: putResponse.totalValue });
}
This second approach solves the problem of the first approach, but I don't know how to implement it, and it is I'm guessing a form of requests chaining or rerouting? Also I am guessing that the second approach will not have a performance penalty, since node will be sending requests to itself, so no time lost in accessing servers across the world or whatever)
I am pretty sure that the second approach is the one that I have to take (or is there another way that I am currently not aware of??? I am open to any suggestions, I am aiming for performance), but I am unsure of exactly how to implement it.

Firebase cloud function to count and update collections

I have three collections in my Firebase project, one contains locations that users have checked in from, and the other two are intended to hold leaderboards with the cities and suburbs with the most check ins.
However, as a bit of a newbie to NOSQL databases, I'm not quite sure how to do the queries I need to get and set the data I want.
Currently, my checkins collection has this structure:
{ Suburb:,
City:,
Leaderboard:}
The leaderboard entry is a boolean to mark if the check in has already been added to the leaderboard.
What I want to do is query for all results where leaderboard is false, count the entries for all cities, count the entries for all suburbs, then add the city and suburb data to a separate collection, then update the leaderboard boolean to indicate they've been counted.
exports.updateLeaderboard = functions.pubsub.schedule('30 * * * *').onRun(async context => {
db.collection('Bears')
.where('Leaderboard', '==', 'false')
.get()
.then(snap =>{
snap.forEach(x => {
//Count unique cities and return object SELECT cities,COUNT(*) AS `count` FROM Bears GROUP BY cities
})
})
.then(() => {
console.log({result: 'success'});
})
.catch(error => {
console.error(error);
});
})
Unfortunately, I've come to about the limit of my knowledge here and would love some help.
Firebase is meant to be a real-time platform, and most of your business logic is going to be expressed in Functions. Because the ability to query is so limited, lots of problems like this are usually solved with triggers and data denormalization.
For instance, if you want a count of all mentions of a city, then you have to maintain that count at event-time.
// On document create
await firestore()
.collection("city-count")
.doc(doc.city)
.set({
count: firebase.firestore.FieldValue.increment(1),
}, { merge: true });
Since it's a serverless platform, it's built to run a lot of very small, very fast functions like this. Firebase is very bad at doing large computations -- you can quickly run in to mb/minute and doc/minute write limits.
Edit: Here is how Firebase solved this exact problem from the perspective of a SQL trained developer https://www.youtube.com/watch?v=vKqXSZLLnHA
As clarified in this other post from the Community here, Firestore doesn't have a built-in API for counting documents via query. You will need to read the whole collection and load it to a variable and work with the data then, counting how many of them have False as values in their Leaderboard document. While doing this, you can start adding these cities and suburbs to arrays that after, will be written in the database, updating the other two collections.
The below sample code - untested - returns the values from the Database where the Leaderboard is null, increment a count and shows where you need to copy the value of the City and Suburb to the other collections. I basically changed some of the orders of your codes and changed the variables to generic ones, for better understanding, adding a comment of where to add the copy of values to other collections.
...
// Create a reference to the collection of checkin
let checkinRef = db.collection('cities');
// Create a query against the collection
let queryRef = checkinRef.where('Leaderboard', '==', false);
var count = 0;
queryRef.get().
.then(snap =>{
snap.forEach(x => {
//add the cities and suburbs to their collections here and update the counter
count++;
})
})
...
You are very close to the solution, just need now to copy the values from one collection to the others, once you have all of them that have False in leaderboard. You can get some good examples in copying documents from a Collection to another, in this other post from the Community: Cloud Functions: How to copy Firestore Collection to a new document?
Let me know if the information helped you!

MongoDB: findOneAndUpdate seems to be not atomic

I am using nodejs + mongodb as a backend for a largely distributed web application. I have a series of events, that need to be in a specific order. There are multiple services generating these events and my application should process and store them as they come in and at any given time I want to have them in the correct order.
I cannot rely on timestamps since javascript only provides timestamps in milliseconds, which is not accurate enough for my case.
I have two collections in my database. One that stores the events and one that stores an index, which represents my eventorder. I have tried using findOneAndUpdate in order to increase my index atomically. This however does not seem to be working.
console.log('Adding');
console.log(event.type);
this._db.collection('evtidx').findOneAndUpdate({ id : 'index' }, { $inc: { value : 1 } }, (err, res) => {
console.log('For '+event.type);
console.log('Got value: '+res.value.value);
event.index = res.value.value;
this._db.collection('events').insertOne(event, (err, evtres) => {
if (err) {
throw err;
}
});
});
When I check the output of the code above I see:
Adding
Event1
Adding
Event2
Adding
Event3
Adding
Event4
For Event1
Got value: 1
For Event3
Got value: 4
For Event2
Got value: 2
For Event4
Got value: 3
Which concludes to me, that my code is not working atomically.
The events come in in the correct index, but don't have the correct order attached to them after findOneAndUpdate. Could anyone help me out there?
Atomic database operations does not mean that they lock the database while the request is running. Maybe You are getting requests in order but they are not executed in sequential order nor in the backend nor in the database.
What you need to do is read the last document index from the 'events' collection. If its one less then your current request index then insert else wait and retry.
Although this can cause problems if one event fails because of network error or something else. Then Your request processing would stop.

How to read an individual column from Dynamo-Db without using Scan in Node-js?

I have 4.5 millions of records in my Dynamo Db.
I want to read the the id of each record as a batchwise.
i am expecting something like offset and limit like how we can read in Mongo Db.
Is there any way suggestions without scan method in Node-JS.
I have done enough research i can only find scan method which buffers the complete records from Dynamo Db and the it starts scanning the records, which is not effective in performance basis.
Please do give me suggestion.
From my point of view, there's no problem doing scans because (according to the Scan doc):
DynamoDB paginates the results from Scan operations
You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them
The default size for pages is 1MB, but you can also specify the max number of items per page with the Limit parameter.
So it's just basic pagination, the same thing MongoDB does with offset and limit.
Here is an example from the docs of how to perform Scan with the node.js SDK.
Now, if you want to get all the IDs as a batchwise, you could wrap the whole thing with a Promise and resolve when there's no LastEvaluatedKey.
Below a pseudo-code of what you could do :
const performScan = () => new Promise((resolve, reject) => {
const docClient = new AWS.DynamoDB.DocumentClient();
let params = {
TableName:"YOUR_TABLE_NAME",
ProjectionExpression: "id",
Limit: 100 // only if you want something else that the default 1MB. 100 means 100 items
};
let items = [];
var scanExecute = cb => {
docClient.scan(params, (err,result) => {
if(err) return reject(err);
items = items.concat(result.Items);
if(result.LastEvaluatedKey) {
params.ExclusiveStartKey = result.LastEvaluatedKey;
return scanExecute();
} else {
return err
? reject(err)
: resolve(items);
}
});
};
scanExecute();
});
performScan().then(items => {
// deal with it
});
First things to know about DynamoDB is that it is a Key-Value Store with support for secondary indexes.
DynamoDB is a bad choice if the application often has to iterate over the entire data set without using indexes(primary or secondary), because the only way to do that is to use the Scan API.
DynamoDB Table Scan's are (a few things I can think off)
Expensive(I mean $$$)
Slow for big data sets
Might use up the provisioned throughput
If you know the primary key of all the items in DynamoDB (some external knowledge like primary is an auto incremented value, is referenced in another DB etc) then you can use BatchGetItem or Query.
So if it is a one off thing then Scan is your only option else you should look into refactoring your application to remove this scenario.

Inserting records without failing on duplicate

I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?
The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.
There are 2 types of bulk operations:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
So in your case Unordered is what you want. The previous link provides an example:
MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();
// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});
// Execute the operations
batch.execute(function(err, result) {
console.dir(err);
console.dir(result);
db.close();
});
});
*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."

Resources