GraphQL Dataloader vs Mongoose Populate

GraphQL Dataloader vs Mongoose Populate - node.js

In order to perform a join-like operation, we can use both GraphQL and Mongoose to achieve that end.
Before asking any question, I would like to give the following example of Task/Activities (none of this code is tested, it is given just for the example's sake):
Task {
_id,
title,
description,
activities: [{ //Of Activity Type
_id,
title
}]
}
In mongoose, we can retrieve the activities related to a task with the populate method, with something like this:
const task = await TaskModel.findbyId(taskId).populate('activities');
Using GraphQL and Dataloader, we can have the same result with something like:
const DataLoader = require('dataloader');
const getActivitiesByTask = (taskId) => await ActivityModel.find({task: taskId});
const dataloaders = () => ({
activitiesByTask: new DataLoader(getActivitiesByTask),
});
// ...
// SET The dataloader in the context
// ...
//------------------------------------------
// In another file
const resolvers = {
Query: {
Task: (_, { id }) => await TaskModel.findbyId(id),
},
Task: {
activities: (task, _, context) => context.dataloaders.activitiesByTask.load(task._id),
},
};
I tried to see if there is any article that demonstrates which way is better regarding performance, resource exhaustion,...etc but I failed to find any comparison of the two methods.
Any insight would be helpful, thanks!

It's important to note that dataloaders are not just an interface for your data models. While dataloaders are touted as a "simplified and consistent API over various remote data sources" -- their main benefit when coupled with GraphQL comes from being able to implement caching and batching within the context of a single request. This sort of functionality is important in APIs that deal with potentially redundant data (think about querying users and each user's friends -- there's a huge chance of refetching the same user multiple times).
On the other hand, mongoose's populate method is really just a way of aggregating multiple MongoDB requests. In that sense, comparing the two is like comparing apples and oranges.
A more fair comparison might be using populate as illustrated in your question as opposed to adding a resolver for activities along the lines of:
activities: (task, _, context) => Activity.find().where('id').in(task.activities)
Either way, the question comes down to whether you load all the data in the parent resolver, or let the resolvers further down do some of the work. because resolvers are only called for fields that are included in the request, there is a potential major impact to performance between these two approaches.
If the activities field is requested, both approaches will make the same number of roundtrips between the server and the database -- the difference in performance will probably be marginal. However, your request might not include the activities field at all. In that case, the activities resolver will never be called and we can save one or more database requests by creating a separate activities resolver and doing the work there.
On a related note...
From what I understand, aggregating queries in MongoDB using something like $lookup is generally less performant than just using populate (some conversation on that point can be found here). In the context of relational databases, however, there's additional considerations to ponder when considering the above approaches. That's because your initial fetch in the parent resolver could be done using joins, which will generally be much faster than making separate db requests. That means at the expense of making the no-activities-field queries slower, you can make the other queries significantly faster.

Related

Mongodb: is replacing an array with a new version more efficient than adding elements to it?

I have a single /update-user endpoint on my server that triggers an updateUser query on mongo.
Basically, I retrieve the user id thanks to the cookie, and inject the received form, that can comprise any kind of key allowed in the User model, in the mongo query.
It looks like:
const form = {
friends: [{id: "1", name: "paul", thumbnail: "www.imglink.com"},
{id: "2", name: "joe", thumbnail: "www.imglink2.com"}],
locale: "en",
age: 77
}
function updateUser(form, _id){
const query = JSON.stringify(form)
return UserDB.findOneAndUpdate( { _id }, { $set: query })
}
So each time, I erase the necessary data and replace it by a brand new one. Sometimes, this data can be an array of 50 objects (let's say I've removed two persons in a 36 friends array as described above).
It is very convenient, because I can abstract all the logic both in the front and back with a single update function. However, is this a pure heresy from a performance point of view? Should I rather use 10 endpoints to update each part of the form?
The form is dynamic, I never know what is going to be inside, except that it belongs to the User model, this is why I've used this strategy.

From MongoDB's point of view, it doesn't matter much. MongoDB is a journalled database (particularly with the WiredTiger storage engine), and it probably (re)writes a large part of the document on update. It might make a minor difference under very heavy loads when replicating the oplog between primary and replicas, but if you have performance constraints like these, you'll know. If in doubt, benchmark and monitor your production system - don't over-optimize.
Focus on what's best for the business domain. Is your application collaborative? Do multiple users edit the same documents at the same time? What happens when they overwrite one another's changes? Are the JSONs that the client sends to the back-end large, or do they not clog up the network? These are the most important questions you should ask, and performance should only be optimized once you have the UX, the interaction model and the concurrency issues nailed.

Is it faster to use aggregation or manually filter through data with nodejs and mongoose?

I'm at a crossroads trying to decide what methodology to use. Basically, I have a mongodb collection and i want to query it with specific params provided by the user, then i want to group the response according to the value of some of those parameters. For example, let's say my collection is animals and if i query all animals i get something like this
[
{type:"Dog",age:3,name:"Kahla"},
{type:"Cat",age:6,name:"mimi"},
...
]
Now i would like to return to the user a response that is grouped by the animal type, so that i end up with something like
{
Dogs: [...dog docs],
Cats: [...cat docs],
Cows: [...],
}
So basically I have 2 ways of doing this. One is to just use Model.find() and fetch all the animals that match my specific queries, such as age or any other field, and then manually filter and format my json response before sending it back to the user with res.json({}) (im using express btw)
Or I can use mongo's aggregate framework and $group to do this at the query level, hence returning from the DB an already grouped response to my request. The only inconvenience I've found with this so far with this is how the response is formatted, and ends up looking more something like this
[
{
"_id":"Dog",
"docs":[{dog docs...}]
},
{
"_id":"Cat",
"docs":[{...}]
}
]
The overall result is BASICALLY the same, but the formatting of the response is quite different, and my front end client needs to adjust to how Im sending the response. I don't really like the array of objects from the aggregation, and prefer a json-like object response with key names correponding to the arrays as I see fit.
So the real question here is whether there is one significant advantage of one way over the other? Is the aggregation framework so fast that it will scale well if my collection grows to huge numbers? Is filtering through the data with javascript and mapping the response so I can shape it to my liking a very inefficient process, and hence it's better to use aggregation and adapt the front end to this response shape?

I'm considering that by Faster you meant the least time to serve a request. That said, let's divide the time required to process your request:
Asynchronous Operations (Network Operations, File read/write etc)
Synchronous Operations
Synchronous operations are usually much more faster than the Asynchronous ones.(This also depends on the nature of the operation and the amount of data being processed). For example, if you loop over an iterable(e.g. Array, Map etc) which has a length of less than 1000 it won't take more than a few milliseconds.
On the other hand, Asynchronous operations takes more times. For example, if you run an HTTP request it would take couple of milliseconds to get the response.
When you are querying on the MongoDB with mongoose, it's an asynchronous call and it will take more time. So, if you run more queries to Database it will make your API slower. MongoDB Aggregation can help you reducing the total number of queries which may help you to make APIs faster. But the problem is, Aggregations are usually slower than normal find requests.
The summary is, if you can manually filter data without adding any DB query it's going to be faster.

bulkWrite vs initialize(Un)orderedBulkOp

what is the differences between those 2 methods, and which should I use?
what is the diff between: initializeUnorderedBulkOp and bulkWrite with ordered: false
what is the diff between: initializeOrderedBulkOp and default bulkWrite
https://docs.mongodb.com/manual/reference/method/db.collection.initializeUnorderedBulkOp/
https://docs.mongodb.com/manual/reference/method/db.collection.initializeOrderedBulkOp/
https://docs.mongodb.com/manual/core/bulk-write-operations/

TL;DR
The difference is mainly in the usage. bulkWrite takes in an array of operations and executes it immediately.
initializeOrderedBulkOp and initializeUnorderedBulkOp return an instance which can be used to build queries gradually and execute it at last using the execute function.
Late to the party but I had a similar confusion so did some digging up.
The difference lies in the API implementation and usage.
bulkWrite
According to the API reference,
Perform a bulkWrite operation without a fluent API
In this method, you directly pass in an array of "write operations" as the first argument. See here for examples. I think by fluent API, they mean you don't exactly separate your update operations from your insert operations or delete operations. Every operation is in one array.
One crucial point is These operations are executed immediately.
As noted in the question, the execution is ordered by default but can be changed to unordered by setting { ordered: false } in the second argument which is a set of options.
The return value of the function is BulkWriteResult which contains information about the executed bulk operation.
initializeOrderedBulkOp and initializeUnorderedBulkOp
Referring to the API reference again,
Initiate an In order bulk write operation
As it says here, these methods initialize/return an instance which provides an API for building block operations. Those instances are of the class OrderedBulkOperation and UnorderedBulkOperation respectively.
const bulk = db.items.initializeUnorderedBulkOp();
// `bulk` is of the type UnorderedBulkOperation
This bulk variable provides a "fluent API" which allows you to build your queries across the application:
bulk.find( { /** foo **/ } ).update( { $set: { /** bar **/ } } );
Bear in mind, these queries are not executed in the above code. You can keep on building the whole operation and when all the write operations are "called", we can finally execute the query:
bulk.execute();
This execute function returns a BulkWriteResult instance which is basically what bulkWrite returns. Our database is finally changed.
Which one should you use?
It depends on your requirements.
If you want to update a lot of documents with separate queries and values from an existing array, bulkWrite seems a good fit. If you want to build your bulk operation through a fairly complex business logic, the other options make sense. Note that you can achieve the same by constructing a global array gradually and passing it in the end to bulkWrite.

Mongoose Cursors with Many Documents and Load

We've been using mongoose in Node.Js/Express for sometime and one of the things we are not clear about is, what happens when you have a query using find and you have a large result set of documents. For example, let's say you wanted to iterate through all your users to do some low-priority background processing.
let cursor = User.find({}).cursor();
cursor.on('data',function(user) {
// do some processing here
});
My understanding is that cursor.on('data') doesn't block. Therefore, if you have let's say 100,000 users, you would overwhelm the system trying to process 100,000 people nearly simultaneously. There does not seem to be a "next" or other method to regulate our ability to consume the documents.
How do you process large document result sets?

Mongoose actually does have a .next() method for cursors! Check out the Mongoose documentation. Here is a snapshot of the Example section as of this answer:
// There are 2 ways to use a cursor. First, as a stream:
Thing.
find({ name: /^hello/ }).
cursor().
on('data', function(doc) { console.log(doc); }).
on('end', function() { console.log('Done!'); });
// Or you can use `.next()` to manually get the next doc in the stream.
// `.next()` returns a promise, so you can use promises or callbacks.
var cursor = Thing.find({ name: /^hello/ }).cursor();
cursor.next(function(error, doc) {
console.log(doc);
});
// Because `.next()` returns a promise, you can use co
// to easily iterate through all documents without loading them
// all into memory.
co(function*() {
const cursor = Thing.find({ name: /^hello/ }).cursor();
for (let doc = yield cursor.next(); doc != null; doc = yield cursor.next()) {
console.log(doc);
}
});
With the above in mind, it's possible that your data set could grow to be quite large and difficult to work with. It may be a good idea for you to consider using MongoDB's aggregation pipeline for simplifying the processing of large data sets. If you use a replica set, you can even set a readPreference to direct your large aggregation queries to secondary nodes, ensuring that the primary node's performance remains largely unaffected. This would shift the burden from your server to less-critical secondary database nodes.
If your data set is particularly large and you perform the same calculations on the same documents repeatedly, you could even consider storing precomputed aggregation results in a "base" document and then apply all unprocessed documents on top of that "base" as a "delta"--that is, you can reduce your computations down to "every change since the last saved computation".
Finally, there's also the option of load balancing. You could have multiple application servers for processing and have a load balancer distributing requests roughly evenly between them to prevent any one server from becoming overwhelmed.
There are quite a few options available to you for avoiding a scenario where your systems becomes overwhelmed from all of the data processing. The strategies you should employ will depend largely on your particular use case. In this case, however, it seems as though this is a hypothetical question, so the additional strategies noted probably will not be things you will need to concern yourself with. For now, stick with the .next() calls and you should be fine.

I just found a "modern" way of this using for await.
for await (const doc of User.find().cursor()) {
console.log(doc.name);
}
I am using this for my 4M+ docs in one single collection, and it worked fine for me.
Here is the mongoose documentation if you want to refer.

With async await it has become easy. we can now have
const cursor = model.find({})
for await (const doc of cursor){
// carry out any operation
console.log(doc)
}

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});

You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string