Speeding up my cloudant query - node.js

I was wondering whether someone could provide some advice on my cloudant query below. It is now taking upwards of 20 seconds to execute against a DB of 50,000 documents - I suspect I could be getting better speed than this.
The purpose of the query is to find all of my documents with the attribute "searchCode" equalling a specific value plus a further list of specific IDs.
Both searchCode and _id are indexed - any ideas why my query would be taking so long / what I could do to speed it up?
mydb.find({selector: {"$or":[{"searchCode": searchCode},{"_id":{"$in":idList}}]}}, function (err, result) {
if(!err){
fulfill(result.docs);
}
else{
console.error(err);
}
});
Thanks,
James

You could try doing separate calls for the queries
find me documents where the searchCode = 'some value'
find me documents whose ids match a list of ids
The first can be achieved with a find call and a query like so:
{ selector: {"searchCode": searchCode} }
The second can be achieved by hitting the databases's _all_docs endpoint, passing in the list of ids as a keys parameter e.g.
GET /db/_all_docs?keys=["a","b","c"]
You might find that running both requests in parallel and merging the results gives you better performance.

Related

Use an array of values to query Firestore and setup a snapshot listener

Here is my problem:
I have a firestore collection that has a number of documents. There are about 500 documents generated/updated every hour and saved to the collection.
I would like to query the collection and setup a real-time snapshot listener for a subset of document IDs, that are provided by the client.
I think maybe I could to something like this (this syntax is likely not correct...just trying to get a feel for if it's even possible...but isn't the "in" limited to an array of 10 items? ):
const subbedDocs = ["doc1","doc2","doc3","doc4","doc5"]
docsRef.where('docID', 'in', subbedDocs).onSnapshot((doc) => {
handleSnapshot(doc);
});
I'm sorry, that code probably doesn't make sense....I'm still trying to learn all the ins and outs of Firestore.
Essentially, what I am trying to do is take an array of ID's and setup a .onSnapshot listener for those ID's. This list of IDs could be upwards of 40-50 items. Is this even possible? I am trying to avoid just setting up a listener on the whole collection and filtering out things I am not "subscribed" too as that seems wasteful from a resources perspective.
If you have the doc IDs in your array (it looks like you have) you can loop over them and start a listener during that:
const subbedDocs = ["doc1", "doc2", "doc3", "doc4", "doc5"];
for (let i = 0; i < subbedDocs.length; i++) {
const docID = subbedDocs[i];
docsRef.doc(docID).onSnapshot((doc) => {
handleSnapshot(doc);
});
}
It would be better to listen to a query and all filtered docs at once. But if you want to listen to each of them with a explicit listener that would do the trick.
As you've discovered, Firestore's in operator only allows up to 10 entries in the array. I'm also guessing you've added the docID as a field in the document, since I don't believe 'docID references the actual documentid.
I would not take this approach, because of the 10-entry limitation. What I would do is, as the client is selecting documents to follow, set a field (same in each document) to a unique Id for the client, so your query completely avoids the limitation. You can allow an unlimited number of Client listeners (up to implementation limits of Firestore) if you add that client ID into an array (called something like "ListenerArray") [again, as the client is selecting them]. Your query would be more like:
docsRef.where('ListenerArray', 'array-contains', clientID).onSnapshot((doc) => {
handleSnapshot(doc);
})
array-contains checks a single value against all entries in a document array, without limit. Every client can mark any number of documents to subscribe to.

mongoose query using sort and skip on populate is too slow

I'm using an ajax request from the front end to load more comments to a post from the back-end which uses NodeJS and mongoose. I won't bore you with the front-end code and the route code, but here's the query code:
Post.findById(req.params.postId).populate({
path: type, //type will either contain "comments" or "answers"
populate: {
path: 'author',
model: 'User'
},
options: {
sort: sortBy, //sortyBy contains either "-date" or "-votes"
skip: parseInt(req.params.numberLoaded), //how many are already shown
limit: 25 //i only load this many new comments at a time.
}
}).exec(function(err, foundPost){
console.log("query executed"); //code takes too long to get to this line
if (err){
res.send("database error, please try again later");
} else {
res.send(foundPost[type]);
}
});
As was mentioned in the title, everything works fine, my problem is just that this is too slow, the request is taking about 1.5-2.5 seconds. surely mongoose has a method of doing this that takes less to load. I poked around the mongoose docs and stackoverflow, but didn't really find anything useful.
Using skip-and-limit approach with mongodb is slow in its nature because it normally needs to retrieve all documents, then sort them, and after that return the desired segment of the results.
What you need to do to make it faster is to define indexes on your collections.
According to MongoDB's official documents:
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
-- https://docs.mongodb.com/manual/indexes/
Using indexes may cause increased collection size but they improve the efficiency a lot.
Indexes are commonly defined on fields which are frequently used in queries. In this case, you may want to define indexes on date and/or vote fields.
Read mongoose documentation to find out how to define indexes in your schemas:
http://mongoosejs.com/docs/guide.html#indexes

How to fetch/count millions of records in mongodb with nodejs

We have a collection with millions of records in mongoDB. its taking lots of time and time out to count and create pagination with these records. whats the best way to do it using nodejs. I want to create a page where I see records with pagination, count, delete, search of records. Below is the code which doing query to Mongo with different conditions.
crowdResult.find({ "auditId":args.audit_id,"isDeleted":false})
.skip(args.skip)
.limit(args.limit)
.exec(function (err, data) {
if (err)
return callback(err,null);
console.log(data);
return callback(null,data);
})
If the goal is to get through a large dataset without timing out then I use the following approach to get pages one after another and process the paged resultset as soon as it becomes available:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f
Please focus on the following lines to get a quick idea of what the code is doing before diving deeper:
Let getPage() handle the work, you can set the pageSize and query to your liking:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L68
Method signature:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L29
Process pagedResults as soon as they become available:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L49
Move on to the next page:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L53
The code will stop when there is no more data left:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L41
Or it will stop when working on the last page of data:
https://gist.github.com/pulkitsinghal/2f3806670439fa137210fc26b134237f#file-sample-js-L46
I hope this offers some inspiration, even if its not an exact solution for your needs.

Mongoose: How to slice the entire query?

I'm looking for a way to get M documents out of a particular query, starting at the Nth document, without rendering the entire collection at the exec() callback and then splice an array from there. I'm well aware of .limit(x) which works fine and dandy to get from 0 to x, but to my knowledge there is no way I select where does the query start limiting the number of documents, something like limit(10) starting from 5.
I tried something like this:
Model.find().sort({creationDate: -1}).where("_id").splice([5,10]).exec(function(err, data) {
if(err) res.send(502, "ERROR IN DB DATABASE");
res.send(data);
});
But the resulting data consists of the entire collection.
Any ideas on how to achieve this?
.skip is what you are looking for
Model.find(...).sort(...).skip(5).limit(10).exec(....)

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Resources