I am writing an aggregate function for mongo using the native driver, v2.1.
My code looks something like this:
db.collection("whatever").aggregate(...).each(function(err, doc) {
// cursor processing
})
My question is: where is the cursor processing executed? On the client or on the server?
I am assuming that it's executed on the client side (node), and if it's so, is there any way to run a cursor processing (or some other sort of data processing) on the server?
I am working with lots of gb of data, and I don't want to be transferring back and forth with mongo server.
Thx!
Little bit of internals of 'mongodb' driver's Cursor constructor.
When 'each'(prototype method of Cursor constructor) method of a cursor is invoked with a callback function passed to it,
It fires the given query on database. Over the wire, gets the complete results set returned by the database and push into an array in memory at client side(node application end).
Then invokes the callback function given to 'each' method by passing each element in the above array as argument. Of course in node style. callback(err, doc)
So, the point here to be noted is - once the data is received from the database, building an array and iterating through it etc. are happening at the application's end. Loading and iterating an array can be memory intensive. It is the caller's responsibility to make sure that the entire array of results set can fit the memory. Not only that, the amount of data to be transferred over the wire should also be considered.
So here are my 2 cents..
In the cases of dealing with substantial amount of data using mongodb driver,
Better to set ''batch size' of cursor. For example, cursor1.batchSize(100, callback). When batch size of the cursor is set, the cursor gets the data in batches(of 100 docs in the example above) from database, instead of trying to get the complete result set in one go. By doing it in batches, it consumes relatively less memory and/or reduced amount of data to be transferred over the wire so better performance.
Use 'projections' in query wherever possible. Again, by using projections in right place in right way, we stop unnecessary data from being transferred to client side. So less data in size to process, less memory and better performance.
Please be advised about doing 'sort' on cursors. Invoking 'sort' works on the complete list of documents returned by find query. If the list is big, it might slow down the query execution. When you need to do sort, check if you can use any filter in query before you sort. Not exactly a client side issue though, but our queries should be faster in execution as much as possible.
Hope this information is useful.
Thank you.
Related
We are doing manipulation and insertion of data in mongo db. So for single insert in mongo db it is taking 28ms. I have to insert 2 times per request. At a time, if I get 6000 requests, I have to insert each data individually and it takes lot more time. How can I optimize this? Kindly help me on this.
var obj = new gnModel({
id: data.EID,
val: data.MID,
});
let response = await insertIntoMongo(gnModel);
If it is not vital for the data to be stored immediately, you can implement some form of batching.
For example, you can have a service which queues operations and commits them to the database every X seconds. In the service itself, you can use mongo's Bulk and more specifically for insertion: Bulk.insert(). It lets you queue operations to be executed as a single query(or at least minimal amount of queries/round trips).
It would also be a good idea to serialize and store this operation log/cache somewhere, as server restart will wipe it out if it is stored entirely in memory. A possible solution is Redis as it can both persist data to disk and is also distributed thus enabling you to queue operations from different application instances.
You'll achieve even better performance if the operations are unrelated and not dependent on each other. In this case you can use db.collection.initializeUnorderedBulkOp() which will allow mongo to execute the operations in parallel instead of sequentially and a single operation fail won't affect the execution of the rest of the set(contrary to OrderedBulkOp).
We are troubled by eventually occurring cursor not found exceptions for some Morphia Queries asList and I've found a hint on SO, that this might be quite memory consumptive.
Now I'd like to know a bit more about the background: can sombody explain (in English), what a Cursor (in MongoDB) actually is? Why can it kept open or be not found?
The documentation defines a cursor as:
A pointer to the result set of a query. Clients can iterate through a cursor to retrieve results. By default, cursors timeout after 10 minutes of inactivity
But this is not very telling. Maybe it could be helpful to define a batch for query results, because the documentation also states:
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. [...] For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
Note: in our queries in question we don't use sort statements at all, but also no limit and offset.
Here's a comparison between toArray() and cursors after a find() in the Node.js MongoDB driver. Common code:
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function (err, db) {
assert.equal(err, null);
console.log('Successfully connected to MongoDB.');
const query = { category_code: "biotech" };
// toArray() vs. cursor code goes here
});
Here's the toArray() code that goes in the section above.
db.collection('companies').find(query).toArray(function (err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(doc => {
console.log(`${doc.name} is a ${doc.category_code} company.`);
});
db.close();
});
Per the documentation,
The caller is responsible for making sure that there
is enough memory to store the results.
Here's the cursor-based approach, using the cursor.forEach() method:
const cursor = db.collection('companies').find(query);
cursor.forEach(
function (doc) {
console.log(`${doc.name} is a ${doc.category_code} company.`);
},
function (err) {
assert.equal(err, null);
return db.close();
}
);
});
With the forEach() approach, instead of fetching all data in memory, we're streaming the data to our application. find() creates a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The second parameter to cursor.forEach shows what to do when an error occurs.
In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.
Note that MongoDB returns data in batches. The image below shows requests from cursors (from application) to MongoDB:
forEach scales better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it in your application, if you can.
I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to #xameeramir for the excellent walkthough about how cursors work in general.
The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer.
The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries.
If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run.
All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one.
The collection's find method returns a cursor - this points to the set of documents (called as result set) that are matched to the query filter. The result set is the actual documents that are returned by the query, but this is on the database server.
To the client program, for example the mongo shell, you get a cursor. You can think the cursor is like an API or a program to work with the result set. The cursor has many methods which can be run to perform some actions on the result set. Some of the methods affect the result set data and some provide the status or info about the result set.
As the cursor maintains information about the result set, some information can change as you use the result set data by applying other cursor methods. You use these methods and information to suit your application, i.e., how and what you want to do with the queried data.
Working on the result set using the cursor and some of its commonly used methods and features from mongo shell:
The count() method returns the count of the number of documents in the result set, initially - as the result of the query. It is always constant at any point in the life of the cursor. This is information. This information remains same even after the cursor is closed or exhausted.
As you read documents from the result set, the result set gets exhausted. Once completely exhausted you cannot read any more. The hasNext() tells if there are any documents available to be read - returns a boolean true or false. The next() returns a document if available (you first check with hasNext, and then do a next). These two methods are commonly used to iterate over the result set data. Another iteration method is the forEach().
The data is retrieved from the server in batches - which has a default size. With the first batch you read the documents and when all it's documents are read, the following next() method retrieves the next batch of documents, etc., until all documents are read from the result set. This batch size can be configured and you can also get its status.
If you apply the toArray() method on the cursor, then all the remaining documents in the result set are loaded into the memory of your client computer and are available as a JavaScript array. And, the result set data is exhausted. The following hasNext method will return false, and the next will throw an error (once you exhaust the cursor, you cannot read data from it). This method loads all the result set data into your client's memory (the array). This can be memory consuming in case of large result sets.
The itcount() returns the count of remaining documents in the result set and exhausts the cursor.
There are cursor methods like isClosed(), isExhausted(), size() which give status information about the cursor and its underlying result set as you work with your data.
Those are the basic features of cursor and result set. There are many cursor methods, and you can try and see how they work and get a better understanding.
Reference:
mongo shell's cursor
methods
Cursor behavior with Aggregate
method
(the collection's aggregate method also returns a cursor)
Example usage in mongo shell:
Assume the test collection has 200 documents (run the commands in the same sequence).
var cur = db.test.find( { } ).limit(25) creates a result set with 25
documents only.
But, cur.count() will show 200, which is the actual count of
documents by the query's filter.
hasNext() will return true.
next() will return a document.
itcount() will return 24 (and exhausts the cursor).
itcount() again will return 0.
cur.count() will still show 200.
This error also comes when you have a large set of data and are doing batch processing on that data and each batch takes more time, totalling that time be exceeded the default cursor live time.
Then you need to change that default time to tell mongo that will not expire this cursor until processing is done.
Do check No TimeOut Documentation
A cursor is an object returned by calling db.collection.find() and which enables iterating through documents (NoSQL equivalent of a SQL "row") of a MongoDB collection (NoSQL equivalent of "table").
In case your cluster is stable and no members where down or changing state, the most posible reason for not finding the cursor is this:
Default idle cursor timeout is 10min , but in the versions >= 3.6 the cursor is also associated with session which is having default session timeout 30min , so even you set the cursor to not expire with the option noCursorTimeout() you are still limited by the session timeout of 30min. To avoid your cursor to be killed by the session timeout you will need to perioducally check in your code and execute sessionRefresh command:
db.adminCommand({"refreshSessions" : [sessionId]})
to extend the session with another 30min so your cursor to not be killed if you do something with the data before fetching the next batch...
check the docs here for detail how to do it:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/
Part of my Node Js app includes reading a file and after some (lightweight, row by row) processing, insert these records into the database.
Original code did just that. The problem is that the file may contain a crazy number of records which are inserted row by row. According to some tests I did, a file of 10000 rows blocks completely the app for some 10 seconds.
My considerations were:
Bulk create the whole object at once. This means reading the file, preparing the object by doing for each row some calculation, pushing it to the final object and in the end using Sequelize's bulkcreate. There were two downsides:
A huge insert can be as blocking as thousands of single-row inserts.
This may make it hard to generate reports for rows that were not inserted.
Bulk create in smaller, reasonable objects. This means reading the file, iterating each n (ex. 2000) rows by doing the calculations and adding it to an object, then using Sequelize's bulkcreate for the object. Object preparation and the bulkcreate would run asyncroniously. The downside:
Setting the object length seems arbitrary.
Also it seems like an artifice on my side, while there might be existing and proven solutions for this particular situation.
Moving this part of the code in another proccess. Ideally limiting cpu usage to reasonable levels for this process (idk. if it can be done or if it is smart).
Simply creating a new process for this (and other blocking parts of the code).
This is not the 'help me write some code' type of question. I have already looked around and it seems there is enough documentation. But I would like to invest on an efficient solution, using the proper tools. Other ideas are welcomed.
I'm running MongoDb on an instance with 512 meg of RAM, (along with some other web apps) so every megabyte counts
MongoDb documentation states that out: { inline: 1 }
Perform the map-reduce operation in memory and return the result.
which suggests that other out types don't perform in memory.. Would it be more memory efficient to return mapReduce results into another collection - provided that in the end I would still need to read that collections data to return it to client
Considering that inline can only really be usefull when Map Reduce is called from an application I should state that Map Reduce is not designed to be run inline your application and you should try and convert over to the aggregation framework or something else if you can.
The inline out is limited to 16MB (single BSON document). Considering this you may find that actually is is slower to write out to a collection and then read that collection again rather than just do it all in memory.
Writing to another collection and then later reading from it does not require holding whole dataset in memory (where by "dataset" I mean that set of aggregated data). It can be safely written to / read from a disk.
When you're using "out: inline", it is stored in memory in its entirety. If there's not enough memory, some swapping/paging-out will occur.
Anyway, to get good performance numbers you should have enough memory to hold all hot data. Disk is slow.
In light of all that, inline or not inline, it probably makes little difference.
I would say: yes, definitely more memory efficient to have mapReduce store the results in a collection.
If you use inline, the results of the MR operation will be returned as an array, which means all results are stored in memory before being handed to the calling code. If you store the results in a collection instead, you can use cursors or streams to prevent having to read everything into memory before returning to the client.
I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.