Entire Document or Selected Fields only Performance in Mongoose

Entire Document or Selected Fields only Performance in Mongoose - node.js

I have been thinking on how I can make my app in NodeJS to go faster, so I have tried querying for only some fields and the entire document, because at MongoDB Documentation says its faster to query for certain fields. The problem is it's seems to me incorrect, where am I failing? Here is the code I am using I have made it to save to csv to get a Chart from Libreoffice:
http://pastebin.com/G8KRRY3n
First Option (A) is get the entire Document.
Second Option (B) is get some fields.
Here is the graph I toke from it (Every operation in miliseconds):
http://prntscr.com/5oofoz
I process almost 9500 users. As you can see, at first (0~200) items procesed, It's the same, but then the second options start to grow in time... I have tried to switch the order of the options because of the garbage collector has something to do, but the results are almost the same.
Yes, the first option is faster at first elements, So the question is... In a High Traffic webapp which option is the recomended? Why? I am newbie at performance field so I am pretty sure I'm doing something wrong...

Related

About speedy mass deletion of users in Kentico10

I want to delete more than 1 million User information in Kentico10.
I tried to delete it with UserInfoProvider.DeleteUser (); (see the following documentation), but it is expected that it will take nearly one year with a simple calculation.
https://docs.kentico.com/api10/configuration/users#Users-Deletingauser
Because it's a simple calculation, I think it's actually a bit shorter, but it still takes time.
Is there any other way to delete users in a short time?

Of course make sure you have a backup of your database before you do any of this.
Depending on the features you're using, you could get away with a SQL statement. Due to the complexities of the references of a user to multiple other tables, the SQL statement can get pretty complex and you need to make sure you remove the other references before removing the actual user record.
I'd highly recommend an API approach and delete users through the API so it removes all the references for you automatically. In your API calls make sure you wrap the delete action in the following so it stops the logging of the events and other labor-intensive activities not needed.
using (var context = new CMSActionContext())
{
context.DisableAll();
// delete your user
}
In your code, I'd only select the top 100 or so at a time and delete them in batches. Assuming you don't need this done all in one run, you could let the scheduled task run your custom code for a week and see where you're at.
If all else fails, figure out how to delete the user and the 70+ foreign key references and you'll be golden.

Why don't you delete them with SQL query? - I believe it will be much faster.

Bulk delete functionality exist starting from version 10.
UserInfoProvider has BulkDelete method. Actually any InfoProvider object inhereted from AbstractInfoProvider has BulkDelete method.

MongoDB performance with $in

I'm developing a web service that is being load tested. So that I don't waste anyone's time, I won't go into detail about the REST API because it's too complicated to explain.
I'm developing with Node and using MongoDB's native Node driver. I use the following query:
let mediaIdQuery = {_id: {$in: media}};
let result = await db.collection(COLLECTION_MEDIA).findOne(mediaIdQuery);
'media' is an array of strings. I'm essentially searching the database to see if I have a document with an _id that is in the array.
I add these documents elsewhere. In testing, it seems to work fine when the creation of the documents and search query for them are far apart in time. However, if I send a request that creates a document and a search request within ~100 ms of each other it seems that this query fails to find the document.
This is actually an attempt at optimization on my part. In the past, I had a for loop that went through each individual array element and sent a separate search query each time. This, should be able to do it all in 1 query, so I figure it should be faster. When I used the for loop method, everything worked properly.
I'm just really confused why the database can't find the documents when the requests are close in time. I'm 100% sure that the insert queries of the documents occur before the search request occurs because it takes pretty much 0ms to insert the documents, and the insert queries are sent first. Is there some sort of delay in MongoDB's system before the documents are visible? I've never experienced issues with anything like that before, so I'm hesitant to blame it on a delay in the system. Any ideas?
Edit:
My database is not sharded. Everything is stored on 1 instance.

Non-blocking insert into database with node js

Part of my Node Js app includes reading a file and after some (lightweight, row by row) processing, insert these records into the database.
Original code did just that. The problem is that the file may contain a crazy number of records which are inserted row by row. According to some tests I did, a file of 10000 rows blocks completely the app for some 10 seconds.
My considerations were:
Bulk create the whole object at once. This means reading the file, preparing the object by doing for each row some calculation, pushing it to the final object and in the end using Sequelize's bulkcreate. There were two downsides:
A huge insert can be as blocking as thousands of single-row inserts.
This may make it hard to generate reports for rows that were not inserted.
Bulk create in smaller, reasonable objects. This means reading the file, iterating each n (ex. 2000) rows by doing the calculations and adding it to an object, then using Sequelize's bulkcreate for the object. Object preparation and the bulkcreate would run asyncroniously. The downside:
Setting the object length seems arbitrary.
Also it seems like an artifice on my side, while there might be existing and proven solutions for this particular situation.
Moving this part of the code in another proccess. Ideally limiting cpu usage to reasonable levels for this process (idk. if it can be done or if it is smart).
Simply creating a new process for this (and other blocking parts of the code).
This is not the 'help me write some code' type of question. I have already looked around and it seems there is enough documentation. But I would like to invest on an efficient solution, using the proper tools. Other ideas are welcomed.

couchdb request never completes

I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard

you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.

how best to 'tail -f' a large collection in mongo through meteor?

I have a collection in a mongo database that I append some logging-type of information. I'm trying to figure out the most efficient/simplest method to "tail -f" that in a meteor app - as a new document is added to the collection, it should be sent to the client, who should append it to the end of the current set of documents in the collection.
The client isn't going to be sent nor keep all of the documents in the collection, likely just the last ~100 or so.
Now, from a Mongo perspective, I don't see a way of saying "the last N documents in the collection" such that we wouldn't need to apply any sort at all. It seems like the best option available is doing natural sort descending, then a limit call, so something like what's listed in the mongo doc on $natural
db.collection.find().sort( { $natural: -1 } )
So, on the server side AFAICT the way of publishing this 'last 100 documents' Meteor collection would be something like:
Meteor.publish('logmessages', function () {
return LogMessages.find({}, { sort: { $natural: -1 }, limit: 100 });
});
Now, from a 'tail -f' perspective, this seems to have the right effect of sending the 'last 100 documents' to the server, but does so in the wrong order (the newest document would be at the start of the Meteor collection instead of at the end).
On the client side, this seems to mean needing to (unfortunately) reverse the collection. Now, I don't see a reverse() in the Meteor Collection docs and sorting by $natural: 1 doesn't work on the client (which seems reasonable, since there's no real Mongo context). In some cases, the messages will have timestamps within the documents and the client could sort by that to get the 'natural order' back, but that seems kind of hacky.
In any case, it feels like I'm likely missing a much simpler way have a live 'last 100 documents inserted into the collection' collection published from mongo through meteor. :)
Thanks!
EDIT - looks like if I change the collection in Mongo to a capped collection, then the server could create a tailable cursor to efficiently (and quickly) get notified of new documents added to the collection. However, it's not clear to me if/how to get the server to do so through a Meteor collection.
An alternative that seems a little less efficient but doesn't require switching to a capped collection (AFAICT) is using Smart Collections which does tailing of the oplog so at least it's event-driven instead of polling, and since all the operations in the source collection will be inserts, it seems like it'd still be pretty efficient. Unfortunately, AFAICT I'm still left with the sorting issues since I don't see how to define the server side collection as 'last 100 documents inserted'. :(
If there is a way of creating a collection in Mongo as a query of another ("materialized view" of sorts), then maybe I could create a log-last-100 "collection view" in Mongo, and then Meteor would be able to just publish/subscribe the entire pseudo-collection?

For insert-only data, $natural should get you the same results as indexing on timestamp and sorting so that's a good idea. The reverse thing is unfortunate; I think you have a couple choices:
use $natural and do the reverse yourself
add timestamp, still use $natural
add timestamp, index by time, sort
'#1' - For 100 items, doing the reverse client-side should be no problem even for mobile devices and that will off-load it from the server. You can use .fetch() to convert to an array and then reverse it to maintain order without needing to use timestamps. You'll be playing in normal array-land though; no more nice mini-mongo features so do any filtering first before reversing.
'#2' - This one is interesting because you don't have to use an index but you can still use the timestamp on the client to sort the records. This gives you the benefit of staying in mini-mongo-land.
'#3' - Costs space on the db but its the most straight-forward
If you don't need the capabilities of mini-mongo (or are comfortable doing array filtering yourself) then #1 is probably best.
Unfortunately MongoDB doesn't have views so can't do your log-last-100 view idea (although that would be a nice feature).
Beyond the above, keep an eye on your subscription life-cycle so users don't continually pull down log updates in the background when not viewing the log. I could see that quickly becoming a performance killer.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string