Any method to get a size of arangodb collection? - arangodb

I'm looking for a way to measure the total size of a collection.
Is there equivalent method for arangodb like db.collection.totalSize() or db.collection.stats() in mongodb?

pyArango.collection.Collection.figures() solves the problem.

Related

How slow is mongodb's skip() in mongoose

I've seen a lot of questions regarding this and all answers seem to imply .skip() method is expensive & I should use ranges. What I'm trying to ask is when does it get slower? I have a database of 11000 records, will it be slow to do pagination with a limit of 25.
Also is there an alternative or a better way of doing pagination in mongoose. Is mongoose-paginate module fast? how does it work (skip or something else)?
Thanks :)

Mongoose populate option and query time

I am working on a platform where I use mongoose .populate number of times in all my queries, I turn on the mongoose debug mode and find that there is hardly difference in query execution time (for 100 document now , there will will be 100000 doc in future) with using populate and without using populate.
I know that basically populate is also doing a finOne query internally , my question is, is using .populate will increase my query time or is it anyways going to effect my performance if number of record reaches millions. Also is there any alternate that I can choose to increase performance
In general, you're correct - you want to avoid using populate() since it will issue another query for each row. Keep in mind that this is a full round-trip to the server. Mongo doesn't have any sort of concept for a join, so when you do populate you're issuing an additional query for each row in your returned set.
The technique to work around this is to denormalize your data - don't design a mongo database like a relational one. The mongo docs have lots of information on how to do this. https://docs.mongodb.org/manual/core/data-model-design/ One important thing to keep in mind with Mongo design is that you never want to have a subdocument with unbounded growth. Due to the way mongo space allocation and paging works, this can cause severe performance problems, so if you're in a situation like this it's best to normalize.
Another very common technique is subdocument caching. This is where you take partial data from the "joined" collection and cache it on the collection you're querying. In this case, you're trading space for performance because you have duplicate data. Also, you'll have to make sure you keep the data updated whenever there's a change. With mongoose, it is easy to do this as a post-save hook on the model of the foreign collection.

Heterogeneous Data Storage in CouchDB

I would like to know what are the best practices for storing heterogeneous data in CouchDB. In MongoDB you have collections which help with the modelling of the data (IE: Typical usage is one document type per collection). What is the best way to handle this kind of requirement in CouchDB? Tagging of documents with a _type field? Or is there some other method that I am not aware of?
Main benefit of Mongo's collection is that indexes are defined and calculated per collection. In case of Couch you have even more freedom and flexibility to do that. Each index is defined by the view in map/reduce way. You limit the data to calculate the index by filtering it in map function. Because of this flexibility, it is up to you how to distinguish which document belongs to which view.
If you really like the fixed Mongo-like style of division documents into set of distinct partitions with separate indexes just create the field collection and never mix two different collections in single view. In my opinion, rejecting one of the only benefit of Couch over Mongo (where Mougo is in general more powerful and flexible system) does not seem to be good idea.

What is the best practice for mongoDB to handle 1-n n-n relationships?

In relational database, 1-n n-n relationships mean 2 or more tables.
But in mongoDB, since it is possible to directly store those things into one model like this:
Article{
content: String,
uid: String,
comments:[Comment]
}
I am getting confused about how to manage those relations. For example, in article-comments model, should I directly store all the comments into the article model and then read out the entire article object into JSON every time? But what if the comments grow really large? Like if there is 1,000 comments in an article object, will such strategy make the GET process very slow every time?
I am by no means an expert on this, however I've worked through similar situations before.
From the few demos I've seen yes you should store all the comments directly in line. This is going to give you the best performance (unless you're expecting some ridiculous amount of comments). This way you have everything in your document.
In the future if things start going great and you do notice things going slower you could do a few things. You Could look to store the latest (insert arbitrary number) of comments with a reference to where the other comments are stored, then map-reduce old comments out into a "bucket" to keep loading times quick.
However initially I'd store it in one document.
So would have a model that looked maybe something like this:
Article{
content: String,
uid: String,
comments:[
{"comment":"hi", "user":"jack"},
{"comment":"hi", "user":"jack"},
]
"oldCommentsIdentifier":12345
}
Then only have oldCommentsIdentifier populated if you did move comments out of your comment string, however I really wouldn't do this for less then 1000 comments and maybe even more. Would take a bit of testing here to see what the "sweet" spot would be.
I think a large part of the answer depends on how many comments you are expecting. Having a document that contains an array that could grow to an arbitrarily large size is a bad idea, for a couple reasons. First, the $push operator tends to be slow because it often increases the size of the document, forcing it to be moved. Second, there is a maximum BSON size of 16MB, so eventually you will not be able to grow the array any more.
If you expect each article to have a large number of comments, you could create a separate "comments" collection, where each document has an "article_id" field that contains the _id of the article that it is tied to (or the uid, or some other field unique to the article). This would make retrieving all comments for a specific article easy, by querying the "comments" collection for any documents whose "article_id" field matches the article's _id. Indexing this field would make the query very fast.
The link that limelights posted as a comment on your question is also a great reference for general tips about schema design.
But if solve this problem by linking article and comments with _id, won't it kinda go back to the relational database design? And somehow lose the essence of being NoSQL?
Not really, NoSQL isn't all about embedding models. Infact embedding should be considered carefully for your scenario.
It is true that the aggregation framework solves quite a few of the problems you can get from embedding objects that you need to use as documents themselves. I define subdocuments that need to be used as documents as:
Documents that need to be paged in the interface
Documents that might exist across multiple root documents
Document that require advanced sorting within their group
Documents that when in a group will exceed the root documents 16meg limit
As I said the aggregation framework does solve this a little however your still looking at performing a query that, in realtime or close to, would be much like performing the same in SQL on the same number of documents.
This effect is not always desirable.
You can achieve paging (sort of) of suboducments with normal querying using the $slice operator, but then this can house pretty much the same problems as using skip() and limit() over large result sets, which again is undesirable since you cannot fix it so easily with a range query (aggregation framework would be required again). Even with 1000 subdocuments I have seen speed problems with not just me but other people too.
So let's get back to the original question: how to manage the schema.
Now the answer, which your not going to like, is: it all depends.
Do your comments satisfy the needs that they should separate? Is so then that probably is a good bet.
There is no best way to this. In MongoDB you should be designing your collections according to application that is going to use it.
If your application needs to display comments with article, then I can say it is better to embed these comments in article collection. Otherwise, you will end up with several round trips to your database.
There is one scenario where embedding does not work. As far as I know, document size is limited to 16 MB in MongoDB. This is quite large actually. However, If you think your document size can exceed this limit it is better to have separate collection.

How to have more than 100 results?

My question is simple but I can't find the answer. Is there a way to set in Lucene to retrieve an amount of results higher than 100 in a query?
Im using lucene 2.4.0 now.
Thanks all.
Functionality for controlling the search result size and itarations over larger search results has been vastly improved in later versions of Lucene. If you have the possibility, consider upgrading to 2.9 or even 3.0.
That being said, I cannot from your post determine how exactly you get your search results returned. Perhaps you use a Hits object? In that case you should consider using the TopDocCollector instead. The TopDocCollector(int) constructor allows you to specify the maximum number of hits you would like to have returned from your search.
In 2.4.0, you can use Searcher.search(Query query, int n) method to retrieve desired number of results. The method returns TopDocs object.

Resources