What are best practices for partitioning data in MongoDB?

What are best practices for partitioning data in MongoDB? - node.js

I'm creating a social site using mean stack and I need some suggestions regarding mongoDB and mongoose.
I'm part of a startup and we decided to use these amazing technologies to fulfil our task.
Basically, I need some suggestions.
Currently, I have finished creation of simple CRUD and implemented local passport JS. I have currently one single collection in my mongoDB called users.
Our social site will have a blog, marketplace and many other pages (features) that will be related to a single user.
Since I never worked with mongoDB before, I'm curious if mongoDB should use one collection per user or have multiple collections for each feature.
To clarify it, let's say I use User model for user registration, blog model for blogs etc etc.
This would really mean a lot to me if you would shortly explain me how to structure my mongoose models, if all data should be inside one collection or if one user should have separate collections for different features. And if you recommend multiple collections, how do I then link these collections together and make sure that all data is saved for one user etc.
Thanks a lot in advance!

I will explain partitioning/dividing into two level.
Of course, you're going to create different collections for different models. Such as Users, Blogs, Messages etc.
Now comes the 2nd part, if we are talking about millions of data. How you partition them for faster data lookup.
For example, you have 1M users, which you are going to put in one big collection of 'Users'. But if you look for a user whose first name is 'Imdad' and age is 28, Now your query looks through these 1M items in your single Users collection, which will take a good amount of time.
To solve this problem, users collection can be divided into multiple collections through horizontal partition (Users1 (age between 10-20), Users2 (age between 20-30), Users3 (age between 30-40)). Now based on your query predicate monggoDB is to look up into a different collection/s. This is the idea that MongoDB has applied like other SQL DB. You don't have to explicitly execute your query to the chunk collection but the mongoDB itself take care of that.
Shard key generation
Mongoose shard key

If you are using mongoDB as a backend for a REST interface, the best practice is to create on collection per resource. For example, if you intend on having a /api/users endpoint, you should have users collection and it should contain any and everything you intend to return on that endpoint.
If you are using node to compile server-side templates, structure can be more flexible. In this case, the above still applies (as you will probably eventually want to expose a REST service), but there is more flexibility. In fact, if a many-to-many style relationship is appropriate, it is easier to separate these collections and load them together in the same page.
As an aside, you mention having users and a marketplace. The bigger issue than the separation of data into collections is the use of transactions. Any time you intend on performing a transaction of data, it should be performed within a SQL transaction. There are no notions of transactions in MongoDB. This is by design, as MongoDB is designed to be a fast, scaleable data store. It is not unreasonable to amalgamate SQL and noSQL data, in this case.

Related

Would it be good idea to store mongoose documents in AsyncLocalStorage?

I have a web server which is made specifically for an app. Some of the services get very complex in terms of calling a lot of functions across multiple services and each of them querying multiple models. Would it be good idea to save the queried documents of each request in AsyncLocalStorage to reduce the time taken for querying?
So, I would check if document is present in ALS if yes then use it else fetch and save it there before using.
I tried to find references related to this but didn't find anything. Which led me to think maybe it is not such a good idea after all. But then querying the same document again and again over different services doesn't make sense either.

How to design a firestore data model for categories?

I wan to create a collection of categories where all categories can have as many children as they want.
Like if the user wants to add a additional category to a child, they can. How to design that? What would be the queries to add a category to the last child?

Firestore is NoSQL database which by it's concept does not contain any designed schema so you do not need to design anything. It's just working as you described your needs - every document can contain different structure.
I suggest to reed documentation about data model.
Looking at your description, you could be as well interested in Realtime Database also available in Firebase (but not in Google Cloud Console). It's as well NoSQL database, but is practically just large, easy to manage json tree. There is nice article about choosing database in Firebase documentation (but you can find lot of articles on the web as well).

The _replicator database is not scalable or my design needs tweaking

I think it is important that I elaborate on where I am coming from so that you can understand my use case, please bear with me.
Background: I’m looking to migrate my app from CouchDB 1 to 2 and this migration is going to take a decent amount of work. I just want to double check that I’m not reinventing the wheel and make sure that there isn’t a better design to what I will elaborate on below, especially since CouchDB 2 appears to have some awesome new features.
Consider the following simplified use case for an app that allows students to submit quiz answers digitally. Each student should be able to submit her/his quiz answers and the teacher should be able to view all the answers. This design needs to work with PouchDB as PouchDB speaks directly to the DB and this saves us a lot of time as otherwise an elaborate set of APIs would need to be written.
My chosen design consists of one database per student and one database per teacher, i.e. a database per user. Only the owner of the database can edit her/his database and this is enforced via CouchDB roles. When a student submits an answer, it is synced with her/his database via PouchDB. The answers are then replicated to the teacher’s database. This in turn allows the students to quickly load their answers in the app and the teachers to load all the answers for all their students. Of course, there are views in the teacher databases that segment the answers by class, quiz, etc… so that the teacher doesn’t have to load the answers for all their students at once. If we didn’t have the teacher database then a teacher would need access to all the students’ databases and would have to sync with all of the their student’s databases.
At first glance, the _replicator database appears to be the the obvious way to replicate the data from the student databases to a single teacher database. The big gotcha is that when you use continuous replication, it consumes a file handle and a database connection which means that you can very quickly starve a database of its resources. For example, if we have say 10,000 students in our database then we need 10,000 concurrent file handles and database connections just for the replications. This is pretty crazy considering that it is unlikely that even say 100 of these 10,000 students would be using the app simultaneously.
Instead, I developed a service that listens to the _db_updates feed and then only replicates a database when there is a change to that specific database. With this method, we only worry about consuming resources when there are changes and as a result we end up with plenty of free file handles and database connections.
I’ve briefly experimented with CouchDB 2 and it appears that the _replicator database is just as greedy with resources as it was in CouchDB 1.
Is this database-per-user design for both students and teachers the best solution or is there a better solution? If it is the best solution, is there a better way of replicating this data that doesn’t consume as many resources?

I've open sourced my solution, called Spiegel, which provides the missing piece: scalable CouchDB replication and change listening. Spiegel is currently being used in production with a db-per-user design and is efficiently handling the replication of over 10,000 databases for Quizster.

Cloudant number of database limitation

I'm planing on having my database stored in Cloudant.
Our application is multi-tenant. We currently do the separation to tenants based on a value in some of our tables which will naturally translation to value in a document. Another way is to have database per tenant. We currently have around 100 tenants and hopefully will grow to 500-2000 in our best projections.
What is the pros and cons between all tenants in one db vs. db per tenant?
Is there limitation on the number of database we can create and work with concurrently?

This is a good and involved question. There are pros and cons to both models. The main advantage to one large database is that you can analyze (search, mapreduce, etc) across all users very easily. The main advantage of one-db-per-user is that every user has their own data "sandbox", which may be nice for your SLA. Additionally, that means that the amount of data in each user database can be relatively small.
If you can provide more details about the data you are storing, the relational modeling, and the queries you hope to be able to do, I can probably give you a more satisfying answer.

Restricting resource access in CouchDB to exactly 2 users

Currently I'm in the process of evaluating CouchDB for a new project.
Key constraint for this project is strong privacy. There need to be resources that are readable by exactly two users.
One usecase may be something similar to Direct Messages (DMs) on Twitter. Another usecase would be User / SuperUser access level.
I currently don't have any ideas about how to solve these kind of problems with CouchDB other than creating one Database that is accessable only by these 2 users. I wonder how I would then build views aggregating data from several databases?
Do you have any hints / suggestions for me?

I've asked this question several times on couchdb mailing lists, and never got an answer.
There are a number of things that couchdb is missing.
One of them is the document level security which would :
allow only certain users to view a doc
filter the documents indexed in a view on a user level permission base
I don't think that there is a solution to the permission considerations with the current couchdb implementation.
One solution would be to use an external indexing tool like lucene, and tag your documents with user rights, then issue a lucene query with user right definition in order to get the docs. It also implies extra load on your server(s) (lucene requires a JVM) and an extra delay for the data to be available (lucene indexing time ... )
As for the several databases solution, there are language framework implementations that simply don't allow to use more then one databases ( for instance couch_potato for Ruby ).
Having several databases also means that you'll have several replication processes if your databases are replicated.
Also, this means that the views will be updated for each of the database. In some cases this is better then have huge views indexed in a single database, but it also means that distinct users might not be up to date for a single source of information ( i.e some will have their views updated, other won't). So you cannot guarantee that the data is consistent for all users.
So unless something is implemented in the couch core in order to manage document level authorizations, CouchDB does not seem appropriate for managing data with privacy constraints.

There are a bunch of details missing about what you are trying to accomplish, what the data looks like, so it's hard to make a specific recommendation. You may be able to create a database per user and copy items into each users database (for the DM use case you described). Each user would only be able to access their own database, and then you could have an admin user that could access all databases. If you need to later update those records copying them to multiple databases might not be a good idea, and then you might consider whether you want to control permissions at a different level from storage.
For views that aggregate data from several databases, I recommend looking at lounge and bigcouch, which take different approaches.
http://tilgovi.github.com/couchdb-lounge/
http://support.cloudant.com/faqs/views/chained-mapreduce-views

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string