SaaS, central database, database per user, or combination? - couchdb

Problem at hand is as follows:
SaaS to keep maintenance records
95% of data would be specific to each user i.e. no need to be accessed by other users
5% of data shared (and contributed by all users), like parts that are used in maintenance
SaaS to be delivered as CouchApp i.e. with public facing CouchDB
So I am torn between database per user, and single database for all users.
Database per user seems to offer much easier backup and maintenance, smaller data set, and easier access control. On the negative side how could I handle shared data?
Is it possible to have database per user, and one common database for shared information (parts)? Then replicate parts documents from all user databases to central one, from there back to all user databases? How to handle conflicts in that case (or even better avoid if possible)?
Or any much simpler approach? Or bite the bullet and go with just one central database?

It depends on the nature of the shared data, I guess. It seems natural to have filtered replication flowing from the user databases to the shared databases and unfiltered replication from the shared database to the user databases; I think that covers your requirements? It makes it so that each user only has to read/write from/to their specific database, while you can still distribute out the shared docs.
It may be easier to query from the shared database directly instead of replicating it back into the user databases, but that really depends on what kind of data would be in there.

Related

The _replicator database is not scalable or my design needs tweaking

I think it is important that I elaborate on where I am coming from so that you can understand my use case, please bear with me.
Background: I’m looking to migrate my app from CouchDB 1 to 2 and this migration is going to take a decent amount of work. I just want to double check that I’m not reinventing the wheel and make sure that there isn’t a better design to what I will elaborate on below, especially since CouchDB 2 appears to have some awesome new features.
Consider the following simplified use case for an app that allows students to submit quiz answers digitally. Each student should be able to submit her/his quiz answers and the teacher should be able to view all the answers. This design needs to work with PouchDB as PouchDB speaks directly to the DB and this saves us a lot of time as otherwise an elaborate set of APIs would need to be written.
My chosen design consists of one database per student and one database per teacher, i.e. a database per user. Only the owner of the database can edit her/his database and this is enforced via CouchDB roles. When a student submits an answer, it is synced with her/his database via PouchDB. The answers are then replicated to the teacher’s database. This in turn allows the students to quickly load their answers in the app and the teachers to load all the answers for all their students. Of course, there are views in the teacher databases that segment the answers by class, quiz, etc… so that the teacher doesn’t have to load the answers for all their students at once. If we didn’t have the teacher database then a teacher would need access to all the students’ databases and would have to sync with all of the their student’s databases.
At first glance, the _replicator database appears to be the the obvious way to replicate the data from the student databases to a single teacher database. The big gotcha is that when you use continuous replication, it consumes a file handle and a database connection which means that you can very quickly starve a database of its resources. For example, if we have say 10,000 students in our database then we need 10,000 concurrent file handles and database connections just for the replications. This is pretty crazy considering that it is unlikely that even say 100 of these 10,000 students would be using the app simultaneously.
Instead, I developed a service that listens to the _db_updates feed and then only replicates a database when there is a change to that specific database. With this method, we only worry about consuming resources when there are changes and as a result we end up with plenty of free file handles and database connections.
I’ve briefly experimented with CouchDB 2 and it appears that the _replicator database is just as greedy with resources as it was in CouchDB 1.
Is this database-per-user design for both students and teachers the best solution or is there a better solution? If it is the best solution, is there a better way of replicating this data that doesn’t consume as many resources?
I've open sourced my solution, called Spiegel, which provides the missing piece: scalable CouchDB replication and change listening. Spiegel is currently being used in production with a db-per-user design and is efficiently handling the replication of over 10,000 databases for Quizster.

Per document user access control for PouchDB / CouchDB

I wish to use PouchDB - CouchDB for saving user data for my web application, but cannot find a way to control the access per user basis. My DB would simply consists of documents using user id as the key. I know there are some solutions:
One database per user - however it requires to monitor whenever a new user wants to save data in order to create a new DB, and may create a lot of DBs;
Proxy between client and CouchDB - however I don't want PouchDB to sync changes for the whole DB including documents of other users in those _all_docs, _revs_diff request.
Is there any suggestion for user access control for pouchDB for a user base of around 1 million (active users around 10 thousand only)?
The topic of a million or more databases has come up on the mailing list in the past. The conclusion was that it depends on how your operating system deals with that many files. CouchDB is just accessing parts of the .couch file when requested. Performance is related to how quickly it can find, open, access, and close that file.
There are tricks for some file systems like putting / delimiters in the database name--which will cause CouchDB to store them in matching directory structures such as groupA/userA.couch or using email-style database names com/bigbluehat/byoung.couch (or some similar).
If that's not sufficient, Apache CouchDB 2.0 brings in BigCouch code (which IBM Cloudant uses) to provide a fully auto-sharded CouchDB. It's not done yet, but it will provide scalability across multiple nodes using an Amazon Dynamo style sharding system.
Another option is to do your own username-based partitioning between multiple CouchDB servers or use IBM Cloudant (which is built for this level of scale).
All these options provide the same Apache CouchDB replication protocol and will work just fine with PouchDB sitting on the user's computer, phone, or tablet.
The user's device would then have their own database +/- any share databases. The apps on those million user devices would only have the scalability of their own content (aka hard drive space) to be concerned about. The app would replicate directly to the "cloud"-side user database for backup, web use, etc.
Hopefully something in there sounds promising. :)

1000+ users, 30+ private data collections

I'm working on a management/planning application that will have 1,000+ users, each with 30+ data collections.
For instance, each user might have a collection of, say, client contacts with as few as 10 and as many as several hundred items/records.
Would Arangodb be a suitable choice for this application?
Is there a better choice?
Many thanks,
LRP
I assume that all the user databases should be kept separate, so user 1 should not see any data of user 2 etc.
If so, there is the option to create a separate database for each user, or, as you mentioned 30+ databases for each user. That would result in 30,000+ databases. I think this wouldn't be an ideal usage of ArangoDB, as each database will incur some overhead, and you may want to keep the total number of databases in an ArangoDB relatively small, at least you wouldn't create 30,000+ databases in it.
The alternative option is to not create that many databases but as many collections, maybe all in the same databases. While this provides a good separation of user data, from the point of resource usage this would also be rather expensive (as each collections may need a separate storage file if it contains data). I think it could work if not all users/collections need to be active at the same time and the server has plenty of resources (or you split the data across multiple servers).
The solution that would use least resources in ArangoDB would be to put data of multiple users into just a few collections. For each record you could store a user-id, and have your application use the user-id in each query.
This would ensure the application would only access records of one specific user at a time. Additionally, as this would use just few collections, there would be no need to create empty or mostly empty databases / collections for users with few data. From the resource usage point of view, this should be relatively efficient.

Cloudant number of database limitation

I'm planing on having my database stored in Cloudant.
Our application is multi-tenant. We currently do the separation to tenants based on a value in some of our tables which will naturally translation to value in a document. Another way is to have database per tenant. We currently have around 100 tenants and hopefully will grow to 500-2000 in our best projections.
What is the pros and cons between all tenants in one db vs. db per tenant?
Is there limitation on the number of database we can create and work with concurrently?
This is a good and involved question. There are pros and cons to both models. The main advantage to one large database is that you can analyze (search, mapreduce, etc) across all users very easily. The main advantage of one-db-per-user is that every user has their own data "sandbox", which may be nice for your SLA. Additionally, that means that the amount of data in each user database can be relatively small.
If you can provide more details about the data you are storing, the relational modeling, and the queries you hope to be able to do, I can probably give you a more satisfying answer.

Restricting resource access in CouchDB to exactly 2 users

Currently I'm in the process of evaluating CouchDB for a new project.
Key constraint for this project is strong privacy. There need to be resources that are readable by exactly two users.
One usecase may be something similar to Direct Messages (DMs) on Twitter. Another usecase would be User / SuperUser access level.
I currently don't have any ideas about how to solve these kind of problems with CouchDB other than creating one Database that is accessable only by these 2 users. I wonder how I would then build views aggregating data from several databases?
Do you have any hints / suggestions for me?
I've asked this question several times on couchdb mailing lists, and never got an answer.
There are a number of things that couchdb is missing.
One of them is the document level security which would :
allow only certain users to view a doc
filter the documents indexed in a view on a user level permission base
I don't think that there is a solution to the permission considerations with the current couchdb implementation.
One solution would be to use an external indexing tool like lucene, and tag your documents with user rights, then issue a lucene query with user right definition in order to get the docs. It also implies extra load on your server(s) (lucene requires a JVM) and an extra delay for the data to be available (lucene indexing time ... )
As for the several databases solution, there are language framework implementations that simply don't allow to use more then one databases ( for instance couch_potato for Ruby ).
Having several databases also means that you'll have several replication processes if your databases are replicated.
Also, this means that the views will be updated for each of the database. In some cases this is better then have huge views indexed in a single database, but it also means that distinct users might not be up to date for a single source of information ( i.e some will have their views updated, other won't). So you cannot guarantee that the data is consistent for all users.
So unless something is implemented in the couch core in order to manage document level authorizations, CouchDB does not seem appropriate for managing data with privacy constraints.
There are a bunch of details missing about what you are trying to accomplish, what the data looks like, so it's hard to make a specific recommendation. You may be able to create a database per user and copy items into each users database (for the DM use case you described). Each user would only be able to access their own database, and then you could have an admin user that could access all databases. If you need to later update those records copying them to multiple databases might not be a good idea, and then you might consider whether you want to control permissions at a different level from storage.
For views that aggregate data from several databases, I recommend looking at lounge and bigcouch, which take different approaches.
http://tilgovi.github.com/couchdb-lounge/
http://support.cloudant.com/faqs/views/chained-mapreduce-views

Resources