Couchdb database design options - couchdb

Is it recommended to have a separate database for each document type in couchdb or place all types of documents in a single database?
Is there any limitation on the number of databases that we can create on couchdb?
Are there any drawbacks in creating large number of databases in couchdb?

There is no firm answer. Here are some guidelines:
If two documents must be visible to different sets of users, they must be in different DBs (read/write privs are per-DB, not per-doc).
If two documents must be included in the same view, they must be in the same DB (views are for a single DB only).
If two types of documents will be numerous and never be included in the same view, they might as well be in different DBs (so that accessing a view over one type won't need to process all of the docs of the other type).
It's cheap to drop a database, but expensive to delete all of the documents out of a database. Keep this in mind when designing your data expiration plan.
Nothing hardcoded, but you will eventually start running into resource constraints, depending on the hardware you have available.
Depends on what you mean by "large numbers." Thousands are fine; billions probably not (though with the Cloudant changes coming in v2.0.0 I'd guess that the reasonable cap on DB count probably goes up).

Related

is it good to use different collections in a database in mongodb

I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone

Potential issue with Couchbase paging

It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew

CouchDB "virtual" database, that combines 2 databases into 1

Is there feature in CouchDB to see 2 (or more) databases as 1.
For example when querying in this "virtual" databases all documents, it would show all documents from both "real" databases.
For a case when there are documents with the same _id in different databases, 2 logical resolution are possible:
to take from the 1st database (databases order is specified)
to take document with bigger revision number
Both resolution would be OK. I just need it to be predictable.
There is no such feature, so you'll have to code it yourself.

Can CouchDB handle thousands of separate databases?

Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.

Cassandra design pattern for shared record (m:n)

we have two entities User and Role. One User can have multiple Roles, and single Role can be shared by many users -
typical m:n relation.
Roles are also dynamic and we expect large amount (millions).
It is quiet simple to model such data in relational DB. I would like to find out whenever it would be possible in cassandra.
Currently I see two solutions:
A) Use normalized model and create something similar to inner-join
Create each single role in separate CF and store in User record foreign keys to referenced roles.
pro: Roles are not replicated and maintenance is simple
contra: In order to get all Roles for single User multiple network calls are necessary. User record contains only FK, Roles are stored
using random partitioner, in this case each role could be stored on different cassandra node.
B) Denormalize model and replicate roles to avoid round trips
In this scenario User record in cassandra contains all user roles as copy.
pro: It is possible to read User with all roles within single query. This guarantees short load times.
contra: Each shared Role is copied multiple times - on each related User. Maintaining roles is very difficult, especially if we have
large data amount. For example: one Role is shared by 1000 users. Changes on this Role require update on 1000 User records.
For very large data sets such updates has to be executed as asynchronous job.
Solutions above are very limited, meybie Cassandra is not right solution for m:n relations ? Do you know any cassandra design patter for such problem?
Thanks,
Maciej
The way you want to design a data store in Cassandra is to start with the queries you plan to execute and make it so you can get all the information you need at once. Denormalization is the name of the game here; if you're not replicating that role information in each user node, you're not going to avoid disk seeks, and your read performance will suffer. Joins do not make sense; if you want a relational database, use a relational database.
At a guess, you're going to ask a lot of questions about what roles a user has and what they should be doing with them, so you definitely want to have role information duplicated in each user entry - probably with each role getting its own column (role-ROLE_KEY => serialized-capability-info instead of roles => [serialized array of capability info]). Your application will need some way to iterate over all those columns itself.
You will probably want to look at what users are in a role, and so you should probably store all the user information you'll need for that view in the role column family as well (though a subset of the full user record will do).
When you run updates, and add/remove users from roles, you will need to make sure that you update both the role's list of users and the user's roles at the same time. Because you're using a column for each relation, instead of a single shared serialized blob, this should work even if you're editing two different roles that share the same user at the same time: Cassandra can merge the updates, including the deletes.
If the query needs to be asynchronous, then go make your application handle it. Remember that Cassandra is an eventual-consistency data store and you shouldn't expect updates to be visible everywhere immediately anyway.
Another option these days is to use playORM that can do joins for you ;). You just decide how to partition your data. It uses Scalabla JQL which is a simple addition on JQL as follows
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t('account', :partId) select t FROM Trade as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
So, we can finally normalize our data on a noSQL system AND scale at the same time. We don't need to give up normalization which has certain benefits.
Dean

Resources