CouchDB "virtual" database, that combines 2 databases into 1 - couchdb

Is there feature in CouchDB to see 2 (or more) databases as 1.
For example when querying in this "virtual" databases all documents, it would show all documents from both "real" databases.
For a case when there are documents with the same _id in different databases, 2 logical resolution are possible:
to take from the 1st database (databases order is specified)
to take document with bigger revision number
Both resolution would be OK. I just need it to be predictable.

There is no such feature, so you'll have to code it yourself.

Related

is it good to use different collections in a database in mongodb

I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone

How to reduce reserved RUs to reduce cost of DocumentDB

We are using DocumentDB on azure. We have a single database with 7 collection, each having maximum 15 records. It does not require much storage.
Only a few developers are using this DB instance. So traffic is also below average.
Still this server is using 67,600 RUs per day. There must be some problem with DocumentDB settings. So, I'm looking for direction to analyse exactly how these RUs are charged and how to reduce it?
There's no problem with DocumentDB settings. You provisioned 7 collections. By default, via the portal, each collection is assigned 1000 RU (which you have at your disposal, regardless whether you use 0 RU or all 1000 RU). The minimum RU setting for a non-partitioned collection is 400.
EDIT - I misread - if you're at 67,000 RU, then you have likely provisioned several partitioned collections (which start at 10,100 RU). For initial dev/test, with only 15 documents, you've grossly over-allocated capacity.
Since you provisioned seven collections (which are likely partitioned, based on your RU sizing), you have a ~70,000 RU deployment. Regardless what you actually consume (you're essentially reserving capacity).
I have no idea what your app needs are, and whether you need 7 collections for some specific reason. But... objectively speaking, there is no rule that says you need to separate different document types into different collections. You can easily store heterogeneous data within a single collection. How you query for specific types is really up to you, but it's trivial to add something like a type property to each document).
Note, since I now believe you're using partitioned collections: You cannot convert these to non-partitioned collections; you'll need to create new non-partitioned collections and move your data from your partitioned collections. (given that you have 15 total documents, this should be trivial).
Note that a single non-partitioned collection may be scaled down to 400 RU. If you then combine your 7 collections into 1 collection, you should be able to reduce your consumption from ~70,000 => 400. (at least during dev/test).
EDIT As of February 2017, the minimum RU for partitioned collections dropped to 2,500 (from the original 10,100 minimum). In December 2017, it dropped again, to 1,000.
It's common for people new to DocumentDB to think of a collection similar to a table in SQL or even what MongoDB calls a "collection". However, DocumentDB is designed differently. It's best to use a single partitioned collection to store all document types and partition on something like geography, tenant, or user. You'll distinguish document types with a type = <MyType> field or I actually prefer to use myType = true approach so I can model inheritance and mixins.
This means, you'll only need to pay for a single partitioned collection. A single partitioned collection may still end up costing you more than table storage, but if you want DocumentDB's near infinite scalability later on, then I highly recommend you start out the way I'm describing.
One more note about David's suggestion to go with non-partitioned collections. That was the only option when DocumentDB first launched but it's now recommended to use partitioned collections. I suspect that non-partitioned collection option may be phased out at some point. You interact with them slightly differently and as David pointed out, there is currently no conversion assistance (especially if you use multiple non-partitioned collections) so transitioning later from non-partitioned collections to a partitioned collection is not hard but it's not as simple as changing your partition type and will cost you development effort. It'll cost you a little more to have a single partitioned collection than a single non-partitioned collection, but it's worth it to save transition costs later, IMHO and it'll cost you less to have a single partitioned collection than it costs to have seven non-partitioned ones.

Couchdb database design options

Is it recommended to have a separate database for each document type in couchdb or place all types of documents in a single database?
Is there any limitation on the number of databases that we can create on couchdb?
Are there any drawbacks in creating large number of databases in couchdb?
There is no firm answer. Here are some guidelines:
If two documents must be visible to different sets of users, they must be in different DBs (read/write privs are per-DB, not per-doc).
If two documents must be included in the same view, they must be in the same DB (views are for a single DB only).
If two types of documents will be numerous and never be included in the same view, they might as well be in different DBs (so that accessing a view over one type won't need to process all of the docs of the other type).
It's cheap to drop a database, but expensive to delete all of the documents out of a database. Keep this in mind when designing your data expiration plan.
Nothing hardcoded, but you will eventually start running into resource constraints, depending on the hardware you have available.
Depends on what you mean by "large numbers." Thousands are fine; billions probably not (though with the Cloudant changes coming in v2.0.0 I'd guess that the reasonable cap on DB count probably goes up).

Can CouchDB handle thousands of separate databases?

Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.

Solr Search Across Multiple Cores

I have two Solr cores.
Core0 imports data from a Oracle table called items. Each item has a unique id (item_id) and is either a video item or a audio item (item_type). Other fields contain searchable texts (description, comments etc)
Core1 imports data from two tables (from a different database) called video_item_dates and audio_item_dates which record occurrence dates of an item in a specific market. The fields are item_id, item_market and dates. A single row would look like (item_001, 'Europe', '2011/08/15, 2011/08/17,2011/08/20). The unique key in these two database tables here is the combination of item_id and item_market. I have flattened data into a single index for Core1.
My problem now is searching both cores to produce a single result. A typical query would be like 'What are the items that have the word Hurricane in the description field and ran in North American market during the the month of August 2011?'. I could separate this query into two different queries and make them run against a different core and then merge the results. But given the fact each query may produce millions of rows that approach is very inefficient.
I tried the Solr Distributed Search. I created a third core (called Core2) with fields from Core0 and Core1. I added a request handler with shards attribute to the third core like this :
<requestHandler name="shard" class="solr.SearchHandler">
<lst name="defaults">
<str name="shards">localhost/solr/core0/,localhost/solr/core1/</str>
</lst>
</requestHandler>
If I run a query against this third core, it forwards the query to both Core0 and Core1 and since neither of them have all the fields , one of them reports "undefined field" and the response is a bad request error message.
Any help would be greatly appreciated.
Please note I have no control over the structure of the database tables.
This does not seem to be a case for multiple cores. You should look into designing a single schema that supports the desired search.
Sharding is used when the core gets hugh and tough to handle as a single entity. The cores would be broken in to smaller chunks and you can now search across the multiple cores.
Usually they share the same configuration.
You would need to define the fields in both the cores to keep them in sync, so that you don't get the fields undefined error.
The fields irrelevant to the cores would be blank, so should not affect.
Sharding doesn't require you a create a new core. You can work with core0 and core1.
More on it # http://wiki.apache.org/solr/DistributedSearch
Also check the limitations with distributed search.
If the sharding performance is not satisfactory to you, you can create a single core with both datasets or check the merge option which combines the cores into single core.
You can merge the indexes from the different cores into a new index using CoreAdmin:
http://wiki.apache.org/solr/MergingSolrIndexes

Resources