CouchDb - One design doc with multiple views vs multiple design docs with split views

CouchDb - One design doc with multiple views vs multiple design docs with split views - couchdb

I am trying to figure out the tradeoffs between these two.
It seems that using one design doc with multiple views is fast to update because when indexing, each doc is passed into each view in a single pass.
But, a tradeoff would be that if I change a view in the design doc, all the views need to be updated.
Does this seem correct? Is there something else someone could add to this understanding?

More detail informations can be found here :
Views are organized into design docs. Theoretically, you can have as many design docs as you want in a database, and as many views as you want in a single design doc. Theoretically, each view can emit arbitrarily many b-tree nodes per document, and your map/reduce code can be arbitrarily complex. But keep in mind:
Having many views degrades performance, because each view must be run on every document change
All views in the same design doc are indexed together; changing, adding, or removing any view requires all of them to be reindexed
Having many emits per document in a view can degrade performance (but slightly more performant than putting each emit in its own view)
Complex map and reduce code degrades performance
Emitting values other than null degrades performance
Using reduce code other than the _sum, _count, _stats built-ins degrades performance
As a side note CouchDB and Cloudant differ on exactly when views are updated:
CouchDB updates views lazily, that is when they are queried. This can lead to long wait times for infrequently accessed views.
Cloudant updates views asynchronously in the background. This means that views that are no longer being accessed are still consuming system resources.

Related

MongoDb slow aggregation with many collections (lookup)

i'm working on a MEAN stack project, i use too many collections in my aggregation so i use a lot of lookup, and that impacts negatively the performance and makes the execution of aggregation very slow. i was wondering if you have any suggestions , i found that we can reduce lookup by creating for each collection i need an array of objects into a globale collection however, i'm looking for an optimale and secured solution.
As an information, i defined indexes on all collections into mongo.
Thanks for sharing your ideas!

This is a very involved question. Even if you gave all your schemas and queries, it would take too long to answer, and be very specific to your case (ie. not useful to anyone else coming along later).
Instead for a general answer, I'd advise you to read into denormalization and consider some database redesign if this query is core to your project.
Here is a good article to get you started.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
A simple example to outline it:
Say you have a Blog with a comment collection, and a user collection
You want to display the comment with the name of the user. So you have to load the player for every comment.
Instead you could save the username on the comment collection as well as the user collection.
Then you will have a fast query to show comments, as you don't need to load the users too. But if the user changes their name, then you will have to update all of the comments with the new name. This is the main tradeoff.
If a DB redesign is too difficult, I suggest splitting into multiple aggregates and combining them in memory (ie. in your node server side code)

MongoDB schema design

I'm planning to implement this schema in MongoDB, I have been doing some readings about schema design, and the notion was whenever you structure your data like a relational database you must be doing something wrong.
My questions:
what should I do when collection size gets larger than 16MB limit?
app_log in server_log collections gets might in some cases grow larger than 16MB depending how busy the server is.
I'm aware of the cap feature that I could use, but the requirement is store all logs for 90 days.
Do you see any potential issues with my design?
Is it a good practice to have the application check collection size and create new collection by day / hour ..etc to accommodate log size growth?
Thanks

Your collection size is not restricted to 16MB, as one of the comments pointed out, you can check in the MongoDB manual that it is the largest document size. So there is no need to separate the same class of data between different collections, in fact it would be a major headache for you to do so :) One user collection, one for your servers and one for your server_logs. You can then create references from one collection to the next by using the id field.

Whether this is a good design or not will depend on your queries. In general, you want to avoid using joins in Mongo (they're still possible, but if you're doing a bunch of joins, you're using it wrong, and really should use a relational DB :-)
For example, if most of your queries are on the server_log collection and only use the fields in that collection, then you'll be fine. OTOH, if your server_log queries always need to pull in data from the server collection as well (say for example the name and userId fields), then it might be worth selectively denormalizing that data. That's a fancy way of saying, you may wish to copy the name and userId fields into your server_log documents, so that your queries can avoid having to join with the server collection. Of course, every time you denormalize, you add complexity to your application which must now ensure that the data is consistent across multiple collections (e.g., when you change the server name, you have to make sure you change it in the server_logs, too).
You may wish to make a list of the queries you expect to perform, and see if they can be done with a minimum of joins with your current schema. If not, see if a little denormalization will help. If you're getting to the point where either you need to do a bunch of joins or a lot of manual management of denormalized data in order to satisfy your queries, then you may need to rethink your schema or even your choice of DB.

what should I do when collection size gets larger than 16MB limit
In Mongodb there is no limit for collection size. Limit is exist for each document. Each document should not exceed the size of 16 MB.
Do you see any potential issues with my design?
No issue with above design

Paging among multiple aggregate root

I'm new to DDD so please executes me if some term/understanding are bit off. But please correct me and any advice are appreciated.
Let's say I'm doing a social job board site, and I've identified my aggregate roots: Candidates, Jobs, and Companies. Very different things/contexts so each has own database table, repository, and service. But now I have to build a Pinterest style homepage where data blocks show data for either a Candidate, a Job, or a Company.
Now the tricky part is the data blocks have to be ordered by the last time something happened to the aggregate it represents (a company is liked/commented, or a job was update, etc), and paging occurs in form of infinite scrolling, again just like Pinterest. Since things occur to these aggregates independently I do not have a way to know how many of what aggregate is on any particular page. (but if I did btw, say a table that tracks aggregates' last update time, have I no choice but to promote this to be another aggregate root, with it's own repository?)
Where would I implement the paging logic? I read somewhere that there should be one service per repository per aggregate root, so should I sort and page in controller (I'm using MVC by the way)? Or should there be a independent Application Service that does cross boundary stuff like this? Either case I have to fetch ALL entities for ALL aggregates from db?
That's too many questions already but I'm basically asking:
Is paging presentation, business, or persistence logic? Which horizontal layer?
Where should cross boundary code reside in DDD? Which vertical stack?

Several things come to mind.
How fresh does this aggregated data need to be? I doubt realtime is going to add much value. Talk to a business person and bargain for some latency. This will allow you to build a simpler solution to the problem.
Why not have some process do the scanning, aggregation, sorting and store the result of that asynchronously? Doesn't even need to be in a database (Redis). The bargained latency could be the interval at which to run your process.
Paging is hardly a business decision concern in your example. You just need to provide infinite scrolling and some ajax calls that fetch the cached, aggregated, sorted information. This has little to do with DDD.
Your UI artifacts and the aggregation, sorting process seem to be very much a thing on their own, working together with the data or - better yet - a datacomponent of each context that provides the data in the desired format.

In CouchDB, are there ways to improve performance of the View index process?

I have some basic views and some map/reduce views with logic. Nothing too complex. Not too many documents. I've tried with 250k, 75k, and 10k documents. Seems like I'm always waiting for view indexing.
Does better, more efficient code in the view help? I'm assuming it's basically processing the view at all levels of aggregation. So there must be some improvement there.
Does emit()-ing less data help? emit(doc.id, doc) vs specifying fewer fields?
Do more or less complex keys impact view indexing?
Or is it all about memory, CPU cores, and processor speed?
There must be some documentation out there, but I can't find anything referencing ways to improve performance.

I would take a deeper look into the reduce function. Try to use the built-in Erlang functions like _sum, _count, instead of writing Javascript.
Complex views can take hours and more, that's normal.
Maybe post such not too complex map/reduce.
And don't forget: indexing all docs is only done once after changing the view (or pushing a whole bunch of new docs). Subsequent new docs are indexed incrementally.
Use a view with &stale=ok to retrieve the "old" data instantly, so you don't have to wait. (But pay attention: you always have to call a view without stale=ok at least once to trigger the indexing process). Or better: use stale=update_after.

The code you write in views is more like CREATE INDEX than SELECT. It should be irrelevant how long it takes, as long as the view builds keep up with the document change rate. Building a view is a sunk (one-time) cost.
When you query the view, that is always a binary tree scan, which operates against a static data set in logarithmic time. That is usually the performance people care about more (in production.)
If you are not seeing behavior like I describe, perhaps we could discuss your view functions and your general approach to your problem. CouchDB is very different from relational databases. In the latter, you have highly structured data and free-form queries. In CouchDB, you have free-form data but highly structured index definitions (views). Except during development, changing and rebuilding views should be rare.

not emitting anything will help, but doing the view creation in smaller batches ( there are scripts that do this automagically ) helps more than anything other than not emitting anything at all, which can't be helped sometimes.

Which is better - auto-generated id or manual id assignment in couchdb documents?

Should I be generating the id of the documents in a CouchDB or should I depend on CouchDB to generate it? What are the advantages or disadvantages in these approaches? Is there any performance implications on any of these options?

There is no difference as far as CouchDB is concerned. Frederick is right that sequential ids are slightly faster. If you query /_uuids?count=10 you will notice that the UUIDs are sequential (by default).
However, even with random IDs, once you run compaction, they will all be in the "right" order internally in the .couch file and at that point there is no difference. So in the long run, I don't usually worry about it.

The main thing is that you should use mostly sequential ids. As this article and this bit of the couchdb book explain, using random ids results in a much less efficient structure internally, both speed wise and in terms of space used on disc.

Self generated ids are almost impossible to deal with if you have two or more separated instances of your app. Because the synchronisation between the different instances is not instantaneous. A solution for this can be to have one server dedicated to generate (or check the availability of) the ids, for example using a SQL database, and acting as a gate for document creation.
On the other hand, if you have only one server and will never need more, there is one advantage I find interesting to self generated uids: since they have to be unique, you can use them in urls. For instance take the slug of the title of a blog post as the _id.
Performance-wise, the CouchDB's generated ids are pretty long so if your own ids are shorter, you will save significant disk space (assuming you have a looot of documents).

Both answers above tell about PROS of sequential IDs.
Here is a major problem arose by sequential IDs.
Predictability of other IDs in documents using a single ID.
Due to this we can't use sequential IDs in application URLs as identifiers due to other IDs being predictable using one ID, and using as url authentication is also not possible.( As done by file sharing services).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string