Replicating pre-calculated views in CouchDB/Couchbase

Replicating pre-calculated views in CouchDB/Couchbase - couchdb

When i first query a CouchDB/Couchbase view it needs to be calculated. This can take a good while if there are large number of docs and that for each single view..
Is there any way of replicate an already calculated view from one Couch to another?

Not directly through CouchDB replication, no, there's all sorts of practical complexities in how that would have to be implemented that make it impractical I'm afraid.
For starters it means that CouchDBs have to carefully manage replication of view calculation of changes simultaneously somehow exactly in sync with the actual data (so you don't ever get newer view calculations than data), and that then gets further complicated by the fact that views only get updated when requested, so view data on either end could be out of date (and if users are querying with stale=ok, it might even be required to stay out of date).
I believe you can do it by directly copying the view index files (in /var/lib/couchdb/.DBNAME_design/SOMEHASH.view by default I think), if you just need a once-off view sync. I'd recommend against doing that frequently as a general solution though, since it's not officially supported AFAIK and is likely to be pretty fragile.

This isn't directly the answer to your question, although as PimTerry pointed out, replicating the view index is not supported, especially between different implementation.
What you can do instead is follow the procedure described here:
http://wiki.apache.org/couchdb/How_to_deploy_view_changes_in_a_live_environment
This way you can have your couchdb calculate the new index "in background" without blocking the usage of your application.
Hope this helps.

Related

full database table update

I currently have a REST endpoint with basic CRUD operations for a sqlite database.
But my application updates whole tables at a time (with a "save" button)
My current idea/solution is to query the data first, compare the data, and update only the "rows" that changed.
The solution is a bit complex because there are several different types of changes that can be done:
Add row
Remove row
Row content changed (similar to content moving up or down)
Is there a simpler solution?
The most simplest solution is a bit dirty. (Delete table, create table and add each row back)

The solution is a bit complex because there are several different types of changes that can be done:
Add row
Remove row
Row content changed (similar to content moving up or down)
Is there a simpler solution?
The simple answer is
Yes, you are correct.
That is exactly how you do it.
There is literally no easy way to do this.
Be aware that, for example, Firebase entirely exists to do this.
Firebase is worth billions, is far from perfect, and was created by the smartest minds around. It literally exists to do exactly what you ask.
Again there is literally no easy solution to what you ask!
Some general reading:
One of the handful of decent articles on this:
https://www.objc.io/issues/10-syncing-data/data-synchronization/
Secondly you will want to familiarize yourself with Firebase, since, a normal part of computing now is either using baas sync solutions (eg Firebase, usually some noSql solution), or indeed doing it by hand.
http://firebase.google.com/docs/ios/setup/
(I don't especially recommend Firebase, but you have to know how to use it in as much as you have to know how to do regex and you have to know how to write sql calls.)
Finally you can't make realistic iOS apps without Core Data,
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/CoreData/index.html
By no means does core data solve the problem you describe, but, realistically you will use it while you solve the problem conceptually.
You may enjoy realm,
https://realm.io
which again is - precisely - a solution to the problem you describe. (Which is basically the basic problem in typical modern app development.) As with FBase, even if you don't like it or decide not to go with it on a particular project, one needs to be familiar with it.

Scratch couchdb document

Is it possible to "scratch" a couchdb document? By that I mean to delete a document, and make sure that the document and its history is completely removed from the database.
I do not want to perform a database compaction, I just want to fully wipe out a single document. And I am looking for a solution that guarantees that there is no trace of the document in the database, without needing to wait for internal database processes to eventually remove the document.
(a python solution is appreciated)

When you delete a document in CouchDB, generally only the _id, _rev, and a deleted flag are preserved. These are preserved to allow for eventual consistency through replication. Forcing an immediate delete across an entire group of nodes isn't really consistent with the architecture.
The closest thing would be to use purge; once you do that, all traces of the doc will be gone after the next compaction. I realize this isn't exactly what you're asking for, but it's the closest thing off the top of my head.
Here's a nice article explaining the basis behind the various delete methods available.

Deleting anything from file system for sure is difficult, and usually quite expensive problem. Even more with databases in general. Depending of what for sure means to you, you may end up with custom db, custom os and custom hw. So it is kind of saying I want fault tolerant system, yes everyone would like to have one, but only few can afford it, but good news is that most can settle for less. Similar is for deleteing for sure, I assume you are trying to adress some security or privacy issue, so try to see if there is some other way to get what you need. Perhaps encrypting the document or sensitive parts of it.

How can I alter the incoming documents on replication in CouchDB

I need to replicate in CouchDB data from one database to another but in the process I want to alter the documents being replicated over,
mostly stripping out particular fields (but other applications mentioned in comments).
The replication would always be 100% one way (but other applications mentioned in comments could use bi-directional and sync)
I would prefer if this process did not increment their revision ID but that might be asking for too much.
But I don't see any of the design document functions that do what I am trying to do.
As it seems doesn't do this, what plans are there for adding this? And meanwhile, what workarounds are there?

No, there is no out-of-the-box solution, as this would defy the whole purpose and logic of multi-master, MVCC logic.
The only option I can see here is to create your own solution, but I would not call this a replication, but rather ETL (Extract, Transform, Load). And for ETL there are tools available that will let you do the trick, like (mixing open source and commercial here):
Scriptella
CloverETL
Pentaho Data Integration, or to be more specific Kettle
Jespersoft ETL
Talend have some tools as well
There is plenty more of ETL tools on the market.

I believe the best approach here would be to break out the fields you want to filter out into a separate document and then filter out the document during replication.

Of course the best way would be to have built-support for this, but a workaround which occurs to me would be, instead of here using the built-in replication, to code and use a custom replication which will do the additional needed alterations/transformations, still using rather than going beneith, the other built-ins, and with good coding, in many situations (especially if each master can push to its slaves), it feels this could be nearly as efficient.
This requires efficient triggers be put on each source/master to detect any changes, which I believe CouchDB does offer (or at least PouchDB appears to), which would then copy the changes to another location also doing the full alterations.
If the source of the change is unable to push the change to the final destination, this fixed store may to be local to it where the destination can pull from -- which could get pretty expensive especially in multi-master, as each location has to not only store & maintain its own data but also the data (being sent) of everyone it sends to.
This replicate would also place each source document's revision ID in the the document's copy...
...that is ideally, including essential if the copy was to be {updated, aka a master}, too.
...in form of either:
ideally the normal "_rev" property. Indeed this looks quite possible per it ("preserve their revisions ID") already done by the normal replication algorithm using the builtin "Bulk Docs API" which seemingly our varient would use, too
otherwise have a new copy object (with its own _rev) plus another field as "_rev_original" ntelling the original rev. But well that would work?
Clearly such copy could be created no problem.
Probably no big if the destination is just reading the data.
Seems hairy if the destination is also writing the data. As we'd now have to merge with these non-standard revisions. But doable.
Relevant to this (coding an a custom/improved replication (to do this apparently-missing functionality) ideally without altering Pouch and especially Couch source code), as starter/basis material (the standard method), here's the normal Couch replication algorithm which unfortunately doens't clearly say it only uses builtin ops but it looks like it, and also the official overview of what it does; I'm suspecting Pouch implements this, likely in Pouch's replicate.js (latest release as of 2014.07).
Futher implementation particulars? - those who would know, please put it here.
This is a "community wiki" answer so please extend it.
Also please comment links & details of anyone/system already doing or trying to do this or similar.

In CouchDB, are there ways to improve performance of the View index process?

I have some basic views and some map/reduce views with logic. Nothing too complex. Not too many documents. I've tried with 250k, 75k, and 10k documents. Seems like I'm always waiting for view indexing.
Does better, more efficient code in the view help? I'm assuming it's basically processing the view at all levels of aggregation. So there must be some improvement there.
Does emit()-ing less data help? emit(doc.id, doc) vs specifying fewer fields?
Do more or less complex keys impact view indexing?
Or is it all about memory, CPU cores, and processor speed?
There must be some documentation out there, but I can't find anything referencing ways to improve performance.

I would take a deeper look into the reduce function. Try to use the built-in Erlang functions like _sum, _count, instead of writing Javascript.
Complex views can take hours and more, that's normal.
Maybe post such not too complex map/reduce.
And don't forget: indexing all docs is only done once after changing the view (or pushing a whole bunch of new docs). Subsequent new docs are indexed incrementally.
Use a view with &stale=ok to retrieve the "old" data instantly, so you don't have to wait. (But pay attention: you always have to call a view without stale=ok at least once to trigger the indexing process). Or better: use stale=update_after.

The code you write in views is more like CREATE INDEX than SELECT. It should be irrelevant how long it takes, as long as the view builds keep up with the document change rate. Building a view is a sunk (one-time) cost.
When you query the view, that is always a binary tree scan, which operates against a static data set in logarithmic time. That is usually the performance people care about more (in production.)
If you are not seeing behavior like I describe, perhaps we could discuss your view functions and your general approach to your problem. CouchDB is very different from relational databases. In the latter, you have highly structured data and free-form queries. In CouchDB, you have free-form data but highly structured index definitions (views). Except during development, changing and rebuilding views should be rare.

not emitting anything will help, but doing the view creation in smaller batches ( there are scripts that do this automagically ) helps more than anything other than not emitting anything at all, which can't be helped sometimes.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.

You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.

If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.

While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.

Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)

Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.

Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.

Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string