CouchDB, how to get document changes only - couchdb

Using /_changes?filter=_design I can get all the changes for design documents.
How do I get all the changes for documents only?
Is there such a thing like /_changes?filter=_docs_only ???

There is no built in filter for this. You will need to write your own filter function (http://couchdb.readthedocs.org/en/latest/couchapp/ddocs.html#filterfun) that excludes design documents (check the doc's _id for "_design/", etc.) from the feed. You then reference this filter function when you query the changes feed (http://couchdb.readthedocs.org/en/latest/api/database/changes.html?highlight=changes). However, most applications don't run into this too often since design documents are typically only updated when there is an application change.
It would probably be more efficient to implement this filter on the client side instead of streaming all your changes to the couchjs process (always inefficient). As your application loops through the changes simply check whether it is a design doc there.
Cheers.

Related

If I drop the same document into ElasticSearch again, is it going to reindex it?

This is obviously a question about ES internals.
What I have is a custom search engine built on top of ES feeding it with data from multiple vendors. In order to find out if particular document has changed since last indexing (during e.g. periodic re-pulling the documents from vendors - there's no way to ask some vendors "give me only documents changed since that date"), I'd have to check it somehow for modification and drop it into ES for indexing iff the document changed.
Question: does ES keep track of document checksums internally to see if it actually needs to re-index it? (of course I'm presuming that it's not some HTML where some fields, timestamps, etc. are updated dynamically on each GET).
If it did (that is, re-indexing identical documents has negligible amortized cost), that would simplify updates for me, obviously.
If you use the Update API, you can detect no ops https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#_detecting_noop_updates. You can see the source code for the no op here. https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/update/UpdateRequestBuilder. Note the "extra work" comment. That's definitely something to consider.
Keep in mind the update API tends to be a lot slower than plain vanilla bulk inserts. Regular inserts in which you let ES increment the _version number when you index a document in the same index with the same id will be faster... but they'll also create GC and indexing pressure.

Applying "tag" to millions of documents, using bulk/update methods

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.
Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.
A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.

CouchDB _changes, view related

Simple question: I would like to react to some changes in a database, but only to those changes which are causing modifications in a certain view1. That is, I am not interested in all changes in the database, just those changes which are affecting view1. I am not talking about filter here, just about view+changes. Something like this (although this is probably not correct):
http://localhost:5984/db/_design/doc1/_view/view1/_changes
Is this at all supported by CouchDB? Does this makes sense at all?
It's possible, but in a little another way. Since 1.1.0 release CouchDB is able to use map function as filters for changes feed. This works as like regular filters: if key-value pair was emitted at least once for changed document it means that he passes filter and _changes yields the record about him. If you need get only new updates for specific view, you need to specify staring since seq number - it could be easily retrieved from _design/ddoc-name/_info resource from field view_index/update_seq. Since 1.3 release you may also specify since=now to listen updates from current point of time.
Note, that this view filters doesn't uses view index and doesn't updates him while new changes occurs. Also, there is set of patches that improves view filters in the way that you may be also interested.

How does solr work with data split into different services and therefore not synchronously available?

take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.

How to implement CQS with in memory changes?

Having Watched this video by Greg Yound on DDD
http://www.infoq.com/interviews/greg-young-ddd
I was wondering how you could implement Command-Query Separation (CQS) with DDD when you have in memory changes?
With CQS you have two repositories, one for commands, one for queries.
As well as two object groups, command objects and query objects.
Command objects only have methods, and no properties that could expose the shape of the objects, and aren't to be used to display data on the screen.
Query objects on the other hand are used to display data to the screen.
In the video the commands always go to the database, and so you can use the query repository to fetch the updated data and redisplay on the screen.
Could you use CQS with something like and edit screen in ASP.NET, where changes are made in memory and the screen needs to be updated several times with the changes before the changes are persisted to the database?
For example
I fetch a query object from the query repository and display it on the screen
I click edit
I refetch a query object from the query object repository and display it on the form in edit mode
I change a value on the form, which autoposts back and fetches the command object and issues the relevant command
WHAT TO DO: I now need to display the updated object as the command made changes to the calculated fields. As the command object has not been saved to the database I can't use the query repository. And with CQS I'm not meant to expose the shape of the command object to display on the screen. How would you get a query object back with the updated changes to display on the screen.
A couple of possible solutions I can think of is to have a session repository, or a way of getting a query object from the command object.
Or does CQS not apply to this type of scenario?
It seems to me that in the video changes get persisted straight away to the database, and I haven't found an example of DDD with CQS that addresses the issue of batching changes to a domain object and updating the view of the modified domain object before finally issuing a command to save the domain object.
So what it sounds like you want here is a more granular command.
EG: the user interacts with the web page (let's say doing a check out with a shopping cart).
The multiple pages getting information are building up a command. The command does not get sent until the user actually checks out where all the information is sent up in a single command to the domain let's call it a "CheckOut" command.
Presentation models are quite helpful at abstracting this type of interaction.
Hope this helps.
Greg
If you really want to use CQS for this, I would say that both the Query repo and the Write repo both have a reference to the same backing store. Usually this reference is via an external database - but in your case it could be a List<T> or similar.
Also for the rest of your concerns ...
These are more so concerns with eventual consistency as opposed to CQRS. You do not need to be eventually consistent with CQRS you can make the processing of the command also write to the reporting store (or use the same physical store for both as mentioned) in a consistent fashion. I actually recommend people to do this as their base architecture and to later come throught and introduce eventual consistency where needed as there are costs azssociated with it.
In memory, you would usually use the Observer design pattern.
Actually, you always want to use this pattern but most databases don't offer an efficient way to call a method in your app when something in the DB changes.
The Unit of Work design pattern from Patterns of Enterprise Application Architecture matches CQS very well - it is basically a big Command that persist stuff in the database.
JdonFramework is CQRS DDD java framework, it supply a domain events + Asynchronous pattern, more details https://jdon.dev.java.net/

Resources