Can we use TrimFieldUpdateProcessorFactory to trim the values of a particular /all the attributes directly in solr from document screen, without resyncing the full data.
No, update processors are applied when an update is performed (.. as is in the name). Since the indexing operation might not be a lossless process, filters can't be applied after the document has been indexed.
Reindex the document and apply the update chain with the filter active.
Related
This is obviously a question about ES internals.
What I have is a custom search engine built on top of ES feeding it with data from multiple vendors. In order to find out if particular document has changed since last indexing (during e.g. periodic re-pulling the documents from vendors - there's no way to ask some vendors "give me only documents changed since that date"), I'd have to check it somehow for modification and drop it into ES for indexing iff the document changed.
Question: does ES keep track of document checksums internally to see if it actually needs to re-index it? (of course I'm presuming that it's not some HTML where some fields, timestamps, etc. are updated dynamically on each GET).
If it did (that is, re-indexing identical documents has negligible amortized cost), that would simplify updates for me, obviously.
If you use the Update API, you can detect no ops https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#_detecting_noop_updates. You can see the source code for the no op here. https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/update/UpdateRequestBuilder. Note the "extra work" comment. That's definitely something to consider.
Keep in mind the update API tends to be a lot slower than plain vanilla bulk inserts. Regular inserts in which you let ES increment the _version number when you index a document in the same index with the same id will be faster... but they'll also create GC and indexing pressure.
I'm storing documents of several different types (entity types?) in a single collection. What would be the best way get all documents of a certain type (like you would do with select * from a table).
Options I see so far:
Include the type as a property. But that would mean a looking into every document when getting the documents, right?
Prepend the type name to the document id and try searching by id with typename*.
Is there a better way to do this?
There's no built-in entity-type property, but you can certainly create your own, and ensure that it's indexed. At this point, it's as straightforward as adding a WHERE clause:
WHERE docs.docType = "SomeType"
Assuming it's a hash-based index, this should provide efficient lookups and filter out unwanted document types.
While you can embed the type into a property (such as document id), you'd then have to do partial string matches, which won't be as efficient as an indexed-property comparison.
If you're curious to know what this query is costing you, the RU value is displayed both in the portal and via x-ms-request-charge return header.
I agree with David's answer and using a single docType field is what I did when I first started using DocumentDB. However, there is another option that I started using after doing some experiments. That is to create an is<Type> field and setting its value to true. This is slightly more efficient for queries than using a single string field, because the indexes themselves are smaller partial indexes, but could potentially take up slightly more storage space.
The other advantage to this approach is that it provides advantages for inheritance and mixins. For example, I have both isLookup=true and isState=true on certain entities. I also have other lookup types. Then in my application code, some behaviors are common for all lookup fields and other behaviors are only applicable to the State type.
If you index the type property on the collection, it will not be a complete scan.
CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?
Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.
Using /_changes?filter=_design I can get all the changes for design documents.
How do I get all the changes for documents only?
Is there such a thing like /_changes?filter=_docs_only ???
There is no built in filter for this. You will need to write your own filter function (http://couchdb.readthedocs.org/en/latest/couchapp/ddocs.html#filterfun) that excludes design documents (check the doc's _id for "_design/", etc.) from the feed. You then reference this filter function when you query the changes feed (http://couchdb.readthedocs.org/en/latest/api/database/changes.html?highlight=changes). However, most applications don't run into this too often since design documents are typically only updated when there is an application change.
It would probably be more efficient to implement this filter on the client side instead of streaming all your changes to the couchjs process (always inefficient). As your application loops through the changes simply check whether it is a design doc there.
Cheers.
take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.