Lifecycle for IndexSearcher in Solr - search

I would like to have more understanding of the lifecycle for IndexSearcher in Solr, I understand that IndexSearcher for Lucene would recommend that “For performance reasons, if your index is unchanging, you should share a single IndexSearcher instance across multiple searches instead of creating a new one per-search.” (Lucene 4.6.1).
But when things come to Solr world which in a Java Webapp with servlet dispatcher. Do we also keep reusing the same IndexSearcher instance as long as there is no index changing?
I see “Hossman” had this talk for the lifecycle of the solr search request, but he doesn’t mention anything about how we handle/cleanup the indexsearcher.

The reason for that is that you don't interact with Lucene's IndexSearcher directly. That's Solr's responsibility, and kept away from you as the user of Solr.
The only place you kind-of interface with the searcher when using Solr, is through the auto commit setting.
When issuing a commit or an optimize, the current index reader will be closed and a new one will be opened. When issuing those commands, you have the option of telling Solr you want to wait until a new searched as been opened before it returns, allowing you to be sure that your changes have become visible (through waitSearcher).
Other than that, the lifecycle of the searcher (which can be summed up as "stay alive until changes to the index have been made, then reopen a new one") is completely hidden away from the outside world.

Related

Is it possible to cutover to a Solr new index in sub-second time?

Something I have been thinking of for a while. Let's say I have a solr implementation that has a very large index, and the index has to be rebuilt nightly due to new data imported daily. Can have a job that indexes the new data into an index that is "off-line" then cut over to the new index when it has been fully indexed? This would essentially mean my search index would only be searched and never updated in real time -- only when the new index was cut over.
Thanks in advance for any/all replies.
-- MG
Let's see the two main possible scenarios :
Single Solr Instance
you create 2 cores : A, B
A online
re-index B (offline)
when ready you swap [1]
/solr/admin/cores?action=SWAP&core=A&other=B
N.B. you search client will point always to A
SolrCloud architecture
you create 2 collections : A, B
you assign an alias to A [2]:
/admin/collections?action=CREATEALIAS&name=online_search&collections=A
N.B. you search client will access 'online_search' endpoints.
you re-index collection B
when ready you assign the alias to B[2]
/admin/collections?action=CREATEALIAS&name=online_search&collections=B
5. now collection A is offline
[1] https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-SWAP
[2] https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CREATEALIAS:CreateorModifyanAliasforaCollection
In that case you need to create two core.
SearchCore - for Searching
IndexingCore - for Indexing
When Indexing successfully done in IndexingCore. You need to swap IndexingCore with SearchCore.
http://localhost:8983/solr/admin/cores?action=SWAP&core=IndexingCore&other=SearchCore
After this SearchCore will point to IndexingCore data directory and vice versa. Then you can unload the IndexingCore So that it does not consume memory.
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=IndexingCore
I would address this with Aliases (assuming you are using Solrcloud):
say your collection is called 'current'
you create an alias a_current pointing to 'current'
your client code does not call 'current' collection itself, instead it just calls 'a_current'
whenever you need, you create a new collection with new data, say current_2
in one single operation, without downtime, using the same CREATEALIAS command as before, you point a_current to current_2
If you're not talking about cores or collections on the same cluster or solr server, don't use Solr to distribute the requests (which would require you to keep a dedicated Solr server online to just use it as a sharding endpoint without doing anything useful).
Use a regular HTTP load balancer and point it to the active Solr server. Be sure to use proper warming queries on your Solr server with the fresh index before switching the load over to it (to avoid slow queries just when the server comes online). A load balancer might also be able to send queries to both nodes (but only return the response from your primary server) to let you dynamically warm the new server while still serving requests from the old one.

How to use a mongodb cursor with node.js

Let's say that I have a collection in my database called rabbits. My app uses this database and currently there are multiple users using my app. The users want to see the rabbits one by one; when they start the app they see 1 rabbit and then they press 'next' to see the next one, and so on.
I don't want to query the database every time the user presses next, so I decided to use cursors. I am thinking of creating a simple map data structure (working as a cache) that maps a user to its cursor. So before querying the database again we simply check in the map first.
Is this good practice? should I perhaps use redis here instead?
there are probably a million answers to this question and most would be correct. Just some possibilities:
Of course you can use Redis, and read it from memory.
You can also downgrade a bid use something like node-cache which will have less overhead and simpler to implement.
You can take the cursor --> array ---> JSON and if you are not worried about constant new rabbits (after all rabbits do multiply fast :) -- then you can write the rabbits to a JSON file, and pick up as the client wants to swing through it.
You can of course aggregate your MongoDB Cursor...or have a cron job run every few minutes to create a new Rabbit Pick cursor.
On and on it goes.
The critical thing is to match what you decide with the services, memory and cores on your server(s).

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

Most efficient way to create couchdb views

My CouchDB view indexes are being created slower than I would like. Writing the documents is not such a problem but the users can edit them offline and then bulk update, which seems to slow things right down.
This answer helped but I was just wondering is it better to separate out various views into different design documents (eg1) or to store them all in one (eg2).
Eg. 1
_design/posts/_view/id
_design/comments/_view/id
_design/tags/_view/id
Eg.2
_design/webresources/_view/_id?key="posts"
_design/webresources/_view/_id?key="comments"
_design/webresources/_view/_id?key="tags"
*This example is just for illustration purposes. I am only concerned with the time it takes to build the indexes.
You will gain better performance if you read often. Couchdb views are updated and build at read time. So you can can read the view every time the document updates to keep it hot*.
Or maybe listen to the changes feed and keep a track of documents updated. Once they reach a certain threshold value read a view.
Another option is use stale parameter.
If stale=ok is set, CouchDB will not refresh the view even if it is stale, the benefit is a an improved query latency. If stale=update_after is set, CouchDB will update the view after the stale result is returned
Every design document is a separate erlang process. So separating your views across different design documents will cause them to be built concurrently. However each view will still be built in a blocking manner. That is the two views across different design documents can start updating at the same time but the time it takes to update the individual views will be the same as if they were in the same design document.
*You don't necessarily have to care about the result. Our goal here is to trick couchdb to update the view. So you can fire off a request in a separate async process and be done with it.

How ManifoldCF job scheduling behaves?

I am working on integrating manifoldcf or mcf with alfresco cms as repository connector using CMIS query and using solr as output channel where all index are stored. I am able to do it fine & can search documents in solr index.
Now as part of implementation, i am planing to introduce multiple repository such as sharepoint, file systems etc. so now i have three document repositories : alfresco, sharepoint & filesystem. I am planning to have scheduled jobs which run through each of repositories and crawl these at particular intervals. But i have following contentions.
Although i am scheduling jobs for frequent intervals, i want to make sure that mcf jobs pick only those content which are either added new or updated say i have 100 docs dring current job run but say 110 at next job run so i only want to run jobs for new 10 docs not entire 110 docs.
As there are relatively lesser mcf tutorials available, i have no means to ensure that mcf jobs behaves this way but i assume it is intelligent enough to behave this way but again no proof to substantiate it.
I want to know more about mcf job schedule type : scan every document once/rescan documents directly. Similarly i want to know more about job invocation : complete/minimal. i would be sorry for being a newbie.
Also i am considering about doing some custom coding to ensure that only latest/updated docs are eligible for processing but again going thru code only as less documentation available.
Is it wise to doc custom coding in this case or mcf provides all these features OOTB.
Many thanks in advance.
ManifoldCF schedules the job based on what you have configured for the Job.
it depends on how you repository connector is written, usually when when job runs it runs the getDocumentVersion() of repository connector, if the version of a document specification is different that earlier version, manifold indexes that document else not. Usually your document version string is the last modified date of the document
Unfortunately, manifold does not contain much of the document from the developer perspective side, your probable bet is to go through the code. It is quite explanatory.
This is what minimal is presented as per the mcf documentation
Using the "minimal" variant of the listed actions will perform the minimum possible amount of work, given the model that the connection type for the job uses. In some cases, this will mean that additions and modifications are indexed, but deletions are not detected mcf doc jobs
you should implement your logic in public String[] getDocumentVersions(..)
OOTB feature, is quite enough. But one thing to consider additionally the permission of the documents. if the permission of the document is changed you can choose change the version of document.

Resources