Lucene.NET indexing files - search

What's the best way of indexing pages? I'm creating about 50/60 new pages a day to my website.
Should I index the page when it's created or run a schedule every 15 mins and index in bulk?

I would say it would depend on if you are updating the pages as well...if you can handle indexing them when changing them that would be fine but at 50/60 pages a day it doesn't seem like your amount of files would cause any problems on a scheduled index.

Related

Can StormCrawler crawl a file system rather than URLs?

Is there a way to use StormCrawler to index files on the file system rather than URLs? We have 5+ million files that need to be crawled and indexed (with ElasticSearch). The index needs to be updated daily or more frequently. Other crawlers take 50+ hours to crawl the full file set. This makes update cycles too slow. For example, if you need to update the search index daily or more frequently it is not possible with other crawlers.
There is a File protocol available in StormCrawler. If you represent the files as URIs using file://, SC should be able to handle them out of the box.

Lotus notes agent runs slower in server compared to development PC

I have an attendance recording system that has 2 databases, one for current, another for archiving. The server processes attendance records, and puts records marked completed into the archive. There is no processing done in the archive database.
Here's the issue. One of the requirement was to build a blank record for each staff every day, for which attendance records are put into. The agent that does this calls a few procedures and does some checking within the database. As of current, there are roughly 1,800 blank records created daily. On the development PC, processing each records takes roughly 2 to 3 seconds, which translates to an average of an hour and a half. However, when we deployed it on the server, processing each records takes roughly 7 seconds, roughly translates into 3 and a half hours to complete. We have had instances when the agent takes 4.5 to 5 hours to complete.
Note that in both instances, agents are scheduled. There are no other lotus apps in the server, and the server is free and idle most of the time (no other application except Windows Server and Lotus Notes). Is there anything that could cause the additional processing time compared on the development PC and the server?
Your process is generating 1800 new documents every day, and you have said that you are also archiving documents regularly, so I presume that means that you are deleting them after you archive them. Performance problems can build up over time in applications like this. You probably have a large number of deletion stubs in the database, and the NSF file is probably highly fragmented (internally and/or externally).
You should use the free NotesPeek utility to examine the database and see how many deletion stubs it contains. Then you should check the purge interval setting and consider lowering it to the smallest value that you are comfortable with. (I.e., big enough so you know that all servers and users will replicate within that time, but small enough to avoid allowing a large buildup of deletion stubs.) If you change the purge interval, you can wait 24 hours for the stubs to be purged, or you can manually run updall against the database on the server console to force it.
Then you should run compact -c on the NSF file, and also run a defrag on the server disk volume where the NSF lives.
If these steps do improve your performance, then you may want to take steps in your code to prevent recurrence of the problem by using coding techniques that minimize deletion stubs, database growth and fragmentation.
I.e., go into your code for archiving, and change it so it doesn't delete them after archiving. Instead, have your code mark them with a field such as FreeDocList := "1". Then add a hidden view called (FreeDocList) with a selction formula of FreeDocList = "1". Also go into ever other view in the database and add & (!(FreeDocList = "1")) to the selection formulas. Then change the code adds the new blank documents, so that instead of creating new docs it just goes to the FreeDocList view, finds the first document, sets FreeDocList = "0", and clears all the previous field values. Of course, if there aren't enough documents the FreeDocList view, your code would revert to the old behavior and create a new document.
With the above changes, you will be re-using your existing documents whenever possible instead of deleting and creating new ones. I've run benchmarks on code like this and found that it can help; but I can't guarantee it in all cases. Much would depend on what else is going on in the application.

Using Solr with frequently updated data

I have a site search I would like to implement using Solr. Unfortunately, I also have a lot of frequently updated dynamic data in my MySQL database from cron jobs, which I would also like to be searchable.
I would automatically assume that constantly updating records in Solr is not a good idea so is there a workable solution to give me the text-search power of Solr as well as being able to filter based on these frequently updated fields?
I think this depends what "frequently" means and how long your tolerated Solr-lag is.
In my case, i update Solr twice every minute, which works fine.
..based on an MySql DB with some hundred updates a Minute.
In this situation it's important NOT to run an optimize on every Solr update/commit. Better run an optimize every n hoers.
So finally, all the new MySQL stuff will be visible in Solr with max. 30 sec. delay.
It depends on your situation if this is acceptable.

Frequent Updates to Solr Documents - Efficiency/Scalability concerns

I have a Solr index with document fields something like:
id, body_text, date, num_upvotes, num_downvotes
In my application, a document is created with some integer id and some body_text (500 chars max). The date is set to the time of input, and num_upvotes and num_downvotes begin at 0.
My application gives users the ability to upvote and downvote the content mentioned above, and the reason I want to keep track of this in Solr instead of just the DB is that I want to be able to consider the number of upvotes and downvotes into my search.
This is a problem because you can't simply update a solr document (i.e. increment number of up_votes) and you must replace the entire document, which is probably fairly inefficient considering it would require hitting my DB to grab all the relevant data again.
I realize the solution may require a different layout of data, or possibly multiple indexes (although I don't know if you can query/score across solr cores).
Is anyone able to offer any recommendations on how to tackle this?
A solution that I use in a similar problem is to update that information in database and do SOLR Updates/Inserts every ten minutes using the documents that were modified since the last update.
Also every night, when I don't have much traffic I do index optimize.
After each import I set up some warm-up queries in SOLR config.
In my SOLR index I have around 1.5 milion documents,each document has 24 fields, and around 2000 characters in the entire document.
I update the index every 10 minutes around 500 documents ( without optimizing the index ), and I do around 50 warmup queries comprised of most common facets, most used filter queries and free text search.
I don't get negative impact on performance. ( at least it is not visible ) - my queries run average in 0.1 seconds. ( before doing update at every 10 minutes average queries were 0.09 seconds)
LATER EDIT:
I didn't encounter any problems during this updates. I allways take the documents from database and insert them with a Unique key to SOLR. If the document exist in SOLR it is replaced ( this is what I mean by update).
It never takes more than 3 minutes to update SOLR. Actually I am doing 10 minutes break after each update. So I start the update of the index, I wait for it to finish, and then I wait another 10 minutes to start again.
I did not look on the performance over the night, but for me it is not relevant, as I want to have fresh information of data during the users visits peaks.
The Join feature would help you here. Then you could store the up/down votes in a separate document.
The bad news is that you need to wait until Solr 4 unless you're comfortable running with a trunk build.
If you are only going to be updating the up/down votes. Instead of going back to the database, just use the appropriate Solr Client for your application and pull the document from the index, set the up/down values as needed and then reinsert the document back into the index.
There is no solution to your problem within SOLR. You have a database problem and you are trying to solve it with a search engine.
The best way to deal with this is to keep a redis database that records the document id from SOLR and the up/down vote counts. Then your app can merge the data from both sources before displaying.

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Resources