solr or alternatives to index xml annotated text files - search

Is it possible to index annotated xml files in lucene and search them in solr?
Thanks

Well, if your questions is on weather either Solr or Lucene can parse XML and index it taking into account it's XML structure (making a difference between text for tags and text inside the body of those tags) then the answer is no, they cannot.
What you need to do, if you want to use either of them, is to create your own XML parser, extract the needed data from the XML file, and index it as Lucene or Solr documents. Once you do that, the documents will be searchable on the fields you declared.
I recommend using Solr. It uses a bit more resources than a direct Lucene implementation (a bit more RAM, though all this is configurable in Solr parameters) but is far more easier to develop against as compared to Lucene.

Related

Field.Store and Field.Index both set to `NO` in a Lucene document?

I am aware of what Field.store and Field.Index means in Lucene document and aware of the use-cases when either Field.store or Field.Index is set to NO.
But recently, I came across piece of code, when both are set to NO. Could anybody explain the use-case with an example, when we need to set them to NO ?.
PS: I referred to this SO question, which explains why one is set to NO and another is set to Yes, with good use-cases, but it doesn't give answer to my question.
Lucene is the generic full-text indexing and search library and its not the framework in itself like ElasticSearch or Solr.
So, if you are developing your search application and directly using Lucene then you have full control over which fields to index and/or which fields to store from your app in the Lucene inverted index.
Frameworks like ElasticSearch or Solr which are built on top of Lucene, may use a schema for indexing or it might be schemaless too.
I think in cases where it's schemaless, it makes sense to explicitly ignore the fields which we don't want to index and store both.

Solr - Enriching the TermsComponent answer

I'm using Solr 3.5.0 (with WebSphere Commerce). While performing a search, commerce use the suggestion tool to suggest (auto-complete) search terms regarding the letters already typed on the search box.
Currently WebSphere Commerce is using the Solr's TermsComponent. But one of my new requirement is to be abble to enrich the list of suggested terms.
Do you know is there is any way to do that by creating a plain text dictionary, using an other solr component, ... ?
Thanks for reading,
and for your help.
Regards,
Dekx.
I think a plain-text dictionary probably wouldn't be a usable data source (even if you could use it, search linearly through a plain-text file would probably be too slow). If you create an index from you dictionary, you could probably incorporate it in the TermsComponent as a shard (see the TermsComponent documentation, under the heading "Distributed Search Support").
I don't believe TermsComponent supports searching multiple fields, so you'll want to make sure the same field name is used for the terms in the dictionary that you want to use (that is, if you are looking at the "name" field in the index, then create a "name" field in your indexed dictionary as well, rather than a "dictionaryentry" field)
Just to my mind, though, I fail to understand what the value this would be. Generally, it's intended to look at the terms available in the index on that field. "Enriching" it with more data, would just be providing suggestions that it won't actually be able to find when searching. Of course, I don't really know about your search implementation, but in most cases, that would certainly be my thought.

Creating a web indexer in Java?

I'm supposed to write a web crawler in Java. The crawling part is easy, but the indexing part is difficult. I need to be able to query the indexer and have it return matches (multiple word queries). What would be the best data structure for doing such a thing?
Use an indexing tool such as Lucene, Solr or Compass.
The solution to the index & search step is to use an inverted index data structure, and the best available open source package that implements this for indexing & search is Lucence.
There are also open source projects that provide a composite solution to the crawling, indexing & searching steps which may be of interest, e.g. nutch
This free online book on information retrieval may help you (see chapter on constructing an inverted index).
If you're buliding this from scratch you should look at the inverted index data structure. If you can use one off the shelf then look at the Nutch project.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Does StackOverflow use Lucene for tagged searches?

How has SO implemented the tagged search? Is it using Lucene or any other open-source search engine library for tagged searching?
What is the best way to search document (PDF, XML, HTML, MS Word) or database?
Searching tags is very different than searching text. A tagged search is searching for an association where questions are all associated with a particular tag. This can be implemented with a full-text engine where the tags are all appended in a single large entry, but a relational database will probably be best in this situation (assuming the tagged data is in a relational database to start with).
For searching other documents like PDF, XLS, HTML, then you need full text like Lucene. You'll need a parser that can extract just the relevant text from each source (i.e., separate text from markup).
So, yes, it is using Lucene.NET, though I'm not sure exactly how. The "best" way is a whole 'nother story.
The last time this was discussed (on the podcast) it was mentioned that Stackoverflow uses SQL Server's full-text search feature, not Lucene.
SO doesn't use Lucene.
If you want to index documents and are running Windows, then IFilters would be my first choice.

Resources