Adding fields in xapian c++ library - xapian

Hello i am trying xapian c++ library i am basically from java and i used lucene and for now i need xapian i have no other go.
so i am using it.
In lucene we can use like this
Document doc = new Document();
doc.add(new Field("title", "stackoverflow", Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
So title contains the value .But from this example
Xapian::Document newdocument;
newdocument.set_data(string("stackoverflow");
How to make the same thing in xapian.

Xapian, unlike Lucene, does not constrain how you use document data; it simply allows any binary data to be stored for each document – although this is in some way a missing feature, it also provides more flexibility, meaning that some people could use JSON, some a simple key-value serialization and so on. The downside, of course, is that you have to decide how to serialize your data.
There is code in Omega which uses a simple key-value serialization that may be helpful. Alternatively, you could look at something like restpose, which gives a higher-level approach to search built on top of Xapian, and is more comparable to Solr.

Related

Hazelcast Portable serialization

I want to use Portable serialization for objects stored in IMap to achieve:
fast indexing during insertion (without deserializing objects and
reflection)
class evolution (versioning)
Is it possible to store my classes without implementing Portableinterface?
Is it possible to store 3rd party classes like Date or BigDecimal (or with nested structure) which can not implement Portable interface, while still being indexable?
You can achieve fast indexing using Portable, yes. You'll also see benefits when you're querying on non-indexed fields since there'll be no full deserialization. VersionedPortable support versioning as well but
You must implement Portable interface
For types that doesn't supported by portable, you need to convert the data to a supported format, For date Long for example. And you need to code serialization/deserialization for each property & handle versioning yourself.
Portable is backward compatible only for read. If you update the data from an app who has a previous version, then you'll lost the new field updates done previously by an app has higher version of the Portable object.
So depends on your exact requirements, you need to chose the correct serialization format.
If versioning is not so important or you can handle it manually, but query performance is, then yes Portable make sense. But if you're planning to use versioning heavily, I would suggest using a backward/forward compatible serialization format like Google Protocol Buffers.
You can check this example to get an idea: https://github.com/gokhanoner/data-versioning-protobuf

Field.Store and Field.Index both set to `NO` in a Lucene document?

I am aware of what Field.store and Field.Index means in Lucene document and aware of the use-cases when either Field.store or Field.Index is set to NO.
But recently, I came across piece of code, when both are set to NO. Could anybody explain the use-case with an example, when we need to set them to NO ?.
PS: I referred to this SO question, which explains why one is set to NO and another is set to Yes, with good use-cases, but it doesn't give answer to my question.
Lucene is the generic full-text indexing and search library and its not the framework in itself like ElasticSearch or Solr.
So, if you are developing your search application and directly using Lucene then you have full control over which fields to index and/or which fields to store from your app in the Lucene inverted index.
Frameworks like ElasticSearch or Solr which are built on top of Lucene, may use a schema for indexing or it might be schemaless too.
I think in cases where it's schemaless, it makes sense to explicitly ignore the fields which we don't want to index and store both.

Solr - Enriching the TermsComponent answer

I'm using Solr 3.5.0 (with WebSphere Commerce). While performing a search, commerce use the suggestion tool to suggest (auto-complete) search terms regarding the letters already typed on the search box.
Currently WebSphere Commerce is using the Solr's TermsComponent. But one of my new requirement is to be abble to enrich the list of suggested terms.
Do you know is there is any way to do that by creating a plain text dictionary, using an other solr component, ... ?
Thanks for reading,
and for your help.
Regards,
Dekx.
I think a plain-text dictionary probably wouldn't be a usable data source (even if you could use it, search linearly through a plain-text file would probably be too slow). If you create an index from you dictionary, you could probably incorporate it in the TermsComponent as a shard (see the TermsComponent documentation, under the heading "Distributed Search Support").
I don't believe TermsComponent supports searching multiple fields, so you'll want to make sure the same field name is used for the terms in the dictionary that you want to use (that is, if you are looking at the "name" field in the index, then create a "name" field in your indexed dictionary as well, rather than a "dictionaryentry" field)
Just to my mind, though, I fail to understand what the value this would be. Generally, it's intended to look at the terms available in the index on that field. "Enriching" it with more data, would just be providing suggestions that it won't actually be able to find when searching. Of course, I don't really know about your search implementation, but in most cases, that would certainly be my thought.

Does a B Tree work well for auto suggest/auto complete web forms?

Auto suggest/complete fields are used all over the web. Google has appeared to master it given that as soon as one types in a search query, suggestions are returned almost instantaneously.
I'm assuming the framework for achieving this involves a fast, in-memory data store on the web tier. We're building a Grails app based around retail products, so a user may search for Can which should suggest things like Canon, Cancun, etc, and wondering if a Java B-tree cached in memory would suffice for quick auto completes returned as JSON over AJAX. Outside of the jQuery AutoComplete field, do any frameworks and/or libraries exist to facilitate the development of this solution?
Autocomplete is a text matching, information retrieval problem. Implementing your own B-tree and writing your own logic to match words to other words is something you could do. But then you would have to implement Porter Stemming, a Vector Space Model, and a String-edit distance calculation.
...or you could use Lucene and its derivatives, which do a lot of this stuff already. If you really care about the data structures used to store this stuff, you could dive into its source. But I highly doubt writing your own and doing it all yourself would be more maintainable and efficient in the long run.
One of the more popular Grails ecosystem plugins for this is Searchable, which was mentioned in Ledbrook & Smith's Grails in Action. It uses Lucene under the covers, and make sit pretty easy to add full-text search to your domain classes. (For example, check out chapter 8 in GinA or the searchable docs).
The Grails Richui plugin has an autocomplete that I've used in the past. We had it hooked up to hit the database every keystroke (which I would not suggest but our data changed often enough that real-time data was required). If your list of things is pretty static though then it could probably work well for you.
http://grails.org/plugin/richui#AutoComplete

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources