Boost score to different fields in Lucene - search

I have 2 fields "title" and "body" in my document. I want to give more weightage to title field. In latest Lucene 8.6* setboost is not available for field object. I want to know the best way to do this. I read PerFieldSimilarityWrapper can be used but I didn't get how to use it and in documentation warning is present "WARNING: This API is experimental and might change in incompatible ways in the next release." So is it safe to use in product which will be supported for long time.

I too wish I understood more about why Lucene sometimes marks portions of their API as "experimental". But one thing I can tell you, which should be fairly reassuring, is that there are large portions of the API marked this way and many haven't changed much in years and years.
So for example the class you are interested in PerFieldSimilarityWrapper was marked as experimental at least as far back as Lucene 4.8. See the tag on the 4.8 version in github.
So, I don't think I'd be too concerned. Especially because if the Lucene team ever decides to change the API they won't immediately remove it. They typically mark the old API as depreciated (but still callable) for at least on major version.

Related

DDD / Aggregate Root / Versioning

How do we usually deal with versioning of an aggregate root?
I was thinking along this line (I'm in a survey-design domain).
One way to have versioning is to have an explicit method to create a new version, based on the existing one. For example, Study (an aggregate root).
So initially we have an aggregate root, whose root-entity is Study with (business) key "ABC", version "1".
By invoking the method "newVersion()" on the Study, a copy of that Study and all the other entities that belong to the same aggregate root will be created.
So basically, versioning is done through creation a separate instance (of aggregate root). The ID is composite (business key + version).
How do we know if it's a branch? or is it just one version up? (1.1? or 2). I guess, this simple rule would work: if there's no further version associated, then it's "one version up" (2); if there's already another version, than it's a branch (1.1).
Another concern: noise.
But that means, we cannot work on / modify existing version. We'd have to create a newVersion everytime we want to make modifications to our object. Everytime??? Hmmm.... Doesn't sound right.
Or... we can make rule like this, based on a flag (active / not-active, or published / un-published). If the flag is "not-active", we can modify the AR directly, without creating a new version. If the flag is active we have to either: (a) set it to "not-active" first, and modify.... or (b) create a newVersion and work on the version (initially set to "not-active").
Any thoughts / experience you want to share on this matter?
I think you will find things a bit confusing in researching this question, because there are two very different concepts at play:
Versioning as a concurrency control mechanism to support optimistic concurrency
Versioning as an explicit domain concept
Versioning to support Optimistic Concurrency
Optimistic concurrency is when two simultaneous transactions are allowed to start, but if they both try and modify the same data item, only the first one is permitted to proceed. See Concurrency Control for an overview of different locking strategies.
In summary, you leave versioning up to the persistence technology, because the purpose of the version is to detect simultaneous writes to the persistence layer.
When using this pattern, it's common to not even keep copies of old versions, however it's certainly possible to do so as an audit trail/change log.
Versioning as an explicit domain concept
Based on your question, and the need to support potential branching strategies, it sounds like versioning is an explicit domain concept in your domain - i.e. the concept of a "Version" is something that your domain experts talk about, and working with versions is an important part of the ubiquitous language.
However, you raise a few different concepts which indicate that the domain needs further exploration:
Version branching
User-defined version naming/tagging (but still connected to a 'chain' of versions)
Explicit version changes (user requested) vs implicit version changes (automatic on every change)
If I understand your intent correctly, with explicit versioning, the current 'active'/'live'/'tip' version is mutable and can be modified without tracking the change, until the user 'commits' it - it becomes immutable, and a new 'live' version that is mutable is created.
Some other concepts that may come up if you explore this version:
Branch merging (once you have split two branches, what happens if you want to bring them back together?)
Rolling back - if you have an old version, do you support 'undoing' one or more changes?
Given the above, you may also find some insights from the way that version control systems work both centralised (e.g. subversion) and distributed (e.g. git and mercurial), as they present an active working model of version tracking with a mixture of mutable and immutable elements.
The open questions here suggest to me that you need to explore this in more detail with your domain experts. With DDD sometimes it's easy to get lost in what you can do, but I strongly encourage you to try and understand what you need to do.
How do your users/domain experts think about the world? What kind of operations do they want to be able to do? What is the purpose of these operations towards their initial goal? Your aim is to distill the answers to these questions into a model that effectively encapsulates the processes they work with.
Edit to Consider Modelling
Based on your comment - my first response would be to challenge the interpretation of the word 'version' when thinking about the modified questionnaire. In fact, I'd be tempted to challenge the modelling of the template/survey relationship. Consider a possible set of entities:
Template
Defines the set of questions in the questionnaire
Supports operations:
StartSurvey
Various operations to modify the questions and options in the template etc.
Survey
Rather than referencing a 'live' template, the survey would own it's own questionnaire
When you call Template.StartSurvey it returns a Survey that is prefilled with the list of questions from the template
A survey also supports modifying the questions - but this doesn't change the template it was created from
Unlike a template, a survey also maintains a list of recorded answers, and offers operations to set the answers
It probably also includes a lifecycle state wherein in some states answering questions is permitted, but once 'submitted' you can't modify the answers (just guessing on this one).
In this world, the survey is 'stamped out' from the template, but then lives an independent life. You can modify the questionnaire in the survey all you like, and it won't effect the template.
The trade-off here is that if you do modify the template, none of the surveys that have already been created from it would get updated - but it sounds like that might be safer for you anyway?
You could also support operations to convert a survey back into a template so that if you like the look of a modified survey, you could 'templatize' it so it could be used for future surveys.

Is It Possible To Reference TFS Work Item Fields More Than Once Within The Same Work Item?

We are currently in the process of upgrading from TFS 2008 to TFS 2012. When TFS 2008 was set up, the people involved didn't understand a lot of what the work item fields were for, and we ended up with very heavily customised templates and in fact lost a lot of default fields. As part of the upgrade to 2012 we are trying to return to the out of the box templates as much as possible to ensure we get to use as many of the features as possible, however there are a small number of custom fields that we need to include for reporting purposes.
Our product development process involves a roadmap for upcoming releases which includes new work as well as bug fixes. When a bug is assigned to be worked on by the developers we would like to be able to choose which release we're targeting the fix for - as far as I can see, Iteration is best suited for this. At the point the bug is closed though, we would also like to track what release it was actually fixed in, since things often get bumped from one release to the next if higher priority bugs or change requests come in, but this is where we come unstuck since I can't seem to assign Iteration to both fields such that the two show different values.
If possible we would prefer not to have global lists that have to be constantly updated with release numbers across our product range (we have around 8 different products which are constantly in development, each with their own release numbers), and leaving one of them as a text field leaves open the possibility that we will get inconsistencies in what people enter, eg 1.01 versus 1.1 which will show up in reporting as 2 different releases. As the fields are just looking up a set of values in the background, is there no way that the iteration list can be used twice? Or does someone have an alternative suggestion as to how we get round this?
What I think I'd suggest in this case is using a COPY rule on a state change event, so that when you move your work item into the Closed state, it would populate your custom field with the value currently in your Iteration field.
This would give you a snapshot of the value at the right point in time which then wouldn't be altered if the iteration was later changed, along with a history entry if it was opened & closed multiple times over its lifetime.
As iteration is time limited and release is perpetual there is an inherent mismatch of purpose with using iteration here. Iteration is for planning.
You would be better creating a release list with the version that you release.
If you are sprinting for example you may not know up front which release you will end up on before you start. If you are not sprinting then you are just kidding yourself that your know.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

The best IR software for my use?

I want to take what people chat about in a chat room and do the following information retrieval:
Get the keywords
Ignore all noise words, keep verb an nouns mainly
Perform stemming on the keywords so that I don't store the same keyword in many forms
If a synonym keyword is already stored in my storage then the existing synonym should be used instead of the new keyword
Store the processed keyword in a persistant storage with a reference to the chat message it was located in and the user who uttered it
With this prosessed information I want to slowly get an idea of what people are talking about in chatrooms, and then use this to automatically find related chatrooms etc. based on these keywords.
My question to you is a follows: What is the best C/C++ or .NET tools for doing the above?
I partially agree with #larsmans comment. Your question, in practice, may indeed be more complex than the question you posted.
However, simplifying the question/problem, I guess the answer to your question could be one of Lucene's implementation: Lucene (Java), Lucene.Net (C#) or CLucene (C++).
Following the points in your question:
Lucene would take care of point 1 by using String tokenizers (you can customize or use your own).
For point 2 you could use a TokenFilter like StopFilter so Lucene can read a list of stopwords ("the", "a", "an"...) that it should not use.
For point 3 you could use PorterStemFilter.
Point 4 is a little bit trickier, but could be done using a customized TokenFilter.
Point 1 to 4 are perfomed in the Analysis/tokenization phase, which an Analyzer is responsible.
Regarding point 5, in Lucene you can store Documents with fields. A document can have an arbitrary number and mix of fields. So you could create a single Document for each chat room with all its text concatenated, and have another field of the document reference the chatroom it was extracted from. You will end up with a bunch of Lucene documents that you can compare. So you can compare your current chat room with others to see which one is more similar to the one you are on.
If all you want is a set of the best keywords to describe a chatrom your needs are closer to information extraction/automatic summarization/topic spotting task as #larsmans said. But you can still use Lucene for the parsing/tokenization phase.
*I referenced the Java docs, but CLucene and Lucene.Net have very similar APIs so it won't be much trouble to figure out the differences.

Drupal search engine does not index my custom nodes!

Somebody has posted an hour ago or so a question that was about the drupal search engine and was about like this:
I know drupal should index anything that is returned by node_view() but this is not happening for my custom content. Also: are there better alternatives to Drupal built-in functionality?
As the question has been removed while I was answering, and didn't want to throw away 20 minutes of my life for nothing ;) I thought to re-create the question a second time. Hope this is fine by the rules of SO! :)
The Drupal search engine is probably not the most celebrated feature of Drupal, but is fairly solid, sophisticated and reliable. There are plenty of modules that enhance or substitute it but - at least in my experience - there is not a commonly accepted "better way" to manage searching and indexing.
However, for very big and busy sites people prefer to use external tools altogether, like a google searchbox or even dedicated software or hardware, like solr / lucene or google search appliance (GSA).
The link I provided above - however - sorts the search-related modules by descending usage statistics, so you will find on the first page the one most commonly used. One that I personally like for English language sites is the porter-stemmer, which index words by their stem (eg: highness, highest and higher will all be returned as matches for the word "high").
That was for the general information on search and Drupal. As for your problem, there are a number of things you could check to track down your problem:
Have your cron.php been executed lately? Indexing is done as part of the cron run, so - if you do not have a crontab set or if you haven't executed it by hand, your node will likely not been indexed yet.
Are the settings correct? Settings for the search module are located at http://example.com/admin/settings/search : is your minimum word length sufficient for your needs (the default is 3 letters)?
Has the 100% of the site being indexed? (You can check that from the setting page). If it is not, and running cron.php doesn't solve the matter, look further down.
Does a re-index solve the problem? Especially if you inserted data by mean of SQL queries directly on the Drupal tables, chances are Drupal hasn't realised the content of the node has changed and therefore doesn't update the index.
Is the node you are trying to find, visible? Search results about unpublished nodes or nodes that require higher-than-yours permissions to be viewed are not returned, AFAIK.
As for the "stuck indexing" that happened to me once as well. It turned out it was some PHP code within a node body that would trigger a PHP exception when the node was being indexed, and as a result the indexing process would halt and all the following nodes would not be indexed as well.
Hope this helps. Good luck!

Resources