I'm trying to do text searching on dictionary definitions which are stored in an array of each word. What I want to do is have the following conditions to be ranked higher:
If the keyword appears early in the definitions, the entry should be ranked higher.
If the definition fully matches the keyword, the entry should be ranked higher.
If the keyword appears more often in the definitions, the entry should be ranked higher.
For example, search for "car".
Word1:
Car
Vehicle
Bus
Word 2:
Parking a car
Carpark
Word 3:
Small car
The ranking should return Word1, then Word 3, then Word 2.
How would I implement this with Solr, if at all possible? If not, what other options do I have for this kind of search ranking?
I have only analyzed this academically, so cannot back my answer with experience.
Have a look at Payloads in Solr. Grant Ingersoll explains it at a basic level, and the article has tests that look similar to your use case.
Spanquery is also worth checking out, but am not sure if it fits the use case you describe.
Do post back with your experiments/experiences.
Related
I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.
It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.
Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.
I have been looking over the different text analyzers which Azure Cognitive Search offers with this api.
The type of data I have is generic and can be either an email address / name, these are just some examples.
Which is the best analyzer to use on this type of data (generic)?
Also, does the text analyzer in use affect how search works when looking for more than 1 word?
What is the best way to make it do a fuzzy search for more than 1 word i.e. "joe blogs" but all fuzzy.
I don't want "somename blogs" to show up for somename is not a fuzzy match on joe.
I do want "joe clogs" to show up for joe would fuzzy match to joe and clogs would fuzzy match on blogs.
What is the best practice to do fuzzy search with more than 1 word, which would give the end user fewer hits as they give more words?
If you have generic content and don't want to use linguistic processing, you can use a generic analyzer like the Whitespace analyzer. See
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
How searches for single or multiple words work is determined by the searchMode parameter. The recommended practice is to use all, instead of any. When you specify more terms, you are more specific and you want fewer (more precise) results.
You can specify multi-word queries where individual search terms are fuzzy by using the tilde syntax. E.g. to do a fuzzy search for joe but exact match on blogs you could something like:
joe~ blogs
You can also control how fuzzy you want it to be. See
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_fuzzy
PS: From your use case it sounds like proximity matching is also something you could consider using:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_proximity
for example, if I am searching an index of book titles with the term "harry", "Dirty Harry" is scored equally to "Harry Potter", and when two items are equally scored, the order is random. I'd like to weight the one that begins with my search term (Harry Potter) higher.
I would rather not use TermPostionVector as it seems that this is something I can read only after the search and scoring has been completed.
thanks for your time/consideration.
Look up spanquery.
I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.
Is there any open source/free software available that gives you semantically related keywords for a given word. for example the word dog: it should give the keywords like: animal, mammal, ...
or for the word France it should give you keywords like: country, Europe ... .
basically a set of keywords related to the given word.
or if there is not, has anybody an idea of how this could be implemented and how complex this would be.
best regards
Wordnet might be what you need. Wordnet groups English words in sets of synonyms and provides general definitions, and records the various semantic relations between these groups.
There are tons of projects out there using Wordnet, here you have a list:
http://wordnet.princeton.edu/wordnet/related-projects/
Look at this one, you might find it particularly useful (http://kylescholz.com)
you can see the live demo here :
http://kylescholz.com/projects/wordnet/?text=dog
I hope this helps.
Yes. A company named Saplo in Sweden specialize in this. I beleive you can use their API for this and if you ask nicely you might be able to use it for free (if it's not for commercial purposes of course).
Saplo
Yes. What you are looking for is something similar to vector space model for searching and it is the best efficient way of doing. There are some open source libraries available for latent semantic indexing / searching ( special case of vector space model). Apache Lucene is one of the most pupular one. Or something from google code.
If you are looking for online resources, there are several to consider (at least in 2017; the OP is dated 2010).
Semantic Link (http://www.semantic-link.com): The creator of Semantic Link offers an interface to the results of a computation of a metric called "mutual information" on pairs of words over all of English Wikipedia. Only words occurring more than 1000 times in Wikipedia are available.
"Dog" gets you, for example: purebred, breeds, canine, pet, puppies.
It seems, however, you are really looking for an online tool that gives hyponyms and hypernyms. From the Wikipedia page for "Hyponymy and hypernymy":
In linguistics, a hyponym (from Greek hupó, "under" and ónoma, "name") is a word or phrase whose semantic field is included within that of another word, its hyperonym or hypernym (from Greek hupér, "over" and ónoma, "name") . In simpler terms, a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hyperonym); which, in turn, is a hyponym of animal.
WordNet(https://wordnet.princeton.edu) has this information and has an online search tool. With this tool, if you enter a word, you'll get one or more entries with an "S" beside them. If you click the "S", you can browse the "Synset (semantic) relations" of the word with that meaning or usage and this includes direct hyper- and hyponyms. It's incredibly rich!
For example: "dog" (as in "domestic dog") --> "canine" --> "carnivore" --> "placental mammal" --> "vertebrate" --> "chordate" --> etc. or "dog" --> "domestic animal" --> "animal" --> "organism" --> "living thing" -->
There is also WordNik which lists hypernyms and reverse dictionary words (words with the given word in their definition). Hypernyms for "France" include "european country/nation" and reverse dictionary includes regions and cities in France, names of certain rulers, etc.. "Dog" gets the hypernym "domesticated animal" (and others).