Lucene phrase query with terms in OR - search

Suppose that i have 5 documents having the field text as follow:
the red house is beautiful
the house is little
the red fish
the red and yellow house is big
What kind of query should i use to retrieve the documents such that the rank is the following if i search for "red house":
the red house is beautiful and big [matching: red house]
the red and yellow house is big [matching: red x x house]
the house is little [matching: house]
the red fish [matching: red]
What i need is to give an high rank to the documents that match the phrase i've searched, and a lower score to the documents that have just a part of the phrase searched.
Notice that the string query could contains also more than 2 terms.
It is like a PhraseQuery in which each term can appear or not, and in which the closer are the terms the higher is the score.
I've tried to use compose a PhraseQuery with a TermQuery but the result is not what i need.
How can i do?
Thanks

Try creating a BooleanQuery composed of TermQuery objects, combined with OR (BooleanClause.Occur.SHOULD). This will match documents where only one term appears, but should give a higher score to those where both appear.
Query term1 = new TermQuery(new Term("text", "red"));
Query term2 = new TermQuery(new Term("text", "house"));
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(term1, BooleanClause.Occur.SHOULD);
booleanQuery.add(term2, BooleanClause.Occur.SHOULD);

I think a PhraseQuery with a postive setSlope, SHOULD-combined with a TermQuery for every term, should get you there. Maybe with a boost for the PhraseQuery.
I've tried to use compose a PhraseQuery with a TermQuery but the
result is not what i need.
What do you get with this combination and how it is not what you need?

Related

MarkLogic - Tokenize Search Phrase based on XML Field as a dictionary of phrases

I have a list of "known phrases" stored in an XML Document under an element named label. I am trying to figure out how to write a function, that can tokenize a search phrase into all of its label pieces (if available).
For instance. I have a Label for North Korea, and ICBM.
If the user types in North Korea ICBM, I would expect to get back two tokens, one for each label as opposed to North and Korea and ICBM.
In another example if the user types in New York City, I would expect only one token (label) of "New York City".
If there is no labels found, it would return the default tokenization of each word.
I tried to start writing this, but am not sure how to do this properly without a while loop facility, and am pretty new to xQuery in general.
The code below was how I started, but quickly realized it would not work for scaling out search terms.
Basically, it checks to see if the full phrase is in the Label fields. If it is not, it starts to strip away from the back of the search phrase checking what's left for a label.
let $label-query := cts:element-value-query(fn:QName('','label'), $searchTerm, ('case-insensitive', 'whitespace-sensitive'))
let $results := cts:search(fn:collection('typea'),$label-query)
let $test :=
if (fn:empty($results)) then
let $tokens := (fn:tokenize($searchTerm, " "))
let $tokenCount := fn:count($tokens)
let $lastWord := $tokens[last()]
let $firstPhrase := $tokens[position() ne (last())]
let $_ :=
if (fn:count($firstPhrase) = 1 ) then
()
else
let $label-query2 := cts:element-value-query(fn:QName('','label'), $firstPhrase, ('case-insensitive', 'whitespace-sensitive'))
let $results2 := cts:search(fn:collection('typea'),$label-query2)
return
if (fn:empty($results2)) then
xdmp:log('second empty')
else
xdmp:log($results2)
let $l := xdmp:log( $firstPhrase )
return $tokens
else
let $_ := xdmp:log('full')
return element {'result'} {$results}
Does anyone have any advice how I could implement this recursively or perhaps any alternate strategies. I am essentially trying to say, break this sentence up into all of the phrases found that exist in the Label fields of the typea collection. If there are no labels found, tokenize by word.
Thanks I look forward to your guidance.
Update to help clarify my ultimate intention.
Below is the document referring to North Korea.
The goal is to parse the search phrase, and use extra information found in these documents to aid in search.
Meaning if the person types in DPRK or North Korea they should both search the same way. It should also include Narrower Labels as an Or Condition on the search, and will more likely than not be updated to include other relationships that will also be included in search. (IE: Kim Jong Un is Notably Associated with North Korea.)
So in short I would like to reconcile the multi phrase search terms using the label field, and then if it was found, use the information from all labels + the narrower labels as well from that document.
Edit 2: Trying to use cts:highlight to get the phrases. Once I have the phrases I will do an element lookup to get to the right document, and then get the associated documents data for submission to query building.
The issue now is that the cts:highlight does not always return the full phrase under one <phrase> tag.
let $phrases := cts:highlight(<nod>New York City FC</nod>, cts:or-query((//label)), <phrase>{ $cts:text }</phrase>)
A possible alternative approach, if you are using MarkLogic 9, is to set up a custom tokenization dictionary. See custom dictionary API documentation1 and search developer's guide2 for details.
But the gist is, if you add an entry "North Korea" in your tokenization dictionary for a language, you'll get it back as a single token for that language. This will apply everywhere: in content or searches.
That said, it isn't clear from your code what you are ultimately trying to achieve with this. If it is more accuracy with phrase searches, there are better ways to achieve this (enabling fast-phrase for 2-word phrases, or word positions for longer ones).
If this us about search parsing only, you could still use the tokenization dictionary approach, but you probably want to use a special language code for it so it doesn't mess up your actual content, and then use cts:tokenize, e.g. cts:tokenize("North Korea ICBM","xen") where "xen" is your special language code.
Another approach is to use cts:highlight to apply markup to matches to your phrases in the string and go from there:
cts:highlight(<node>North Korea ICBM</node>,
cts:or-query((//label)),
<phrase>{$cts:text}</phrase>)
That would embed the markup for any matching phrase: <node><phrase>North Korea</phrase></node>
Some care would have to be taken around overlaps, if you want to force a particular winner, by applying the set you want to win first, and then running a second pass with the others.

Spotipy: How do I search by both artist and song

Given a song title and an artist name, I am trying to find the correct song using Spotipy. However, I do not see a way to search by both song title and artist: it's either one or the other:
sp.search(q="Money", type="track", limit=10)
sp.search(q="Pink Floyd", type="artist", limit=10)
The problem with this is that I get a bunch of irrelevant results, especially if I search by track(example: top result for money is "Rent Money" by Future not "Money" by Pink Floyd). I could extend the limit and filter out irrelevant results, but considering I'll be doing this on a large scale, I'd rather just query Spotify correctly, take the first result, and move on. Is there any way to query on both track name and artist at the same time using Spotipy?
Try looking at https://developer.spotify.com/web-api/search-item/
I think that you're misunderstanding type. I always want to return a track list so type is track. In other words this defines the type of entities to be returned.
The query filter can be completely generic like Money or can be focussed to certain parameters like artist:Floyd track:Money. This can be immensely powerful as you can look at albums, date fields, popularity and all sorts.
I commonly use
let q = String.init(format:"artist:%# track:%#",artistName,trackName)
Don't forget to %-encode the string!

How do I perform a partial search while preserving order using solr?

Using Solr 4.0, I have a field Title_t(to store titles of books) which is of type TextField. Assuming that these are the following titles stored in my db:
Physics Guide
Tutorial on Theoretical Physics
The General Physics
Book
If one wants to search for a title "Physics Guide", then one could use
Title_t:physics G*
this shows up all results
Physics Guide
Tutorial on Theoretical Physics
The General Physics Book
Now, to my question:
Why isnt the filter not showing only the "Physics Guide" result?
Since the search criteria is "physics G*" and not "*physics G *", only one result should be displayed .Is there a way to preserve order in the search key word?
After parsing you query Title_t:physics G* will become like
Title_t:physics df:G*
df here is default field(usually it will be text, check your config files).
and default operator will OR. so it returns documents with Title_t having term 'physics' and documents with any fields copied default field with words starting with G.
Try with ComplexPhraseQueryParser
q={!complexphrase inOrder=true}Title_t:"physics G*"

Implementing a location search in ElasticSearch

I have a problem with location queries returning erroneous results in ElasticSearch.
In our system, a business search engine, every search takes two inputs: a location, and a query-string, e.g.
q=sushi
location=Greenwich Village, New York, New York
I want the search to show me sushi in Greenwich Village first, then sushi outside of Greenwich Village, but to never show me non-sushi results.
The problem is, because of the location query, anything in Greenwich Village gets matched -- lawyers, doctors, whatever. I'd like say the following to ElasticSearch:
If q matches, then location doesn't have to (it's OK to return sushi outside of Greenwich Village), but if location matches, don't return it unless q matches also (not OK to return non-sushi businesses in Greenwich Village).
Anyone have any thoughts on how to do this?
It sounds like you want to search for "sushi" (you don't want non-sushi results) but sort your results by location (you want Greenwich Village results first).
If you are storing locations as geo points, you can simply use distance to sort your results.
If location is just a field, and you can only know if the business is inside or outside of a location, you can use Custom Filters Score query to boost relevancy of the results in the desired location. The query part should contain the search for "sushi" and the filters part should contain the search for location.
I incorporated the information on this post and here to to come up with the following solution.
Index every 'place' (neighborhood, city, etc) with a center-point, and also index the coordinates of every business.
Index the place ids attached to the businesses that contain them.
Use a sub-search to convert the text entered into the location bar to a place record.
Use a CustomScoreQuery to modify every result's score by the following formula, which was worked out by trial and error:
new_score = old_score / (1 + distance_between_place_centerpoint_and_result)^3
Also query the place id that results from 3 against the place_ids field as a 'should' boolean query. This gives a flat boost to everything that actually falls within the confines of the specified place.
A side effect of this strategy is that businesses near the center point of the place are considered more relevant -- it is arguable, in my opinion, whether this is correct or not. But other than that it has worked quite well.
Thanks to imitov for his insight that helped me come up with this solution.

Solr searching within a dictionary file

I was wondering if there is something out of the box from Solr which would allow me to search within a dictinary file (containing words and phrases) to returning all the phrases that contains my search terms.
For example, my dictionary file could have:
red car
blue bike
big bike tires
When I search 'bike', I expect to see
blue bike
big bike tires
And when I search 'big tires', i expect to see
big bike tires
Is there anything from Solr that could support this? I was looking into the SpellCheckComponent but it would only support prefix searches.
Basically, I would like to achieve solr searches (token searching) but against a dictionary file (this same file would also be used for autosuggest).
Any advice or direction would be appreciated.
Why not store such phrases in the index itself? The schema can be:
type: suggest_phrase #other types are product or review_article
phrase: big bike tires
So your search for big tires would be:
..fq=type:suggest_phrase&q=phrase:big tires

Resources