Solr - Why are scores of documents different although the query has not differentiated between them

Solr - Why are scores of documents different although the query has not differentiated between them - search

I have put the following queries below to get this response -
"response":{"numFound":200,"start":0,"maxScore":20.458012,"docs":[
{
"food_group":"Dairy",
"carbs":"13.635",
"protein":"2.625",
"name":"Apple Milkshake",
"fat":"3.814",
"id":"109",
"calories":99.0,
"_version_":1565386306583789568,
"score":20.458012},
{
"food_group":"Proteins",
"carbs":"4.79",
"protein":"4.574",
"name":"Chettinad Egg Curry",
"fat":"6.876",
"id":"526",
"calories":99.0,
"_version_":1565386306489417728,
"score":19.107327}
.....//other documents...
]}
Querys -
q = (food_group:"Proteins" OR
food_group:"Dairy" OR
food_group:"Grains")
bf = div(1,abs(sub(100,calories)))^15
bq = food_group:"Proteins" + food_group:"Dairy" + food_group:"Grains"
My question is that even though i have not provided any boost to "Dairy" with respect to "Proteins" in bq why is the "Dairy" document having higher score.

because "Dairy" is a more rare term in your corpus. Lucene will give a higher score to a match with a term that is rare vs a match with a very common term.
If you want to get into the detials, look up how BM25 similarity is computed. BM25 is what Lucene (thus Solr) uses now by default, before it was TD-IDF, but they are very similar.

Related

Azure search - custom function for result boosting

I'm trying to move "complex" function to Azure Search. This function calculates score per each result element base on filter data (from search query) and data stored in result element. Score is use for reasult boosting. Base on my research Azure Search provides result boosting, but it's too simple for mine requirement.
Example function:
//filterElementsIds - ids taken from search query filter
public double Score(IEnumerable<string> filterElementsIds, ResultElement element)
{
double score = 0;
foreach(var elem in element.ScoreForFilters)
if (filterElementsIds.Any(x => x == elem.Key))
score += elem.Value * 1.5;
return score;
}
Currently, I'm iterating through each result returned by Azure Search - calculating score and sorting elements inside my application.
Is it possible to implement such function in Azure Search to improve process of boosting results?

I'm not sure I fully understand your question, but it appears like you are trying to boost the score of certain document if their key is equal to any of the IDs in your collection of "filterElements". If that's so, you could use the lucene query language to craft a query which does that:
https://learn.microsoft.com/en-us/azure/search/search-query-lucene-examples
You could do a search that looks like this
OriginalSearchTerm OR (OriginalSearchTerm AND key:("filterID1" OR "filterID2" OR "filterID3"))
That way, documents that match both the original search term as well as having one of the filter ID as part of the "key" field will match higher than documents that only match the original search term. You can also term boosting to give a specific boost to the key field in this case
If that's so, could you use "term boosting" to achieve this?
https://learn.microsoft.com/en-us/azure/search/search-query-lucene-examples#example-5-term-boosting
OriginalSearchTerm OR (OriginalSearchTerm AND key:("filterID1" OR "filterID2" OR "filterID3")^2)

friends of friend Query in ArangoDB 3.0

I want to writing 'friends of friend' traversal using AQL
I have a Collection with Name:User and a edge Collection with name Conatct.
my Conatct documents:
I also read this article that implement friends of friend in ArangoDb, but that's post Uses functions of lower version of ArangoDB that used GRAPH_NEIGHBORS() function.
in ArnagoDB 3.0(latest version), GRAPH_NEIGHBORS() function have been removed!
now, how can I implement fof using Aql in ArnagoDB 3.0 ?
thanks a lot

The graph functions have been removed, because there is the more powerful, flexible and performant native AQL traversal, which was introduced with 2.8, and extended and optimized for version 3.0.
To retrieve friends of friends, a traversal starting at the user in question with a traversal depth = 2 is needed:
LET user = DOCUMENT("User/#9302796301")
LET foaf = (
FOR v IN 2..2 ANY user Contact
RETURN v // you might wanna return the name only here
)
RETURN MERGE(user, { foaf } )
The document for the user with _key = #9302796301 is loaded and assigned to a variable user. It is used as start vertex for a traversal with min and max depth = 2, using the edges of the collection Contact and ignoring their direction (ANY; can also be INBOUND or OUTBOUND). The friends of friends documents are fully returned in this example (v) and merged with the user document, using the attribute key "foaf" and the value of the variable foaf.
This is just one simple example how to traverse graphs and how to construct result sets. There are many more options of course.

Setting a df threshold, beyond which, query terms should be ignored

I am using Solr to search and index products from a database. Products have two interesting fields : a name and a description. Product names are normally unique, but sometimes contain common words, which serve as a pre-description of the product. One example would be "UltraScrew - a motor powered screwdriver”. Names are generally much shorter than descriptions
The problem is that when one searches for a common term, documents that contain it in the name get an unwanted boost, over those that contain it only in the description. This is due to the fact that names are shorter, and even with the normalization added afterwards, it is quite visible.
I was wondering if it is possible to filter terms out of the name, not with a dictionary of stop words, but based on the relative document frequency of the term. That means, if a term appears in more than 10% of the available documents, it should be ignored when the name field is queried. The description field should be left untouched.
Is this generally possible?

maybe you could use your own similarity:
import org.apache.lucene.search.Similarity;
public class MySimilarity extends Similarity {
#Override
public float idf(int docFreq, int numDocs) {
float freq = ((float)docFreq)/((float)numDocs);
if (freq >=0.1) return 0;
return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);
}
...
}
and use that one instead of the default one.
You can set the similarity for an indexSearcher at lucene level, see this other answer to a question.

I am not sure if I understood the question correctly, but you could run two separate queries. Pseudo code:
SearchResults nameSearchResults = search("name:X");
if (nameSearchResults.size() * 10 >= corpusSize) { // name-based search useless?
return search("description:X"); // use description-based search
} else {
return search("name:X description:X); // search both fields
}

Lucene wild card search

How can I perform a wildcard search in Lucene ?
I have the text: "1997_titanic"
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches:
1) If I search with only 1997 it is not returning any results.
2) Also if there is a space, such as in "spider man", that is not finding any results.
I retrieve all movie information from a DB and store it in Lucene Documents:
public Document createMovieDoc(Movie m){
document.add(new StoredField("moviename", m.getName()));
TextField field = new TextField("movienameSearch", m.getName().toLowerCase(), Store.NO);
field.setBoost(5.0f);
document.add(field);
}
And to search, I have this method:
public List searh(String txt){
PhraseQuery phQuery= new PhraseQuery();
Term term = new Term("movienameSearch", txt.toLowerCase());
BooleanQuery b = new BooleanQuery();
b.add(phQuery, Occur.SHOULD);
TopFieldDocs tp= searcher.search(b, 20, ..);
for(int i=0;i<tp.length;i++)
{
int mId = tp[i].doc;
Document d = searcher.doc(mId);
String moviename = d.get("moviename");
list.add(moviename);
}
return list;
}

I'm not sure what analyzer you are using to index. Sounds like maybe WhitespaceAnalyzer? It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man".
Could also be SimpleAnalyzer which uses a LetterTokenizer. This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text.
Your search method doesn't look right. You aren't adding any terms to your PhraseQuery, so I wouldn't expect it to find anything. You must add some terms in order for anything to be found. You create a Term in what you've provided, but nothing is ever done with that Term. Maybe this has something to do with how you've pick your excerpts, or something? Not sure, I'm a bit confused by that.
In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like:
PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));
This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. The simpler solution is to just use the QueryParser:
//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);
This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit.
As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer. It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997.

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?

The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.

The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'

Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string