ElasticSearch default scoring mechanism - search

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?

The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.

The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'

Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Related

Searching for a particular phrase in _all fields generates less number of records comparing to do the same thing on small number of fields

I wanted to search for a particular phrase using elasticsearch on both _all fields and only 2 fields. The phrase is taken from a file listing more than 10000 keywords.: Here is the code:
from elasticsearch import Elasticsearch
import os, sys
import requests
import json
es = Elasticsearch(['localhost:9200/'])
with open('localDrive\\extract_keywords\\t2.txt') as my_keywordfile:
for keyword in my_keywordfile.readlines():
keyword_array.append(keyword.strip().strip("'"))
with open('LocalFile\\_description_Results2.txt','w',encoding="utf-8") as f:
for x in keyword_array:
doc = {
"query": {
"multi_match": {
"query": x,
"type": "phrase",
"fields":["title", "description"],
}
res = es.search(index='xxx_062617', body=doc)
json.dump(res, f, ensure_ascii=False)
f.write(("\n"))
f.close()
Also, the query that matches _all fields is:
"multi_match": {
"query": x,
"type": "phrase",
"fields":"_all",
}
Now what happens is that, I get 101 returned record if I use query only on title, and description. But, I only get 100 returned records if I use _all fields. And if I want to get unique IDs by combining the ids of all records and remove duplicate ones, I see that there are only 86 duplicate records!
My questions are:
Does using type:phrase works differently if I use _all fields?
Should not I get more number of records if I use _all fields?
If _all includes all fields including title, and description, then why using _all does not cover all the records that have been returned by querying title, and description?
Thanks,

Solr - Why are scores of documents different although the query has not differentiated between them

I have put the following queries below to get this response -
"response":{"numFound":200,"start":0,"maxScore":20.458012,"docs":[
{
"food_group":"Dairy",
"carbs":"13.635",
"protein":"2.625",
"name":"Apple Milkshake",
"fat":"3.814",
"id":"109",
"calories":99.0,
"_version_":1565386306583789568,
"score":20.458012},
{
"food_group":"Proteins",
"carbs":"4.79",
"protein":"4.574",
"name":"Chettinad Egg Curry",
"fat":"6.876",
"id":"526",
"calories":99.0,
"_version_":1565386306489417728,
"score":19.107327}
.....//other documents...
]}
Querys -
q = (food_group:"Proteins" OR
food_group:"Dairy" OR
food_group:"Grains")
bf = div(1,abs(sub(100,calories)))^15
bq = food_group:"Proteins" + food_group:"Dairy" + food_group:"Grains"
My question is that even though i have not provided any boost to "Dairy" with respect to "Proteins" in bq why is the "Dairy" document having higher score.
because "Dairy" is a more rare term in your corpus. Lucene will give a higher score to a match with a term that is rare vs a match with a very common term.
If you want to get into the detials, look up how BM25 similarity is computed. BM25 is what Lucene (thus Solr) uses now by default, before it was TD-IDF, but they are very similar.

Elastic Search input analysis

Can Elastic Search split input string into categorized words? i.e. if the input is
4star wi-fi 99$
and we are searching hotels with ES, is it possible to analyze/tokenize this string as
4star - hotel level, wi-fi - hotel amenities, 99$ - price?
yep, it's a noob question :)
Yes and no.
By default, query_string searches will work against the automatically created _all field. The contents of the _all field come from literally and naively combining all fields into a single analyzed string.
As such, if you have a "4star" rating, a "wi-fi" amenity, and a "99$" price, then all of those values would be inside of the _all field and you should get relevant hits against it. For example:
{
"level" : "4star",
"amenity" : ["pool", "wi-fi"],
"price" : 99.99
}
The problem is that you will not--without client-side effort--know what field(s) matched when searching against _all. It won't tell you the breakdown of where each value came from, rather it will simply report a score that determines the overall relevance.
If you have some way of knowing which field each term (or terms) is meant to search against, then you can easily do this yourself (quotes aren't required, but they're good to have to avoid mistakes with spaces). This would be the input that you might provide to the query_string query linked above:
level:"4star" amenity:"wi-fi" price:(* TO 100)
You could further complicate this by using a spelled out query:
{
"query" : {
"bool" : {
"must" : [
{ "match" : { "level" : "4star" } },
{ "match" : { "amentiy" : "wi-fi" } },
{
"range" : {
"price" : {
"lt" : 100
}
}
}
]
}
}
}
Naturally the last two requests would require advanced knowledge about what each search term referenced. You could certainly use the $ in "99$" as a tipoff for price, but not for the others. Chances are you wouldn't have them typing in 4 stars I hope, rather having some checkboxes or other form-based selections, so this should be quite realistic.
Technically, you could create a custom analyzer that recognized each term based on their position, but that's not really a good or useful idea.

Redundant query trigger when creating a graph?

whenever I try to create a new graph with 700.000 to 2 Mio edges, it takes a long time. I observed due to the great new feature in the API
/_api/query/current
that possibly the graph creation triggers automatically some kind of cache loading, but twice?
[
{
"id": "70",
"query": "FOR x IN GRAPH_VERTICES(#graph, {}) SORT RAND() LIMIT #limit RETURN x",
"started": "2015-03-31T19:06:59Z",
"runTime": 41.95919394493103
},
{
"id": "71",
"query": "FOR x IN GRAPH_VERTICES(#graph, {}) SORT RAND() LIMIT #limit RETURN x",
"started": "2015-03-31T19:06:59Z",
"runTime": 41.95719385147095
}
]
Is this correct. Is there a more efficient way?
Thanks in Advance!
The graph viewer issued the mentioned RAND() query two times:
- one instance is fired to determine a random vertex from the graph
- the other instance is fired to determine the attributes of some random vertices of the graph, in order to populate the search input field
The AQL that was used by the graph viewer was inefficient. It build a big list, sorted it randomly and returned 1 (first query) or 10 (second query) documents from it. This has been fixed in commit c28575f202a58d5c93e6c36883effda48c2a7159 so it's much more efficient now.
The fix will be included in the next build (i.e. 2.5.2).

couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

Resources