ElasticSearch -- boosting relevance based on field value

ElasticSearch -- boosting relevance based on field value - search

Need to find a way in ElasticSearch to boost the relevance of a document based on a particular value of a field. Specifically, there is a special field in all my documents where the higher the field value is, the more relevant the doc that contains it should be, regardless of the search.
Consider the following document structure:
{
"_all" : {"enabled" : "true"},
"properties" : {
"_id": {"type" : "string", "store" : "yes", "index" : "not_analyzed"},
"first_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"last_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"boosting_field": {"type" : "integer", "store" : "yes", "index" : "yes"}
}
}
I'd like documents with a higher boosting_field value to be inherently more relevant than those with a lower boosting_field value. This is just a starting point -- the matching between the query and the other fields will also be taken into account in determining the final relevance score of each doc in the search. But, all else being equal, the higher the boosting field, the more relevant the document.
Anyone have an idea on how to do this?
Thanks a lot!

You can either boost at index time or query time. I usually prefer query time boosting even though it makes queries a little bit slower, otherwise I'd need to reindex every time I want to change my boosting factors, which usally need fine-tuning and need to be pretty flexible.
There are different ways to apply query time boosting using the elasticsearch query DSL:
Boosting Query
Custom Filters Score Query
Custom Boost Factor Query
Custom Score Query
The first three queries are useful if you want to give a specific boost to the documents which match specific queries or filters. For example, if you want to boost only the documents published during the last month. You could use this approach with your boosting_field but you'd need to manually define some boosting_field intervals and give them a different boost, which isn't that great.
The best solution would be to use a Custom Score Query, which allows you to make a query and customize its score using a script. It's quite powerful, with the script you can directly modify the score itself. First of all I'd scale the boosting_field values to a value from 0 to 1 for example, so that your final score doesn't become a big number. In order to do that you need to predict what are more or less the minimum and the maximum values that the field can contain. Let's say minimum 0 and maximum 100000 for instance. If you scale the boosting_field value to a number between 0 and 1, then you can add the result to the actual score like this:
{
"query" : {
"custom_score" : {
"query" : {
"match_all" : {}
},
"script" : "_score + (1 * doc.boosting_field.doubleValue / 100000)"
}
}
}
You can also consider to use the boosting_field as a boost factor (_score * rather than _score +), but then you'd need to scale it to an interval with minimum value 1 (just add a +1).
You can even tune the result in order the change its importance adding a weight to the value that you use to influence the score. You are going to need this even more if you need to combine multiple boosting factors together in order to give them a different weight.

With a recent version of Elasticsearch (version 1.3+) you'll want to use "function score queries":
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
A scored query_string search looks like this:
{
'query': {
'function_score': {
'query': { 'query_string': { 'query': 'my search terms' } },
'functions': [{ 'field_value_factor': { 'field': 'my_boost' } }]
}
}
}
"my_boost" is a numeric field in your search index that contains the boost factor for individual documents. May look like this:
{ "my_boost": { "type": "float", "index": "not_analyzed" } }

if you want to avoid to do the boosting each time inside the query, you might consider to add it to your mapping directly adding "boost: factor.
So your mapping then may look like this:
{
"_all" : {"enabled" : "true"},
"properties" : {
"_id": {"type" : "string", "store" : "yes", "index" : "not_analyzed"},
"first_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"last_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"boosting_field": {"type" : "integer", "store" : "yes", "index" : "yes", "boost" : 10.0,}
}
}

If you are using Nest, you should use this syntax:
.Query(q => q
.Bool(b => b
.Should(s => s
.FunctionScore(fs => fs
.Functions(fn => fn
.FieldValueFactor(fvf => fvf
.Field(f => f.Significance)
.Weight(2)
.Missing(1)
))))
.Must(m => m
.Match(ma => ma
.Field(f => f.MySearchData)
.Query(query)
))))

Related

How to sort on multiple fields individually using a single index

I am trying to declare multiple fields in a single index like below and trying to sort on the single field only. is it possible?
Is there any way by which using a single combine fields index I can sort on individual fields dynamically.
{
"index": {
"fields": ["name","createdDate","updatedDate"]
},
"name" : "multi-filter",
"ddoc" : "MultiFilter"
"type" : "json"
}
after that, I can apply sort on the same sequence and list like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}, {"createdDate": "asc"},{"updatedDate": "asc"}]
}
BUT if I change the sequence or want to use a filter/sort on a single field like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}]
}
it gives an error saying, my motive is to use the single index, but sort individual fields. It looks like it is a limitation of couch DB and I need to create three separate indexes for the same to make it work, still hoping for the best option
{"error":"no_usable_index","reason":"No index exists for this sort, try indexing by the sort fields."}

I found this answer here: "Unknown Error: mango_idx :: {no_usable_index,missing_sort_index}"}
you could define an index only with the good field, eg:
{
"index": {
"fields": ["name"]
},
"name" : "name_sort",
"type" : "json"
}

How to do field mapping in azure search for complex json objects for example nested array

I have following problem
I have a field mapping update to an index .Payload is complex where
I have:
{
"type": "abc",
"Party": [{
"Type": "abc",
"Id": "123",
"Name": "manasa",
"Phone": [{
"Type": "Office",
"Number": "12345"
}]
}]
}
And now I want to create a field for an index. The field name is phonenumber of type Collection(Edm.String)
where mapping is
{
"sourceFieldName" : "/Party/Phone/Number",
"targetFieldName" : "phonenumber",
"mappingFunction" : { "name" : "jsonArrayToStringCollection" }
}
In http post body
But still after indexing i get phone number result as null.That means the mapping went wrong.If you see the phone number in source json, it is inside a json array and it itself is an array and result needs to get stored inside a collection of a string.Is it possible how can I achieve this?
If this is not possible I atleast want field mapping till phone array ie., /Party/Phone/
If i index complete party array as a text, I get an error while running the index saying:
"Field 'partydetails' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields."
Can someone please help!

If party would have been a Json object than an array and phone would have been only a string array for example
{
"type": "abc",
"Party": {
"Type": "abc",
"Id": "123",
"Name": "manasa",
"Phone": [{
"12345",
"23463"
}]
}
}
Then I could have mapped
{
"sourceFieldName" : "Party/Phonenumber",
"targetFieldName" : "phonenumbers",
"mappingFunction" : { "name" : "jsonArrayToStringCollection" }
}
It map as collection of type odata EDM.string.
So to put this in better and straight forward way,
Either transform your json to something flatter (the example that I
gave above) or
Use the proper index incase if you know before inhand as
#Luis Cabrera said,
“sourceFieldName”: “/Party/0/Phone/0/Type
It is a limitation from azure search side.

Note that Party and Phone are arrays, so the field mapping you mention won't work.
You will need to index into the specific element. For example:
{
"sourceFieldName": "/Party/0/Phone/0/Type",
"targetFieldName": "firstPhoneNumberTypeOfFirstParty"
}
You may want to give that a shot.
Thanks!
Luis Cabrera | Program Manager | Azure Search

Sort documents by a present field and a calculated value

How would I go about displaying the best reviews and the worst reviews at the top of the page.
I think the user's "useful" and "notUseful" votes should have an effect on the result.
I have reviews and if people click on the useful and notUseful buttons their Id gets added to the appropriate array (useful or notUseful).
you can tell what a positive or a negative score is by the "overall" score. that is 1 through 5. so 1 would be the worst and 5 would be the best.
I guess If someone gave a review with a 5 overall score but only got one useful but someone gave a score with a 4 overall and 100 people clicking on "useful" the one with 100 people should be shown as the best positive?
I only want to show 2 reviews at the top of the page the best and the worst worst review if there are ties with the overall scores the deciding factor should be the usefulness. so if there are 2 reviews with the same overall score and one of them has 5 usefuls and 10 notUsefuls that would be -5 usefuls and in the other review someone has 5 usefuls and and 4 notUsefuls that would be 1 usefuls so that would be shown to break the tie.
I'm hopping to do it with one mongoose query and not aggregation but I think the answer will be aggregation.
I guess there could be a cut off like scores greater than 3 is a positive review and lower is negative review.
I use mongoose.
Thanks in advance for your help.
some sample data.
{
"_id" : ObjectId("5929f89a54aa92274c4e4677"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f88154aa92274c4e4675"),
"overall" : 3,
"titleReview" : "53",
"reviewText" : "53",
"companyName" : "store1",
"replies" : [],
"version" : 2,
"notUseful" : [ObjectId("58d94c441eb9e52454932db6")],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:07:22.207Z"),
"images" : [],
"__v" : 0
}
{
"_id" : ObjectId("5929f8dfa1435135fc5e904b"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f8bab0bc8834f41e9cf8"),
"overall" : 3,
"titleReview" : "54",
"reviewText" : "54",
"companyName" : "store1",
"replies" : [],
"version" : 1,
"notUseful" : [ObjectId("5929f83bf371672714bb8d44"), ObjectId("5929f853f371672714bb8d46")],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:08:31.516Z"),
"images" : [],
"__v" : 0
}
{
"_id" : ObjectId("5929f956a692e82398aaa2f2"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f93da692e82398aaa2f0"),
"overall" : 3,
"titleReview" : "56",
"reviewText" : "56",
"companyName" : "store1",
"replies" : [],
"version" : 1,
"notUseful" : [],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:10:30.608Z"),
"images" : [],
"__v" : 0
}

If I am reading your question correctly then it appears you want a calculated difference of the "useful" and "nonUseful" votes to also be taken into account when sorting on the "overall" score of the documents.
The better option here is include that calculation in your stored documents, but for totality we will cover both options.
Aggregation
Without changes to your schema and other logic, then aggregation is indeed required to do that calculation. This is best presented as:
Model.aggregate([
{ "$addFields": {
"netUseful": {
"$subtract": [
{ "$size": "$useful" },
{ "$size": "$notUseful" }
]
}
}},
{ "$sort": { "overall": 1, "netUseful": -1 } }
],function(err, result) {
})
So you are basically getting the difference between the two arrays, where more "useful" responses have a positive impact boosting the ranking ans more "notUseful" will reduce that impact. Depending on the MongoDB version you have available you use either $addFields with only the additional field or $project with all the fields you need to return.
The $sort is then performed on the combination of the "overall" score in ascending order as per your rules, and the new field of "netUseful" in descending order ranking "positive" to "negative".
Re-Modelling
Foregoing aggregation altogether, you get a faster result from the plain query. But this of course means maintaining that "score" in the document as you add members to the array.
In basic options, you are using the $inc update operator along with $push to change the score.
So for a "useful" entry, you would do something like this:
Model.update(
{ "_id": docId, "useful": { "$ne": userId } },
{
"$push": { "useful": userId },
"$inc": { "netUseful": 1 }
},
function(err, status) {
}
)
And for a "notUseful" you do the opposite by "decrementing" with a negative value to $inc:
Model.update(
{ "_id": docId, "nonUseful": { "$ne": userId } },
{
"$push": { "nonUseful": userId },
"$inc": { "netUseful": -1 }
},
function(err, status) {
}
)
To cover all cases including where a vote is "changed" from "useFul" to "nonUseful" then you would expand on the logic and implement the appropriate reverse actions with $pull. But this should give the general idea.
N.B The reason we do not use the $addToSet operation here is because we want to make sure the user id is not present in the array when "incrementing" or "decrementing". Thus instead the $ne operator is used to test the value does not exist. If it does, then we do not attempt to modify the array or affect the "netUseful" value. The same applies to the reverse case of "removing" the user from those votes.
Since the calculation is always maintained with each update, you simply perform as query with a standard .sort()
Model.find().sort({ "overall": 1, "netUseful": -1 }).exec(function(err,results) {
})
So by moving the "cost" into the maintenance of the "votes", you remove the overhead of running the aggregation later. For my money, where this is a regular operation and the "sort" does not rely on other run-time parameters which force the calculation to be dynamic, then you use the stored result instead.

I18n search and filtering in Elasticsearch

tldr;
How to match and filter localized search with a localized index ?
long version
I have an application where the user search must be done in the context of it's language.
In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language).
The mapping of the document should look like :
'entry': {
'properties': {
'name' : {'type': 'string'}, /* unlocalized properties */
'category': { /* localized properties */
"properties" : {
"lang_fr" : {
"type" : "string"
},
"lang_de" : {
"type" : "string"
}
}
},}}
having that, I have two requirements:
1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work :
{
"query": {
"match": {
"*.lang_fr": "full_text"
}
}
}
However, "categories.lang_fr": "full_text" works well. But I don't want to maintain the list of fields in the query. I want a general rule like you can do in SolR.
2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang" , but include all fields being 'lang_fr'. I tried the following but it doesn't work:
{
"_source": {
"include": [ "*", "*.lang_fr" ],
"exclude": [ "*.lang_*" ],
}
...}
the wildcard operator doesn't seems to work. I partially have what I want if I specify "categories.lang_de", but again, I don't want to maintain the list of fields, I want a generic rule. The include/exclude operation doesn't work as I would like. The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as :
{
"_source": {
"exclude": [ "categories.lang_de", "categories.lang_en", "categories.lang_it",
"another_field.lang_de", "catanother_fieldgories.lang_en", "another_field.lang_it"],
}
...}
for 'fr' search.
I'm quite surprised I couldn't find anything on google. I see it as a very standard case of i18n applied to elasticsearch. Maybe I'm modelizing i18n the wrong way in ES ?
thank you in advance !

You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names:
{
"query": {
"query_string": {
"query": "\\*.lang_fr:full_text"
}
}
}
or you can also specify the field name in the fields parameter, like this
{
"query": {
"query_string": {
"query": "full_text"
"fields": ["*.lang_fr"]
}
}
}
As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). That would allow you to add localized fields as you go without having to change the query.
{
"_source": {
"exclude": [ "*.lang_de", "*.lang_it" ],
}
...}

Using near with elemMatch in Mongoose

I am searching within a collection of Stores. Stores have an embedded collection of outlets with locations. My goal is to return the set of stores that have outlets near a geolocation, and also only return those Outlets within that location.
I can successfully restrict the query to only return Stores have an Outlet at a particular location using 'near'
Store
.where('isActive').equals(true)
.where('outlets.location')
.near({ center: [153.027117, -27.468515], maxDistance: 1000 / 6378137, spherical: true })
.where('outlets.isActive').equals(true)
.where('products.productType').equals('53433f1f3e02e39addde1954')
.where('products.isActive').equals(true)
.select('name outlets')
.select({'products': {$elemMatch: {'isActive': true, 'productType': '53433f1f3e02e39addde1954'}}})
.select('name outlets')
.execQ()
.then(function (results) {
console.log(results);
})
.fail(function (err) {
console.log(err);
})
.done();
The problem I have is that the store document returns all the outlets, not just the outlet that matched the geolocation. I've tried using elemMatch within a select like I did with the products;
.select({'outlets': {$elemMatch: {'location': {near:{ center: [153.027117, -27.468515], maxDistance: 10000 / 6378137, spherical: true }}}}})
However it returns an empty array. Can use use the near operator in an elemMatch clause? Am I doing it incorrectly? Is there an more efficient/fast/better way to achieve the goal?

I see what you are trying to do here but there seems to be a few flaws in this sort of design. Though not exactly your document structure I see you are trying to do something like this:
{
"_id" : ObjectId("5344badd519563414f23fdf8"),
"store" : "Mine",
"outlets" : [
{
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
},
{
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
}
]
}
{
"_id" : ObjectId("5344be6f519563414f23fdf9"),
"store" : "Another",
"outlets" : [
{
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
},
{
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
}
]
}
So basically you appear to be attempting to nest the outlet locations within an array in a top level document.
What I am referring to a flaw here is that by design, any type of "near" based query is going to return more than 1 result. That does seem logical when you look at the purpose. You can of course modify this to restrict the results by "maxDistance" but generally it will be more than 1.
So the only way is to .limit() the results returned by the cursor to a single "nearest" response. Also note that with some operations those results are not necessarily "sorted" with the "nearest response first.
Now as these results are actually contained within an array of the document, remember that .find() itself does not actually "filter" the results of an array, so of course the document will contain all of the array contents.
If you tried to "project" with a positional $ operator, then the problem falls back to the original point because there is no singular actual match, so it is not possible to return an "index" value for the matching element. If you in fact did try this, you would always get the default index value of 0, so just returning the first element.
If then you thought you could run off to aggregate and and try to actually "de-normalize" the array entries, you would be out of luck because due to the need to use the index at the first stage of any aggregation pipeline statement.
So the summary of this is that embedded entries like this are not suited to this design where you need to do geo-spatial matching on those store locations. The locations are better off in a separate collection:
{
"_id" : ObjectId("5344bec7519563414f23fdfa"),
"store": "Mine"
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
}
{
"_id" : ObjectId("5344bed5519563414f23fdfb"),
"store": "Mine"
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
}
So that would allow you to "limit" the result to the "nearest" match by setting the limit to 1. You can also include any necessary values such as the "store" to be used in your filtering. If you need to you can include other information aside from what you need to filter or otherwise just put the ObjectId values within the array of the original object, or possibly even duplicate for both collections.
But since the very nature of these queries is intended to not only return 1 match, then there is no way you are going to get this to work on embedded documents. So your solution will require some changes in your overall schema.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string