Azure search - custom function for result boosting - azure

I'm trying to move "complex" function to Azure Search. This function calculates score per each result element base on filter data (from search query) and data stored in result element. Score is use for reasult boosting. Base on my research Azure Search provides result boosting, but it's too simple for mine requirement.
Example function:
//filterElementsIds - ids taken from search query filter
public double Score(IEnumerable<string> filterElementsIds, ResultElement element)
{
double score = 0;
foreach(var elem in element.ScoreForFilters)
if (filterElementsIds.Any(x => x == elem.Key))
score += elem.Value * 1.5;
return score;
}
Currently, I'm iterating through each result returned by Azure Search - calculating score and sorting elements inside my application.
Is it possible to implement such function in Azure Search to improve process of boosting results?

I'm not sure I fully understand your question, but it appears like you are trying to boost the score of certain document if their key is equal to any of the IDs in your collection of "filterElements". If that's so, you could use the lucene query language to craft a query which does that:
https://learn.microsoft.com/en-us/azure/search/search-query-lucene-examples
You could do a search that looks like this
OriginalSearchTerm OR (OriginalSearchTerm AND key:("filterID1" OR "filterID2" OR "filterID3"))
That way, documents that match both the original search term as well as having one of the filter ID as part of the "key" field will match higher than documents that only match the original search term. You can also term boosting to give a specific boost to the key field in this case
If that's so, could you use "term boosting" to achieve this?
https://learn.microsoft.com/en-us/azure/search/search-query-lucene-examples#example-5-term-boosting
OriginalSearchTerm OR (OriginalSearchTerm AND key:("filterID1" OR "filterID2" OR "filterID3")^2)

Related

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

Solr - Why are scores of documents different although the query has not differentiated between them

I have put the following queries below to get this response -
"response":{"numFound":200,"start":0,"maxScore":20.458012,"docs":[
{
"food_group":"Dairy",
"carbs":"13.635",
"protein":"2.625",
"name":"Apple Milkshake",
"fat":"3.814",
"id":"109",
"calories":99.0,
"_version_":1565386306583789568,
"score":20.458012},
{
"food_group":"Proteins",
"carbs":"4.79",
"protein":"4.574",
"name":"Chettinad Egg Curry",
"fat":"6.876",
"id":"526",
"calories":99.0,
"_version_":1565386306489417728,
"score":19.107327}
.....//other documents...
]}
Querys -
q = (food_group:"Proteins" OR
food_group:"Dairy" OR
food_group:"Grains")
bf = div(1,abs(sub(100,calories)))^15
bq = food_group:"Proteins" + food_group:"Dairy" + food_group:"Grains"
My question is that even though i have not provided any boost to "Dairy" with respect to "Proteins" in bq why is the "Dairy" document having higher score.
because "Dairy" is a more rare term in your corpus. Lucene will give a higher score to a match with a term that is rare vs a match with a very common term.
If you want to get into the detials, look up how BM25 similarity is computed. BM25 is what Lucene (thus Solr) uses now by default, before it was TD-IDF, but they are very similar.

Elastic Search Java API Multi match query prefix query on tokens

I am looking for some way that I want to perform search on my index with NativeSearchQueryBuilder from Elastic java api but I want to add the following things while search.
Index details:
Filter type EdgeNgram
White space tokenizer
I am looking for autocomplete functionality so here i want to apply the search keyword on multiple fields but it should apply using prefix to improve the performance, also I want to the results needs to be returned if they reach my specified page limit instead of keep on searching the index even it found enough results.
Ex: "albert einstein" is there in my index, now if I search "alb" it should return the result or if I search "ein" it should return the result.
NativeSearchQueryBuilder sb = new NativeSearchQueryBuilder()
.withIndices(Constants.ES_INDEX_NAME)
//.withPageable(pageable)
.withSourceFilter(new FetchSourceFilterBuilder().withIncludes("id").build())
.withTypes(Constants.USERS_TYPE)
.withQuery(multiMatchQuery("alb", new String[]{"userFirstName","userLastName","userMobile", "userEmail"}))
.withFilter(boolQuery()
.must(termQuery("userCityName", "Chicago")));
Please someone help me on this, how to add prefix and limit to my Multimatch Query builder.
What you are looking for is match_phrase_prefix
int limit = 100; //Set your limit
NativeSearchQueryBuilder sb = new NativeSearchQueryBuilder()
.withIndices(Constants.ES_INDEX_NAME)
.withPageable(new PageRequest(0, limit))
.withSourceFilter(new FetchSourceFilterBuilder().withIncludes("id").build())
.withTypes(Constants.USERS_TYPE)
.withQuery(QueryBuilders.multiMatchQuery("alb", "userFirstName","userLastName","userMobile", "userEmail")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX))
.withFilter(boolQuery()
.must(termQuery("userCityName", "Chicago")));

Couchdb query for values calculated from key input

suppose i have the following data in my database:
[1,2],[2,1],[1,3],[3,1]...
were the numbers represent the a and b values of the formula a*x+b
what i now want is a query that returns the difference to a given point x,y.
for example: the point [2,6] is given. i want my query to return
[1,2] = -2 (1*2+2=4 4-6=-2)
[2,1] = -1 (2*2+1=5 5-6=-1)
[1,3] = -1 (1*2+3=5 4-6=-1)
[3,1] = 1 (3*2+1=7 7-6=-1)
I know how to do this in SQL but the data is already in a couchdb. I'm quite new to the NoSQL world and was wondering if something like this would be possible in couchdb.
what you can do is to use the standard MapReduce functionality of CouchDB.
Map is function you put in a view, which finds your data. You can have various criteria how to locate the docs you need. Next, if you specify so in the query with reduce=true, a reduce function is executed on each document that matched the map condition. You can use JavaScript to perform various operations on the document's values.
In your case, the map can look something like this:
function(doc) {
if(doc.a && doc.b) {
emit(doc._id,[doc.a, doc.b]);
}
}
then, the reduce gets called, like this:
function(keys, values, rereduce) {
var res;
//do something with values...
return res;
}
In your case keys will be list of document ID's and values will be the array of your a & b fields.
When you call the MapReduce (depending what method you use to access the DB), you should specify reduce=true.
Good resources on MapReduce (and on Views, Sorting and List funtions) are:
http://guide.couchdb.org/draft/views.html
http://www.slideshare.net/okurow/couchdb-mapreduce-13321353
Another way to go is to use a list function on the Map result, if you want to output the result in HTML. A good reason to use List function is that you can pass arguments to it with querystring, in your case it may be the point for which you want to calculate distances.
For detailed description on List functions, have a look here:
http://guide.couchdb.org/draft/transforming.html
Hope this helps.

Lucene wild card search

How can I perform a wildcard search in Lucene ?
I have the text: "1997_titanic"
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches:
1) If I search with only 1997 it is not returning any results.
2) Also if there is a space, such as in "spider man", that is not finding any results.
I retrieve all movie information from a DB and store it in Lucene Documents:
public Document createMovieDoc(Movie m){
document.add(new StoredField("moviename", m.getName()));
TextField field = new TextField("movienameSearch", m.getName().toLowerCase(), Store.NO);
field.setBoost(5.0f);
document.add(field);
}
And to search, I have this method:
public List searh(String txt){
PhraseQuery phQuery= new PhraseQuery();
Term term = new Term("movienameSearch", txt.toLowerCase());
BooleanQuery b = new BooleanQuery();
b.add(phQuery, Occur.SHOULD);
TopFieldDocs tp= searcher.search(b, 20, ..);
for(int i=0;i<tp.length;i++)
{
int mId = tp[i].doc;
Document d = searcher.doc(mId);
String moviename = d.get("moviename");
list.add(moviename);
}
return list;
}
I'm not sure what analyzer you are using to index. Sounds like maybe WhitespaceAnalyzer? It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man".
Could also be SimpleAnalyzer which uses a LetterTokenizer. This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text.
Your search method doesn't look right. You aren't adding any terms to your PhraseQuery, so I wouldn't expect it to find anything. You must add some terms in order for anything to be found. You create a Term in what you've provided, but nothing is ever done with that Term. Maybe this has something to do with how you've pick your excerpts, or something? Not sure, I'm a bit confused by that.
In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like:
PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));
This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. The simpler solution is to just use the QueryParser:
//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);
This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit.
As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer. It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997.

Resources