How can I exclude scores that equal 0 in a Solr function query and maintain the actual score? - search

My goal is to round score to group similar items and then sort by another field (let's use price as an example).
I'm able to accomplish this with the following query:
/select?defType=func&q=rint(product(query({!v=the search term}),100))&fl=score,price&sort=score%20desc,price
However, this query returns every document indexed in Solr.
How can I filter this query so that items with a score of 0 are excluded?
I've tried adding {!frange l=1} to the query which kind of worked... but it made all of the scores equal to 1. This obviously isn't good because I need to show the most relevant results first.
Thanks in advance for any help.
Alex

I spent hours trying to filter out values with a relevance score of 0. I couldn't find any straight forward way to do this. I ended up accomplishing this with a workaround that assigns the query function to a local param. I call this local param in both the query ("q=") and the filter query ("fq=").
Example
Let's say you have a query like:
q={!func}sum(*your arguments*)
First, make the function component its own parameter:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
Now to only return results with scores between 1 and 10 simply add a filter query on that localParam:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
&fq={!frange l=1 u=10 inclusive=true}$localParam

Related

PySpark Design Pattern for Combining Values Based on Criteria

Hi I am new to PySpark and want to create a function that takes a table of duplicate rows and a dict of {field_names : ["the source" : "the approach for getting the record"]} as an input and creates a new record. The new record will be equal to the first non-null value in the priority list where each "approach" is a function.
For example, the input table looks like this for a specific component:
And given this priority dict:
The output record should look like this:
The new record looks like this because for each field there is a function selected that dictates how the value is selected. (e.g. phone is equal to 0.75 as Amazon's most complete record is null so you coalesce to the next approach in the list which is the value of phone for the most complete record for Google = 0.75)
Essentially, I want to write a pyspark function that groups by components and then applies the appropriate function for each column to get the correct value. While I have a function that "works" the time complexity is terrible as I am naively looping through each component then each column, then each approach in the list to build the record.
Any help is much appreciated!
I think you can solve this using pyspark.sql.functions.when . See this blog post for some complicated usage patterns. You're going to want to group by id, and then use when statements to implement your logic. For example, 'title': {'source': 'Google', 'approach': 'first record'} can be implemented as
(df.groupBy('id').agg(
when(col("source") == lit("Google"), first("title") ).otherwise("null").alias("title" )
)
'Most recent' and 'most complete' are more complicated and may require some self-joins, but you should still be able to use when clauses to get the aggregates you need.

How to sort this Solr query by distance?

I am querying results like this:
?q=*&wt=json&rows=1000&fq={!geofilt%20pt=36.722484,-4.371908%20sfield=location%20d=50}
This is using the geofilt function to find all results within 50km of a given point. But the results are returning in a strange order. I want to sort them by proximity to the given point, ascending. How can I add that to the above query?
I'd bet you rather need to apply additional sorting param which is described here: Spatial Search.
So in your case it would look like:
?q=*&wt=json&rows=1000&fq={!geofilt%20pt=36.722484,-4.371908%20sfield=location%20d=50}&sort=geodist()+asc
If the values you're getting isn't a field value, but the results of, say, a calculation (i.e. dynamically generated values, such as the distance between points) then you can sort the output of a function. Have a look at this link:
https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

Query Couchdb by date while maintaining sort order

I am new to couchdb, i have looked at the docs and SO posts but for some reason this simple query is still eluding me.
SELECT TOP 10 * FROM x WHERE DATE BETWEEN startdate AND enddate ORDER BY score
UPDATE: It cannot be done. This is unfortunate since to get this type
of data you have to pull back potentially millions of records (a few
fields) from couch then do either filtering, sorting or limiting
yourself to get the desired results. I am now going back to my
original solution of using _changes to capture and store elsewhere the data i do need to perform that query on.
Here is my updated view (thanks to Dominic):
emit([d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score], doc.name);
What I need to do is:
Always sort by score descending
Optionally filter by date range (for instance, TODAY only)
Limit by x
Update: Thanks to Dominic I am much closer - but still having an
issue.
?startkey=[2017,1,13,{}]&endkey=[2017,1,10]&descending=true&limit=10&include_docs=true
This brings back documents between the dates sorted by score
However if i want top 10 regardless of date then i only get back top 10 sorted by date (and not score)
For starters, when using complex keys in CouchDB, you can only sort from left to right. This is a common misconception, but read up on Views Collation for a more in-depth explanation. (while you're at it, read the entire Guide to Views as well since you're getting started)
If you want to be able to sort by score, but filter by date only, you can accomplish this by breaking down your timestamp to only show the degree you care about.
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score ])
}
You'll end up outputting a more complex key than what you currently have, but you query it like so:
startkey=[2017,1,1]&endkey=[2017,1,1,{}]
This will pick out all the documents on 1-1-2017, and it'll be sorted by score already! (in ascending order, simply swap startkey and endkey to get descending order, no change to the view needed)
As an aside, avoid emitting the entire doc as the value in your view. It is likely more efficient to leverage the include_docs=true parameter, and leaving the value of your emit empty. (please refer to this SO question for more information)
With this exact setup, you'd need separate views in order to query by different precisions. For example, to query by month you just use the year/month and so on.
However, if you are willing/able to sort your scores in your application, you can use a single view to get all the date precision you want. For example:
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), d.getUTCHour(), d.getUTCMinutes(), d.getUTCSeconds(), d.getUTCMilliseconds() ])
}
With this view and the group_level parameter, you can get all the scores by year, month, date, hour, etc. As I mentioned, in this case it won't be sorted by score yet, but maybe this opens up other queries to you. (eg: what users participated this month?)

Solr Query - Set a threshold match percentage for my search

I am using Solr - Lucene 4.0. I am trying to run a query to search a field called Names.
An example of a query would be:
Names:George
When I execute the search with the amount of rows to return to 1000 it returns 1000 results. I expect it to return way less than that. The last results aren't similar at all. Is there a way to set a threshold for my results so that it only returns matches of a certain similarity?
actually you cant not create a minimum matching score. because the matching score is relative and depends on a lot of things (ex. number of overall documents, number of matching terms found).
i do not know what is your case exactly. but you may consider use paging. like get results in 20 document at a time and check the score of the last document and then stop if its lower than a threshold that you specify.

Order SharePoint search results by more columns

I'm using a FullTextSqlQuery in SharePoint 2007 (MOSS) and need to order the results by two columns:
SELECT WorkId FROM SCOPE() ORDER BY Author ASC, Rank DESC
However it seems that only the first column from ORDER BY is taken into account when returning results. In this case the results are ordered correctly by Author, but not by Rank. If I change the order the results will be ordered by Rank, but not by Author.
I had to resort to my own sorting of the results, which I don't like very much. Has anybody a solution to this?
Edit: Unfortunately it also doesn't accept expressions in the ORDER BY clause (SharePoint throws an exception). My guess is that even if the query looks like legitimate SQL it is parsed somehow before being served to the SQL server.
I tried to catch the query with SQL Profiler, but to no avail.
Edit 2: In the end I used ordering by a single column (Author in my case, since it's the most important) and did the second ordering in code on the TOP N of the results. Works good enough for the project, but leaves a bad feeling of kludgy code.
Microsoft finally posted a knowledge base article about this issue.
"When using RANK in the ORDER BY clause of a SharePoint Search query, no other properties should be used"
http://support.microsoft.com/kb/970830
Symptom: When using RANK in the ORDER BY clause of a SharePoint Search query only the first ORDER BY column is used in the results.
Cause: RANK is a special property that is ranked in the full text index and hence cannot be used with other managed properties.
Resolution: Do not use multiple properties in conjunction with the RANK property.
Rank is a special column in MOSS FullTextSqlQuery that give a numeric value to the rank of each result. That value will be different for each query, and is relative to the other results for that particular query. Because of this rank should have a unique value for each result, and sorting by rank then author would be the same as just sorting by rank. I would try sorting on another column instead of rank to see if results come back as you expect, if so your trouble could be related to the way MOSS is ranking the results, which will vary for each unique query.
Also you are right, the query looks like SQL, but it is not the query actually passed to the SQL server, it is special Microsoft Enterprise Search SQL Query syntax.
I, too, am experiencing the same problem with FullTextSqlQuery and MOSS 2007 where only the first column in a multi-column "ORDER BY" is respected.
I entered this topic in the MSDN Forums for SharePoint Search, but have not received any replies:
http://social.msdn.microsoft.com/Forums/en-US/sharepointsearch/thread/489b4f29-4155-4c3b-b493-b2fad687ee56
I have no experience in SharePoint, but if it is the case where only one ORDER BY clause is being honored I would change it to an expression rather than a column. Assuming "Rank" is a numeric column with a maximum value of 10 the following may work:
SELECT WorkId FROM SCOPE() ORDER BY AUTHOR + (10 - Rank) ASC

Resources