Currently I'm using lucene, the version is 4.10.3.
But I don't care the score and weight in search, I just need the documents meets the query. So I just want the posting list to be returned to speed up the search.
I tried some way to ignore the score, like rewrite TermQuery, disable the TermWeight and TermScore. It works, the search speeded up a lot, but there are too many queries to rewrite, and this is incompatible with QueryParser.
Also I tried ConstantScoreQuery and tried to rewrite ConstantScoreQuery, but failed. It seems that ConstantScoreQuery use the Weight and Score of the Query it wrapped, like TermWeight and TermScorer, so it can't speed up the search, because the weight and score is still calculated. Maybe I made it wrong, but I did some test, the search time of new ConstantScoreQuery(new TermQuery()) is the same as new TermQuery().
And i tried to rewrite the collector to replace TopScoreCollector, but socre and weight is still calcaulated, the new collector won't speed up too much.
So is there any way to totally disable score in search to speed up the search. I think rewriting ConstantScoreQuery may work, make it TrueConstantScoreQuery and wraps other queries, but I don't know how to do that.
Related
Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.
I've got a search system where I allow users to set their preferences, then I boost results in a SOLR search according to these preferences. I'd like to give the users some visual feedback when a result has been boosted, but to do this I need to find a way to tell if a particular result has been boosted.
So far, I've thought of using the score value and if the score is above a certain threshold then I know it's been boosted, however the score seems to change quite a bit from query to query so I don't know how to set such a threshold.
If I had access to the pre-boost score somehow in the result, then I could compare this with the final post-boost score and know that the result had been boosted but I don't think the pre-boost score is available (please correct me if I'm wrong).
Does anyone have any other ideas how to achieve this?
you add this to your request:
&debugQuery=true
then you will get a debug element in your response. Among other things, in contains explain where you can see (for each doc id returned) how it's score is built. If you parse that info, you can see if it comes from, including boosting info).
explain info is quite convoluted to parse, there is even a page to help you with that.
Having issue when I do document search in index, I use keywords as search param and distance as order by clues in api parameter.
The outcome result has sorted the result by distance, but the keyword based best data never come up into result.
https://****/indexes/IndexName/docs?api-version=2014-10-20-Preview&$filter= geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)') le 16091.8615317766&search=the beach villas &$orderby=geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)')&$skip=0&$top=10&$count=true
It is very possible that there is an issue, but I would like to step back and make sure you actually want to use sorting as opposed to scoring profiles. Based on the query, it seems as though what you want to do is boost items that are close to the user. A good way to do this is to use our Distance scoring profile that allows you to provide additional weighting to documents that are closer to the location specified by the user. You can also apply an exponential or linear interpolation to this scoring. Using exponential the villa closest to the location get a really large boost and the further ones get a small boost. Or using linear it is more of a gradual degradation of weighted boosting as it gets farther from the point.
Liam
Please see this page for more details on this: https://msdn.microsoft.com/en-us/library/azure/dn798928.aspx
I ran the keyword and prefix search for some generic keywords like it, there, he, etc.
The most amazing part about these was that it gave wrong results and took around 10 times more time to process the request than some named entities like Nokia, Samsung, McDonald's.
Can anyone explain the weird results I get for these keywords
it ====> http://dbpedia.org/resource/United_States
there ====> http://dbpedia.org/resource/United_States
Why are the results wrong and why does it take so much time to process these requests?
I wonder what kind of results you were looking for with a query like "there" or "it"?
In the context of search engine terms these are often referred to as stop words and are sometimes ignored completely due to the fact they are so common that they add very little relevance to the search query or result. I think actually this is what the lookup tool does now as I do not get the same results you mentioned.
Why did the query take longer? This is likely because the words are very frequent and a query for them returns many more results. This means the search engine has more work to do in figuring out the most relevant result.
Why is United_States the top result? Probably because the wiki page for United_States is the highest ranked in terms of inbound links from other Wikipedia pages. This is the heart of the relevance algorithm used within the lookup tool. Essentially there are more links with the words "there", "it", etc pointing to United_States than any other page, so it is judged to be the most relavent for those terms.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).