Default proximity search in solr - search

I was wondering if there is anyway to set up a default search proximity within solr via the solrconfig.xml.
Currently, if I want to perform a proximity search I would have to do the following:
q="red cars"~10
Is there a way to set the 10 word proximity by default so that all queries are proximity searches with a 10 word proximity range?

By using eDismax, you set the proximity as default slop.
The proximity would be enabled by default as the search would look for words that are slop distance apart.
Check Query Phrase Slop and Phrase slop which will set the slop for the queries.
Query Phrase Slop is applied to Phrase Queries.
While, Phrase slop will be applied to normal queries.
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<str name="qf">field</str>
<str name="qs">10</str>
<str name="pf">field</str>
<str name="ps">10</str>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>

Related

Azure Cognitive Text Analytics

The new Text Analytics library working with v3.0-preview for Sentiment Analysis. I passed a text with multiple sentences as a document to get the sentiment of the whole text.
I have received the following warning in the response.
"warnings":["Sentence was truncated because it exceeded the maximum token count."]}
Frist result on your favorite search engine: https://learn.microsoft.com/en-us/azure/cognitive-services/text-analytics/overview#data-limits
Maximum size of a single document: 5,120 characters as measured by
StringInfo.LengthInTextElements.
This is by design: a document is split internally into sentences, but we have a maximum length limit for the number of words in a sentence.

Clustering Components

When clustering I receive the following warning
UserWarning: A component contained 77760 elements.
Components larger than 30000 are re-filtered.
The threshold for this filtering is 4.08109134074e-15
What does this mean?
My original thereshold specification was 0.191 as below
clustered_dupes = deduper.match(data,threshold=0.191)
the threshold is for the cophenetic similarity of a cluster not pairwise similarity.

Setting different priority for each term in a sentence in Solr

I have a search sentence like the following:
philips led bulb
I want to set different priorities for each word: bulb priority is 8, led priority is 7 and philips priority is 6. How can do I that with Solr?
The standard query parser (and edismax, etc.) supports giving weights for each term, using the ^<weight> syntax: bulb^8 led^7 philips^6.
If you want to apply different weights to different "categories" of words, index the words to different fields and use qf to query all the fields. qf also supports the ^<weight> syntax, so you can apply a query as categories^8 manufacturer^6 etc.

How to tweak Solr Matching Parameters?

I'm using Solr for the search component of my application, and am looking to play around with different factors to see how it affects results.
Specifically Solr docs make mention of the basic scoring factors:
tf --> term frequency
idf --> inverse document frequency
coord --> coordination factor
lengthNorm --> matches based on length of field
Could anyone tell me how to "adjust" whatever numerical factors are being used for these values? (If that's possible, haven't found much documentation saying ye or ney)
After I've played around with these I'll move on to methods such as boosting and so on.
Thanks guys!
You can start with the Custom Similarity class.
This would allow you to modify the above parameters and scoring factors.
Check the lucene DefaultSimilarity class for reference which is the actual implementation.

tf-idf: Does using it help to weigh documents that share the terms higher than a document that doesnt?

I'm working on a customized search feature for a website. and I was curious if using only tf-idf to rank documents in my corpus would also help to weigh documents that have multiple search terms higher than documents with only one search term.
Example: Search = "poland spring water"
Theoretically, would the above query weigh (using traditional tf-idf) a document higher if the a document contained "poland" 100 times and "water" zero times. Or would it weigh a document heavier if it contained "poland" 10 times and "water" 10 times.
I'm aware that it all depends on the tf-idf value of "poland" and "water" but theoretically on an even playing field, would the algorithm help bring documents to the top of the results more if there were multiple terms in the document, or is it really term independent?
It is term independent. Remember, the tf-idf weighing scheme treats the query as a bag of words and each document is seen as a vector. For the above example, consider tf for poland is 100 while its idf is 1 in doc x. Also, consider tf for poland is 10 and tf for water is 2 is doc y. the idf of water is 1.
score of doc x = 100
score of doc y = 12
doc x ranked higher even though has one term.
its term independent. Depends on the ratio of how many documents contain poland and how many contain water. it that ratio. If its half-half, than the second document wins. If the ratio is 100:1, then the first document wins since the ratio is more similar to in-document distribution of the words.

Resources