In sphinx search, is that possible set weight or pr somelike to specified record(#id) for corresponding query keyword? What i expect to do is raise/decrease some specified rank of records for specified queries. eg:doc A has 2 weights for 2 different words
There is Payloads
http://sphinxsearch.com/docs/current.html#conf-sql-joined-field
you attach specific weights to words in a document.
Related
I have a document which has title, stockCode, category fields.
I have different field types (and analysis chains) for each. For instance title has EdgeNGram 2 to 20, category has EdgeNGram 3 to 10 with different range and stockCode just has lowercase filter.
So that, I don't want to search from documents with keyword "sample" with building the query like title:sample OR stockCode:sample OR category:sample.
I'd like to search with just "q=sample".
I copied my fields to text but It does not work. Because all fields analyzed as same. But I don't want to index stockCode as EdgeNGram or any other filters. I'd like to index my fields as I configured and I'd like to search a keyword over them base on my indexes.
I've been researching about that for three days, and Solr has a little bit poor documentation.
You can use the edismax handler, as this will allow you to give a list of fields to query and supply the query by itself. You can also give separate weights to each field for scoring them differently.
defType=edismax&q=sample&qf=title^10 stockCode category
.. will search for sample in each of the three fields, giving a 10x boost to any hits in the title field.
You can find the documentation about the edismax query parser under Searching in the reference guide.
Consider a scenario where all documents have following fields
The requirement is that for email the score should be either 100 (if exact match) or 0.
For remaining fields, it is 0 to 100 based on edit distance .
Suppose in an index the records are like the following
1.abcd#gmail.com,Peterr,Parker,Developer
2.xyz#yahoo.com,Steve,Smith,Manager
The query is made on fuzzy search of all the fields and parameters are like
abcd#gmail.com,Pet,Par,Devl
The search result should have a score for first record like
score for email + score of last name +score of first name+score of title
=100+50(approx edit distance of 'Peterr and Pet')+50(approx edit distance of 'Peterr and Parker')+44(approx edit distance of 'Devl and Developer')
=244
Similarly ,the search result should have a score in similar way.
I just checked Azure search scoring has weights but those I don't think would be of much helpful in scenarios like this .The main thing we are looking for is to find a way where the search score returned for each record by Azure search would be in accordance with the score I discussed above
To clarify, it seems what you need is the scoring formula to be a function of the edit distance between the query term and the indexed term - the shorter the distance, the higher the score. Unfortunately, this is not possible in Azure Search.
Azure Search engine executes the search query in two phases: retrieval and scoring.
During retrieval search query terms processed by the lexical analyzer are looked up in the inverted index. Documents that had those terms are returned. When you use fuzzy search we expand your search query by adding terms from the inverted index that are within edit distance from a given query term - fuzzy expansion. This way your query can match more documents.
During scoring we assign a relevance score to retrieved documents using the Lucene scoring formula. This formula is based on TF/IDF. Practically, it means that documents that matched terms that are rare will be ranked higher up in the results set.
It's important to know that the Lucene scoring formula only applies to documents that matched the original query terms and terms added through fuzzy expansion. Documents that matched terms added through prefix expansion or regex/wildcard expansion are given constant score 1. This way those documents will be in the results set but won't have impact on ranking that's based on frequency of terms.
Hope that helps
I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?
The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.
I know that ElasticSearch uses relevance ranking algorithms such as Lucene's tf/idf, length normalization and couple of more algorithms to rank term queries applied on textual fields (i.e. searching words "medical" AND "journal" in the "title" and "body" fields).
My question is how does ElasticSearch rank and retrieve results of a filter or range query (i.e. age=25, or weight>60)?
I know these types of queries are just filtering documents based on the condition(s). But lets say I have 200 documents which their age field value is 25. Which of those documents will be retrieved as top 10 results?
Does ElasticSearch retrieve them by the order it indexed them?
From the Elasticsearch documentation:
Filters: As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
Queries: As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
So when running a search such as "age=25, or weight>60" you should be using a filter.
However - Filters do not affect the scoring - i.e. if you only used a filter your search results would all have the same score.
There is a range query - this is a query that would affect score and I would guess that it scores documents based on things like the document timestamp (most recent gets a higher score).
You'd need to explore the documentation further and dig into Lucene documentation to understand exactly how and why the a document got its score - but as above, you may be better using Filters that don't affect scoring.
I have a set of text files in a particular domain. I need to rank the files based on some metric.
Please help me out with a few metrics that can be used to rank my text files (term frequency, size, frequency of use, etc..). I would then like to use text mining techniques to rank the files based on one of these techniques.
The major issue that i had come across is to rank the documents according to thier relevance or some other metric .
Now i have come to a conclusion that documents ranked based on their content(relevance) provides better results.
I am making use of a vector based approach to rank documents based on the search words given in the query . I am not sure if that is the best approach but it provides results with average accuracy