Information search in user defined (restricted or limited) set of search space (feasible region) - search

Is there any particular term /phrase for the following:
Searching information where the search space (feasible region) is restricted or limited to a particular set of user-defined documents only.
Example: I am searching for an answer from a set of 100 documents only, in a given space of 10000 documents. This 100 documents are user defined. The search engine cannot refer to any other documents.
Please let me know. Thanks.
I am NOT looking for proximity search. A good reference is here. If I can reframe the question, lets say that I want to restrict my search to "STOP SEARCH AFTER number DOCUMENT | DOCUMENTS". Does this kind of search have a particular term / pharse?

Related

Why does Azure Search give higher score to less relevant document?

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.
It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.
Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

What is the best text analyzer to use on generic data in Azure Cognitive Search when searching for more than 1 word?

I have been looking over the different text analyzers which Azure Cognitive Search offers with this api.
The type of data I have is generic and can be either an email address / name, these are just some examples.
Which is the best analyzer to use on this type of data (generic)?
Also, does the text analyzer in use affect how search works when looking for more than 1 word?
What is the best way to make it do a fuzzy search for more than 1 word i.e. "joe blogs" but all fuzzy.
I don't want "somename blogs" to show up for somename is not a fuzzy match on joe.
I do want "joe clogs" to show up for joe would fuzzy match to joe and clogs would fuzzy match on blogs.
What is the best practice to do fuzzy search with more than 1 word, which would give the end user fewer hits as they give more words?
If you have generic content and don't want to use linguistic processing, you can use a generic analyzer like the Whitespace analyzer. See
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
How searches for single or multiple words work is determined by the searchMode parameter. The recommended practice is to use all, instead of any. When you specify more terms, you are more specific and you want fewer (more precise) results.
You can specify multi-word queries where individual search terms are fuzzy by using the tilde syntax. E.g. to do a fuzzy search for joe but exact match on blogs you could something like:
joe~ blogs
You can also control how fuzzy you want it to be. See
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_fuzzy
PS: From your use case it sounds like proximity matching is also something you could consider using:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_proximity

Azure Search - return only certain number of words from a field

We have a database that stores news stories from many websites. The entire text of each article is stored in one field of unstructured data as nvarcharmax. Our clients query on a person's name to see if they appear anywhere in any of our articles. But in order for us to be compliant with requirements in our industry, as well as not infringe on any copyrights, we're only allowed to return the 25 words which surround that person's name that was queried on. Along with that we give the client the URL of the article and they can take it from there.
Is this something that can be accomplished in Azure Search? The ability to only display a subset of words from the field which is being queried on?
You can use the hit highlighting feature in Azure Search. Please see highlight= parameter in https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents.
Azure Search returns up to five text fragments, which are usually sentences, that contains the search hits per field. It currently does not allow you to configure the size of the window/fragment so you can't get a specific number of words surrounding the search terms.

Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?
No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

Improving type-ahead suggestions in Search using SOLR

What are the possible ways you could improve the type-ahead (auto-complete) suggestions that appear in a free-form search?
From my understanding, all the suggestions that appear for keywords are stored in a SOLR table.
How do you ensure that it covers all the industry specific relevant type-ahead suggestions?
Can you automate including most recent user generated queries that are not currently providing search results to lead to relevant ones?
In preprocessing, the documents fed into the search engine need to be enriched with whatever is sensible and provides help to find them. E.g. a document containing the string paris may be enriched by french capital, capital of france, ile-de-france, … You will need a dictionary to do so. You can take data from dbpedia.org or—for English only—WordNet. For not to over-generalize you will need to implement some disambiguation (meaning discovery) in the first step, since paris—for example—could equally be expanded with alexandros, alaksandu of wilusa, king of troy, depending on the context.

Resources