Azure search - Using Microsoft English Analyzer increases size of Index - azure

Earlier my index was using lucene analyzer. I changed it to Microsoft. Now the size of index has largely increased. Why does the size increase so much . ? P.S. the attachment.

Difference in index size is expected. For each word in your documents a Microsoft analyzer produces the original word and the base form of that word, for example, if your document has the word running, Azure Search will index two terms: running and run. See my answer in the following post for more details: Azure Search: Searching for singular version of a word, but still include plural version in results
Lucene analyzers stem words what results in fewer unique terms in the index.
You can learn more about the differences here: https://learn.microsoft.com/en-us/rest/api/searchservice/Language-support?redirectedfrom=MSDN
Depending on the analyzer/language the impact on the index size will be different. You can test the behavior of the analyzer you are using with the Analyze API: https://learn.microsoft.com/en-us/rest/api/searchservice/test-analyzer.
That being said, the difference you are seeing is more than I would expect. Please reach out to me at janusz.lembicz at microsoft to discuss the details of your scenario.

Related

Why does Azure Search give higher score to less relevant document?

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.
It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.
Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

Azure Cognitive Search analyzer vs. normalizer, when you should you use each

I am learning Azure Cognitive Search and got a bit confused about Analizer and Normilizer.
https://learn.microsoft.com/en-us/azure/search/search-analyzers
https://learn.microsoft.com/en-us/azure/search/search-normalizers
As far as I understood the only difference is the fact that Analyzers perform tockenization.
Could someone provide good example whene I should use one over antoher?
What is benefits of analizer over normalizer and vise versa ?
What is more efficitent permance wise?
Thank you for your time!
The simplest explanation is to use an analyzer for properties containing blocks of text. The normalizer is more suitable for properties with short content that you typically would use for filtering or sorting like City, Country, Name, etc.
A block of text will have content in a specific language. A language-specific analyzer will do a better job of producing good tokens for internal use by the search engine. You will find that you get better recall for textual content that is correctly processed using a relevant analyzer.
The values that you pass in a filter, sort, or facet can't be analyzed, so "normalizers" were created to fill that gap. They don't do everything that an analyzer can do (i.e., there is a smaller set of allowed tokenizers, token filters, and character filters) but they take care of the bigger issues, like normalizing text casing and getting rid of punctuation.

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.
I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.
As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

find the top words, relative to all documents

I have some 100.000+ text documents. I'd like to find a way to answer this (somewhat ambiguous) question:
For a given subset of documents, what are the n most frequent words - related to the full set of documents?
I'd like to present trends, eg. a word cloud showing something like "these are the topics that are especially hot in the given date range". (Yes, I know that this is an oversimplification: words != topics etc.)
It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don't want to reinvent any wheels here.
I'm planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question - how? Or would you recommend some other tools in addition / instead?
This should work: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/HighFreqTerms.html
This Stack Overflow question also covers term frequencies in general with Lucene.
If you were not using Lucene already, the operation you are talking about is a classic introductory problem for Hadoop (the "word count" problem).

associated words

I am developing a program but stuck on a particular hurdle. I need to find words associated with other words. EG "green" might be associated with "environment", "leaf", "earth", "wind", "electric", "hybrid", etc. All I can find is Google Sets. Is there any other resource that is better?
If you have a large text collection (say Wikipedia, Project Gutenberg) you can use co-occurrence scores extract this kind of data. See e.g. Padó and Lapata and the references therein.
I recently built a tool that mines this kind of associations from Wikipedia database dumps by another method. It requires a lot of memory though; other folks have tried to do the same using randomized methods.
If you're still looking for a resource of semantically related words, I've just recently developed an API that takes a query and returns semantically related words. It offers parts of speech, relationships to the query word, and a word similarity measurement.
https://kiingo.co/rapid-associations-api
Disclaimer: I'm the developer of this API.

Resources