What is the best text analyzer to use on generic data in Azure Cognitive Search when searching for more than 1 word? - azure

I have been looking over the different text analyzers which Azure Cognitive Search offers with this api.
The type of data I have is generic and can be either an email address / name, these are just some examples.
Which is the best analyzer to use on this type of data (generic)?
Also, does the text analyzer in use affect how search works when looking for more than 1 word?
What is the best way to make it do a fuzzy search for more than 1 word i.e. "joe blogs" but all fuzzy.
I don't want "somename blogs" to show up for somename is not a fuzzy match on joe.
I do want "joe clogs" to show up for joe would fuzzy match to joe and clogs would fuzzy match on blogs.
What is the best practice to do fuzzy search with more than 1 word, which would give the end user fewer hits as they give more words?

If you have generic content and don't want to use linguistic processing, you can use a generic analyzer like the Whitespace analyzer. See
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
How searches for single or multiple words work is determined by the searchMode parameter. The recommended practice is to use all, instead of any. When you specify more terms, you are more specific and you want fewer (more precise) results.
You can specify multi-word queries where individual search terms are fuzzy by using the tilde syntax. E.g. to do a fuzzy search for joe but exact match on blogs you could something like:
joe~ blogs
You can also control how fuzzy you want it to be. See
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_fuzzy
PS: From your use case it sounds like proximity matching is also something you could consider using:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_proximity

Related

Azure Search Suggestions don't catch missing prefix

When sending a phrase to the Azure Search service, using Suggest method,
the results are only phrases start with the search term.
Even when using "FuzzyMatching"
for example "ap" will return "aplle" and "april" but not "rap"
Is it possible to get any phrase contain the search term ?
You are correct that Azure Search does not allow for the ability to do this type of contain (or wildcard) search for suggestions. However, one thing that we will be releasing (hopefully towards the end of next week) is something called custom analyzers. Custom analyzers allow you to do not only this, but other types of analysis on your data. For example, you can create a field and tell us that it should allow for prefix or suffix matching. You can also choose to do regex style queries against your field.
I do want to caveat this with a bit of a warning though. If you set your field to allow for prefix or suffix search we can get results quite quickly because if we know that you want us to either look at the start or end of the word, we can build our inverted index appropriately to handle this very quickly. However, for things like generic contain (or even regex) it is more of a brute force type of search and if you have significant content, this could have an impact on the latency of your queries.
Hopefully that will help you do what you need here and if you want to keep an eye out for this, we will be posting content on this at our documentation page: https://azure.microsoft.com/en-us/documentation/services/search/
Liam

Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?
No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

Improving type-ahead suggestions in Search using SOLR

What are the possible ways you could improve the type-ahead (auto-complete) suggestions that appear in a free-form search?
From my understanding, all the suggestions that appear for keywords are stored in a SOLR table.
How do you ensure that it covers all the industry specific relevant type-ahead suggestions?
Can you automate including most recent user generated queries that are not currently providing search results to lead to relevant ones?
In preprocessing, the documents fed into the search engine need to be enriched with whatever is sensible and provides help to find them. E.g. a document containing the string paris may be enriched by french capital, capital of france, ile-de-france, … You will need a dictionary to do so. You can take data from dbpedia.org or—for English only—WordNet. For not to over-generalize you will need to implement some disambiguation (meaning discovery) in the first step, since paris—for example—could equally be expanded with alexandros, alaksandu of wilusa, king of troy, depending on the context.

smart search by first/last name

I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.

iOS: Search on a main word(noun), not its pronoun

I am writing a TableView app where people can search for a word in a foreign language. In this language, the article is important as it tells the word's gender.
A reasonable english example is "The Book".
I want to search for "Book", not "The".
Any ideas on the best way to do this?
Many thanks
You need a secondary index free from noise words and do a search against this. There are also some full text search libraries for iOS, or you can build your own version of Sqlite with full text module turned on.
Also you may consider preprocessing the query, for example using an algorithm to reduce it to its word root and then searching that with a wildcard (eg. 'consideration' >> 'consider*'
Locatya http://www.locayta.com/iOS-search-engine/locayta-search-mobile/register-for-download
Building Sqlite with Fulltext on iOS http://longweekendmobile.com/2010/06/16/sqlite-full-text-search-for-iphone-ipadyour-own-sqlite-for-iphone-and-ipad/
Are you talking about looking something up in a database, eg? SQLite can be built with Full Text Search extensions that allow you to search for individual words in text. Even without the FTS extensions you can use a LIKE match in SQLite to find a word in a phrase, though the FTS extensions are much faster and more flexible.
You can also implement your own poor-man's Key Word In Context (KWIC) scheme -- basically just enter each item in the database N times for an N-word phrase, each time rotated one word.
And there are variations on the KWIC scheme that work for large numbers of phrases with less duplication -- using a tree structure to access the data. With such approaches it's practical to implement a search without need for a keyboard, just by successively refining the table contents.

Resources