Best way to support wildcard search on a large dictionary? - search

I am working on a project to search in a large dictionary (100k~1m words). The dictionary items look like {key,value,freq}. Myy task is the development of an incremental search algoritm to support exact match, prefix match and wildcard match. The results should be ordered by freq.
For example:
the dictionary looks like
key1=a,value1=v1,freq1=4
key2=ab,value2=v2,freq2=2
key3=abc,value3=v3 freq3=1
key4=abcd,value4=v4,freq4=3
when a user types 'a', return v1,v4,v2,v3
when a user types 'a?c', return v4,v3
Now my best choice is a suffix tree represented by DAWG data struct, but this method does not support wildcard matches effectively.
Any suggestion?

You need to look at n-grams for indexing your content. If you want to something Out-of-the box, you might want to look at Apache Solr which does a lot of the hard work for you. It also supports prefix, wildcard queries etc.

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

In Azure Search, how do you run a "contains" search with multiple terms in searchtext?

I am using Azure Search in full query mode on top of CosmosDB and I want to run a query for any documents with a field that contains the string "azy do". This should match, for example, a document containing "lazy dog".
Reading the Azure Search documentation, it looks like this is impossible due to the term-based indexes it uses.
Rejected solutions
0 matches since it is looking for whole words:"azy do"
Doesn't work since regexes are not allowed to span multiple terms:/.azy do./
This "works", to the extent that it will match "lazy dog", but this does not respect the ordering of the query and will also match "dog lazy", for example /.azy./ AND /.do./
Is there any way of doing this correctly in Azure Search?
If you cannot achieve that via regular expression in the Lucene Query syntax, then is not possible. You may want to vote for supporting contains here.
It should be /.lazy|dog./
So, split the terms based on whitespaces and add a pipe(|) delimiter which stands for OR.
Shortly, Azure Search is not designed to support this scenario. You might be better off using the CONTAINS function in Cosmos DB or its equivalent, depending on what query language you use.
Azure Search is designed for finding terms or phrases that occur in unstructured content (documents) and returning the most relevant documents. The process of extracting and indexing those searchable terms is customizable and described here: How full text search works in Azure Search.

How to Index and Search multiple terms and phrases with Lucene

I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

Resources