LUCENE - Fuzzy Search on a word containing Space - search

The case i am facing seems very simple, but truly i can't imagine a clear solution:
Imagine i want to indexed a text containing "Summertime, and the living is easy" on a Lucene Index.
I want that the search on my ui of "summer time" finds the document indexed containing my text with Summertime, while maintaining all the benefits of a StandardAnalyser standard data.
I imagine that using a fuzzyQuery will suffice (since the distance is 1). since the tokenizer i use split based on the spaces, the solution isn't revlevant
I don't know wich analyzer to use to allow this possibility? while keeping all the benefits of a StandardAnalyzer'like (Stopwords, possibility to add synonyms,...).
Maybe it's simpler than i think (at least it seems so), but i really can't see any solution for now .... :(

You can use a ShingleFilter to make Solr combine multiple tokens into one, with a user define separator.
That way you'll get "summer time" as a single token, as well as "summer" and "time" (unless you disable outputUnigrams). When you do this you'll get tokens with a small edit distance, and the fuzzy search should work as you want it to.

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

How to Index and Search multiple terms and phrases with Lucene

I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.

Fastest way to search a SQL Server table (or indexed view) column with "like '%search%'"?

Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?
A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

Solr - Enriching the TermsComponent answer

I'm using Solr 3.5.0 (with WebSphere Commerce). While performing a search, commerce use the suggestion tool to suggest (auto-complete) search terms regarding the letters already typed on the search box.
Currently WebSphere Commerce is using the Solr's TermsComponent. But one of my new requirement is to be abble to enrich the list of suggested terms.
Do you know is there is any way to do that by creating a plain text dictionary, using an other solr component, ... ?
Thanks for reading,
and for your help.
Regards,
Dekx.
I think a plain-text dictionary probably wouldn't be a usable data source (even if you could use it, search linearly through a plain-text file would probably be too slow). If you create an index from you dictionary, you could probably incorporate it in the TermsComponent as a shard (see the TermsComponent documentation, under the heading "Distributed Search Support").
I don't believe TermsComponent supports searching multiple fields, so you'll want to make sure the same field name is used for the terms in the dictionary that you want to use (that is, if you are looking at the "name" field in the index, then create a "name" field in your indexed dictionary as well, rather than a "dictionaryentry" field)
Just to my mind, though, I fail to understand what the value this would be. Generally, it's intended to look at the terms available in the index on that field. "Enriching" it with more data, would just be providing suggestions that it won't actually be able to find when searching. Of course, I don't really know about your search implementation, but in most cases, that would certainly be my thought.

Resources