elasticsearch, ngrams should cover entire query? (compound word query) - search

Suppose a user search "koreanpop"
when he really means "korean pop".
I don't think I can build a dictionary in order to recognize "korean" and "pop" as word.
I'm going to use nGram for query analyzer. (is this a horrible idea?)
I'd like to try out
"ko/reanpop"
"kor/eanpop"
"kore/anpop"
"korea/npop"
"korean/pop"
"koreanp/op"
and find out documents with both "korean/pop".
(which will be edge-ngram, min=2)
Is this an ok strategy in practice?
(I know that koreans do not use whitespaces as they should to separate words because Korean search engines support them)
How do I accomplish this with elasticsearch?

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

How to Index and Search multiple terms and phrases with Lucene

I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.

Finding possibly matching strings in a large dataset

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...
Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Can Solr index sentences instead of web pages?

I've just set up Solr, indexed some pages (crawled using Nutch) and I can now search.
I now need to change it to index sentences instead of web pages. The result I need is, for example, to do a search for "one word" and get a list of all sentences that contain "one" and/or "word".
I'm new to Solr so any pointers to where I should start from to achieve this would be extremely helpful. Is it at all possible? Or is there an easy way of doing this I've missed?
Yes. Solr indexes 'documents'. You define what a document is by what you post to it via the REST-ful endpoint. If you push one sentence at a time, it indexes one sentence at a time.
If you meant, 'can I push a document, have solr split into sentences and index each one individually', then the answer is, I think, not very easily inside Solr. If you are using Nutch, I'd recommend putting the splitting into Nutch so that it presents solr with one sentence at a time.
Neither the analysis chain nor update request processors provide for splitting a document into littler documents. You might also contemplate the Elastic Search alternative, though I have no concrete knowledge that there's a greased pole to slide down that leads to your solution there.

Best way to support wildcard search on a large dictionary?

I am working on a project to search in a large dictionary (100k~1m words). The dictionary items look like {key,value,freq}. Myy task is the development of an incremental search algoritm to support exact match, prefix match and wildcard match. The results should be ordered by freq.
For example:
the dictionary looks like
key1=a,value1=v1,freq1=4
key2=ab,value2=v2,freq2=2
key3=abc,value3=v3 freq3=1
key4=abcd,value4=v4,freq4=3
when a user types 'a', return v1,v4,v2,v3
when a user types 'a?c', return v4,v3
Now my best choice is a suffix tree represented by DAWG data struct, but this method does not support wildcard matches effectively.
Any suggestion?
You need to look at n-grams for indexing your content. If you want to something Out-of-the box, you might want to look at Apache Solr which does a lot of the hard work for you. It also supports prefix, wildcard queries etc.

Resources