I'm writing a java program that needs to find possible matches for specified strings. Strings will generally be in the form of
onetwothree one.two.three
onesomethingtwoblah onesomething
where one two and three are parts of an actual title. Candidate matches from the database are in the form one+two+three. The method i have come up with is to compare each token from database candidates with the entire specified string using regex. A counter for the number of database token matches will be used to determine the rank of possible matches.
My concern is the accuracy of matches presented and the method's ability to successfully find matches if they do exist. Is this method efficient?
Depends, if you have a lot of database records and large strings to compare against the search may end up being quite expensive. It would need to pass the entire input string for each record.
You could consider doing a single pass over the input string and search tokens against the database. Some smart search indexed could help speed this up.
When pairing multiple tokens you would need to figure out a way knowing when to stop scanning and advance to a next token. Partial matches could help here; store one+two+three also as seperate one, two and three. Or if the order matters store it also as one, one+two and one+two+three.
Basically as you scan you have a list of candidate DB entries that gets smaller and smaller, comparable to a facet search.
Related
I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!
I've a collection fields like:
["city of god"]
["god of war", "city of war"]
I want to perform a search on the field with 'city' AND 'god' and I want only 'city of god' to be returned.
Yet, the second field is also return regardless of the terms being in two different strings within the collection.
Anyway to make the search strict to within strings and not to the entire collection?
Each searchable field in the index is treated as a bag of terms, so for “city AND god” you’re matching on all terms of that field in the whole document, not only the terms within sub-documents (in this case individual strings in the collection).
One way to get around this would be to specify a reasonable estimate of distance between these terms within a single string of a collection and use that to issue a proximity search query to get the desired result. For your specific example, assuming that the terms would be within 5 words of each other, the following query should work -
&queryType=full&search=fieldName:"city god"~5
Using proximity search is really useful as it helps that words don’t have to be in the provided order in the phrase query with large enough proximity value e.g., “city god”~5 would also match “god bless the city”.
Make sure to include queryType=full in your query string as proximity search is part of the full query syntax and would not work otherwise. You can check some other examples here.
I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.
I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.
I have a large number of strings (potentially 1,000,000+), and I want to search another string (a document) to see which of these search strings appears in the document.
Not all of the search strings are a single word, so it's not just a case of searching for each word in the document in the list of search strings.
What's the most efficient way of doing this?
I will be doing this for a large number of documents (coming from a feed), and need to do it fast enough that I can process the documents quicker than they're coming in (a second or two at most ideally).
I can potentially come up with a list of stop words that won't appear in the search strings (e.g. 'the', 'and').
Ideally the solution will be in Java, but that's not a requirement as I can always port the code into Java. If it makes any difference, the search strings are currently stored in a MongoDB.
Take a look at Radix trees and Suffix trees.
There is an example on the concurrent-trees project, of how to scan unseen documents efficiently for large numbers of keywords stored in the inverted radix tree in that project. Example code here.
Check out High-performance pattern matching algorithms Java