Fuzzy String Matching - string

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.

You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

Related

LUCENE - Fuzzy Search on a word containing Space

The case i am facing seems very simple, but truly i can't imagine a clear solution:
Imagine i want to indexed a text containing "Summertime, and the living is easy" on a Lucene Index.
I want that the search on my ui of "summer time" finds the document indexed containing my text with Summertime, while maintaining all the benefits of a StandardAnalyser standard data.
I imagine that using a fuzzyQuery will suffice (since the distance is 1). since the tokenizer i use split based on the spaces, the solution isn't revlevant
I don't know wich analyzer to use to allow this possibility? while keeping all the benefits of a StandardAnalyzer'like (Stopwords, possibility to add synonyms,...).
Maybe it's simpler than i think (at least it seems so), but i really can't see any solution for now .... :(
You can use a ShingleFilter to make Solr combine multiple tokens into one, with a user define separator.
That way you'll get "summer time" as a single token, as well as "summer" and "time" (unless you disable outputUnigrams). When you do this you'll get tokens with a small edit distance, and the fuzzy search should work as you want it to.

How to Index and Search multiple terms and phrases with Lucene

I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.

Human-readable short one-way hash for obscuring strings in logfiles?

I need to do large-scale anonymisation of database log-files.
Part of this will involve obscuring various field names (strings), as well as IP addresses.
1. Field Names
For example, we might have the string BusinessLogic.Categorisation.ExternalDeals. In the anonymised version, we would want it to be something like Jerrycan.Doorway.Fodmap (or something gibberish, but still "pronounceable")
The purpose is simply to obscure the original strings - however, we still want to be able to matchup occurrence of those strings across different logfiles.
The requirements of the hash are:
Repeatable - that is, the same inputs passed in each time would always produce the same outputs. We need to be able to match-up fields between different logfiles (all we're trying to prevent is somebody deriving the original string).
One-way - there is no way of reversing the outputs to product the inputs.
Low chance of collision - it will mess up our analysis if two fields are mapped to the same output.
Human readable (or pronounceable) - somebody scanning through logfiles by hand should be able to make out fields, and visually match them up. Or if need be, read them over the phone.
Short strings - I do understand there's a tradeoff between this and available entropy, however, ideally a string like HumanReadable should map to something like LizzyNasbeth.
I had a look around, and I found https://github.com/zacharyvoase/humanhash (output hash is a bit longer than what I want) and https://www.rfc-editor.org/rfc/rfc1751 (not really "pronouceable" - ideally, we'd want something that looks like a English-language human word, but isn't actually - and, once again, a bit long).
What algorithms or approaches are there to this problem? Or any libraries or implementations you could recommend?
2. IP Addresses
For the IP addresses, we need a way to mask them (i.e. not possible for an outside observer to derive the original IP address), but still have it be repeatable across different logfiles (i.e. the same input always produces the same output).
Ideally, the output would still "look" like an IP address. For example, maybe 192.168.1.55 would map to 33.41.22.44 (or we can use alphabetical codes as well, if that's easier).
Any thoughts on how to do this?
You could use codenamize :
$ codenamize BusinessLogic -j "" -c
AbsorbedUpper
You can use this from command line or as a Python library.
(Disclaimer, I wrote it).
I was discussing with a colleague, and he suggested one approach.
Take the field name - and pass it through a standard one-way hash (e.g. MD5).
Use the resulting digest as a index to map to a dictionary of English words (e.g. using mod).
That solves the issue of it always being repeatable - the same word hashed each time will always map to the same English word (assuming your dictionary list does not change).
If individuals companies were worried about dictionary attacks (i.e. the field name "firstname" would always map to say "Paris"), then we could also use a company-specific keyfile to salt the hash. This means that it would be repeatable for anonymised logfiles from them (i.e. "firstname" might always map to "Toulouse" for them), but it would not be the same as for other companies who use other keyfiles.
I'm still very keen to see what other people can suggest, or whether they might have any thoughts on the above.

Search strategy

I'm writing a java program that needs to find possible matches for specified strings. Strings will generally be in the form of
onetwothree one.two.three
onesomethingtwoblah onesomething
where one two and three are parts of an actual title. Candidate matches from the database are in the form one+two+three. The method i have come up with is to compare each token from database candidates with the entire specified string using regex. A counter for the number of database token matches will be used to determine the rank of possible matches.
My concern is the accuracy of matches presented and the method's ability to successfully find matches if they do exist. Is this method efficient?
Depends, if you have a lot of database records and large strings to compare against the search may end up being quite expensive. It would need to pass the entire input string for each record.
You could consider doing a single pass over the input string and search tokens against the database. Some smart search indexed could help speed this up.
When pairing multiple tokens you would need to figure out a way knowing when to stop scanning and advance to a next token. Partial matches could help here; store one+two+three also as seperate one, two and three. Or if the order matters store it also as one, one+two and one+two+three.
Basically as you scan you have a list of candidate DB entries that gets smaller and smaller, comparable to a facet search.

Quick Filter List

Everyone is familiar with this functionality. If you open up the the outlook address book and start typing a name, the list below the searchbox instantly filters to only contain items that match your query. .NET Reflector has a similar feature when you're browsing types ... you start typing, and regardless of how large the underlying assembly is that you're browsing, it's near instantaneous.
I've always kind of wondered what the secret sauce was here. How is it so fast? I imagine there are also different algorithms if the data is present in-memory, or if they need to be fetched from some external source (ie. DB, searching some file, etc.).
I'm not sure if this would be relevant, but if there are resources out there, I'm particularly interested how one might do this with WinForms ... but if you know of general resources, I'm interested in those as well :-)
What is the most common use of the trie data structure?
A Trie is basically a tree-structure for storing a large list of similar strings, which provides fast lookup of strings (like a hashtable) and allows you to iterate over them in alphabetical order.
Image from: http://en.wikipedia.org/wiki/Trie:
In this case, the Trie stores the strings:
i
in
inn
to
tea
ten
For any prefix that you enter (for example, 't', or 'te'), you can easily lookup all of the words that start with that prefix. More importantly, lookups are dependent on the length of the string, not on how many strings are stored in the Trie. Read the wikipedia article I referenced to learn more.
The process is called full text indexing/search.
If you want to play with the algorithms and data structures for this I would recommend you read Programming Collective Intelligence for a good introduction to the field, if you just want the functionality I would recommend lucene.

Resources