Data-structure to use for sentence comparison - string

I'll try to be as succinct as possible as this will require some explanation. I am in a situation where I have to match strings, and extract values from those strings based on template strings that we define.
For example, the template sting would be:
I want to go to the $websiteurl homepage
And the other string could be
I want to go to the google.com homepage
We've had initial success checking that these strings match by measuring Levenshtein Distances and by using "creative" regular expressions, but we're trying to grow this to be more fault tolerant and accurate and the amount of complicated code is growing more than we would like.
Some scenarios we need to check compound words or we need to ignore extra adjectives/descriptive-words etc.
Adjective/Descriptive-word Example:
I want to immediately go to the google.com homepage
Compound Word Example (homepage is broken into two words):
I want to go to the google.com home page
And you can probably imagine even more complicated real-world sentences that should match this one, but won't match or work without some extra cases or additional checks for the sentence.
Obviously our current setup isn't ideal, due to the fact we need to do multiple passes over this string for every individual case we need to check which not only slows things down, but makes maintenance and debugging more complicated than it should be.
Is there a data structure that would be ideal for holding and comparing sentences in this manor? Something that would be fairly fault tolerant to extra or even missing words (within reason, obviously)? I imagine some kind of tree, but I do not know what type of tree or data-structure in general would be the best for this situation.
Thanks to any/all in advance!

Related

Internal Search optimization for relevance

My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?
All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.

Match snippets in an HTML document

Somebody has presented me with a very large list of copyedits to make to a long HTML document. The edits are in the format:
"religious" should be "religions"
"their" should be "there"
"you must persistent" should be "you must be persistent"
The copyedits were typed by hand; in some cases, the "actual" value on the left is not an exact match for the content in the document. The order of edits is usually correct, but even that is not guaranteed.
It's a straightforward but very large task to apply these edits by hand to the document. I'd like to automate the process as much as possible, e.g. by automatically searching for the snippets.
In a long document like this, I can't just search for all instances of "their" and replace them with "there." Sometimes "their" was used correctly, just not in one particular instance.
In other words, I'm looking for a fuzzy text match, where the order of the edits influences the search.
What's a good approach to a problem like this? I'm hoping that there's some off-the-shelf open source project that can search for the snippets in a fuzzy order.
I am not aware of any tool. But I would use edit distance for both:
for non-exact string match: probably std. Levenstein + swap (i.e. Damerau-Levenstein distance)
for non-exact sequence match: this time probably only with Match and Swap operations. You can use free (zero-cost) Insert to get the words that should not be edited.
It should not be hard to implement. But the computational complexity will be quite high. I would use some heuristics to skip hopeless matches. Preprocessing words in the document and the edit list could be good: get a set of chars for each word to allow a quick comparison before calculating full edit distance), etc.

Most efficient algorithm for comparing a string with a group of strings

I have a scenario where a user can post a number of responses or phrases via a form field. I would like to be able to take the response and determine what they are asking for. For instance if the user types in car, train, bike, jet .... I can assume they are talking about a vehicle, and respond accordingly. I understand that I could use a switch statement or perhaps a regexp as well, however the larger the number of possible responses, the less efficient that computation will be. I'm wondering if there is an efficient algorithm for comparing a string with a group of strings. Any info would be great.
You may want to look into the Aho-Corasick algorithm. If you have a collection of strings that you want to search for, you can spend linear time doing preprocessing on those strings and from that point forward can, in O(n) time, check for all possible matches of those strings in a text corpus of length n. In other words, with a small preprocessing time to set up the algorithm once, you can extremely efficiently scan over numerous inputs again and again searching for those keywords.
Interestingly enough, the algorithm was specifically invented to build a fast index (that is, to look for a lot of different keywords in a huge body of text), and allegedly outperformed other methods by a factor of ten. I think it would work great in your application.
Hope this helps!
If you have a large number of "magic" words, I would suggest splitting the query into words, and using a hash-based lookup to check whether the words are recognized.
You can check Trie structure. I think one of best solution for your problem.

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?
So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.
This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

A string searching algorithm to quickly match an abbreviation in a large list of unabbreviated strings?

I am having a lot of trouble finding a string matching algorithm that fits my requirements.
I have a very large database of strings in an unabbreviated form that need to be matched to an arbitrary abbreviation. A string that is an actual substring with no letters between its characters should also match, and with a higher score.
Example: if the word to be matched within was "download" and I searched "down", "ownl", and then "dl", I would get the highest matching score for "down", followed by "ownl" and then "dl".
The algorithm would have to be optimized for speed and a large number of strings to be searched through, and should allow me to pull back a list of matching items strings (if I had added both "download" and "upload" to the database, searching "load" should return both). Memory is still important, but not as important as speed.
Any ideas? I've done a bunch of research on some of these algorithms but I haven't found any that even touch abbreviations, let alone with all these conditions!
I'd wonder if Peter Norvig's spell checker could be adapted in some way for this problem.
It's a stretch that I haven't begun to work out, but it's such an elegant solution that it's worth knowing about.

Resources