How to improve comaprison quality when using utl_match.jaro_winkler? - string

I use utl_match.jaro_winkler in order to compare company names. In most cases it works fine, but sometimes I get pretty weird results.
This for example returns 0.62:
utl_match.jaro_winkler('ГОРОДСКАЯ КЛИНИЧЕСКАЯ БОЛЬНИЦА 18','ДИНА');
Those are absolutely different names both by length and symbols! How could it be 62%?
Another example:
SELECT utl_match.jaro_winkler('ООО МЕГИ', 'МЕГИ')
This returns 0! Despite the fact that those are very similar strings.
It feels like I should use something more complicated and advanced than just upper() and utl_match.jaro_winkler(). But I have no idea what exactly.
What would you recommend? What are best practices of comparing two strings? Where I can read about it?

Related

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Data-structure to use for sentence comparison

I'll try to be as succinct as possible as this will require some explanation. I am in a situation where I have to match strings, and extract values from those strings based on template strings that we define.
For example, the template sting would be:
I want to go to the $websiteurl homepage
And the other string could be
I want to go to the google.com homepage
We've had initial success checking that these strings match by measuring Levenshtein Distances and by using "creative" regular expressions, but we're trying to grow this to be more fault tolerant and accurate and the amount of complicated code is growing more than we would like.
Some scenarios we need to check compound words or we need to ignore extra adjectives/descriptive-words etc.
Adjective/Descriptive-word Example:
I want to immediately go to the google.com homepage
Compound Word Example (homepage is broken into two words):
I want to go to the google.com home page
And you can probably imagine even more complicated real-world sentences that should match this one, but won't match or work without some extra cases or additional checks for the sentence.
Obviously our current setup isn't ideal, due to the fact we need to do multiple passes over this string for every individual case we need to check which not only slows things down, but makes maintenance and debugging more complicated than it should be.
Is there a data structure that would be ideal for holding and comparing sentences in this manor? Something that would be fairly fault tolerant to extra or even missing words (within reason, obviously)? I imagine some kind of tree, but I do not know what type of tree or data-structure in general would be the best for this situation.
Thanks to any/all in advance!

Use of INDEX MATCH to find absolute closest value

I've long sought a method for using INDEX MATCH in Excel to return the absolute closest number in an array without reorganizing my data (since MATCH requires lookup_array to be in descending order to find the closest value greater than lookup_value, but ascending order to find the closest value less than lookup_value).
I found the answer in this post. XOR LX's solution:
=INDEX(B4:B10,MATCH(TRUE,INDEX(ABS(A4:A10-B1)=MIN(INDEX(ABS(A4:A10-B1),,)),,),0))
worked perfectly for me, but I don't know why. I can rationalize most of it but I can't figure out this part
INDEX(ABS(A4:A10-B1)=MIN(INDEX(ABS(A4:A10-B1),,))
Can anyone please explain this part?
I guess it makes sense for me to explain it, then!
Actually, it didn't help that I was employing a technique which is designed to circumvent having to enter a formula as an array formula, i.e. with CSE. Although that could be considered a plus by some accounts, I think I was wrong to employ it here, and probably wouldn't do so again.
The technique involves inserting extra INDEX functions at appropriate places within the formula. This forces the other functions, which without array-entry would normally act upon only the first element of any array passed to them, to instead operate over all elements within that array.
However, whilst inserting a single INDEX function for the purpose of avoiding CSE is, in my opinion, perfectly fine, I think when it gets to the point where you're using two or three (or even more) such coercions, then you should probably re-think whether it's worth it all (the few tests that I've done suggest that, in many cases, performance is actually worse off in the non-array, INDEX-heavy version than the equivalent CSE set-up). Besides, the use of array formulas is something to be encouraged, not something to be avoided.
Sorry for the ramble, but it's kind of to the point actually since, if I had given you the array version, then you may well not have come back looking for an explanation, since that version would look like:
=INDEX(B4:B10,MATCH(TRUE,ABS(A4:A10-B1)=MIN(ABS(A4:A10-B1)),0))
which is objectively far easier syntactically to understand than the other version.
Let me know if that helps and/or you still want me to go through a breakdown of either solution, which I'd be happy to do.
You may also find the following links of interest (I hope that I'm not breaking any of this site's rules by posting these):
https://excelxor.com/2014/09/01/index-an-alternative-to-array-cse-formulas
https://excelxor.com/2014/08/18/index-returning-entire-rowscolumns
Regards

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?
So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.
This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

Data clean up: are there libraries of common permutations that we can use? Or is there a better approach?

We are working on clean-up and analysis of a lot of human-entered customer data. We need to decide programmatically whether 2 addresses (for example) are the same, even though the data was entered with slight variations.
Right now we run each address through fairly simplistic string replacement (replacing avenue with ave, for example), concatenate the fields and compare the results. We are doing something similar with names.
At the very least, it seems like our list of search-replace values should already exist somewhere.
Or perhaps you can suggest a totally different and superior way to detect matches?
For the addresses, you should run them through google's map api and get a geocode for each one. Then if the geocodes are the same, the place is the same. I believe they allow 10k hits/day/ip for free.
It's unlikely that you'd come up with anything better on your own.
http://code.google.com/apis/maps/
Soundex and its variants might be a good start as are other approaches suggested by that Wikipedia page.
Essentially you're trying to find how similar two strings are and there are a lot of different ways to measure it. Dice Coefficients could work fairly well for what you're doing, although it is a bit costly of an operation.
http://en.wikipedia.org/wiki/Dice_coefficient
If you want a more comprehensive list of string similarity measures try here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
At work I help write software that verifies addresses (for SmartyStreets).
Address validation is a really tricky operation -- in fact the USPS has designated certain companies which are certified to provide this service. I would not recommend (even if I was in your shoes) that you attempt this on your own. As mentioned, Google does some address parsing, but only approximates the address. Google and Yahoo and similar services will not verify the accuracy of the address data.
So you'll need a CASS-Certified approach to this problem. I would suggest something like the LiveAddress API (for point-of-entry validation) or Certified Scrubbing (for existing lists or databases of addresses). Both are CASS-Certified by the USPS and will do what you require.

Resources