Efficient way for multiple string pattern matching?

Efficient way for multiple string pattern matching? - string

Suppose now we have:
String[] = {
"Name:John, State:MA, City:Boston, Degree:Master",
"Name:Alex, State:CA, City:San Diego, Degree:PhD",
"Name:Aaron, State:NY, City:NYC, Degree:Master",
"Name:Lily, State:MA, City:Worcester, Degree:Master",
}
How I'd like to find ALL that contain both "State:MA" and "Degree:Master"; so obviously that'll be line 1 and 4.
So it looks like SQL database query but I need to implemented using Java or Python.
Also, the input data is supposed be very big, so I'm actually considering more efficient ways like Trie to store the information.
But usually Trie is supposed for prefix string question; say, given a list of strings we'd like to find all strings that contain pattern he,
so final list could be like:
he, hell, help, hello....
While for my question, the two patterns they are not continuous together; but Trie indeed can save lots of space for big input.
So any ideas to solve such multiple pattern matching using Trie?
Or other data structures I don't know?
Thanks

For inspiration you may look at these classes. You'd better start with samples first. The approach is kind of hybrid of a trie and FSA.
You'll have to implement the logic for preparing patterns on your own. Also you'd have to take care of the order of results when multiple patterns match your string.

Related

Meta-information in DAWG/DAFSA

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:
c
/ \
a i
\ /
t
|
e
and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.
Questions:
1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?
2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.

In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.
DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.
Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.
Hope this helps!

Efficient algorithm for grouping array of strings by prefixes

I wonder what is the best way to group an array of strings according to a list of prefixes (of arbitrary length).
For example, if we have this:
prefixes = ['GENERAL', 'COMMON', 'HY-PHE-NATED', 'UNDERSCORED_']
Then
tasks = ['COMMONA', 'COMMONB', 'GENERALA', 'HY-PHE-NATEDA', 'UNDERESCORED_A', 'HY-PHE-NATEDB']
Should be grouped this way:
[['GENERALA'], ['COMMONA', 'COMMONB'], ['HY-PHE-NATEDA', 'HY-PHE-NATEDB'], ['UNDERESCORED_A'] ]
The naïve approach is to loop through all the tasks and inner loop through prefixes (or vice versa, whatever) and test each task for each prefix.
Can one give me a hint how to make this in a more efficient way?

It depends a bit on the size of your problem, of course, but your naive approach should be okay if you sort both your prefixes and your tasks and then build your sub-arrays by traversing both sorted lists only forwards.

There are a few options, but you might be interested in looking into the trie data structure.
http://en.wikipedia.org/wiki/Trie
The trie data structure is easy to understand and implement and works well for this type of problem. If you find that this works for your situation you can also look at Patricia Tries which achieve the similar performance characteristics but typically have better memory utilization. They are a little more involved to implement but not overly complex.

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?

So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

mapping strings

I want to map some strings(word) with number. the similar the string, the nearer their value(mapped number) . also, while checking the positional combination of the letters should impact the mapping.the mapping function should be function of letters, positions (combination given position of letter thepriority such as pit and tip should be different), number of letters.
Well, I would give some examples : starter, stater , stapler, startler, tstarter are some words. These words are of format "(*optinal)sta(*opt)*er" where * denotes some sort of variable in our case it is either 't' or 'l' (i.e. in case of starter and staler). these all should be mapped INDIVIDUALLY, without context to other such that their value are not of much difference. and later on which creating groups I can put appropriate range of numbers for differentiating groups.
So while mapping the string their values should be similar. there are many words, so comparing each other would be complex. so mapping with some numeric value for each word independently and putting the similar string (as they have similar value) in a group and then later find these pattern by other means.
So, for now I need to look up for some existing methods of mapping such that similar strings (I guess I have clarify the term 'similar' for my context) have similar value and these value should be different to the dissimilar ones. please, again I emphasize that the number of string would be huge and comparing each with other is practically impossible(or computationally expensive and much slow).SO WHAT I THINK IS TO DEVISE AN ALGORITHM(taking help from existing ones) FOR MAPPING WORD(STRING) ON ITS OWN
Have I made you clear? Please give me some idea to start with. some terms to search and research.
I think I need some type of "bad" hash function to hash strings and then put them in bucket according to that hash value. at least some idea or algorithm names.

Seems like it would best to use a known algorithm like Levenshtein Distance

This search on StackOverflow
reveals this question about finding-groups-of-similar-strings-in-a-large-set-of-strings, which links to this article describing a SimHash which sounds exactly like what you want.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.

The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.

I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter

You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient way for multiple string pattern matching? - string

Related

Meta-information in DAWG/DAFSA

Efficient algorithm for grouping array of strings by prefixes

Search with attribute values correspondence in Lucene

mapping strings

Identifying frequent formulas in a codebase

Categories

Resources