NLP: Stemming on opcodes data set - nlp

I have a dataset of 27 files, each containing opcodes. I want to use stemming to map all versions of similar opcodes into the same opcode. For example: push, pusha, pushb, etc would all be mapped to push;
addf addi to add, multi multf to mult, etc.). How can I do so? I tried using PorterStemmer with NLTK extensions but it is not working on my dataset. I think it works only on normal human lingual words. (Like played, playing --> play) and not on these opcodes like (pusha, pushb --> push).

I don't think a stemming is what you want to do here. Stemmers are language specific and are based on the common inflectional morphological patterns in that language. For example, in English, you have the infinitival forms of verbs (e.g., "to walk") which becomes inflected for tense, aspect, & person/number: I walk vs. She walks (walk+s), I walk vs. walked (walk+ed), also walk+ing, etc. Stemmers codify these stochastic distributions into "rules" that are then applied on a "word" to change into its stem. In other words, an off-the-shelf stemmer does not exist for your opcodes.
You have two possible solutions: (1) create a dictionary or (2) write your own stemmer. If you don't have too many variants to map, it is probably quickest to just create a custom dictionary where you use all your word variants as keys and the lemma/stem/canonical-form is the value.
addi -> add
addf -> add
multi -> mult
multf -> mult
If your potential mappings are too numerous to do by hand, then you could write a custom regex stemmer to do the mapping and conversion. Here is how you might do it in R. The following function takes an input word and tries to match it to a pattern representing all the variants of a stem, for all the n stems in your collection. It returns a 1 x n data.frame with 1 indicating presence or 0 indicating absence of variant match.
#' Return word's stem data.frame with each column indicating presence (1) or
#' absence (0) of stem in that word.
map_to_stem_df <- function(word) {
## named list of patterns to match
stem_regex <- c(add = "^add[if]$",
mult = "^mult[if]$")
## iterate across the stem names
res <- lapply(names(stem_regex), function(stem) {
pat <- stem_regex[stem]
## if pattern matches word, then 1 else 0
if (grepl(pattern = pat, x = word)) {
pat_match <- 1
} else {
pat_match <- 0
}
## create 1x1 data.frame for stem
df <- data.frame(pat_match)
names(df) <- stem
return(df)
})
## bind all cols into single row data.frame 1 x length(stem_regex) & return
data.frame(res)
}
map_to_stem_df("addi")
# add mult
# 1 0
map_to_stem_df("additional")
# add mult
# 0 0

Related

Is L = {ww^Ru | w, u ∈ {0,1}+} regular language?

let L = {wwRu | w, u ∈ {0,1}+}. Is L regular language ? Note that w, u cannot be empty.
I've tried to prove it is not regular language by the pumping lemma, but I failed when w = 0^p1^p, 01^p, (01)^p. Once I take y = 0^p or 1^p, xyyz will be 00.../11.../01^n0... etc.
And I cannot draw its DFA/NFA or write its regular expression to prove it is regular language.
So is L regular or not ? How can I prove it ?
The language is not regular, and we can prove it using the Myhill-Nerode theorem.
Consider the sequence of strings 01, 0101, ..., (01)^n, ...
First, notice that none of these strings are in the language. Any prefix of any of these strings which has even length is of the form (01)^2m for some m, and therefore just a shorter string in the sequence; splitting such a prefix in two either has both substrings start with 0 and end with 1, or else it has the first substring start and end with 0 and the second start and end with 1. In either case, these strings are not of the form w(w^R)u for any w or u.
Next, notice that the shortest possible string which we can append to any of these strings, to produce a string in the language, is always the reverse of itself followed by either 0 or 1. That is, to turn 01 into a string in the language, we must append 100 or 101; there are no shorter strings we can append to 01 to get a string in the language. The same holds true for 0101: 10100 and 10101 are the shortest possible strings that take 0101 to a string in L. And so on for each string of the form (01)^n.
This means that each string of the form (01)^n is distinguishable with respect to the target language w(w^R)u. The Myhill-Nerode theorem tells us that a minimal DFA for a regular language has exactly as many states as there are equivalence classes under the indistinguishability relation. Because we have infinitely many distinguishable strings with respect to our language, a minimal DFA for this language must have infinitely many states. But, a DFA cannot have infinitely many states; this is a contradiction. This means that our language cannot be regular.
The language is REGULAR:
L = 00(0+1)+ + 11(0+1)+ + 0(11)+0(0+1)+ + 1(00)+1(0+1)+

Haskell: can I use laziness to "abort early" and gain performance?

I'm writing a Haskell program that reads a wordlist of the English language and a rectangular grid of letters such as:
I T O L
I H W S
N H I S
K T S I
and then finds a Hamiltonian path through the grid from the top-left corner that spells out a sequence of English words, such as:
--> $ runghc unpacking.hs < 4x4grid.txt
I THINK THIS IS SLOW
(If there are multiple solutions, it can just print any one it finds and stop looking.)
The naïve, strict approach is to generate a full path and then try to split it up into words. However, assuming that I'm doing this (and currently I am forcing myself to -- see below) I'm spending a lot of time finding paths like:
IINHHTOL...
IINHHTOW...
IINHHWOL...
These are obviously never going to turn out to be words, looking at the first few letters ("IINH" can't be split into words, and no English word contains "NHH".) So, say, in the above grid, I don't want to look at the many[1] paths that begin with IINHH.
Now, my functions look like this:
paths :: Coord -> Coord -> [[Coord]]
paths (w, h) (1, 1) = [[(1, 1), (1, 2), ... (x, y)], ...]
lexes :: Set String -> String -> [[String]]
lexes englishWordset "ITHINKTHISWILLWORK" = [["I", "THINK", "THIS", ...], ...]
paths just finds all the paths worth considering on a (w, h) grid. lexes finds all the ways to chop a phrase up, and is defined as:
lexes language [] = [[]]
lexes language phrase = let
splits = tail $ zip (inits phrase) (tails phrase)
in concat [map (w:) (lexes language p') | (w, p') <- splits,
w `S.member` language]
Given "SAMPLESTRING", it looks at "S", then "SA", then "SAM"... as soon as it finds a valid word, it recurses and tries to "lex" the rest of the string. (First it will recurse on "PLESTRING" and try to make phrases with "SAM", but find no way to chop "plestring" up into words, and fail; then it will find ["SAMPLE", "STRING"].)
Of course, for an invalid string above, any hope of being "lazy" is lost by following this approach: in the example from earlier we need to still search beyond a ridiculous phrase like "ITOLSHINHISIST", because maybe "ITOLSHINHISISTK" (one letter longer) might form a valid single word.
I feel like somehow I could use laziness here to improve performance throughout the entire program: if the first few characters of phrase aren't a prefix of any word, we can bail out entirely, stop evaluating the rest of phrase, and thus the rest of the path.[2] Does this make sense at all? Is there some tree-like data structure that will help me check not for set membership, but set "prefix-ness", thereby making checking validity lazier?
[1] Obviously, for a 4x4 grid there are very few of these, but this argument is about the general case: for bigger grids I could skip hundreds of thousands of paths the moment I see they start with "JX".
[2] phrase is just map (grid M.!) path for some Map Coord Char grid read from the input file.

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)
One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.
First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)
You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

replace within boxed structure

I have the following (for example) data
'a';'b';'c';'a';'b';'a'
┌─┬─┬─┬─┬─┬─┐
│a│b│c│a│b│a│
└─┴─┴─┴─┴─┴─┘
and I'd like to replace all 'a' with a number, 3, and 'b' with another number 4, and get back
┌─┬─┬─┬─┬─┬─┐
│3│4│c│3│4│3│
└─┴─┴─┴─┴─┴─┘
how can I do that?
Thanks for help.
rplc
If that was a string (like 'abcaba') there would be the easy solution of rplc:
'abcaba' rplc 'a';'3';'b';'4'
34c343
amend }
If you need to have it like boxed data (if, for example, 'a' represents something more complex than a character or atom), then maybe you can use amend }:
L =: 'a';'b';'c';'a';'b';'a'
p =: I. (<'a') = L NB. positions of 'a' in L
0 3 5
(<'3') p } L NB. 'amend' "3" on those positions
putting the above into a dyad:
f =: 4 :'({.x) (I.({:x) = y) } y' NB. amend '{.x' in positions where '{:x' = y
('3';'a') f L
┌─┬─┬─┬─┬─┬─┐
│3│b│c│3│b│3│
└─┴─┴─┴─┴─┴─┘
which you can use in more complex settings:
]L =: (i.5);'abc';(i.3);'hello world';(<1;2)
┌─────────┬───┬─────┬───────────┬─────┐
│0 1 2 3 4│abc│0 1 2│hello world│┌─┬─┐│
│ │ │ │ ││1│2││
│ │ │ │ │└─┴─┘│
└─────────┴───┴─────┴───────────┴─────┘
((1;2);(i.3)) f L
┌─────────┬───┬─────┬───────────┬─────┐
│0 1 2 3 4│abc│┌─┬─┐│hello world│┌─┬─┐│
│ │ ││1│2││ ││1│2││
│ │ │└─┴─┘│ │└─┴─┘│
└─────────┴───┴─────┴───────────┴─────┘
btw, {.y is the first item of y; {:y is the last item of y
bottom line
Here's a little utility you can put in your toolbox:
tr =: dyad def '(y i.~ ({." 1 x),y) { ({:" 1 x) , y'
] MAP =: _2 ]\ 'a';3; 'b';4
+-+-+
|a|3|
+-+-+
|b|4|
+-+-+
MAP tr 'a';'b';'c';'a';'b';'a'
+-+-+-+-+-+-+
|3|4|c|3|4|3|
+-+-+-+-+-+-+
just above the bottom line
The utility tr is a verb which takes two arguments (a dyad): the right argument is the target, and the left argument is the mapping table. The table must have two columns, and each row represents a single mapping. To make just a single replacement, a vector of two items is acceptable (i.e. 1D list instead of 2D table, so long as the list is two items long).
Note that the table must have the same datatype as the target (so, if you're replacing boxes, it must be a table of boxes; if characters, then a table of characters; numbers for numbers, etc).
And, since we're doing like-for-like mapping, the cells of the mapping table must have the same shape as the items of the target, so it's not suitable for tasks like string substitution, which may require shape-shifting. For example, ('pony';'horse') tr 'I want a pony for christmas' won't work (though, amusingly, 'pony horse' tr&.;: 'I want a pony for christmas' would, for reasons I won't get into).
way above the bottom line
There's no one, standard answer to your question. That said, there is a very common idiom to do translation (in the tr, or mapping 1:1, sense):
FROM =: ;: 'cat dog bird'
TO =: ;: 'tiger wolf pterodactyl'
input=: ;: 'cat bird dog bird bird cat'
(FROM i. input) { TO
+-----+-----------+----+-----------+-----------+-----+
|tiger|pterodactyl|wolf|pterodactyl|pterodactyl|tiger|
+-----+-----------+----+-----------+-----------+-----+
To break this down, the primitive i. is the lookup function and the primitive { is the selection function (mnemonic: i. gives you the *i*ndex of the elements you're looking for).
But the simplistic formulation above only applies when you want to replace literally everything in the input, and FROM is guaranteed to be total (i.e. the items of the input are constrained to whatever is in FROM).
These contraints make the simple formulation appropriate for tasks like case conversion of strings, where you want to replace all the letters, and we know the total universe of letters in advance (i.e. the alphabet is finite).
But what happens if we don't have a finite universe? What should we do with unrecognized items? Well, anything we want. This need for flexibility is the reason that there is no one, single translation function in J: instead, the language gives you the tools to craft a solution specific to your needs.
For example, one very common extension to the pattern above is the concept of substitution-with-default (for unrecognized items). And, because i. is defined to return 1+#input for items not found in the lookup, the extension is surprisingly simple: we just extend the replacement list by one item, i.e. just append the default!
DEFAULT =: <'yeti'
input=: ;: 'cat bird dog horse bird monkey cat iguana'
(FROM i. input) { TO,DEFAULT
+-----+-----------+----+----+-----------+----+-----+----+
|tiger|pterodactyl|wolf|yeti|pterodactyl|yeti|tiger|yeti|
+-----+-----------+----+----+-----------+----+-----+----+
Of course, this is destructive in the sense it's not invertible: it leaves no information about the input. Sometimes, as in your question, if you don't know how to replace something, it's best to leave it alone.
Again, this kind of extension is surprisingly simple, and, once you see it, obvious: you extend the lookup table by appending the input. That way, you're guaranteed to find all the items of the input. And replacement is similarly simple: you extend the replacement list by appending the input. So you end up replacing all unknown items with themselves.
( (FROM,input) i. input) { TO,input
+-----+-----------+----+-----+-----------+------+-----+------+
|tiger|pterodactyl|wolf|horse|pterodactyl|monkey|tiger|iguana|
+-----+-----------+----+-----+-----------+------+-----+------+
This is the strategy embodied in tr.
above the top line: an extension
BTW, when writing utilities like tr, J programmers will often consider the N-dimensional case, because that's the spirit of the language. As it stands, tr requires a 2-dimensional mapping table (and, by accident, will accept a 1-dimensional list of two items, which can be convenient). But there may come a day when we want to replace a plane inside a cube, or a cube inside a hypercube, etc (common in in business intelligence applications). We may wish to extend the utility to cover these cases, should they ever arise.
But how? Well, we know the mapping table must have at least two dimensions: one to hold multiple simultaneous substitutions, and another to hold the rules for replacement (i.e. one "row" per substition and two "columns" to identify an item and its replacement). The key here is that's all we need. To generalize tr, we merely need to say we don't care about what's beneath those dimensions. It could be a Nx2 table of single characters, or an Nx2 table of fixed-length strings, or an Nx2 table of matrices for some linear algebra purpose, or ... who cares? Not our problem. We only care about the frame, not the contents.
So let's say that, in tr:
NB. Original
tr =: dyad def '(y i.~ ({." 1 x),y) { ({:" 1 x) , y'
NB. New, laissez-faire definition
tr =: dyad def '(y i.~ ({."_1 x),y) { ({:"_1 x) , y'
A taxing change, as you can see ;). Less glibly: the rank operator " can take positive or negative arguments. A positive argument lets the verb address the content of its input, whereas a negative argument lets the verb address the frame of its input. Here, "1 (positive) applies {. to the rows of the x, whereas "_1 (negative) applies it to the the "rows" of x, where "rows" in scare-quotes simply means the items along the first dimension, even if they happen to be 37-dimensional hyperrectangles. Who cares?
Well, one guy cares. The original definition of tr let the laziest programmer write ('dog';'cat') tr ;: 'a dog makes the best pet' instead of (,:'dog';'cat') tr ;: 'a dog makes the best pet'. That is, the original tr (completely accidentally) allowed a simple list as a mapping table, which of course isn't a Nx2 table, even in an abstract, virtual sense (because it doesn't have at least two dimensions). Maybe we'd like to retain this convenience. If so, we'd have to promote degenerate arguments on the user's behalf:
tr =: dyad define
x=.,:^:(1=##$) x
(y i.~ ({."_1 x),y) { ({:"_1 x) , y
)
After all, laziness is a prime virtue of a programmer.
Here's the simplest way I can think of to accomplish what you have asked for:
(3;3;3;4;4) 0 3 5 1 4} 'a';'b';'c';'a';'b';'a'
┌─┬─┬─┬─┬─┬─┐
│3│4│c│3│4│3│
└─┴─┴─┴─┴─┴─┘
here's another approach
(<3) 0 3 5} (<4) 1 4} 'a';'b';'c';'a';'b';'a'
┌─┬─┬─┬─┬─┬─┐
│3│4│c│3│4│3│
└─┴─┴─┴─┴─┴─┘
Hypothetically speaking, you might want to be generalizing this kind of expression, or you might want an alternative. I think the other posters here have pointed out ways of doing that. . But sometimes just seeing the simplest form can be interesting?
By the way, here's how I got my above indices (with some but not all of the irrelevancies removed):
I. (<'a') = 'a';'b';'c';'a';'b';'a'
0 3 5
('a') =S:0 'a';'b';'c';'a';'b';'a'
1 0 0 1 0 1
('a') -:S:0 'a';'b';'c';'a';'b';'a'
1 0 0 1 0 1
I.('a') -:S:0 'a';'b';'c';'a';'b';'a'
0 3 5
I.('b') -:S:0 'a';'b';'c';'a';'b';'a'
1 4

Efficient mass string search problem

The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.
Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.
My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?
Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.
I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.
Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.
you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).
If the list of words were I am an ox then the trie would look a bit like this:
(I ($ [I])
a (m ($ [am])
n ($ [an])
? ($ [am an])
* ($ [am an]))
o (x ($ [ox])
? ($ [ox])
* ($ [ox]))
? ($ [I]
m ($ [am])
n ($ [an])
x ($ [ox])
? ($ [am an ox])
* ($ [I am an ox]
m ($ [am]) ...)
* ($ [I am an ox]
I ...
...
And here is a sample python program:
import sys
def addWord(root, word):
add(root, word, word, '')
def add(root, word, tail, prev):
if tail == '':
addLeaf(root, word)
else:
head = tail[0]
tail2 = tail[1:]
add(addEdge(root, head), word, tail2, head)
add(addEdge(root, '?'), word, tail2, head)
if prev != '*':
for l in range(len(tail)+1):
add(addEdge(root, '*'), word, tail[l:], '*')
def addEdge(root, char):
if not root.has_key(char):
root[char] = {}
return root[char]
def addLeaf(root, word):
if not root.has_key('$'):
root['$'] = []
leaf = root['$']
if word not in leaf:
leaf.append(word)
def findWord(root, pattern):
prev = ''
for p in pattern:
if p == '*' and prev == '*':
continue
prev = p
if not root.has_key(p):
return []
root = root[p]
if not root.has_key('$'):
return []
return root['$']
def run():
print("Enter words, one per line terminate with a . on a line")
root = {}
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
addWord(root, line)
print(repr(root))
print("Now enter search patterns. Do not use multiple sequential '*'s")
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
print(findWord(root, line))
run()
If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:
[('d' , 'world'),
('ello' , 'hello'),
('hello', 'hello'),
('ld' , 'world'),
('llo' , 'hello'),
('lo' , 'hello'),
('o' , 'hello'),
('orld' , 'world'),
('rld' , 'world'),
('world', 'world')]
Use this array to build sets of candidate matches using pieces of the pattern.
For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.
If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.
You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is blah* much more common than bl?h (which i'd assume it was) among your current users?
With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.
You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no bs or a single b can never match the query abba*, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.
If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.
In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.
There are plenty of good algorithms for multi-string search. Google "Navarro string search" and you'll see a good analysis of multi-string options. A number of algorithsm are extremely good for "normal" cases (search strings that are fairly long: Wu-Manber; search strings with characters that are modestly rare in the text to be searched: parallel Horspool). Aho-Corasick is an algorithm that guarantees a (tiny) bounded amount of work per input character, no matter how the input text is tuned to create worst behaviour in the search. For programs like Snort, that's really important, in the face of denial-of-service attacks. If you are interested in how a really efficient Aho-Corasick search can be implemented, take a look at ACISM - an Aho-Corasick Interleaved State Matrix.

Resources