How to find similar patterns in lists/arrays of strings - string

I am looking for ways to find matching patterns in lists or arrays of strings, specifically in .NET, but algorithms or logic from other languages would be helpful.
Say I have 3 arrays (or in this specific case List(Of String))
Array1
"Do"
"Re"
"Mi"
"Fa"
"So"
"La"
"Ti"
Array2
"Mi"
"Fa"
"Jim"
"Bob"
"So"
Array3
"Jim"
"Bob"
"So"
"La"
"Ti"
I want to report on the occurrences of the matches of
("Mi", "Fa") In Arrays (1,2)
("So") In Arrays (1,2,3)
("Jim", "Bob", "So") in Arrays (2,3)
("So", "La", "Ti") in Arrays (1, 3)
...and any others.
I am using this to troubleshoot an issue, not to make a commercial product of it specifically, and would rather not do it by hand (there are 110 lists of about 100-200 items).
Are there any algorithms, existing code, or ideas that will help me accomplish finding the results described?

The simplest way to code would be to build a Dictionary then loop through each item in each array. For each item do this:
Check if the item is in the dictonary if so add the list to the array.
If the item is not in the dictionary add it and the list.
Since as you said this is non-production code performance doesn't matter so this approach should work fine.

Here's a solution using SuffixTree module to locate subsequences:
#!/usr/bin/env python
from SuffixTree import SubstringDict
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
import sys
def main(stdout=sys.stdout):
"""
>>> import StringIO
>>> s = StringIO.StringIO()
>>> main(stdout=s)
>>> print s.getvalue()
[['Mi', 'Fa']] In Arrays (1, 2)
[['So', 'La', 'Ti']] In Arrays (1, 3)
[['Jim', 'Bob', 'So']] In Arrays (2, 3)
[['So']] In Arrays (1, 2, 3)
<BLANKLINE>
"""
# array of arrays of strings
arr = [
["Do", "Re", "Mi", "Fa", "So", "La", "Ti",],
["Mi", "Fa", "Jim", "Bob", "So",],
["Jim", "Bob", "So", "La", "Ti",],
]
#### # 28 seconds (27 seconds without lesser substrs inspection (see below))
#### N, M = 100, 100
#### import random
#### arr = [[random.randrange(100) for _ in range(M)] for _ in range(N)]
# convert to ASCII alphabet (for SubstringDict)
letter2item = {}
item2letter = {}
c = 1
for item in (i for a in arr for i in a):
if item not in item2letter:
c += 1
if c == 128:
raise ValueError("too many unique items; "
"use a less restrictive alphabet for SuffixTree")
letter = chr(c)
letter2item[letter] = item
item2letter[item] = letter
arr_ascii = [''.join(item2letter[item] for item in a) for a in arr]
# populate substring dict (based on SuffixTree)
substring_dict = SubstringDict()
for i, s in enumerate(arr_ascii):
substring_dict[s] = i+1
# enumerate all substrings, save those that occur more than once
substr2indices = {}
indices2substr = defaultdict(list)
for str_ in arr_ascii:
for start in range(len(str_)):
for size in reversed(range(1, len(str_) - start + 1)):
substr = str_[start:start + size]
if substr not in substr2indices:
indices = substring_dict[substr] # O(n) SuffixTree
if len(indices) > 1:
substr2indices[substr] = indices
indices2substr[tuple(indices)].append(substr)
#### # inspect all lesser substrs
#### # it could diminish size of indices2substr[ind] list
#### # but it has no effect for input 100x100x100 (see above)
#### for i in reversed(range(len(substr))):
#### s = substr[:i]
#### if s in substr2indices: continue
#### ind = substring_dict[s]
#### if len(ind) > len(indices):
#### substr2indices[s] = ind
#### indices2substr[tuple(ind)].append(s)
#### indices = ind
#### else:
#### assert set(ind) == set(indices), (ind, indices)
#### substr2indices[s] = None
#### break # all sizes inspected, move to next `start`
for indices, substrs in indices2substr.iteritems():
# remove substrs that are substrs of other substrs
substrs = sorted(substrs, key=len) # sort by size
substrs = [p for i, p in enumerate(substrs)
if not any(p in q for q in substrs[i+1:])]
# convert letters to items and print
items = [map(letter2item.get, substr) for substr in substrs]
print >>stdout, "%s In Arrays %s" % (items, indices)
if __name__=="__main__":
# test
import doctest; doctest.testmod()
# measure performance
import timeit
t = timeit.Timer(stmt='main(stdout=s)',
setup='from __main__ import main; from cStringIO import StringIO as S; s = S()')
N = 1000
milliseconds = min(t.repeat(repeat=3, number=N))
print("%.3g milliseconds" % (1e3*milliseconds/N))
It takes about 30 seconds to process 100 lists of 100 items each. SubstringDict in the above code might be emulated by grep -F -f.
Old solution:
In Python (save it to 'group_patterns.py' file):
#!/usr/bin/env python
from collections import defaultdict
from itertools import groupby
def issubseq(p, q):
"""Return whether `p` is a subsequence of `q`."""
return any(p == q[i:i + len(p)] for i in range(len(q) - len(p) + 1))
arr = (("Do", "Re", "Mi", "Fa", "So", "La", "Ti",),
("Mi", "Fa", "Jim", "Bob", "So",),
("Jim", "Bob", "So", "La", "Ti",))
# store all patterns that occure at least twice
d = defaultdict(list) # a map: pattern -> indexes of arrays it's within
for i, a in enumerate(arr[:-1]):
for j, q in enumerate(arr[i+1:]):
for k in range(len(a)):
for size in range(1, len(a)+1-k):
p = a[k:k + size] # a pattern
if issubseq(p, q): # `p` occures at least twice
d[p] += [i+1, i+2+j]
# group patterns by arrays they are within
inarrays = lambda pair: sorted(set(pair[1]))
for key, group in groupby(sorted(d.iteritems(), key=inarrays), key=inarrays):
patterns = sorted((pair[0] for pair in group), key=len) # sort by size
# remove patterns that are subsequences of other patterns
patterns = [p for i, p in enumerate(patterns)
if not any(issubseq(p, q) for q in patterns[i+1:])]
print "%s In Arrays %s" % (patterns, key)
The following command:
$ python group_patterns.py
prints:
[('Mi', 'Fa')] In Arrays [1, 2]
[('So',)] In Arrays [1, 2, 3]
[('So', 'La', 'Ti')] In Arrays [1, 3]
[('Jim', 'Bob', 'So')] In Arrays [2, 3]
The solution is terribly inefficient.

As others have mentioned the function you want is Intersect. If you are using .NET 3.0 consider using LINQ's Intersect function.
See the following post for more information
Consider using LinqPAD to experiment.
www.linqpad.net

I hacked the program below in about 10 minutes of Perl. It's not perfect, it uses a global variable, and it just prints out the counts of every element seen by the program in each list, but it's a good approximation to what you want to do that's super-easy to code.
Do you actually want all combinations of all subsets of the elements common to each array? You could enumerate all of the elements in a smarter way if you wanted, but if you just wanted all elements that exist at least once in each array you could use the Unix command "grep -v 0" on the output below and that would show you the intersection of all elements common to all arrays. Your question is missing a little bit of detail, so I can't perfectly implement something that solves your problem.
If you do more data analysis than programming, scripting can be very useful for asking questions from textual data like this. If you don't know how to code in a scripting language like this, I would spend a month or two reading about how to code in Perl, Python or Ruby. They can be wonderful for one-off hacks such as this, especially in cases when you don't really know what you want. The time and brain cost of writing a program like this is really low, so that (if you're fast) you can write and re-write it several times while still exploring the definition of your question.
#!/usr/bin/perl -w
use strict;
my #Array1 = ( "Do", "Re", "Mi", "Fa", "So", "La", "Ti");
my #Array2 = ( "Mi", "Fa", "Jim", "Bob", "So" );
my #Array3 = ( "Jim", "Bob", "So", "La", "Ti" );
my %counts;
sub count_array {
my $array = shift;
my $name = shift;
foreach my $e (#$array) {
$counts{$e}{$name}++;
}
}
count_array( \#Array1, "Array1" );
count_array( \#Array2, "Array2" );
count_array( \#Array3, "Array3" );
my #names = qw/ Array1 Array2 Array3 /;
print join ' ', ('element',#names);
print "\n";
my #unique_names = keys %counts;
foreach my $unique_name (#unique_names) {
my #counts = map {
if ( exists $counts{$unique_name}{$_} ) {
$counts{$unique_name}{$_};
} else {
0;
}
}
#names;
print join ' ', ($unique_name,#counts);
print "\n";
}
The program's output is:
element Array1 Array2 Array3
Ti 1 0 1
La 1 0 1
So 1 1 1
Mi 1 1 0
Fa 1 1 0
Do 1 0 0
Bob 0 1 1
Jim 0 1 1
Re 1 0 0

Looks like you want to use an intersection function on sets of data. Intersection picks out elements that are common in both (or more) sets.
The problem with this viewpoint is that sets cannot contain more than one of each element, i.e. no more than one Jim per set, also it cannot recognize several elements in a row counting as a pattern, you can however modify a comparison function to look further to see just that.
There mey be functions like intersect that works on bags (which is kind of like sets, but tolerate identical elements).
These functions should be standard in most languages or pretty easy to write yourself.

I'm sure there's a MUCH more elegant way, but...
Since this isn't production code, why not just hack it and convert each array into a delimited string, then search each string for the pattern you want? i.e.
private void button1_Click(object sender, EventArgs e)
{
string[] array1 = { "do", "re", "mi", "fa", "so" };
string[] array2 = { "mi", "fa", "jim", "bob", "so" };
string[] pattern1 = { "mi", "fa" };
MessageBox.Show(FindPatternInArray(array1, pattern1).ToString());
MessageBox.Show(FindPatternInArray(array2, pattern1).ToString());
}
private bool FindPatternInArray(string[] AArray, string[] APattern)
{
return string.Join("~", AArray).IndexOf(string.Join("~", APattern)) >= 0;
}

First, start by counting each item.
You make a temp list : "Do" = 1, "Mi" = 2, "So" = 3, etc.
you can remove from the temp list all the ones that match = 1 (ex: "Do").
The temp list contains the list of non-unique items (save it somewhere).
Now, you try to make lists of two from one in the temp list, and a following in the original lists.
"So" + "La" = 2, "Bob" + "So" = 2, etc.
Remove the ones with = 1.
You have the lists of couple that appears at least twice (save it somewhere).
Now, try to make lists of 3 items, by taking a couple from the temp list, and take the following from the original lists.
("Mi", "Fa") + "So" = 1, ("Mi", "Fa") + "Jim" = 1, ("So", "La") + "Ti" = 2
Remove the ones with = 1.
You have the lists of 3 items that appears at least twice (save it).
And you continue like that until the temp list is empty.
At the end, you take all the saved lists and you merge them.
This algorithm is not optimal (I think we can do better with suitable data structures), but it is easy to implement :)

Suppose a password consisted of a string of nine characters from the English alphabet (26 characters). If each possible password could be tested in a millisecond, how long would it take to test all possible passwords?

Related

Pick a struct element by field name and some non-sequential index

I mean to use a struct to hold a "table":
% Sample data
% idx idxstr var1 var2 var3
% 1 i01 3.5 21.0 5
% 12 i12 6.5 1.0 3
The first row contains the field names.
Assume I created a struct
ds2 = struct( ...
'idx', { 1, 12 }, ...
'idxstr', { 'i01', 'i12' }, ...
'var1', { 3.5, 6.5 }, ...
'var2', { 21, 1 }, ...
'var3', { 5, 3 } ...
);
How can I retrieve the value for field var2, for the row corresponding to idxstr equal to 'i01'?
Notes:
I cannot ensure the length of idxstr elements will always be 3.
Ideally, I would have a method that also works for columns var2 containing strings, or any other type of variable.
PS: I think https://stackoverflow.com/a/35976320/2707864 can help.
As I mentioned in the comments, I believe you have the wrong kind of struct for this work. Instead of an array of (effectively single-row) structs, you should instead have a single struct with 'array' fields. (numeric or cell, as appropriate).
E.g.
d = struct(
'idx', [1, 12 ],
'idxstr', {{'i01', 'i12'}},
'var1', [3.5, 6.5],
'var2', [21, 1],
'var3', [5, 3]
);
With this structure, your problem becomes infinitely easier to deal with:
d.var2( strcmp( 'i01', d.idxstr ) )
% ans = 21
This is also far more comparable to R / pandas dataframes functionality (which are also effectively initialised via names and equally-sized arrays like this).
PS. Note carefully the syntax used for the 'idxstr' field: there is an 'outer' cell array with a single element, meaning you're only creating a single struct, rather than an array of structs. This single element happens to be a cell array of strings, where this cell array is of the same size (i.e. has the same number of 'rows') as the numeric arrays.
UPDATE
In response to the comment, adding 'rows' should be fairly straightforward. Here is one approach:
function S = addrow( S, R )
FieldNames = fieldnames( S ).'; NumFields = length( FieldNames );
for i = 1 : NumFields,
S.( FieldNames{i} ) = horzcat( S.( FieldNames{i} ), R{i} );
end
end
Then you can simply do:
d = addrow( d, {5, 'i011', 2.7, 10, 11} );
Assuming that idxstr can be more than 3 characters (there is a shorter version of its always 3 chars), this is the thing I came up with (tested on MATLAB):
logical_index=~cellfun(#isempty,strfind({ds2(:).idxstr},'i01'))
you can access the variables as:
ds2(~cellfun(#isempty,strfind({ds2(:).idxstr},'i01'))).var2;
% using above variable
ds2(logical_index).var2;
You can understand now why MATLAB introduced tables hehe.
Maybe you can try the code like below using strcmp
>> [ds2.var2](strcmp('i01',{ds2.idxstr}))
ans = 21
I put together function
function el = struct_pick(s, cdata, cnames, rname)
% Pick an element from a struct by column and row name
coldata = vertcat(s.(cdata));
colnames = mat2cell(vertcat(s.(cnames)), ones(1, length(s)));
% This assumes rname is a string
flt = strcmp(colnames, rname);
el = coldata(logical(flt));
endfunction
which is called with
% Pick an element by column and row name
cdata = 'var3';
cnames = 'idxstr';
rname = 'i01';
elem = struct_pick(ds2, cdata, cnames, rname);
and it seems to do the job.
I don't know if it is an unnecessarily contrived way of doing it.
Still have to deal with the possibility that the row names are not strings, as with
cnames = 'idx';
rname = 1;
EDIT: If the strings in idxstr are not all of the same length, this throws error: vertcat: cat: dimension mismatch.
The answer by Ander Biguri can handle this case.

Number of elements in a nested sublist (starting from the first Index)

My code is as follows :
Nums=[['D'],['A','B'],['A','C'],['C','A']]
Output should be D=0
A=2
C=1
B=0
I have tried as follows:
nums=[['D'],['A','B'],['A','C'],['C','A']]
d=dict()
for i in (nums):
for j in i:
if(len(i)==1):
d[j]=0
else:
d[j]=1
print(d)
Am I on the right path to choose a dictionary to count the path?
Please post your suggestion in any data-structure
import collections
seen_dict = collections.Counter([x[0] for x in Nums if len(x) > 1])
To obtain a dictionary with the sum minus one of the occurrences you can perform a dictionary comprehension using the library Counter:
from collections import Counter # import library
flat = sum(nums, []) # returns a flat list
count = Counter(flat).items() # counts the elements (returns a dictionary)
result = {c[0]:c[1]-1 for c in count} # dictionary comprehnsion returning the sum minus one
Compressed form:
result = {c[0]:c[1]-1 for c in Counter(sum(nums, [])).items()}

Replace string characters with their word index

Note the two consecutive spaces in this string:
string = "Hello there everyone!"
for i, c in enumerate(string):
print(i, c)
0 H
1 e
2 l
3 l
4 o
5
6 t
7 h
8 e
9 r
10 e
11
12
13 e
14 v
15 e
16 r
17 y
18 o
19 n
20 e
21 !
How can I make a list len(string) long, with each value containing the word count up to that point in the string?
Expected output: 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
The only way I could do it was by looping through each character, setting a space=True flag and increasing a counter each time I hit a non-space character when space == True. This is probably because I'm most proficient with C, but I would like to learn a more Pythonic way to solve this.
I feel like your solution is not that far from being pythonic. Maybe you can use the zip operator to iterate your string two by two and then just detect local changes (from a space to a letter -> this is a new word):
string = "Hello there everyone!"
def word_index(phrase):
nb_words = 0
for a, b in zip(phrase, phrase[1:]):
if a == " " and b != " ":
nb_words += 1
yield nb_words
print(list(word_index(string)))
This also make use of generators which is quite common in python (see the documentation for the yield keyword). You can probably do the same by using itertools.accumulate instead of the for loop, but I'm not sure it wouldn't obfuscate the code (see the third item from The Zen of Python). Here is what it would look like, note that I used a lambda function here, not because I think it's the best choice, but simply because I couldn't find any meaningful function name:
import itertools
def word_index(phrase):
char_pairs = zip(phrase, phrase[1:])
new_words = map(lambda p: int(p[0] == " " and p[1] != " "), char_pairs)
return itertools.accumulate(new_words)
This second version similarly to the first one returns an iterator. Note that using a iterators is usually a good idea as it doesn't make any assumption on whether your user want to instantiate anything. If the user want to transform an iterator it to a list he can always call list(it) as I did in the first piece of code. Iterators simply gives you the values one by one: at any point in time, there only is a single value in memory:
for word_index in word_index(string):
print(word_index)
Remark that phrase[1:] makes a copy of the phrase, which means it doubles the memory used. This can be improved by using itertools.islice which returns an iterator (and therefore only use constant memory). The second version would for example look like this:
def word_index(phrase):
char_pairs = zip(phrase, itertools.islice(phrase, 1, None))
new_words = map(lambda p: int(p[0] == " " and p[1] != " "), char_pairs)
return itertools.accumulate(new_words)

Pythonic way to write for loops with nested if statements

Lets say I had a simple python list that contained the type of expense and I wanted to iterate over these expenses with a for loop. At each iteration, if the indice produces the correct expense type a counter will be advanced by 1. I can easily write this with the code below, but it is not using a fast running for loop.
array = ['Groceries', 'Restaurant', 'Groceries', 'Misc', 'Bills']
sum = 0
for i in range(len(array)):
if array[i] == 'Groceries':
sum += 1
Is there a more pythonic way to write this loop that accelerates the execution? I have seen examples that would look something like the code snippet below. NOTE: The code snippet below does not work, it is just an example of an accelerated format that I have seen before, but do not fully understand.
sum = [sum + 1 for i in array if array[i] == 'Groceries']
If it's just about counts, try collections.Counter:
from collections import Counter
array = ['Groceries', 'Restaurant', 'Groceries', 'Misc', 'Bills']
counts = Counter(array)
print(counts)
# Counter({'Groceries': 2, 'Bills': 1, 'Restaurant': 1, 'Misc': 1})
print(counts['Groceries'])
# 2
for i in range(len(array)):
is definitely NOT the Python-ic way of iterating over an array. It is VisualBasic thinking, from which you should free yourself.
If you want to iterate over an array, just iterate over it as follows:
array = ['Groceries', 'Restaurant', 'Groceries', 'Misc', 'Bills']
for eachItem in array:
...
What you do in the loop is up to you. If you want to count howmany groceries in the list, then you can do this:
array = ['Groceries', 'Restaurant', 'Groceries', 'Misc', 'Bills']
groceriesTotal = 0
for eachItem in array:
if eachItem == 'Groceries':
groceriesTotal = groceriesTotal + 1
This is simple, clear and pythonic enough to be readable by others.
You seem to think you need a list comprehension for this. But list comprehensions produce lists, and you want a scalar. Try array. count("Groceries").

Algorithm for generating all string combinations

Say I have a list of strings, like so:
strings = ["abc", "def", "ghij"]
Note that the length of a string in the list can vary.
The way you generate a new string is to take one letter from each element of the list, in order. Examples: "adg" and "bfi", but not "dch" because the letters are not in the same order in which they appear in the list. So in this case where I know that there are only three elements in the list, I could fairly easily generate all possible combinations with a nested for loop structure, something like this:
for i in strings[0].length:
for ii in strings[1].length:
for iii in strings[2].length:
print(i+ii+iii)
The issue arises for me when I don't know how long the list of strings is going to be beforehand. If the list is n elements long, then my solution requires n for loops to succeed.
Can any one point me towards a relatively simple solution? I was thinking of a DFS based solution where I turn each letter into a node and creating a connection between all letters in adjacent strings, but this seems like too much effort.
In python, you would use itertools.product
eg.:
>>> for comb in itertools.product("abc", "def", "ghij"):
>>> print(''.join(comb))
adg
adh
adi
adj
aeg
aeh
...
Or, using an unpack:
>>> words = ["abc", "def", "ghij"]
>>> print('\n'.join(''.join(comb) for comb in itertools.product(*words)))
(same output)
The algorithm used by product is quite simple, as can be seen in its source code (Look particularly at function product_next). It basically enumerates all possible numbers in a mixed base system (where the multiplier for each digit position is the length of the corresponding word). A simple implementation which only works with strings and which does not implement the repeat keyword argument might be:
def product(words):
if words and all(len(w) for w in words):
indices = [0] * len(words)
while True:
# Change ''.join to tuple for a more accurate implementation
yield ''.join(w[indices[i]] for i, w in enumerate(words))
for i in range(len(indices), 0, -1):
if indices[i - 1] == len(words[i - 1]) - 1:
indices[i - 1] = 0
else:
indices[i - 1] += 1
break
else:
break
From your solution it seems that you need to have as many for loops as there are strings. For each character you generate in the final string, you need a for loop go through the list of possible characters. To do that you can make recursive solution. Every time you go one level deep in the recursion, you just run one for loop. You have as many level of recursion as there are strings.
Here is an example in python:
strings = ["abc", "def", "ghij"]
def rec(generated, k):
if k==len(strings):
print(generated)
return
for c in strings[k]:
rec(generated + c, k+1)
rec("", 0)
Here's how I would do it in Javascript (I assume that every string contains no duplicate characters):
function getPermutations(arr)
{
return getPermutationsHelper(arr, 0, "");
}
function getPermutationsHelper(arr, idx, prefix)
{
var foundInCurrent = [];
for(var i = 0; i < arr[idx].length; i++)
{
var str = prefix + arr[idx].charAt(i);
if(idx < arr.length - 1)
{
foundInCurrent = foundInCurrent.concat(getPermutationsHelper(arr, idx + 1, str));
}
else
{
foundInCurrent.push(str);
}
}
return foundInCurrent;
}
Basically, I'm using a recursive approach. My base case is when I have no more words left in my array, in which case I simply add prefix + c to my array for every c (character) in my last word.
Otherwise, I try each letter in the current word, and pass the prefix I've constructed on to the next word recursively.
For your example array, I got:
adg adh adi adj aeg aeh aei aej afg afh afi afj bdg bdh bdi
bdj beg beh bei bej bfg bfh bfi bfj cdg cdh cdi cdj ceg ceh
cei cej cfg cfh cfi cfj

Resources