Generate A Random String With A Set of Banned Substrings - string

I want to generate a random string of a fixed length L. However, there is a set of "banned" substrings all of length b that cannot appear in the string. Is there a way to algorithmically generate this parent string?
Here is a small example:
I want a string that is 10 characters long -> XXXXXXXXXX
The banned substrings are {'AA', 'CC', 'AD'}
The string ABCDEFGHIJ is a valid string, but AABCDEFGHI is not.
For a small example it is relatively easy to randomly generate and then check the string, but as the set of banned substrings gets larger (or the length of the banned substrings gets smaller), the probability of randomly generating a valid string rapidly decreases.

This will be a fairly efficient approach, but it requires a lot of theory.
First you can take your list of strings, like in this case AA, BB, AD. This can trivially be turned into a regular expression that matches any of them, namely /AA|BB|AD/. Which you can then turn into an Nondeterministic Finite Automaton (NFA) and a Deterministic Finite Automaton (DFA) for matching the regular expression. See these lecture notes for an example of how to do that. In this case the DFA will have the following states:
Matched nothing (yet)
Matched A
Matched B
End of match
And the transition rules will be:
If A go to state 2, if B go to state 3, else go to state 1.
If A or D go to state 4, else go to state 1.
If B go to state 4, else go to state 1.
Match complete, we're done. (Stay in state 4 forever.)
Now normally a DFA is used to find a match. We're going to use it as a way to find ways to avoid a match. So how will we do that?
The trick here is dynamic programming.
What we will do is create a table of:
by position in the string
by state of the match
how many ways there are to get here
how many ways we got here from the previous (position, state) pairs
In other words we go forward and create a table which starts like this:
[
{1: {'count': 1}},
{1: {'count': 24, 'prev': {1: 24}},
2: {'count': 1, 'prev': {1: 1}},
3: {'count': 1, 'prev': {1: 1}},
},
{1: {'count': 625, 'prev': {1: 576, 2: 24, 3: 25}},
2: {'count': 25, 'prev': {1: 24, 3: 1}},
3: {'count': 25, 'prev': {1: 24, 2: 1}},
4: {'count': 3, 'prev': {2: 2, 3: 1}},
},
...
]
By the time this is done, we know exactly how many ways we can wind up at the end of the string with a match (state 4), partial match (states 2 or 3) or not currently matching (state 1).
Now we generate the random string backwards. First we randomly pick the final state with odds based on the count. Then from that final state's prev entry we can pick the state we were on before that final one. Then we randomly pick a letter that would have done that. We are picking that letter/state combination completely randomly from all solutions.
Now we know what state we were in at the second to last letter. Pick the previous state, then pick a letter in the same way.
Continue back until finally we know the state we're in after we've picked the first letter, and then pick that letter.
It is a lot of logic to write, but it is all deterministic in time, and will let you pick random strings all day long once you've done the initial analysis.

There are two ways.
Random String is created while repeating until the ban string does not appear.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "AA";
char rnd[N];
int main() {
srand(time(0));
int n = 100;
do {
for (int i = 0; i < n; i++) rnd[i] = 'A'+rand()%26;
rnd[n] = 0;
} while (strstr(rnd, ban));
cout << rnd << endl;
}
I think this is the easiest way to implement.
However, this method has complexity up to O((26/25)^n*(n+b)), if the length of the string to be created is very long and the length of the ban string is very small.
For example if ban="A", n=10000, then there will be time limit exceed!
You can proceed with the creation of one character and one character while checking whether there is a ban string.
If you want to use this way, you must know about KMP algorithm.
In this way, we can not use the system default search function strstr.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "A";
char rnd[N];
int b = strlen(ban);
int n = 10;
int *pre_kmp() {
int *pie=new int [b];
pie[0] = 0;
int k=0;
for(int i=1;i<b;i++)
{
while(k>0 && ban[k] != ban[i] )
{
k=pie[k-1];
}
if( ban[k] == ban[i] )
{
k=k+1;
}
pie[i]=k;
}
return pie;
}
int main() {
srand(time(0));
int *pie = pre_kmp();
int matched_pos = 0;
for (int cur = 0; cur < n; cur++) {
do {
rnd[cur] = 'A'+rand()%26;
while (matched_pos > 0 && ban[matched_pos] != rnd[cur])
matched_pos = pie[matched_pos-1];
if (ban[matched_pos] == rnd[cur])
matched_pos = matched_pos+1;
} while (matched_pos == b);
}
cout << rnd << endl;
}
This algorithm's time complexity will be O(26 * n * b).
Of course you could also search manually without using the KMP algorithm. However, in this case, the time complexity becomes O(n*b), so the time complexity becomes very large when both the ban string length and the generator string length are long.
Hope this helps you.

Simple approach:
Create a map where the keys are the banned substrings excluding the final letter, and the values are the lists of allowed letters. Alternatively you can have the values be the lists of banned letters with slight modification.
Generate random letters, using the set of allowed letters for the (b-1)-length substring at the end of the letters already generated, or the full alphabet if that substring doesn't match a key.
This is O(n*b) unless there's a way to update the substring of the last b-1 characters in O(1).
Ruby Code
def randstr(strlen, banned_strings)
b = banned_strings[0].length # All banned strings are length b
alphabet = ('a'..'z').to_a
prefix_to_legal_letters = Hash.new {|h, k| h[k] = ('a'..'z').to_a}
banned_strings.each do |s|
prefix_to_legal_letters[s[0..-2]].delete(s[-1])
end
str = ''
while str.length < strlen
letters_to_use = alphabet
if str.length >= b-1
str_end = str[(-b+1)..-1]
if prefix_to_legal_letters.has_key?(str_end)
letters_to_use = prefix_to_legal_letters[str_end]
end
end
str += letters_to_use.sample()
end
return str
end
A silly example to show this works. The only legal letter after 'a' is 'z'. Every 'a' in the output is followed by a 'z'.
randstr(1000, ['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao',
'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay'])
=> "gkowhxkhrknrxkbjxbjwiqohvvazwrjxjdekrujdyprjnmbjuklqsjdwzidhpgzzmnfbyjuptbpyezfeeydgdkpznvjfwziazrzohwvnitnfupdqxivtvkbazpvqzdzzsneslrazmhbjojyqowhvjhsrdgpbejicitprxzmkhgsuvvlyfizmhohorazemyhtbazvhvazdmnjmjzoggwmjjnrqxcmrdhxozbsjjdqgfjorazmtwtvvujpgivdxijowgxnkuxovncnivazmtykliqiielsfixuflfsgqbpevazozfsvfynhxyjpxtuolqooowazpyoukssknxdntzjjbqazxjttdblepsjzqgxmxvtrmjbgvuyfvspdrrohmtwhtdxfcvidswxtzbznszsqorpxdywbytsitxeziudmvlnluwmcqtfydxlocltozovhusbblfutbqjfjeslverzctxazyprazxzmazxwbdfkwxdwdqxnqhbcliwuitsnnpscbsjitoftblgjycpnxqsikpjqysmqiazdazwwjmeazxcbejthnlsskhazxazlrceyjtbmcpazscazvsjkqhiqfbjygjhyqazsbjymsovojfxynygzwmlhkmpvswpweqkkvmbrxhazpmiqrazcgprlbywmqpyvtphydniazovrkolzbslsosjvdqkgrjmcorqtgeazfwskjuhndszliiirtncmzrzhocyazyrhhpbcsmneuiktyswvgqwkzswkjnyuazggnreeccyidvrbxuskrlchjxnrrpljilogxmicjvmoeequbpkursrqsisqtfkruswnyftdgbjhwvcrlcnfecyfdnslmxztlbfxjhgeslqedrflthlhnlwopmsdjgochxwxhfhvqcixvxdjixcazggmexidtlhymkiyyfuhxufvxyfazmmwsbrlooqwfphgfhvthspvmyiazdazggpeuhnpjmzsazfxmsukpd"
Note that this code and approach needs modification if the OPs statement that all banned substrings are length b is modified -- either directly (by giving banned substrings of varying lengths), or implicitly (by having a (b-1)-length substring for which every letter is banned, in which case the (b-1)-length substring is effectively banned.
The modification here is the obvious one of checking all the possible key lengths in the map; it's still O(n*b) assuming b is the longest banned substring.

Related

Algorithm for generating all string combinations

Say I have a list of strings, like so:
strings = ["abc", "def", "ghij"]
Note that the length of a string in the list can vary.
The way you generate a new string is to take one letter from each element of the list, in order. Examples: "adg" and "bfi", but not "dch" because the letters are not in the same order in which they appear in the list. So in this case where I know that there are only three elements in the list, I could fairly easily generate all possible combinations with a nested for loop structure, something like this:
for i in strings[0].length:
for ii in strings[1].length:
for iii in strings[2].length:
print(i+ii+iii)
The issue arises for me when I don't know how long the list of strings is going to be beforehand. If the list is n elements long, then my solution requires n for loops to succeed.
Can any one point me towards a relatively simple solution? I was thinking of a DFS based solution where I turn each letter into a node and creating a connection between all letters in adjacent strings, but this seems like too much effort.
In python, you would use itertools.product
eg.:
>>> for comb in itertools.product("abc", "def", "ghij"):
>>> print(''.join(comb))
adg
adh
adi
adj
aeg
aeh
...
Or, using an unpack:
>>> words = ["abc", "def", "ghij"]
>>> print('\n'.join(''.join(comb) for comb in itertools.product(*words)))
(same output)
The algorithm used by product is quite simple, as can be seen in its source code (Look particularly at function product_next). It basically enumerates all possible numbers in a mixed base system (where the multiplier for each digit position is the length of the corresponding word). A simple implementation which only works with strings and which does not implement the repeat keyword argument might be:
def product(words):
if words and all(len(w) for w in words):
indices = [0] * len(words)
while True:
# Change ''.join to tuple for a more accurate implementation
yield ''.join(w[indices[i]] for i, w in enumerate(words))
for i in range(len(indices), 0, -1):
if indices[i - 1] == len(words[i - 1]) - 1:
indices[i - 1] = 0
else:
indices[i - 1] += 1
break
else:
break
From your solution it seems that you need to have as many for loops as there are strings. For each character you generate in the final string, you need a for loop go through the list of possible characters. To do that you can make recursive solution. Every time you go one level deep in the recursion, you just run one for loop. You have as many level of recursion as there are strings.
Here is an example in python:
strings = ["abc", "def", "ghij"]
def rec(generated, k):
if k==len(strings):
print(generated)
return
for c in strings[k]:
rec(generated + c, k+1)
rec("", 0)
Here's how I would do it in Javascript (I assume that every string contains no duplicate characters):
function getPermutations(arr)
{
return getPermutationsHelper(arr, 0, "");
}
function getPermutationsHelper(arr, idx, prefix)
{
var foundInCurrent = [];
for(var i = 0; i < arr[idx].length; i++)
{
var str = prefix + arr[idx].charAt(i);
if(idx < arr.length - 1)
{
foundInCurrent = foundInCurrent.concat(getPermutationsHelper(arr, idx + 1, str));
}
else
{
foundInCurrent.push(str);
}
}
return foundInCurrent;
}
Basically, I'm using a recursive approach. My base case is when I have no more words left in my array, in which case I simply add prefix + c to my array for every c (character) in my last word.
Otherwise, I try each letter in the current word, and pass the prefix I've constructed on to the next word recursively.
For your example array, I got:
adg adh adi adj aeg aeh aei aej afg afh afi afj bdg bdh bdi
bdj beg beh bei bej bfg bfh bfi bfj cdg cdh cdi cdj ceg ceh
cei cej cfg cfh cfi cfj

Generating a mutation frequency on a DNA Strand using Python

I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.

Use of io.read() to reference tables. Lua 5.1

Alright, so here's what I'm trying to do, and I'm almost sure I don't know what phrase to use to find what I'm looking for so I'll do my best to be as clear as possible with limited terminology knowledge.
I'm using lua (or at least attempting to) to generate race tracks/segments for a D&D game.
So here's what I've done, but I cant figure out how to get one table to reference another. And no matter how hard I research or play around it just wont work.
Table Dump:
--Track Tables
local raceClass = { 'SS', 'S', 'A', 'B', 'C', 'D', 'N' }
local trackLength = { 50, 30, 25, 20, 15, 10, 5 }
local trackDifficulty = { 3, 3, 3, 2, 2, 1, 1 }
local trackTypes = { 'Straightaway', 'Curve', 'Hill', 'Water', 'Jump' }
So, just to explain a little here. First off, we have the class of the race. N for novice, SS for most difficult. Next, we have the length of the resulting track. SS is a 50 segment track. N is a 5 segment track. Each race class has a difficulty cap on each segment of track. SS, S and A all have a cap of 3. D and N have a cap of 1. Then, each segment of track is further broken down into it's type. Those are generated using this slab of code;
--Track Generation
math.randomseed(os.time())
for i = 1, trackLength do
local trackTypeIndex = math.random(1, #trackTypes)
local SP = math.random(1, trackDifficulty) --SP is Stamina Cost for that segment.
print(tracktypes[trackTypeIndex]..' of SP '..SP)
end
io.read() --So it doesn't just close the window but waits for some user input.
Now it gets into the part that I start to loose myself in. I want the DM to be able to input the selected race class and get a generated list of the resulting race track.
--DM Input
print('Race Class? N, D, C, B, A, S, SS")
io.flush()
local classChoice = io.read()
So, the DM puts in the class choice, lets go with N. What I can't find is a piece of code that'll take the value for classChoice and pair it to raceClass. Then use that position to select the positions in trackLength, and trackDifficulty and finally run the remainder of the code segment Track Generation extrapolating the proper variables and print the results getting something along the lines of;
Straightaway of SP 1
Curve of SP 1
Water of SP 1
Water of SP 1
Jump of SP 1
For a low end novice race, which is only 5 segments long and has a max difficulty of 1. But with the higher classes would still generate the longer, more difficult tracks. I'm trying to be as specific as I can to minimize any confusion my inexperience in code could cost.
I think you'd be better off with a different table structures:
local raceClass = {
SS = {50, 3},
S = {30, 3},
A = {25, 3},
B = 20, 2},
C = {15, 2},
D = {10, 1},
N = {5, 1},
}
Now, you can access all the data for a raceClass easily. The code would be like:
print "Race Class? N, D, C, B, A, S, SS"
io.flush()
local classChoice = (io.read "*line"):upper() -- To convert the input to upper case characters
if not raceClass[classChoice] then
-- Wrong input was given
end
local SP, Length = raceClass[classChoice][2], raceClass[classChoice][1]

Interesting strings algorithm

Given two finite sequences of string, A and B, of length n each,
for example:
A1: "kk", A2: "ka", A3: "kkk", A4: "a"
B1: "ka", B2: "kakk", B3: "ak", B4: "k"
Give a finite sequences of indexes so that their concentration for A
and B gives the same string. Repetitions allowed.
In this example I can't find the solution but for example if the list (1,2,2,4) is a solution then A1 + A2 + A2 + A4 = B1 + B2 + B2 + B4. In this example there are only two characters but it's already very difficult. Actually it's not even trivial to find the shortest solution with one character!
I tried to think of things.. for example the total sum of the length of the strings must be equal and the for the first and last string we need corresponding characters. But nothing else. I suppose for some set of strings it's simply impossible. Anyone can think of a good algorithm?
EDIT: Apparently, this is the Post Correspondence Problem
There is no algorithm that can decide whether a such an instance has a solution or not. If there were, the halting problem could be solved. Dirty trick...
Very tough question, but I'll give it a shot. This is more of a stream of consciousness than an answer, apologies in advance.
If I understand this correctly, you're given 2 equal sized sequences of strings, A and B, indexed from 1..n, say. You then have to find a sequence of indices such that the concatenation of strings A(1)..A(m) equals the concatenation of strings B(1)..B(m) where m is the length of the sequence of indices.
The first thing I would observe is that there could be an infinite number of solutions. For example, given:
A { "x", "xx" }
B { "xx", "x" }
Possible solutions are:
{ 1, 2 }
{ 2, 1 }
{ 1, 2, 1, 2 }
{ 1, 2, 2, 1 }
{ 2, 1, 1, 2 }
{ 2, 1, 2, 1 }
{ 1, 2, 1, 2, 1, 2}
...
So how would you know when to stop? As soon as you had one solution? As soon as one of the solutions is a superset of another solution?
One place you could start would be by taking all the strings of minimum common length from both sets (in my example above, you would take the "x" from both, and searching for 2 equal strings that share a common index. You can then repeat this for strings of the next size up. For example, if the first set has 3 strings of length 1, 2 and 3 respectively, and the second set has strings of length 1, 3 and 3 respectively, you would take the strings of length 3. You would do this until you have no more strings. If you find any, then you have a solution to the problem.
It then gets harder when you have to start combining several strings as in my example above. The naive, brute force approach would be to start permuting all strings from both sets that, when concatenated, result in strings of the same length, then compare them. So in the below example:
A { "ga", "bag", "ac", "a" }
B { "ba", "g", "ag", "gac" }
You would start with sequences of length 2:
A { "ga", "ac" }, B { "ba", "ag" } (indices 1, 3)
A { "bag", "a" }, B { "g", "gac" } (indices 2, 4)
Comparing these gives "gaac" vs "baag" and "baga" vs "ggac", neither of which are equal, so there are no solutions there. Next, we would go for sequences of length 3:
A { "ga", "bag", "a" }, B { "ba", "g", "gac" } (indices 1, 2, 4)
A { "bag", "ac", "a" }, B { "g", "ag", "gac" } (indices 2, 3, 4)
Again, no solutions, so then we end up with sequences of size 4, of which we have no solutions.
Now it gets even trickier, as we have to start thinking about perhaps repeating some indices, and now my brain is melting.
I'm thinking looking for common subsequences in the strings might be helpful, and then using the remaining parts in the strings that were not matched. But I don't quite know how.
A very simple way is to just use something like a breadth-first search. This also has the advantage that the first solution found will have minimal size.
It is not clear what the 'solution' you are looking for is, the longest solution? the shortest? all solutions?
Since you allow repetition there will an infinite number of solutions for some inputs so I will work on:
Find all sequences under a fixed length.
Written as a pseudo code but in a manner very similar to f# sequence expressions
// assumed true/false functions
let Eq aList bList =
// eg Eq "ab"::"c" "a" :: "bc" -> true
// Eq {} {} is _false_
let EitherStartsWith aList bList =
// eg "ab"::"c" "a" :: "b" -> true
// eg "a" "ab" -> true
// {} {} is _true_
let rec FindMatches A B aList bList level
= seq {
if level > 0
if Eq aList bList
yield aList
else if EitherStartsWith aList bList
Seq.zip3 A B seq {1..}
|> Seq.iter (func (a,b,i) ->
yield! FindMatches A B aList::(a,i) bList::(b,i) level - 1) }
let solution (A:seq<string>) (B:seq<string>) length =
FindMatches A B {} {} length
Some trivial constraints to reduce the problem:
The first selection pair must have a common start section.
the final selection pair must have a common end section.
Based on this we can quickly eliminate many inputs with no solution
let solution (A:seq<string>) (B:seq<string>) length =
let starts = {}
let ends = {}
Seq.zip3 A B seq {1..}
|> Seq.iter(fun (a,b,i) ->
if (a.StartsWith(b) or b.StartsWith(a))
start = starts :: (a,b,i)
if (a.EndsWith(b) or b.EndsWith(a))
ends = ends :: (a,b,i))
if List.is_empty starts || List.is_empty ends
Seq.empty // no solution
else
Seq.map (fun (a,b,i) ->
FindMatches A B {} :: (a,i) {} :: (b,i) length - 1)
starts
|> Seq.concat
Here's a suggestion for a brute force search. First generate number sequences bounded to the length of your list:
[0,0,..]
[1,0,..]
[2,0,..]
[3,0,..]
[0,1,..]
...
The number sequence length determines how many strings are going to be in any solution found.
Then generate A and B strings by using the numbers as indexes into your string lists:
public class FitSequence
{
private readonly string[] a;
private readonly string[] b;
public FitSequence(string[] a, string[] b)
{
this.a = a;
this.b = b;
}
private static string BuildString(string[] source, int[] indexes)
{
var s = new StringBuilder();
for (int i = 0; i < indexes.Length; ++i)
{
s.Append(source[indexes[i]]);
}
return s.ToString();
}
public IEnumerable<int[]> GetSequences(int length)
{
foreach (var numberSequence in new NumberSequence(length).GetNumbers(a.Length - 1))
{
string a1 = BuildString(a, numberSequence);
string b1 = BuildString(b, numberSequence);
if (a1 == b1)
yield return numberSequence;
}
}
}
This algorithm assumes equal lengths for A and B.
I tested your example with
static void Main(string[] args)
{
var a = new[] {"kk", "ka", "kkk", "a"};
var b = new[] {"ka", "kakk", "ak", "k"};
for (int i = 0; i < 100; ++i)
foreach (var sequence in new FitSequence(a, b).GetSequences(i))
{
foreach (int x in sequence)
Console.Write("{0} ", x);
Console.WriteLine();
}
}
but could not find any solutions, though it seemed to work for simple tests.

How to find similar patterns in lists/arrays of strings

I am looking for ways to find matching patterns in lists or arrays of strings, specifically in .NET, but algorithms or logic from other languages would be helpful.
Say I have 3 arrays (or in this specific case List(Of String))
Array1
"Do"
"Re"
"Mi"
"Fa"
"So"
"La"
"Ti"
Array2
"Mi"
"Fa"
"Jim"
"Bob"
"So"
Array3
"Jim"
"Bob"
"So"
"La"
"Ti"
I want to report on the occurrences of the matches of
("Mi", "Fa") In Arrays (1,2)
("So") In Arrays (1,2,3)
("Jim", "Bob", "So") in Arrays (2,3)
("So", "La", "Ti") in Arrays (1, 3)
...and any others.
I am using this to troubleshoot an issue, not to make a commercial product of it specifically, and would rather not do it by hand (there are 110 lists of about 100-200 items).
Are there any algorithms, existing code, or ideas that will help me accomplish finding the results described?
The simplest way to code would be to build a Dictionary then loop through each item in each array. For each item do this:
Check if the item is in the dictonary if so add the list to the array.
If the item is not in the dictionary add it and the list.
Since as you said this is non-production code performance doesn't matter so this approach should work fine.
Here's a solution using SuffixTree module to locate subsequences:
#!/usr/bin/env python
from SuffixTree import SubstringDict
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
import sys
def main(stdout=sys.stdout):
"""
>>> import StringIO
>>> s = StringIO.StringIO()
>>> main(stdout=s)
>>> print s.getvalue()
[['Mi', 'Fa']] In Arrays (1, 2)
[['So', 'La', 'Ti']] In Arrays (1, 3)
[['Jim', 'Bob', 'So']] In Arrays (2, 3)
[['So']] In Arrays (1, 2, 3)
<BLANKLINE>
"""
# array of arrays of strings
arr = [
["Do", "Re", "Mi", "Fa", "So", "La", "Ti",],
["Mi", "Fa", "Jim", "Bob", "So",],
["Jim", "Bob", "So", "La", "Ti",],
]
#### # 28 seconds (27 seconds without lesser substrs inspection (see below))
#### N, M = 100, 100
#### import random
#### arr = [[random.randrange(100) for _ in range(M)] for _ in range(N)]
# convert to ASCII alphabet (for SubstringDict)
letter2item = {}
item2letter = {}
c = 1
for item in (i for a in arr for i in a):
if item not in item2letter:
c += 1
if c == 128:
raise ValueError("too many unique items; "
"use a less restrictive alphabet for SuffixTree")
letter = chr(c)
letter2item[letter] = item
item2letter[item] = letter
arr_ascii = [''.join(item2letter[item] for item in a) for a in arr]
# populate substring dict (based on SuffixTree)
substring_dict = SubstringDict()
for i, s in enumerate(arr_ascii):
substring_dict[s] = i+1
# enumerate all substrings, save those that occur more than once
substr2indices = {}
indices2substr = defaultdict(list)
for str_ in arr_ascii:
for start in range(len(str_)):
for size in reversed(range(1, len(str_) - start + 1)):
substr = str_[start:start + size]
if substr not in substr2indices:
indices = substring_dict[substr] # O(n) SuffixTree
if len(indices) > 1:
substr2indices[substr] = indices
indices2substr[tuple(indices)].append(substr)
#### # inspect all lesser substrs
#### # it could diminish size of indices2substr[ind] list
#### # but it has no effect for input 100x100x100 (see above)
#### for i in reversed(range(len(substr))):
#### s = substr[:i]
#### if s in substr2indices: continue
#### ind = substring_dict[s]
#### if len(ind) > len(indices):
#### substr2indices[s] = ind
#### indices2substr[tuple(ind)].append(s)
#### indices = ind
#### else:
#### assert set(ind) == set(indices), (ind, indices)
#### substr2indices[s] = None
#### break # all sizes inspected, move to next `start`
for indices, substrs in indices2substr.iteritems():
# remove substrs that are substrs of other substrs
substrs = sorted(substrs, key=len) # sort by size
substrs = [p for i, p in enumerate(substrs)
if not any(p in q for q in substrs[i+1:])]
# convert letters to items and print
items = [map(letter2item.get, substr) for substr in substrs]
print >>stdout, "%s In Arrays %s" % (items, indices)
if __name__=="__main__":
# test
import doctest; doctest.testmod()
# measure performance
import timeit
t = timeit.Timer(stmt='main(stdout=s)',
setup='from __main__ import main; from cStringIO import StringIO as S; s = S()')
N = 1000
milliseconds = min(t.repeat(repeat=3, number=N))
print("%.3g milliseconds" % (1e3*milliseconds/N))
It takes about 30 seconds to process 100 lists of 100 items each. SubstringDict in the above code might be emulated by grep -F -f.
Old solution:
In Python (save it to 'group_patterns.py' file):
#!/usr/bin/env python
from collections import defaultdict
from itertools import groupby
def issubseq(p, q):
"""Return whether `p` is a subsequence of `q`."""
return any(p == q[i:i + len(p)] for i in range(len(q) - len(p) + 1))
arr = (("Do", "Re", "Mi", "Fa", "So", "La", "Ti",),
("Mi", "Fa", "Jim", "Bob", "So",),
("Jim", "Bob", "So", "La", "Ti",))
# store all patterns that occure at least twice
d = defaultdict(list) # a map: pattern -> indexes of arrays it's within
for i, a in enumerate(arr[:-1]):
for j, q in enumerate(arr[i+1:]):
for k in range(len(a)):
for size in range(1, len(a)+1-k):
p = a[k:k + size] # a pattern
if issubseq(p, q): # `p` occures at least twice
d[p] += [i+1, i+2+j]
# group patterns by arrays they are within
inarrays = lambda pair: sorted(set(pair[1]))
for key, group in groupby(sorted(d.iteritems(), key=inarrays), key=inarrays):
patterns = sorted((pair[0] for pair in group), key=len) # sort by size
# remove patterns that are subsequences of other patterns
patterns = [p for i, p in enumerate(patterns)
if not any(issubseq(p, q) for q in patterns[i+1:])]
print "%s In Arrays %s" % (patterns, key)
The following command:
$ python group_patterns.py
prints:
[('Mi', 'Fa')] In Arrays [1, 2]
[('So',)] In Arrays [1, 2, 3]
[('So', 'La', 'Ti')] In Arrays [1, 3]
[('Jim', 'Bob', 'So')] In Arrays [2, 3]
The solution is terribly inefficient.
As others have mentioned the function you want is Intersect. If you are using .NET 3.0 consider using LINQ's Intersect function.
See the following post for more information
Consider using LinqPAD to experiment.
www.linqpad.net
I hacked the program below in about 10 minutes of Perl. It's not perfect, it uses a global variable, and it just prints out the counts of every element seen by the program in each list, but it's a good approximation to what you want to do that's super-easy to code.
Do you actually want all combinations of all subsets of the elements common to each array? You could enumerate all of the elements in a smarter way if you wanted, but if you just wanted all elements that exist at least once in each array you could use the Unix command "grep -v 0" on the output below and that would show you the intersection of all elements common to all arrays. Your question is missing a little bit of detail, so I can't perfectly implement something that solves your problem.
If you do more data analysis than programming, scripting can be very useful for asking questions from textual data like this. If you don't know how to code in a scripting language like this, I would spend a month or two reading about how to code in Perl, Python or Ruby. They can be wonderful for one-off hacks such as this, especially in cases when you don't really know what you want. The time and brain cost of writing a program like this is really low, so that (if you're fast) you can write and re-write it several times while still exploring the definition of your question.
#!/usr/bin/perl -w
use strict;
my #Array1 = ( "Do", "Re", "Mi", "Fa", "So", "La", "Ti");
my #Array2 = ( "Mi", "Fa", "Jim", "Bob", "So" );
my #Array3 = ( "Jim", "Bob", "So", "La", "Ti" );
my %counts;
sub count_array {
my $array = shift;
my $name = shift;
foreach my $e (#$array) {
$counts{$e}{$name}++;
}
}
count_array( \#Array1, "Array1" );
count_array( \#Array2, "Array2" );
count_array( \#Array3, "Array3" );
my #names = qw/ Array1 Array2 Array3 /;
print join ' ', ('element',#names);
print "\n";
my #unique_names = keys %counts;
foreach my $unique_name (#unique_names) {
my #counts = map {
if ( exists $counts{$unique_name}{$_} ) {
$counts{$unique_name}{$_};
} else {
0;
}
}
#names;
print join ' ', ($unique_name,#counts);
print "\n";
}
The program's output is:
element Array1 Array2 Array3
Ti 1 0 1
La 1 0 1
So 1 1 1
Mi 1 1 0
Fa 1 1 0
Do 1 0 0
Bob 0 1 1
Jim 0 1 1
Re 1 0 0
Looks like you want to use an intersection function on sets of data. Intersection picks out elements that are common in both (or more) sets.
The problem with this viewpoint is that sets cannot contain more than one of each element, i.e. no more than one Jim per set, also it cannot recognize several elements in a row counting as a pattern, you can however modify a comparison function to look further to see just that.
There mey be functions like intersect that works on bags (which is kind of like sets, but tolerate identical elements).
These functions should be standard in most languages or pretty easy to write yourself.
I'm sure there's a MUCH more elegant way, but...
Since this isn't production code, why not just hack it and convert each array into a delimited string, then search each string for the pattern you want? i.e.
private void button1_Click(object sender, EventArgs e)
{
string[] array1 = { "do", "re", "mi", "fa", "so" };
string[] array2 = { "mi", "fa", "jim", "bob", "so" };
string[] pattern1 = { "mi", "fa" };
MessageBox.Show(FindPatternInArray(array1, pattern1).ToString());
MessageBox.Show(FindPatternInArray(array2, pattern1).ToString());
}
private bool FindPatternInArray(string[] AArray, string[] APattern)
{
return string.Join("~", AArray).IndexOf(string.Join("~", APattern)) >= 0;
}
First, start by counting each item.
You make a temp list : "Do" = 1, "Mi" = 2, "So" = 3, etc.
you can remove from the temp list all the ones that match = 1 (ex: "Do").
The temp list contains the list of non-unique items (save it somewhere).
Now, you try to make lists of two from one in the temp list, and a following in the original lists.
"So" + "La" = 2, "Bob" + "So" = 2, etc.
Remove the ones with = 1.
You have the lists of couple that appears at least twice (save it somewhere).
Now, try to make lists of 3 items, by taking a couple from the temp list, and take the following from the original lists.
("Mi", "Fa") + "So" = 1, ("Mi", "Fa") + "Jim" = 1, ("So", "La") + "Ti" = 2
Remove the ones with = 1.
You have the lists of 3 items that appears at least twice (save it).
And you continue like that until the temp list is empty.
At the end, you take all the saved lists and you merge them.
This algorithm is not optimal (I think we can do better with suitable data structures), but it is easy to implement :)
Suppose a password consisted of a string of nine characters from the English alphabet (26 characters). If each possible password could be tested in a millisecond, how long would it take to test all possible passwords?

Resources