I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.
So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.
What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)
A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p+1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow k other non-parenthetic characters, you need to consider, for each number p ≤ n of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (k(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.
Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.
Precomputing the tables of counts (or memoising the function which generates them) will produce a speedup if you expect to generate a lot of random strings.
Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.
I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet + 2)) and similar data.
import random
class DPPath:
def __init__ (self):
self.count = 0
self.next = None
def add_option(self, transition, tail):
if self.next is None:
self.next = {}
self.next[transition] = tail
self.count += tail.count
def random (self):
if 0 == self.count:
return None
else:
return self.find(int(random.random() * self.count))
def find (self, pos):
result = self._find(pos)
return "".join(reversed(result))
def _find (self, pos):
if self.next is None:
return []
for transition, tail in self.next.items():
if pos < tail.count:
result = tail._find(pos)
result.append(transition)
return result
else:
pos -= tail.count
raise IndexException("find out of range")
def balanced_dp (n, alphabet):
# Record that there is 1 empty string with balanced parens.
base_dp = DPPath()
base_dp.count = 1
dps = [base_dp]
for _ in range(n):
# We are working backwards towards the start.
prev_dps = [DPPath()]
for i in range(len(dps)):
# prev_dps needs to be bigger in case of closed paren.
prev_dps.append(DPPath())
# If there are closed parens, we can open one.
if 0 < i:
prev_dps[i-1].add_option('(', dps[i])
# alphabet chars don't change paren balance.
for char in alphabet:
prev_dps[i].add_option(char, dps[i])
# Add a closed paren.
prev_dps[i+1].add_option(")", dps[i])
# And we are done with this string position.
dps = prev_dps
# Return the one that wound up balanced.
return dps[0]
# And a quick demo of several random strings.
for _ in range(10):
print(balanced_dp(10, "abc").random())
I have some code (few hundreds of lines) and i would like to reproduce the code on some "real" controller.
I would like to predict how long the code would take to run by counting how many instructions (basic arithmetic, type of operation (floating point, binary, etc..)
And i wonder if it is possible to do on python (if yes how so ? haven't found anything yet)
I know there is a time feature to measure how long it takes to run the code but the calculation power of my PC and the controller i plan to use are not the same.
Also i tried counting it myself but it is quite a pain and subject to errors
Ideal result would be like:
X number of basic arithmetic operation using INT
Y number of basic arithmetic operation using FLOAT
Z binary operation
etc ...
Thank you
Your question got me thinking. I wrote a little framework for how you might implement something like this. Basically you create your own number class and a collection to hold them all. Then you over-ride the default operators and increment a variable every time you enter those functions. Note that this is NOT robust.. There's no error checking and it assumes that all operations are done with the custom class objects.
from collections import defaultdict # Acts like a dictionary, but every time you add a key, the value defaults to a specified value
class Collection(object): # Use this to hold your custom types
def __init__(self):
self.items = []
return
def add_item(self, item):
self.items.append(item)
class myFloat(object): # Your custom float class
def __init__(self, val, collection):
""" val is the value, collection is the Collections object where we will place your object """
self.val = float(val)
self.op_counts = defaultdict(int) # a dictionary where values default to an integer, 0.
collection.add_item(self) # Add this object to the collection
def __add__(self, other): # Called when you use + on two myFloat
self.op_counts["+"] += 1 # Adds 1 to the number of "+" used
return self.val + other.val # returns the result.
def __sub__(self, other): # Called when you use - on two myFloat
self.op_counts["-"] += 1
return self.val - other.val
def __mul__(self, other): # Called when you use * on two myFloat
self.op_counts["*"] += 1
return self.val * other.val
def __truediv__(self, other): # Called when you use / on two myFloat
self.op_counts["/"] += 1
return self.val / other.val
### EXAMPLE
import random
ops = ["+", "-", "*", "/"]
# We should create a separate Collection object for each custom type we have.
# Since we only have myFloat, we make one Collection object to hold the myFloats.
float_collection = Collection()
# This instantiates a myFloat object with val=7.12 and uses your float_collection
y = myFloat(7.12, float_collection)
for x in range(1, 1000):
op = random.choice(ops) # Pick a random operation
xx = myFloat(x, float_collection) # Instantiate another myFloat
# Now perform the operation on xx and y. eval evaluates the string but
# opens the door for security holes if you are worried about hackers. CAREFUL.
eval(f"y{op}xx") # Remove this line and use the one below if your python < 3.6
# eval("y{}xx".format(op))
print("### RESULTS ###")
result_op_counts = defaultdict(int) # We use this to count up our results
# Sorry for the confusing syntax. The items parameter of the Collection object
# is NOT the same as the items() method for dictionaries.
# float_collection.items is a list of your myFloats.
# the items() method for dictionary returns a dict_items object that you can iterate through.
# This loop tallies up all the results
for myFloatObj in float_collection.items:
for op, ct in myFloatObj.op_counts.items():
result_op_counts[op] += ct
# And this loop prints them.
for k,v in result_op_counts.items():
print(f"{k}: {v} operations") # Remove this line and use the one below if your python < 3.6
# print("{}: {} operations".format(k, v))
This outputs
### RESULTS ###
*: 227 operations
/: 247 operations
+: 275 operations
-: 250 operations
def similarity(dna1, dna2):
count = 0
for i in range(len(dna1)):
if dna1.lower()[i] == dna2.lower()[i]:
count += 1
return count / len(dna1)
def best_match(dna_list, dna):
for dna_seq in dna_list:
dna1 = dna_seq
dna2 = dna
dict = {dna_seq: similarity(dna1, dna2)}
return dict
In best_match I am given a list that contains dna sequences (dna_list). Using the above function I need to compare each sequence with the given dna (dna). Then return the dna sequence with the highest similarity. I was trying to create a dictionary to store the dna sequence with their similarity value and then comparing similarities and then returning the corresponding sequence. However, I am stuck. When I run this it returns only one dna sequence and that similarity value; however, I am given three dna sequences. I'm also having trouble because the list of given dna sequences (in dna_list) can vary.
You are creating a new dictionary at every iteration. It doesn't stop iterating, it is just returning the value from the last iteration, ignoring the previous ones.
What you want is this:
result = dict()
for dna_seq in dna_list:
dna1 = dna_seq
dna2 = dna
result[dna_seq] = similarity(dna1, dna2)
return result
which can be written more briefly with dictionary comprehension:
return {dna_seq:similarity(dna_seq, dna) for dna_seq in dna_list}
Apart from that, you shouldn't call a variable dict because it shadows the built-in type dict.
I have an object that represents a list of objects. Each of these represents a word and its frequency of occurrence in a file.
each object in the list has a word, and the frequency that it shows up in a file. Currently i'm getting an error that says "object is not iterable".
#each object in the list looks like this
#word = "hello", 4
def max(self):
max_list = [None, 0]
for item in WordList:
if item.get_freq() > max_list[1]:
max_list[0] = item.get_word()
max_list[1] = item.get_freq()
return max_list
how do i find the max and min frequency of these objects
Note: this is in a class WordList and that get_word and get_freq is in the class that created the objects in the list.
You question is not clear to me. Using 'object' in the title is at least once too many. The function does not use self. If WordList is a class, you cannot iterate it. Etc. However, I will try to give you an answer to what I think you are asking, which you might be able to adapt.
def minmax(items)
"""Return min and max frequency words in iterable items.
Items represent a word and frequency accessed as indicated.
"""
it = iter(items)
# Initialize result variables
try:
item = next(items)
min_item = max_item = item.get_word(), item.get_freq()
except StopIteration:
raise ValueError('cannon minmax empty iterable')
# Update result variables
for item in it:
word = item.get_word()
freq = item.get_freq()
if freq < min_item[1]:
min_item = word, freq
elif freq > max_item[1]:
max_item = word, freq
return min_item, max_item
I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.