finding DNA codon starting with a or t with regular expression - python-3.x

Given a DNA sequence of codons, I want to get the precentage of codons starting with A or T.
The DNA sequence would be something like: dna = "atgagtgaaagttaacgt". Eeach sequence starting in the 0,3,6 etc. positions <-and that's the source of the problem as far as my intentions goes
What we wrote and works:
import re
DNA = "atgagtgaaagttaacgt"
def atPct(dna):
'''
gets a dna sequence and returns the %
of sequences that are starting with a or t
'''
numOfCodons = re.findall(r'[a|t|c|g]{3}',dna) # [a|t][a|t|c|g]{2} won't give neceseraly in the pos % 3==0 subseq
count = 0
for x in numOfCodons:
if str(x)[0]== 'a' or str(x)[0]== 't':
count+=1
print(str(x))
return 100*count/len(numOfCodons)
print(atPct(DNA))
My goal is to find it without that for loop, somehow I feel there's a way more elegant way to do this just with regular expressions but I might be wrong, if there's a better way i would be glad to learn how! is there a way to cross the location and "[a|t][a|t|c|g]{2}" as a regular expression?
p.s question assume it's a valid dna sequence that's why i haven't even checked that

A loop will be faster than doing it another way. Still, you can use sum and a generator expression (another SO answer) to improve readability:
import re
def atPct(dna):
# Find all sequences
numSeqs = re.findall('[atgc]{3}', DNA)
# Count all sequences that start with 'a' or 't'
atSeqs = sum(1 for seq in numSeqs if re.match('[at]', seq))
# Return the calculation
return 100 * len(numSeqs) / atSeqs
DNA = "atgagtgaaagttaacgt"
print( atPct(DNA) )

So you just want to find out the percentage of times a or t appear in the first of every three characters in the string? Use the step parameter of a slice:
def atPct(dna):
starts = dna[::3] # Every third character of dna, starting with the first
return (starts.count('a') + starts.count('t')) / len(starts)

Related

Generating a random string with matched brackets

I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.
So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.
What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)
A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p+1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow k other non-parenthetic characters, you need to consider, for each number p ≤ n of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (k(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.
Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.
Precomputing the tables of counts (or memoising the function which generates them) will produce a speedup if you expect to generate a lot of random strings.
Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.
I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet + 2)) and similar data.
import random
class DPPath:
def __init__ (self):
self.count = 0
self.next = None
def add_option(self, transition, tail):
if self.next is None:
self.next = {}
self.next[transition] = tail
self.count += tail.count
def random (self):
if 0 == self.count:
return None
else:
return self.find(int(random.random() * self.count))
def find (self, pos):
result = self._find(pos)
return "".join(reversed(result))
def _find (self, pos):
if self.next is None:
return []
for transition, tail in self.next.items():
if pos < tail.count:
result = tail._find(pos)
result.append(transition)
return result
else:
pos -= tail.count
raise IndexException("find out of range")
def balanced_dp (n, alphabet):
# Record that there is 1 empty string with balanced parens.
base_dp = DPPath()
base_dp.count = 1
dps = [base_dp]
for _ in range(n):
# We are working backwards towards the start.
prev_dps = [DPPath()]
for i in range(len(dps)):
# prev_dps needs to be bigger in case of closed paren.
prev_dps.append(DPPath())
# If there are closed parens, we can open one.
if 0 < i:
prev_dps[i-1].add_option('(', dps[i])
# alphabet chars don't change paren balance.
for char in alphabet:
prev_dps[i].add_option(char, dps[i])
# Add a closed paren.
prev_dps[i+1].add_option(")", dps[i])
# And we are done with this string position.
dps = prev_dps
# Return the one that wound up balanced.
return dps[0]
# And a quick demo of several random strings.
for _ in range(10):
print(balanced_dp(10, "abc").random())

Palindrome problem - Trying to check 2 lists for equality python3.9

I'm writing a program to check if a given user input is a palindrome or not. if it is the program should print "Yes", if not "no". I realize that this program is entirely too complex since I actually only needed to check the whole word using the reversed() function, but I ended up making it quite complex by splitting the word into two lists and then checking the lists against each other.
Despite that, I'm not clear why the last conditional isn't returning the expected "Yes" when I pass it "racecar" as an input. When I print the lists in line 23 and 24, I get two lists that are identical, but then when I compare them in the conditional, I always get "No" meaning they are not equal to each other. can anyone explain why this is? I've tried to convert the lists to strings but no luck.
def odd_or_even(a): # function for determining if odd or even
if len(a) % 2 == 0:
return True
else:
return False
the_string = input("How about a word?\n")
x = int(len(the_string))
odd_or_even(the_string) # find out if the word has an odd or an even number of characters
if odd_or_even(the_string) == True: # if even
for i in range(x):
first_half = the_string[0:int((x/2))] #create a list with part 1
second_half = the_string[(x-(int((x/2)))):x] #create a list with part 2
else: #if odd
for i in range(x):
first_half = the_string[:(int((x-1)/2))] #create a list with part 1 without the middle index
second_half = the_string[int(int(x-1)/2)+1:] #create a list with part 2 without the middle index
print(list(reversed(second_half)))
print(list(first_half))
if first_half == reversed(second_half): ##### NOT WORKING BUT DONT KNOW WHY #####
print("Yes")
else:
print("No")
Despite your comments first_half and second_half are substrings of your input, not lists. When you print them out, you're converting them to lists, but in the comparison, you do not convert first_half or reversed(second_half). Thus you are comparing a string to an iterator (returned by reversed), which will always be false.
So a basic fix is to do the conversion for the if, just like you did when printing the lists out:
if list(first_half) == list(reversed(second_half)):
A better fix might be to compare as strings, by making one of the slices use a step of -1, so you don't need to use reversed. Try second_half = the_string[-1:x//2:-1] (or similar, you probably need to tweak either the even or odd case by one). Or you could use the "alien smiley" slice to reverse the string after you slice it out of the input: second_half = second_half[::-1].
There are a few other oddities in your code, like your for i in range(x) loop that overwrites all of its results except the last one. Just use x - 1 in the slicing code and you don't need that loop at all. You're also calling int a lot more often than you need to (if you used // instead of /, you could get rid of literally all of the int calls).

Conditional string splitting in Python using regex or loops

I have a string c that has a respective repetitive pattern of:
integer from 0 to 10,
character S, D, or T,
special character * or # (optional)
For instance, c could look like 1D2S#10S, or 1D#2S*3S, or so on.
I have a further calculation to make with c, but in order to do so I thought splitting c into substrings that include integer, character, and a possible special character would be helpful. Hence, for example, 1D2S#10S would be split into 1D, 2S#, 10S. 1D#2S*3S would be split into 1D#, 2S*, 3S.
I am aware that such string split can be concisely done with re.split(), but since this is quite conditional, I wasn't able to find an optimal way to split this. Instead, I tried using a for loop:
clist = []
n = 0
for i in range(len(c)):
if type(c[i]) != 'int':
if type(c[i+1]) == 'int':
clist.append(c[n:i+1])
n = i
else:
clist.append(c[n:i+2])
n = i
This raises an indexing issue, but even despite that I can tell it isn't optimal. Is there a way to use re to split it accordingly?
Use re.findall():
>>> re.findall(r'\d*[SDT][\*#]?', '1D2S#10S')
['1D', '2S#', '10S']
>>> re.findall(r'\d*[SDT][\*#]?', '1D#2S*3S')
['1D#', '2S*', '3S']

How to count number of substrings in python, if substrings overlap?

The count() function returns the number of times a substring occurs in a string, but it fails in case of overlapping strings.
Let's say my input is:
^_^_^-_-
I want to find how many times ^_^ occurs in the string.
mystr=input()
happy=mystr.count('^_^')
sad=mystr.count('-_-')
print(happy)
print(sad)
Output is:
1
1
I am expecting:
2
1
How can I achieve the desired result?
New Version
You can solve this problem without writing any explicit loops using regex. As #abhijith-pk's answer cleverly suggests, you can search for the first character only, with the remainder being placed in a positive lookahead, which will allow you to make the match with overlaps:
def count_overlapping(string, pattern):
regex = '{}(?={})'.format(re.escape(pattern[:1]), re.escape(pattern[1:]))
# Consume iterator, get count with minimal memory usage
return sum(1 for _ in re.finditer(regex, string))
[IDEOne Link]
Using [:1] and [1:] for the indices allows the function to handle the empty string without special processing, while using [0] and [1:] for the indices would not.
Old Version
You can always write your own routine using the fact that str.find allows you to specify a starting index. This routine will not be very efficient, but it should work:
def count_overlapping(string, pattern):
count = 0
start = -1
while True:
start = string.find(pattern, start + 1)
if start < 0:
return count
count += 1
[IDEOne Link]
Usage
Both versions return identical results. A sample usage would be:
>>> mystr = '^_^_^-_-'
>>> count_overlapping(mystr, '^_^')
2
>>> count_overlapping(mystr, '-_-')
1
>>> count_overlapping(mystr, '')
9
>>> count_overlapping(mystr, 'x')
0
Notice that the empty string is found len(mystr) + 1 times. I consider this to be intuitively correct because it is effectively between and around every character.
you can use regex for a quick and dirty solution :
import re
mystr='^_^_^-_-'
print(len(re.findall('\^(?=_\^)',mystr)))
You need something like this
def count_substr(string,substr):
n=len(substr)
count=0
for i in range(len(string)-len(substr)+1):
if(string[i:i+len(substr)] == substr):
count+=1
return count
mystr=input()
print(count_substr(mystr,'121'))
Input: 12121990
Output: 2

Generating a mutation frequency on a DNA Strand using Python

I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.

Resources