Sliding Window and Recognizing Specific Characters in a List - python-3.x

Instructions: Write a script that will calculate the %GC of a dna string
based on a sliding window of adjustable size. So say the length of
the window is L = 10 bases, then you will move the window along
the dna strand from position 0 to the end (careful, not too far...)
and 'extract' the bases into a substring and analyze GC content.
Put the numbers in a list. The dna string may be very large so you
will want to read the string in from an infile, and print the results
to a comma-delimited outfile that can be ported into Excel to plot.
For the final data analysis, use a window of L = 100 and analyze the two genomes in files:
Bacillus_amyloliquefaciens_genome.txt
Deinococcus_radiodurans_R1_chromosome_1.txt
But first, to get your script functioning, use the following trainer data set.Let window L=4. Example input and output follow:
INPUT:
AACGGTT
OUTPUT:
0,0.50
1,0.75
2,0.75
3,0.50
My code:
dna = ['AACGGTT']
def slidingWindow(dna,winSize,step):
"""Returns a generator that will iterate through
the defined chunks of input sequence. Input sequence
must be iterable."""
# Verify the inputs
#try: it = iter(dna)
# except TypeError:
#raise Exception("**ERROR** sequence must be iterable.")
if not ((type(winSize) == type(0)) and (type(step) == type(0))):
raise Exception("**ERROR** type(winSize) and type(step) must be int.")
if step > winSize:
raise Exception("**ERROR** step must not be larger than winSize.")
if winSize > len(dna):
raise Exception("**ERROR** winSize must not be larger than sequence length.")
# Pre-compute number of chunks to emit
numOfwins = ((len(dna)-winSize)/step)+1
# Do the work
for i in range(0,numOfwins*step,step):
yield dna[i:i+winSize]
chunks = slidingWindow(dna,len(dna),step)
for y in chunks:
total = 1
search = dna[y]
percentage = (total/len(dna))
if search == "C":
total = total+1
print ("#", y,percentage)
elif search == "G":
total = total+1
print ("#", y,percentage)
else:
print ("#", y, "0.0")
"""
MAIN
calling the functions from here
"""
# YOUR WORK HERE
#print ("#", z,percentage)

When approaching a complex problem, it is helpful to divide it into simpler sub-problems. Here, you have at least two separate concepts: a window of bases, and statistics on such a window. Why don't you tackle them one at a time?
Here is a simple generator that produces chunks of the desired size:
def get_chunks(dna, window_size=4, stride=1):
for i in range(0, len(dna) - window_size + 1, stride):
chunk = dna[i:i + window_size]
assert len(chunk) == window_size
yield chunk
for chunk in get_chunks('AACGGTT'):
print(chunk)
It displays this output:
AACG
ACGG
CGGT
GGTT
Now, with that in hand, could you write a simple function that accepts a four-character string and produces an appropriate statistical summary of it? [Please post it as a separate answer to your question. Yes, it might sound odd at first, but StackOverflow does encourage you to post answers to your questions, so you can share what you have learned.]

Related

Generating a random string with matched brackets

I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.
So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.
What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)
A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p+1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow k other non-parenthetic characters, you need to consider, for each number p ≤ n of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (k(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.
Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.
Precomputing the tables of counts (or memoising the function which generates them) will produce a speedup if you expect to generate a lot of random strings.
Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.
I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet + 2)) and similar data.
import random
class DPPath:
def __init__ (self):
self.count = 0
self.next = None
def add_option(self, transition, tail):
if self.next is None:
self.next = {}
self.next[transition] = tail
self.count += tail.count
def random (self):
if 0 == self.count:
return None
else:
return self.find(int(random.random() * self.count))
def find (self, pos):
result = self._find(pos)
return "".join(reversed(result))
def _find (self, pos):
if self.next is None:
return []
for transition, tail in self.next.items():
if pos < tail.count:
result = tail._find(pos)
result.append(transition)
return result
else:
pos -= tail.count
raise IndexException("find out of range")
def balanced_dp (n, alphabet):
# Record that there is 1 empty string with balanced parens.
base_dp = DPPath()
base_dp.count = 1
dps = [base_dp]
for _ in range(n):
# We are working backwards towards the start.
prev_dps = [DPPath()]
for i in range(len(dps)):
# prev_dps needs to be bigger in case of closed paren.
prev_dps.append(DPPath())
# If there are closed parens, we can open one.
if 0 < i:
prev_dps[i-1].add_option('(', dps[i])
# alphabet chars don't change paren balance.
for char in alphabet:
prev_dps[i].add_option(char, dps[i])
# Add a closed paren.
prev_dps[i+1].add_option(")", dps[i])
# And we are done with this string position.
dps = prev_dps
# Return the one that wound up balanced.
return dps[0]
# And a quick demo of several random strings.
for _ in range(10):
print(balanced_dp(10, "abc").random())

MapReduce in python to calculate average characters

I am new to map-reduce and coding, I am trying to write a code in python that would calculate the average number of characters and "#" in a tweet
Sample data:
1469453965000;757570956625870854;RT #lasteven04: La jeune Rebecca
#Kpossi, nageuse, 18 ans à peine devrait être la porte-drapeau du #Togo à #Rio2016 hyperlink;Twitter for Android 1469453965000;757570957502394369;Over 30 million women
footballers in the world. Most of us would trade places with this lot
for #Rio2016 ⚽️ hyperlink;Twitter for iPhone
fields/columns details:
0: epoch_time 1: tweetId 2: tweet 3: device
Here is the code that I've written, I need help to calculate the average in the reducer function, any help/guidance would be really appreciated :-
updated as per the answer provided by #oneCricketeer
import re
from mrjob.job import MRJob
class Lab3(MRJob):
def mapper(self,_,line):
try:
fields=line.split(";")
if(len(fields)==4):
tweet=fields[2]
tweet_id=fields[0]
yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
except:
pass
def reduce(self,tweet_id,tweet_info):
total_tweet_length=0
total_tweet_hash=0
count=0
for v in tweet_info:
tweet_length,hashes = map(int,v.split())
tweet_length_sum+= tweet_length
total_tweet_hash+=hashes
count+=1
yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))
if __name__=="__main__":
Lab3.run()
Your mapper needs to yield a key and a value, 2 elements, not 3, therefore outputting both average length and hashtag count should ideally be separate mapreduce jobs, but for this case you could combine them because you're processing the entire line, not separate words
# you could use the tweetId as the key, too, but would only help if tweets shared ids
yield (None, "{} {}".format(len(tweet), tweet.count('#')))
Note: len(tweet) includes spaces and emojis, which you may want to exclude as "characters"
I'm not sure you can put _ in a function definition, so maybe change that too
Your reduce function is syntactically incorrect. You cannot put a string as a function parameter, nor use += on a variable that wasn't already defined. Then, an average calculation would require you to divide after you've totalled and counted (so, one returned result per reducer, not per value, in the loop}
def reduce(self,key,tweet_info):
total_tweet_length = 0
total_tweet_hash = 0
count = 0
for v in tweet_info:
tweet_length, hashes = map(int, v.split())
total_tweet_length += tweet_length
total_tweet_hash += hashes
count+=1
yield(total_tweet_length / (1.0 * count), total_tweet_hash / (1.0 * count)) # forcing a floating point output

Recursion Problem - breaking long sentence in multiple short strings

I'm trying to take a string and break it into small chunks if it is over certain number of words.
I keep on getting a RecursionError: maximum recursion depth exceeded in comparison
What in my code is making this happen?
import math
# Shorten Sentence into small pieces
def shorten(sentenceN):
# If it is a string - and length over 6 - then shorten recursively
if (isinstance(sentenceN, str)):
sentence = sentenceN.split(' ')
array = []
length = len(sentenceN)
halfed = math.floor(length / 2)
if length < 6:
return [sentenceN]
# If sentence is long - break into two parts then rerun shorten on each part
else:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
array.append(first)
array.append(second)
return array
# If the object is an array (sentence is already broken up) - run shorten on each - append
# result to array for returning
if(isinstance(sentenceN, list)):
array = []
for sentence in sentenceN:
array.append(shorten(sentence))
return array
# example sentences to use
longSentence = "On offering to help the blind man, the man who then stole his car, had not, at that precise moment."
shortSentence = "On offering to help the blind man."
shorten(shortSentence)
shorten(longSentence)
When you execute a recursive function in Python on a large input ( > 10^4), you might encounter a “maximum recursion depth exceeded error”.
here you have recursion:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
it means calling the same function over and over, it has to store in a stack to be returned in someplace, but it seems like your sentence is too long that stack is overflowing and hit maximum recursion depth.
you have to do something with the logic of the code like increase this 6 to a greater number
if length < 6:
return [sentenceN]
or just increase recursion depth with
import sys
sys.setrecursionlimit(10**6)

Tips optimizing my python graph creation code

My programming task is to, given an input file words.txt, create a graph where the nodes are the words themselves, and edges are words with one letter different. (eg. "abcdef" and "abcdeg" would have an edge).
For later stages of the problem, I really need to use the adjacency list implementation (so no adjacency matrix). I'd prefer not to use external libraries if possible.
My code is functional, but takes about 7 seconds to run (for ~5000 words). I need help optimizing it further. The main optimization I did was using sets with {"index,char"} for each char in the string and compared the sets, which was a lot faster than something like sum(x!=y for x,y in zip(word1, word2))
This is the code I have so far:
def createGraph(filename):
"""Time taken to generate graph: {} seconds."""
wordReference = {} # Words and their smartsets (for comparisons)
wordEdges = {} # Words and their edges (Graph representation)
edgeCount = 0 # Count of edges
with open(filename) as f:
for raw in f.readlines(): # For each word in the file (Stripping trailing \n)
word = raw.strip()
wordReference[word] = set("{}{}".format(letter, str(i)) for i, letter in enumerate(word)) # Create smartsets in wordReference
wordEdges[word] = set() # Create a set to contain the edges
for parsed in wordEdges: # For each of the words we've already parsed
if len(wordReference[word] & wordReference[parsed]) == 5: # If compare smartSets to see if there should be an edge
wordEdges[parsed].add(word) #Add edge parsed -> word
wordEdges[word].add(parsed) #Add edge word -> parsed
edgeCount += 1 #Add 1 to edgesCount
return wordEdges, edgeCount, len(wordReference) #Return dictionary of words and their edges

Generating a mutation frequency on a DNA Strand using Python

I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.

Resources