Recursion Problem - breaking long sentence in multiple short strings - python-3.x

I'm trying to take a string and break it into small chunks if it is over certain number of words.
I keep on getting a RecursionError: maximum recursion depth exceeded in comparison
What in my code is making this happen?
import math
# Shorten Sentence into small pieces
def shorten(sentenceN):
# If it is a string - and length over 6 - then shorten recursively
if (isinstance(sentenceN, str)):
sentence = sentenceN.split(' ')
array = []
length = len(sentenceN)
halfed = math.floor(length / 2)
if length < 6:
return [sentenceN]
# If sentence is long - break into two parts then rerun shorten on each part
else:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
array.append(first)
array.append(second)
return array
# If the object is an array (sentence is already broken up) - run shorten on each - append
# result to array for returning
if(isinstance(sentenceN, list)):
array = []
for sentence in sentenceN:
array.append(shorten(sentence))
return array
# example sentences to use
longSentence = "On offering to help the blind man, the man who then stole his car, had not, at that precise moment."
shortSentence = "On offering to help the blind man."
shorten(shortSentence)
shorten(longSentence)

When you execute a recursive function in Python on a large input ( > 10^4), you might encounter a “maximum recursion depth exceeded error”.
here you have recursion:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
it means calling the same function over and over, it has to store in a stack to be returned in someplace, but it seems like your sentence is too long that stack is overflowing and hit maximum recursion depth.
you have to do something with the logic of the code like increase this 6 to a greater number
if length < 6:
return [sentenceN]
or just increase recursion depth with
import sys
sys.setrecursionlimit(10**6)

Related

list and dictionary: which one is faster

I have the following pieces of code doing the sorting of a list by swapping pairs of elements:
# Complete the minimumSwaps function below.
def minimumSwaps(arr):
counter = 0
val_2_indx = {val: arr.index(val) for val in arr}
for indx, x in enumerate(arr):
if x != indx+1:
arr[indx] = indx+1
s_indx = val_2_indx[indx+1]
arr[s_indx] = x
val_2_indx[indx+1] = indx
val_2_indx[x] = s_indx
counter += 1
return counter
def minimumSwaps(arr):
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
swaps = 0
for i in range(len(arr)):
if arr[i] != i+1:
swaps += 1
t = arr[i]
arr[i] = i+1
arr[temp[i+1]] = t
temp[t] = temp[i+1]
temp[i+1] = i
return swaps
The second function works much faster than the first one. However, I was told that dictionary is faster than list.
What's the reason here?
A list is a data structure, and a dictionary is a data structure. It doesn't make sense to say one is "faster" than the other, any more than you can say that an apple is faster than an orange. One might grow faster, you might be able to eat the other one faster, and they might both fall to the ground at the same speed when you drop them. It's not the fruit that's faster, it's what you do with it.
If your problem is that you have a sequence of strings and you want to know the position of a given string in the sequence, then consider these options:
You can store the sequence as a list. Finding the position of a given string using the .index method requires a linear search, iterating through the list in O(n) time.
You can store a dictionary mapping strings to their positions. Finding the position of a given string requires looking it up in the dictionary, in O(1) time.
So it is faster to solve that problem using a dictionary.
But note also that in your first function, you are building the dictionary using the list's .index method - which means doing n linear searches each in O(n) time, building the dictionary in O(n^2) time because you are using a list for something lists are slow at. If you build the dictionary without doing linear searches, then it will take O(n) time instead:
val_2_indx = { val: i for i, val in enumerate(arr) }
But now consider a different problem. You have a sequence of numbers, and they happen to be the numbers from 1 to n in some order. You want to be able to look up the position of a number in the sequence:
You can store the sequence as a list. Finding the position of a given number requires linear search again, in O(n) time.
You can store them in a dictionary like before, and do lookups in O(1) time.
You can store the inverse sequence in a list, so that lst[i] holds the position of the value i in the original sequence. This works because every permutation is invertible. Now getting the position of i is a simple list access, in O(1) time.
This is a different problem, so it can take a different amount of time to solve. In this case, both the list and the dictionary allow a solution in O(1) time, but it turns out it's more efficient to use a list. Getting by key in a dictionary has a higher constant time than getting by index in a list, because getting by key in a dictionary requires computing a hash, and then probing an array to find the right index. (Getting from a list just requires accessing an array at an already-known index.)
This second problem is the one in your second function. See this part:
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
This creates a list temp, where temp[val] = pos whenever arr[pos] == val. This means the list temp is the inverse permutation of arr. Later in the code, temp is used only to get these positions by index, which is an O(1) operation and happens to be faster than looking up a key in a dictionary.

How to count number of substrings in python, if substrings overlap?

The count() function returns the number of times a substring occurs in a string, but it fails in case of overlapping strings.
Let's say my input is:
^_^_^-_-
I want to find how many times ^_^ occurs in the string.
mystr=input()
happy=mystr.count('^_^')
sad=mystr.count('-_-')
print(happy)
print(sad)
Output is:
1
1
I am expecting:
2
1
How can I achieve the desired result?
New Version
You can solve this problem without writing any explicit loops using regex. As #abhijith-pk's answer cleverly suggests, you can search for the first character only, with the remainder being placed in a positive lookahead, which will allow you to make the match with overlaps:
def count_overlapping(string, pattern):
regex = '{}(?={})'.format(re.escape(pattern[:1]), re.escape(pattern[1:]))
# Consume iterator, get count with minimal memory usage
return sum(1 for _ in re.finditer(regex, string))
[IDEOne Link]
Using [:1] and [1:] for the indices allows the function to handle the empty string without special processing, while using [0] and [1:] for the indices would not.
Old Version
You can always write your own routine using the fact that str.find allows you to specify a starting index. This routine will not be very efficient, but it should work:
def count_overlapping(string, pattern):
count = 0
start = -1
while True:
start = string.find(pattern, start + 1)
if start < 0:
return count
count += 1
[IDEOne Link]
Usage
Both versions return identical results. A sample usage would be:
>>> mystr = '^_^_^-_-'
>>> count_overlapping(mystr, '^_^')
2
>>> count_overlapping(mystr, '-_-')
1
>>> count_overlapping(mystr, '')
9
>>> count_overlapping(mystr, 'x')
0
Notice that the empty string is found len(mystr) + 1 times. I consider this to be intuitively correct because it is effectively between and around every character.
you can use regex for a quick and dirty solution :
import re
mystr='^_^_^-_-'
print(len(re.findall('\^(?=_\^)',mystr)))
You need something like this
def count_substr(string,substr):
n=len(substr)
count=0
for i in range(len(string)-len(substr)+1):
if(string[i:i+len(substr)] == substr):
count+=1
return count
mystr=input()
print(count_substr(mystr,'121'))
Input: 12121990
Output: 2

finding DNA codon starting with a or t with regular expression

Given a DNA sequence of codons, I want to get the precentage of codons starting with A or T.
The DNA sequence would be something like: dna = "atgagtgaaagttaacgt". Eeach sequence starting in the 0,3,6 etc. positions <-and that's the source of the problem as far as my intentions goes
What we wrote and works:
import re
DNA = "atgagtgaaagttaacgt"
def atPct(dna):
'''
gets a dna sequence and returns the %
of sequences that are starting with a or t
'''
numOfCodons = re.findall(r'[a|t|c|g]{3}',dna) # [a|t][a|t|c|g]{2} won't give neceseraly in the pos % 3==0 subseq
count = 0
for x in numOfCodons:
if str(x)[0]== 'a' or str(x)[0]== 't':
count+=1
print(str(x))
return 100*count/len(numOfCodons)
print(atPct(DNA))
My goal is to find it without that for loop, somehow I feel there's a way more elegant way to do this just with regular expressions but I might be wrong, if there's a better way i would be glad to learn how! is there a way to cross the location and "[a|t][a|t|c|g]{2}" as a regular expression?
p.s question assume it's a valid dna sequence that's why i haven't even checked that
A loop will be faster than doing it another way. Still, you can use sum and a generator expression (another SO answer) to improve readability:
import re
def atPct(dna):
# Find all sequences
numSeqs = re.findall('[atgc]{3}', DNA)
# Count all sequences that start with 'a' or 't'
atSeqs = sum(1 for seq in numSeqs if re.match('[at]', seq))
# Return the calculation
return 100 * len(numSeqs) / atSeqs
DNA = "atgagtgaaagttaacgt"
print( atPct(DNA) )
So you just want to find out the percentage of times a or t appear in the first of every three characters in the string? Use the step parameter of a slice:
def atPct(dna):
starts = dna[::3] # Every third character of dna, starting with the first
return (starts.count('a') + starts.count('t')) / len(starts)

make a function that take an integer and reduces it down to an odd number

I'm working on my final for a class I'm taking(Python 3) im stuck at this part.
he gave us a file with numbers inside of it. we opened it and add those numbers to a list.
"Create a function called makeOdd() that returns an integer value. This function should take in any integer and reduce it down to an odd number by dividing it in half until it becomes an odd number.
o For example 10 would be cut in half to 5.
o 9 is already odd, so it would stay 9.
o But 12 would be cut in half to 6, and then cut in half again to 3.
o While 16 would be cut to 8 which gets cut to 4 which gets cut to 2 which gets cut to 1.
 Apply this function to every number in the array. "
I have tried to search the internet but i have not clue where to even begin with this one. any help would be nice.
Here my whole final so far:
#imports needed to run this code.
from Final_Functions import *
#Defines empty list
myList = []
sumthing = 0
sortList = []
oddList = []
count = 0
#Starts the Final Project with my name,class, and quarter
intro()
print("***************************************************************",'\n')
#Opens the data file and reads it then places the intrager into a list we can use later.
with open('FinalData.Data', 'r') as f:
myList = [line.strip() for line in f]
print("File Read Complete",'\n')
#Finds the Sum and Adverage of this list from FinalData.Data
print("*******************sum and avg*********************************")
for oneLine in myList:
tempNum = int(oneLine)
sumthing = sumthing + tempNum
avg = sumthing /1111
print("The Sum of the List is:",sumthing)
print("The Adverage of the List is:",avg,'\n')
print("***************************************************************",'\n')
#finds and prints off the first Ten and the last ten numbers in the list
firstTen(myList)
lastTen(myList)
print("***************************************************************",'\n')
#Lest sort the list then find the first and last ten numbers in this list
sortList = myList
sortList.sort()
firstTen(sortList)
lastTen(sortList)
print("****************************************************************",'\n')
Language:Python 3
I don't want to give you the answer outright, so I'm going to talk you through the process and let you generate your own code.
You can't solve this problem in a single step. You need to divide repeatedly and check the value every time to see if it's odd.
Broadly speaking, when you need to repeat a process there are two ways to proceed; looping and recursion. (Ok, there are lots, but those are the most common)
When looping, you'd check if the current number x is odd. If not, halve it and check again. Once the loop has completed, x will be your result.
If using recursion, have a function that takes x. If it's odd, simply return x, otherwise call the function again, passing in x/2.
Either of those methods will solve your problem and both are fundamental concepts.
adding to what #Basic said, never do import * is a bad practice and is a potential source of problem later on...
looks like you are still confuse in this simple matter, you want to given a number X reduce it to a odd number by dividing it by 2, right? then ask yourself how I do this by hand? the answer is what #Basic said you first ask "X is a even number?" if the answer is No then I and done reducing this number, but if the answer is Yes then the next step dividing it by 2 and save the result in X, then repeat this process until you get to the desire result. Hint: use a while
to answer your question about
for num in myList:
if num != 0:
num = float(num)
num / 2
the problem here is that you don't save the result of the division, to do that is as simple as this
for num in myList:
if num != 0:
num = float(num)
num = num / 2

Generating a mutation frequency on a DNA Strand using Python

I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.

Resources