Error when performing pattern search on a randomly generated characters: - python-3.x

So I am trying to implement the Knuth-Morris-Pratt algorithm in Python, below is my implementation:
def KMP(Pattern, Chars):
# compute the start position (number of characters)of the longest suffix that matches the prefix
# Then store prefix and the suffix into the list K, and then set the first element of K to be 0 and the second element to be 1
K = [] # K[n] store the value so that if the mismatch happens at n, it should move pattern Pattern K[n] characters ahead.
n = -1
K.append(n) #add the first element, and keep n = 0.
for k in range (1,len(Pattern) + 1):
# traverse all the elements in Pattern, calculate the corresponding value for each element.
while(n >=0 and Pattern[n] != Pattern[k - 1]): # if n = 1, if n >=1 and the current suffix does not match then try a shorter suffix
n = K[n]
n = n + 1 # if it matches, then the matching position should be one character ahead
K.append(n) #record the matching position for k
#match the string Chars with Pattern
m = 0
for i in range(0, len(Chars)): #traverse through the list one by one
while(m >= 0 and Pattern[m] != Chars[i]): # if they do not match then move Pattern forward with K[m] characters and restart the comparison
m = K[m]
m = m + 1 #if position m matches, then move forward with the next position
if m == len(Pattern): # if m is already the end of K (or Pattern), then a fully matched pattern is found. Continue the comparison by moving Pattern forward K[m] characters
print(i - m + 1, i)
m = K[m]
def main():
Pattern = "abcba"
letters = "abc"
Chars = print ( ''.join(random.choice(letters) for i in range(1000)) )
kmp(Pattern, Chars)
if __name__ == '__main__':
main()
When I try to run this code for a list of randomly generated letters which are abc I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-c7bc734e5e35> in <module>
1 if __name__ == '__main__':
----> 2 main()
<ipython-input-24-2c3de20f253f> in main()
3 letters = "abc"
4 Chars = print ( ''.join(random.choice(letters) for i in range(1000)) )
----> 5 KMP(Pattern, Chars)
<ipython-input-21-edf1808c23d4> in KMP(Pattern, Chars)
14 #match the string Chars with Pattern
15 m = 0
---> 16 for i in range(0, len(Chars)): #traverse through the list one by one
17 while(m >= 0 and Pattern[m] != Chars[i]): # if they do not match then move Pattern forward with K[m] characters and restart the comparison
18 m = K[m]
TypeError: object of type 'NoneType' has no len()
I am not really sure what I am doing wrong, any help will be greatly appreciated

After I replaced
Chars = print ( ''.join(random.choice(letters) for i in range(1000)) )
by
Chars = ''.join(random.choice(letters) for i in range(1000))
it worked for me.

Related

Finding a substring that occurs k times in a long string

I'm trying to solve some algorithm task, but the solution does not pass the time limit.
The condition of the task is the following:
You are given a long string consisting of small Latin letters. We need to find all its substrings of length n that occur at least k times.
Input format:
The first line contains two natural numbers n and k separated by a space.
The second line contains a string consisting of small Latin letters. The string length is 1 ≤ L ≤ 10^6.
n ≤ L, k ≤ L.
Output Format:
For each found substring, print the index of the beginning of its first occurrence (numbering in the string starts from zero).
Output indexes in any order, in one line, separated by a space.
My final solution looks something like this:
def polinomial_hash(s: str, q: int, R: int) -> int:
h = 0
for c in s:
h = (h * q + ord(c)) % R
return h
def get_index_table(inp_str, n):
q = 1000000007
power = q ** (n-1)
R = 2 ** 64
M = len(inp_str)
res_dict = {}
cur_hash = polinomial_hash(inp_str[:n], q, R)
res_dict[cur_hash] = [0]
for i in range(n, M):
first_char = inp_str[i-n]
next_char = inp_str[i]
cur_hash = (
(cur_hash - ord(first_char)*(power))*q
+ ord(next_char)) % R
try:
d_val = res_dict[cur_hash]
d_val += [i-n+1]
except KeyError:
res_dict[cur_hash] = [i-n+1]
return res_dict
if __name__ == '__main__':
n, k = [int(i) for i in input().split()]
inp_str = input()
for item in get_index_table(inp_str, n).values():
if len(item) >= k:
print(item[0], end=' ')
Is it possible to somehow optimize this solution, or advise some alternative options?!

Why isn't chr() outputting the correct character?

I'm working on a Caesar Cypher with Python 3 where s is the string input and k is the amount that you shift the letter. I'm currently just trying to work through getting a letter like 'z' to wrap around to equal 'B'(I know the case is wrong, I'll fix it later). However when I run caesarCipher using the the following inputs: s = 'z' and k = 2, the line: s[n] = chr((122-ord(s[n]) + 64 + k)) causes s[n] to equal 'D'. If i adjust it down two(logically on the unicode scale this would equal 'B'), it makes s[n] = #. What am I doing wrong on that line that's causing 'B' not to be the output?
def caesarCipher(s, k):
# Write your code here
n = 0
s = list(s)
while n < len(s):
if s[n].isalpha() == True:
if (ord(s[n].lower())+k) < 123:
s[n] = (chr(ord(s[n])+k))
n += 1
else:
s[n] = chr((122-ord(s[n]) + 64 + k))
else:
n += 1
s = ''.join(s)
return s
You forgot to add 1 to n in the test of (ord(s[n].lower())+k) < 123 so that it would count s[n] twice or more.
Change it to
else:
s[n] = chr((122 - ord(s[n]) + 64 + k))
n += 1
and if you input "z" and 2, you'll get "B"
print(caesarCipher("z", 2))
# B
and if you adjust it down two, you'll get "#", which is the previous previous character of B in ASCII.
...
else:
s[n] = chr((122 - ord(s[n]) + 62 + k))
n += 1
...
print(caesarCipher("z", 2))
# #

Speeding up my code for pset6 DNA in cs50x

I am currently doing CS50 DNA pset and I wrote all of my code but it is slower for large files which results in check50 considering it wrong. I have attached my code and the error check50 shows below.
import sys
import csv
def main():
argc = len(sys.argv)
if (argc != 3):
print("Usage: python dna.py [database] [sequence]")
exit()
# Sets variable name for each argv argument
arg_database = sys.argv[1]
arg_sequence = sys.argv[2]
# Converts sequence csv file to string, and returns as thus
sequence = get_sequence(arg_sequence)
seq_len = len(sequence)
# Returns STR patterns as list
STR_array = return_STRs(arg_database)
STR_array_len = len(STR_array)
# Counts highest instance of consecutively reoccurring STRs
STR_values = STR_count(sequence, seq_len, STR_array, STR_array_len)
DNA_match(STR_values, arg_database, STR_array_len)
# Reads argv2 (sequence), and returns text within as a string
def get_sequence(arg_sequence):
with open(arg_sequence, 'r') as csv_sequence:
sequence = csv_sequence.read()
return sequence
# Reads STR headers from arg1 (database) and returns as list
def return_STRs(arg_database):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
STR_array = []
for row in database:
for column in row:
STR_array.append(column)
break
# Removes first column header (name)
del STR_array[0]
return STR_array
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
main()
Check50 error link:
https://submit.cs50.io/check50/fd890301a0dc9414cd29c2b4dcb27bd47e6d0a48
If you wait for long, then you get the answer but since my program is running slow check50 is considering it wrong
Well, I solved it just by adding a break statement.

Get the nth occurrence of a letter in a string (python)

Let's say there is a string "abcd#abcd#a#"
How to get the index of the 2nd occurrence of '#' , and get the output as 9?
Since the position of the second occurrence of '#' is 9
Using a generator expression:
text = "abcd#abcd#a#"
gen = (i for i, l in enumerate(text) if l == "#")
next(gen) # skip as many as you need
4
next(gen) # get result
9
As a function:
def index_for_occurrence(text, token, occurrence):
gen = (i for i, l in enumerate(text) if l == token)
for _ in range(occurrence - 1):
next(gen)
return next(gen)
Result:
index_for_occurrence(text, "#", 2)
9
s = 'abcd#abcd#a#'
s.index('#', s.index('#')+1)

Multiplying all the digits of a number in python

If i have a number 101, i want to multiply all the digits of the number (1*0*1) but this result will become Zero. Instead how to split the number into 10 and 1 and multiply(10*1).
Similar examples,
3003033 -> 300*30*3*3 or
2020049 -> 20*200*4*9
You could use a negative look behind to check its not the start of the list and a positive look ahead for nums that are not 0 as your split point.
REGEX: Essentially this says split where the next num is not a 0 and it not the start of the line
/
(?<!^)(?=[^0])
/
gm
Negative Lookbehind (?<!^)
Assert that the Regex below does not match
^ asserts position at start of a line
Positive Lookahead (?=[^0])
Assert that the Regex below matches
Match a single character not present in the list below [^0]
0 matches the character 0 literally (case sensitive)
CODE
import re
from functools import reduce
def sum_split_nums(num):
nums = re.split(r'(?<!^)(?=[^0])', str(num))
total = reduce((lambda x, y: int(x) * int(y)), nums)
return " * ".join(nums), total
nums = [3003033, 2020049, 101, 4040]
for num in nums:
expression, total = sum_split_nums(num)
print(f"{expression} = {total}")
OUTPUT
300 * 30 * 3 * 3 = 81000
20 * 200 * 4 * 9 = 144000
10 * 1 = 10
40 * 40 = 1600
Let a and b be two integer numbers. Let c be a new number made by putting n zeros in the right side of b. Then multiplying a and c is equal to multiplying a and b and 10^n.
Now you can simplify what you want to do to the following: Multiply digits of your number to each other with the agreement that instead of 0, you will put 10. So actually you don't need to split your number.
Here I defined two functions. In both of them the idea is to convert your number to a string, run a for-loop on its digits and by an if condition in the case
1) multiply the previous result to the new digit if it is not 0, otherwise multiply to 10.
def multio1(x):
s = str(x)
ans = 1
for i in range(len(s)):
if s[i] != '0':
ans *= int(s[i])
else:
ans *= 10
return(ans)
2) multiply the previous result to the new digit if it is not 0, otherwise add one unit to the number of zeros. Then at the end put as many as number of zeros, zeros at the right side of your final result.
def multio2(x):
s = str(x)
ans = 1
number_of_zeros = 0
for i in range(len(s)):
if s[i] != '0':
ans *= int(s[i])
else:
number_of_zeros += 1
if number_of_zeros != 0:
ans = str(ans)
for i in range(number_of_zeros):
ans += '0'
ans = int(ans)
return(ans)
Now the multio1(x) and multio2(x) for x=101,3003033,2020049, both gives equal results shown in below.
10,81000,144000
That's kind of odd, but this code will work:
a = '3003033'
num = ''
last_itr = 0
tot=1
for i in range(len(a)-1):
if a[i]=='0' and a[i+1]<='9' and a[i+1]>'0':
tot*=int(a[last_itr:i+1])
last_itr=i+1
elif a[i]>'0' and a[i]<='9' and a[i+1]<='9' and a[i+1]>'0':
tot*=int(a[i])
last_itr=i+1
tot*=int(a[last_itr:len(a)])
print(tot)
Just put your number at a

Resources