String subpattern recognition optimization

String subpattern recognition optimization - python-3.x

In this kata you need to build a function to return either true/True or false/False if a string can be seen as the repetition of a simpler/shorter subpattern or not.
For example:
has_subpattern("a") == False #no repeated pattern
has_subpattern("aaaa") == True #created repeating "a"
has_subpattern("abcd") == False #no repeated pattern
has_subpattern("abababab") == True #created repeating "ab"
has_subpattern("ababababa") == False #cannot be entirely reproduced repeating a pattern
Strings will never be empty and can be composed of any character (just consider upper- and lowercase letters as different entities) and can be pretty long (keep an eye on performances!).
My solution is:
def has_subpattern(string):
string_size = len(string)
for i in range(1, string_size):
slice1 = string[:i]
appearence_count = string.count(slice1)
slice1_len = len(slice1)
if appearence_count > 0:
if appearence_count * slice1_len == string_size:
return True
return False
Obviously there are weak and too slow things like slice1 = string[:i] and string.count() in loop..
Is there better ways to solve an issue or ways to improve performance ?

Short regex approach:
import re
def has_subpattern_re(s):
return bool(re.search(r'^(\w+)\1+$', s))
It'll provide a close (to initial has_subpattern approach) performance on small strings:
import timeit
...
print(timeit.timeit('has_subpattern("abababab")', 'from __main__ import has_subpattern'))
0.7413144190068124
print(timeit.timeit('has_subpattern_re("abababab")', 'from __main__ import re, has_subpattern_re'))
0.856149295999785
But, a significant performance increase (in about 3-5 times faster) on long strings:
print(timeit.timeit('has_subpattern("ababababababababababababababababababababababababa")', 'from __main__ import has_subpattern'))
14.669428467008402
print(timeit.timeit('has_subpattern_re("ababababababababababababababababababababababababa")', 'from __main__ import re, has_subpattern_re'))
4.308312018998549
And one more test for a more longer string:
print(timeit.timeit('has_subpattern("ababababababababababababababababababababababababaababababababababababababababababababababababababab")', 'from __main__ import has_subpattern'))
35.998031173992786
print(timeit.timeit('has_subpattern_re("ababababababababababababababababababababababababaababababababababababababababababababababababababab")', 'from __main__ import re, has_subpattern_re'))
7.010367843002314

Within standard Python, the bottlenecks here will be count, which enjoys C speed implementation and the looping.
The looping itself may be hard to speed-up (althogh Cython may be of some help).
Hence, the most important optimization is to reduce the number of loopings.
One obvious way is to let range() do not exceed half the size of the input (+ 2: + 1 for rounding issues, + 1 for end extrema exclusion in range()):
Also, string is a standard Python module, so better not use it as a variable name.
def has_subpattern_loop(text):
for i in range(1, len(text) // 2 + 2):
subtext = text[:i]
num_subtext = text.count(subtext)
if num_subtext > 1 and num_subtext * len(subtext) == len(text):
return True
return False
A much more effective way of restricting the number of calls to count is to skip computation when i is not a multiple of the length of the input.
def has_subpattern_loop2(text):
for i in range(1, len(text) // 2 + 2):
if len(text) % i == 0:
subtext = text[:i]
num_subtext = text.count(subtext)
if num_subtext > 1 and num_subtext * len(subtext) == len(text):
return True
return False
Even better would be to generate only the divisors of the length of the input.
This could be done using sympy and the approach outlined here:
import sympy as sym
import functools
def get_divisors(n):
if n == 1:
yield 1
return
factors = list(sym.factor_.factorint(n).items())
nfactors = len(factors)
f = [0] * nfactors
while True:
yield functools.reduce(lambda x, y: x * y, [factors[x][0]**f[x] for x in range(nfactors)], 1)
i = 0
while True:
f[i] += 1
if f[i] <= factors[i][1]:
break
f[i] = 0
i += 1
if i >= nfactors:
return
def has_subpattern_divs(text):
for i in get_divisors(len(text)):
subtext = text[:i]
num_subtext = text.count(subtext)
if num_subtext > 1 and num_subtext * len(subtext) == len(text):
return True
return False
A completely different approach is the one proposed in #ВладДавидченко answer:
def has_subpattern_find(text):
return (text * 2).find(text, 1) != len(text)
or the more memory efficient (requires ~50% less additional memory compared to has_subpattern_find2()):
def has_subpattern_find2(text):
return (text + text[:len(text) // 2 + 2]).find(text, 1) > 0
and it is based on the idea that if there is a exactly self-repeating string, the string itself must be found in a circularly extended string:
Input: abab
Extension1: abababab
Found1: |-abab
Extension2: ababab
Found2: |-abab
Input: ababa
Extension1: ababaababa
Found1: |----ababa
Extension2: ababab
Found2: NOT FOUND!
The find-based method are the fastest, with has_subpattern_find() being fastest in the small input size regime, and has_subpattern_find2() gets generally faster in the intermediate and large input size regime (especially in the False case).
For shorter inputs, the direct looping approaches (especially has_subpattern_loop2()) are fastest, closely followed (but sometimes surpassed by has_subpattern_re()), but as soon as the input gets bigger (and especially for the False outcome), the has_subpattern_divs() method gets to be the fastest (aside of find-based ones) by far and large, as shown by the following benchmarks.
For the True outcome, has_subpattern_loop2() gets to be the fastest due to the very small number of loops required, which is independent of the input size.
The input is generated as a function of n using:
def gen_input(n, m=0):
g = string.ascii_lowercase
if not m:
m = n
offset = '!' if n % 2 else ''
return g[:n] * (m // min(n, len(g)) + 2) + offset
so that if n is even, the has_subpattern*() always return True and the opposite for odd n.
Note that, in general, the has_subpattern() function will depend not only on the raw size of the input but also on the length of the repeating string, if any. This is not explored in the benchmarks, except for the odd/even separation.
Even Inputs
Odd Inputs
(Full code available here).
(EDITED to include some more solutions as well as comparison with regex-based solution from #RomanPerekhrest)
(EDITED to include some more solutions based on the find from #ВладДавидченко)

Found another one solution, probably will be useful:
def has_subpattern(string):
return (string * 2).find(string, 1) != len(string)

Related

What should I do to get this code running (without changing the code)

I was solving the 3rd Question on Project Euler (Largest Prime Factor) and I'm a beginner at Python 3.
This is the solution I came up with, it works but not with very large numbers
x=int(input("Enter a number:"))
a=[]
for i in range(1,x+1):
cnt=0
if x%i==0:
for j in range(1,i+1):
if i%j==0:
cnt=cnt+1
if cnt==2:
a.append(i)
print(a[len(a)-1])
I understand its very basic, and its too slow to run large inputs, but is there any way a compiler could give me the output for this input - 600851475143. I tried using pypy3, it was taking too long as well.
Its my first time I'm using stackoverflow, so let me know if I'm doing anything wrong too.

I know you said you don't want to change the code but you would have to, if you want to solve it efficiently.
There actually a lib just for this eulerlib but the built-in math module can do it too.
If you want to use python with no modules you could try this but it is probably just as slow for large numbers
def Largest_Prime_Factor(n):
prime_factor = 1
i = 2
while i <= n / i:
if n % i == 0:
prime_factor = i
n /= i
else:
i += 1
prime_factor = max(prime_factor, n)
return prime_factor
The built-in math module can also do this and is far quicker. Since it is built-in you don't need any external libs like eulerlib
import math
# Getting input from user
n = int(input("Enter the number : "))
maxPrimeFactor = 0
# Checking and converting the number to odd
while n % 2 == 0:
maxPrimeFactor = 2
n = n/2
# Finding and dividing the number by all
# prime factors and replacing maxPrimeFactor
for i in range(3, int(math.sqrt(n)) + 1, 2):
while n % i == 0:
maxPrimeFactor = i
n = n / i
if n > 2:
maxPrimeFactor = n
print("The largest prime Factor of the number is ",int(maxPrimeFactor))

Find the point on the top or bottom of a sequence of numbers

I have a problem like this and i would like to write a snippet of code to solve this problem.
Sequences like: [1,2,3,2], [1,3,2], [1,3,2,1] -> i want to output 3 (maximum) because the sequence increases to 3 and then decreases again
Sequences like [3,2,1,2], [3,1,2], [3,1,2,3] -> i want to output 1 (minimum) because the sequence decreases to 1 and then increases again
Any idea on how to do this automatically?

Try getting local maximas and/or local minimas:
import numpy as np
from scipy.signal import argrelextrema
a = np.array([3,2,1,3])
res=a[np.hstack([argrelextrema(a, np.greater),argrelextrema(a, np.less)]).ravel()]
This will return both local maximas and minimas. You can mark them somehow separately, if it's better for your use case. From your question I assumed it can be just one extremum. Also - depending on your data you might consider using np.less_equal or np.greater_equal instead of np.less or np.greater respectively.

I found it interesting to implement this algorithm in Python 3.
The basic idea is practically to find the minimum and maximum given a sequence of numbers. A sequence, however, can have several maximum points and several minimum points to be taken into consideration. However this is the algorithm I implemented.
I hope it's useful.
sequence = [1,2,3,4,5,4,3,2,8,10,1]
index = 1
max_points = [] #A sequence may have multiple points of relatives max
relative_maximum = sequence[0]
for element in range(len(sequence)):
if(index == len(sequence)):
break
if(sequence[element] < sequence[index]):
relative_maximum = sequence[index]
change_direction = True
else:
if(change_direction == True):
max_points.append(relative_maximum)
change_direction = False
index = index + 1
index = 1
min_points = [] #A sequence may have multiple points of relatives min
relative_minimum = sequence[0]
for element in range(len(sequence)):
if(index == len(sequence)):
break
if(sequence[element] > sequence[index]):
relative_minimum = sequence[index]
change_direction = True
else:
if(change_direction == True):
min_points.append(relative_minimum)
change_direction = False
index = index + 1
print("The max points: " + str(max_points))
print("The min points: " + str(min_points))
Result: The max points: [5, 10] The min points: [2]

Optimize Sieve of Eratosthenes Further

I have written a Sieve of Eratosthenes--I think--but it seems like it's not as optimized as it could be. It works, and it gets all the primes up to N, but not as quickly as I'd hoped. I'm still learning Python--coming from two years of Java--so if something isn't particularly Pythonic then I apologize:
def sieve(self):
is_prime = [False, False, True, True] + [False, True] * ((self.lim - 4) // 2)
for i in range(3, self.lim, 2):
if i**2 > self.lim: break
if is_prime[i]:
for j in range(i * i, self.lim, i * 2):
is_prime[j] = False
return is_prime
I've looked at other questions similar to this one but I can't figure out how some of the more complicated optimizations would fit in to my code. Any suggestions?
EDIT: as requested, some of the other optimizations I've seen are stopping the iteration of the first for loop before the limit, and skipping by different numbers--which I think is wheel optimization?
EDIT 2: Here's the code that would utilize the method, for Padraic:
primes = sieve.sieve()
for i in range(0, len(primes)):
if primes[i]:
print("{:d} ".format(i), end = '')
print() # print a newline

a slightly different approach: use a bitarray to represent the odd numbers 3,5,7,... saving some space compared to a list of booleans.
this may save some space only and not help speedup...
from bitarray import bitarray
def index_to_number(i): return 2*i+3
def number_to_index(n): return (n-3)//2
LIMIT_NUMBER = 50
LIMIT_INDEX = number_to_index(LIMIT_NUMBER)+1
odd_primes = bitarray(LIMIT_INDEX)
# index 0 1 2 3
# number 3 5 7 9
odd_primes.setall(True)
for i in range(LIMIT_INDEX):
if odd_primes[i] is False:
continue
n = index_to_number(i)
for m in range(n**2, LIMIT_NUMBER, 2*n):
odd_primes[number_to_index(m)] = False
primes = [index_to_number(i) for i in range(LIMIT_INDEX)
if odd_primes[i] is True]
primes.insert(0,2)
print('primes: ', primes)
the same idea again; but this time let bitarray handle the inner loop using slice assignment. this may be faster.
for i in range(LIMIT_INDEX):
if odd_primes[i] is False:
continue
odd_primes[2*i**2 + 6*i + 3:LIMIT_INDEX:2*i+3] = False
(none of this code has been seriously checked! use with care)
in case you are looking for a primes generator based on a different method (wheel factorizaition) have a look at this excellent answer.

Trying to understand this simple python code

I was reading Jeff Knupp's blog and I came across this easy little script:
import math
def is_prime(n):
if n > 1:
if n == 2:
return True
if n % 2 == 0:
return False
for current in range(3, int(math.sqrt(n) + 1), 2):
if n % current == 0:
return False
return True
return False
print(is_prime(17))
(note: I added the import math at the beginning. You can see the original here:
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/)
This is all pretty straightforward and I get the majority of it, but I'm not sure what's going on with his use of the range function. I haven't ever used it this way or seen anyone else use it this way, but then I'm a beginner. What does it mean for the range function to have three parameters, and how does this accomplish testing for primeness?
Also (and apologies if this is a stupid question), but the very last 'return False' statement. That is there so that if a number is passed to the function that is less than one (and thus not able to be prime), the function won't even waste its time evaluating that number, right?

The third is the step. It iterates through every odd number less than or equal to the square root of the input (3, 5, 7, etc.).

import math #import math module
def is_prime(n): #define is_prime function and assign variable n to its argument (n = 17 in this example).
if n > 1: #check if n (its argument) is greater than one, if so, continue; else return False (this is the last return in the function).
if n == 2: #check if n equals 2, it so return True and exit.
return True
if n % 2 == 0: #check if the remainder of n divided by two equas 0, if so, return False (is not prime) and exit.
return False
for current in range(3, int(math.sqrt(n) + 1), 2): #use range function to generate a sequence starting with value 3 up to, but not including, the truncated value of the square root of n, plus 1. Once you have this secuence give me every other number ( 3, 5, 7, etc)
if n % current == 0: #Check every value from the above secuence and if the remainder of n divided by that value is 0, return False (it's not prime)
return False
return True #if not number in the secuence divided n with a zero remainder then n is prime, return True and exit.
return False
print(is_prime(17))

Repeat string to certain length

What is an efficient way to repeat a string to a certain length? Eg: repeat('abc', 7) -> 'abcabca'
Here is my current code:
def repeat(string, length):
cur, old = 1, string
while len(string) < length:
string += old[cur-1]
cur = (cur+1)%len(old)
return string
Is there a better (more pythonic) way to do this? Maybe using list comprehension?

Jason Scheirer's answer is correct but could use some more exposition.
First off, to repeat a string an integer number of times, you can use overloaded multiplication:
>>> 'abc' * 7
'abcabcabcabcabcabcabc'
So, to repeat a string until it's at least as long as the length you want, you calculate the appropriate number of repeats and put it on the right-hand side of that multiplication operator:
def repeat_to_at_least_length(s, wanted):
return s * (wanted//len(s) + 1)
>>> repeat_to_at_least_length('abc', 7)
'abcabcabc'
Then, you can trim it to the exact length you want with an array slice:
def repeat_to_length(s, wanted):
return (s * (wanted//len(s) + 1))[:wanted]
>>> repeat_to_length('abc', 7)
'abcabca'
Alternatively, as suggested in pillmod's answer that probably nobody scrolls down far enough to notice anymore, you can use divmod to compute the number of full repetitions needed, and the number of extra characters, all at once:
def pillmod_repeat_to_length(s, wanted):
a, b = divmod(wanted, len(s))
return s * a + s[:b]
Which is better? Let's benchmark it:
>>> import timeit
>>> timeit.repeat('scheirer_repeat_to_length("abcdefg", 129)', globals=globals())
[0.3964178159367293, 0.32557755894958973, 0.32851039397064596]
>>> timeit.repeat('pillmod_repeat_to_length("abcdefg", 129)', globals=globals())
[0.5276265419088304, 0.46511475392617285, 0.46291469305288047]
So, pillmod's version is something like 40% slower, which is too bad, since personally I think it's much more readable. There are several possible reasons for this, starting with its compiling to about 40% more bytecode instructions.
Note: these examples use the new-ish // operator for truncating integer division. This is often called a Python 3 feature, but according to PEP 238, it was introduced all the way back in Python 2.2. You only have to use it in Python 3 (or in modules that have from __future__ import division) but you can use it regardless.

def repeat_to_length(string_to_expand, length):
return (string_to_expand * ((length/len(string_to_expand))+1))[:length]
For python3:
def repeat_to_length(string_to_expand, length):
return (string_to_expand * (int(length/len(string_to_expand))+1))[:length]

This is pretty pythonic:
newstring = 'abc'*5
print newstring[0:6]

def rep(s, m):
a, b = divmod(m, len(s))
return s * a + s[:b]

from itertools import cycle, islice
def srepeat(string, n):
return ''.join(islice(cycle(string), n))

Perhaps not the most efficient solution, but certainly short & simple:
def repstr(string, length):
return (string * length)[0:length]
repstr("foobar", 14)
Gives "foobarfoobarfo". One thing about this version is that if length < len(string) then the output string will be truncated. For example:
repstr("foobar", 3)
Gives "foo".
Edit: actually to my surprise, this is faster than the currently accepted solution (the 'repeat_to_length' function), at least on short strings:
from timeit import Timer
t1 = Timer("repstr('foofoo', 30)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo', 30)", 'from __main__ import repeat_to_length')
t1.timeit() # gives ~0.35 secs
t2.timeit() # gives ~0.43 secs
Presumably if the string was long, or length was very high (that is, if the wastefulness of the string * length part was high) then it would perform poorly. And in fact we can modify the above to verify this:
from timeit import Timer
t1 = Timer("repstr('foofoo' * 10, 3000)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo' * 10, 3000)", 'from __main__ import repeat_to_length')
t1.timeit() # gives ~18.85 secs
t2.timeit() # gives ~1.13 secs

How about string * (length / len(string)) + string[0:(length % len(string))]

i use this:
def extend_string(s, l):
return (s*l)[:l]

Not that there haven't been enough answers to this question, but there is a repeat function; just need to make a list of and then join the output:
from itertools import repeat
def rep(s,n):
''.join(list(repeat(s,n))

Yay recursion!
def trunc(s,l):
if l > 0:
return s[:l] + trunc(s, l - len(s))
return ''
Won't scale forever, but it's fine for smaller strings. And it's pretty.
I admit I just read the Little Schemer and I like recursion right now.

This is one way to do it using a list comprehension, though it's increasingly wasteful as the length of the rpt string increases.
def repeat(rpt, length):
return ''.join([rpt for x in range(0, (len(rpt) % length))])[:length]

Another FP aproach:
def repeat_string(string_to_repeat, repetitions):
return ''.join([ string_to_repeat for n in range(repetitions)])

def extended_string (word, length) :
extra_long_word = word * (length//len(word) + 1)
required_string = extra_long_word[:length]
return required_string
print(extended_string("abc", 7))

c = s.count('a')
div=n//len(s)
if n%len(s)==0:
c= c*div
else:
m = n%len(s)
c = c*div+s[:m].count('a')
print(c)

Currently print(f"{'abc'*7}") generates:
abcabcabcabcabcabcabc

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

String subpattern recognition optimization - python-3.x

Found another one solution, probably will be useful: def has_subpattern(string): return (string * 2).find(string, 1) != len(string)

Related

What should I do to get this code running (without changing the code)

Find the point on the top or bottom of a sequence of numbers

Optimize Sieve of Eratosthenes Further

Trying to understand this simple python code

Repeat string to certain length

Categories

Resources