Explanation for the "loop rolling algorithm" - string

I'm trying to implement the algorithm given by Evgeny Kluev in his answer to loop rolling algorithm
but I'm having trouble getting it to work. Below is an example I tried to work out by hand following his instructions:
text: ababacababd
<STEP 1>
suffixes and LCPs:
ababacababd
4
ababd
3
abacababd
2
abd
1
acababd
0
babacababd
3
babd
2
bacababd
1
bd
0
cababd
0
d
<STEP 2>
sorted LCP array indices: 0,1,5,2,6,3,7,4,8,9
(values) : 4,3,3,2,2,1,1,1,0,0
<STEP 3>
LCP groups sorted by position in text (format => position: entry):
lcp 4:
0: ababacababd
6: ababd
lcp 3:
1: babacababd
2: abacababd
6: ababd
7: babd
lcp 2:
2: abacababd
3: bacababd
7: babd
8: abd
lcp 1:
3: bacababd
4: acababd
8: abd
9: bd
lcp 0:
0: ababacababd
1: babacababd
4: acababd
5: cababd
9: bd
10: d
<STEP 4>
entries remaining after filter (LCP == positional difference):
none! only (abd),(bd) and (bacababd),(acababd) from LCP 1 group
have positional difference equal to 1 but they don't share prefixes
with each other. shouldn't i have at least (ab) and (ba) here?
Can anybody tell me what I'm doing wrong in this process?
Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?
Is this a known algorithm with a name that I can find more details on elsewhere?
(I'm also confused about his definition of intersecting sequences in step 5, but maybe it would make sense if I understood the preceding steps correctly).
EDIT: Here is what I have for step 4,5,6 after Evgeny's helpful clarification:
<STEP 4>
filter pseudocode:
results = {}
for (positions,lcp) in lcp_groups:
results[lcp] = []
while positions not empty:
pos = positions.pop(0) #pop lowest element
if (pos+lcp) in positions:
common = prefix(input, pos, lcp)
if common.size() < lcp:
next
i = 1
while (pos+lcp*(i+1)) in positions:
if common != prefix(input, pos+lcp*i, lcp):
break
positions.delete(pos+lcp*i)
i += 1
results[lcp].append( (common, pos, i+1) )
application of filter logic:
lcp 4:
0: ababacababd # 4 not in {6}
6: ababd # 10 not in {}
lcp 3:
0: ababacababd # 3 not in {1,2,6,7}
1: babacababd # 4 not in {2,6,7}
2: abacababd # 5 not in {6,7}
6: ababd # 9 not in {7}
7: babd # 10 not in {}
lcp 2:
0: ababacababd # 2 in {1,2,3,6,7,8}, 4 not in {1,2,3,6,7,8} => ("ab", 0, 2)
1: babacababd # 3 in {2,3,6,7,8}, 5 not in {2,3,6,7,8} => ("ba", 1, 2)
2: abacababd # 4 not in {3,6,7,8}
3: bacababd # 5 not in {6,7,8}
6: ababd # 8 in {7,8}, 10 not in {7,8} => ("ab", 6, 2)
7: babd # 9 not in {8}
8: abd # 10 not in {}
lcp 1:
0: ababacababd # 1 in {1,2,3,4,6,7,8,9}, prefix is ""
1: babacababd # 2 in {2,3,4,6,7,8,9}, prefix is ""
2: abacababd # 3 in {3,4,6,7,8,9}, prefix is ""
3: bacababd # 4 in {4,6,7,8,9}, prefix is ""
4: acababd # 5 not in {6,7,8,9}
6: ababd # 7 in {7,8,9}, prefix is ""
7: babd # 8 in {8,9}, prefix is ""
8: abd # 9 in {9}, prefix is ""
9: bd # 10 not in {}
sequences: [("ab", 0, 2), ("ba", 1, 2), ("ab", 6, 2)]
<STEP 5>
add sequences in order of LCP grouping. sequences within an LCP group
are added according to which was generated first:
lcp 4: no sequences
lcp 3: no sequences
lcp 2: add ("ab", 0, 2)
lcp 2: dont add ("ba", 1, 2) because it intersects with ("ab", 0, 2)
lcp 2: add ("ab", 6, 2)
lcp 1: no sequences
collection = [("ab", 0, 2), ("ab", 6, 2)]
(order by position not by which one was added first)
<STEP 6>
recreate input by iterating through the collection in order and
filling in gaps with the normal input:
input = "ab"*2 + input[4..5] + "ab"*2 + input[10..10]
Evgeny, one quick question for you if you happen to look at this again:
Am I doing step 5 correctly? That is, do I add sequences according to which LCP group they were generated from (with higher LCP valued groups coming first)? Or is it something else related to LCP?
Also, if there is anything wrong with step 4 or 6 please let me know, but it seems what I have works very well for this example.

I have to clarify what is meant by "grouping by LCP value" in the original answer. In fact, to the group with selected LCP value we should include all entries with larger LCP values.
This means that for your example, while processing LCP3, we need to merge preceding entries 0 and 6 to this group. And while processing LCP2, we need to merge all preceding entries with LCP3 and LCP4: 0, 1, 2, 6, 7.
As a result two (ab) pairs as well as one (ba) pair are remaining after filter. But since (ba) "intersects" with the first (ab) pair, it is rejected on step 5.
Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?
That's right, I mean all possible repeating sequences.
Is this a known algorithm with a name that I can find more details on elsewhere?
I don't know. Never seen such algorithm before.
Here is how steps 2 .. 4 may be implemented (in pseudo-code):
for (in_sa, in_src) in suffix_array: # step 2
lcp_groups[max(LCP[in_sa.left], LCP[in_sa.right])].append((in_sa, in_src))
apply(sort_by_position_in_src, lcp_groups) # step 3
for lcp from largest downto 1: # step 4
# sorted by position in src array and unique:
positions = merge_and_uniq(positions, lcp_groups[lcp])
for start in positions:
pos = start
while (next = positions[pos.in_src + lcp]).exists
and LCP.RMQ(pos.in_sa, next.in_sa) >= lcp
and not(prev = positions[pos.in_src - lcp]).exists # to process each
and LCP.RMQ(pos.in_sa, prev.in_sa) >= lcp): # chain only once
pos = next
if pos != start:
pass_to_step5(start, lcp, pos + lcp)
Here I don't plan any particular data structure for positions. But for convenience an ordered associative array is assumed. RMQ is Range Minimum Query, so LCP array should be preprocessed accordingly.
This code is practically the same as the code in OP. But instead of expensive string comparisons (like common != prefix(input, pos+lcp*i, lcp)) it uses RMQ, which (if properly implemented) works almost instantly (and has the same effect as the string comparison because it allows to reject a sub-string when it has too few starting characters in common with preceding sub-string).
It has quadratic worst-case time complexity. Should be slow for input arrays like "aaaaaaaaa". And it's not easy to find its time complexity for "better" strings, probably it is sub-quadratic in "average" case. The same problem could be solved with much simpler quadratic-time algorithm:
def loop_rolling(begin, end):
distance = (end - begin) / 2)
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
pass_to_step5(start, d, start + d * (repeats + 1))
start = pos
Or it may be made even simpler by removing steps 5 and 6. But the variant below has a disadvantage. It is much too greedy, so instead of 5*(ab) it will find 2*(2*(ab))ab:
def loop_rolling(begin, end, distance):
distance = min(distance, (end - begin) / 2))
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
loop_rolling(begin, start, d - 1)
print repeats+1, "*("
loop_rolling(start, start + d, d - 1) # "nested loops"
print ')'
loop_rolling(start + d * (repeats + 1), end, d)
return
else:
if d == 1: print src[start .. pos]
start = pos

Related

Check how many consecutive times appear in a string

I want to to display a number or an alphabet which appears mostly
consecutive in a given string or numbers or both.
Example:
s= 'aabskeeebadeeee'
output: e appears 4 consecutive times
I thought about set the string then and for each element loop the string to check if equal with element set element if so count =+1 and check if next to it is not equal add counter value to list with same index as in set, if is add counter value to li list if value is bigger than existing.
The problem is error index out or range although I think I am watching it.
s = 'aabskeeebadeeee'
c = 0
t = list(set(s)) # list of characters in s
li=[0,0,0,0,0,0] # list for counted repeats
print(t)
for x in t:
h = t.index(x)
for index, i in enumerate(s):
maximus = len(s)
if i == x:
c += 1
if index < maximus:
if s[index +1] != x: # if next element is not x
if c > li[h]: #update c if bigger than existing
li[h] = c
c = 0
else:
if c > li[h]:
li[h] = c
for i in t:
n = t.index(i)
print(i,li[n])
print(f'{s[li.index(max(li))]} appears {max(li)} consecutive times')
Here is an O(n) time, O(1) space solution, that breaks ties by returning the earlier seen character:
def get_longest_consecutive_ch(s):
count = max_count = 0
longest_consecutive_ch = previous_ch = None
for ch in s:
if ch == previous_ch:
count += 1
else:
previous_ch = ch
count = 1
if count > max_count:
max_count = count
longest_consecutive_ch = ch
return longest_consecutive_ch, max_count
s = 'aabskeeebadeeee'
longest_consecutive_ch, count = get_longest_consecutive_ch(s)
print(f'{longest_consecutive_ch} appears {count} consecutive times in {s}')
Output:
e appears 4 consecutive times in aabskeeebadeeee
Regex offers a concise solution here:
inp = "aabskeeebadeeee"
matches = [m.group(0) for m in re.finditer(r'([a-z])\1*', inp)]
print(matches)
matches.sort(key=len, reverse=True)
print(matches[0])
This prints:
['aa', 'b', 's', 'k', 'eee', 'b', 'a', 'd', 'eeee']
eeee
The strategy here is to find all islands of similar characters using re.finditer with the regex pattern ([a-z])\1*. Then, we sort the resulting list descending by length to find the longest sequence.
Alternatively, you can leverage the power of itertools.groupby() to approach this type of problem (for quick counting for similar items in groups. [Note, this can be applied to some broader cases, eg. numbers]
from itertools import groupby
>>> char_counts = [str(len(list(g)))+k for k, g in groupby(s)]
>>> char_counts
['2a', '1b', '1s', '1k', '3e', '1b', '1a', '1d', '4e']
>>> max(char_counts)
'4e'
# you can continue to do the rest of splitting, or printing for your needs...
>>> ans = '4e' # example
>>> print(f' the most frequent character is {ans[-1]}, it appears {ans[:-1]} ')
Output:
the most frequent character is e, it appears 4
This answer was posted as an edit to the question Check how many consecutive times appear in a string by the OP Ziggy Witkowski under CC BY-SA 4.0.
I did not want to use any libraries.
s = 'aabskaaaabadcccc'
lil = tuple(set(s)) # Set a characters in s to remove duplicates and
then make a tuple
li=[0,0,0,0,0,0] # list for counted repeats, the index of number
repeats for character
# will be equal to index of its character in a tuple
for i in lil: #iter over tuple of letters
c = 0 #counter
h= lil.index(i) #take an index
for letter in s: #iterate ove the string characters
if letter == i: # check if equal with character from tuple
c += 1 # if equal Counter +1
if c > li[lil.index(letter)]: # Updated the counter if present is bigger than the one stored.
li[lil.index(letter)] = c
else:
c=0
continue
m = max(li)
for index, j in enumerate(li): #Check if the are
characters with same max value
if li[index] == m:
print(f'{lil[index]} appears {m} consecutive times')
Output:
c appears 4 consecutive times
a appears 4 consecutive times

compare 2 strings, determine whether A contains all of the characters in B in python using list

question is :
Compare two strings A and B, determine whether A contains all of the characters in B.
The characters in string A and B are all Upper Case letters.
and I saw the solution is:
def compareStrings(self, A, B):
if len(B) == 0:
return True
if len(A) == 0:
return False
trackTable = [0 for _ in range(26)]
for i in A:
trackTable[ord(i) - 65] += 1
for i in B:
if trackTable[ord(i) - 65] == 0:
return False
else:
trackTable[ord(i) -65] -= 1
return True
I do not understand :
1) why give 26 '0' in the list at beginning ?
2) what does trackTable[ord(i) - 65] += 1 do ?
what is ord(i)?
Thanks !
Min
This is an interesting solution for sure (it's pretty convoluted). It creates a 26-element array to count the occurences of each letter in A, then makes sure that the count for each of those letters in B is greater than or equal to the count in A.
To directly answer your questions:
1) why give 26 '0' in the list at beginning ?
We are starting with a list of 26 0's, one for each letter A-Z. We will increment this in the first for loop.
2) what does trackTable[ord(i) - 65] += 1 do ?
This does the counting. Assuming the input is just capital letters A-Z, then ord('A')=65 and ord('Z')=90. We subtract 65 to make this range from 0-25.
3) what is ord(i)?
I would recommend searching online for this. https://www.programiz.com/python-programming/methods/built-in/ord
"The ord() method returns an integer representing Unicode code point for the given Unicode character."

My current loop which i use to sort elements by their digits produces an list index out of range

Im a complete noob and this is Part of my first sorting algorithm(Radixsort)
So far this way of sorting the numbers by their respective digits is working out but i still get an list index out of range. My theorie is that the while loop takes an extra iteration but i dont understand why.
def put_into_bucket(liste, iteration):
iteration = int
digit = len(liste[0]) - 1
i = 0
while i < len(liste):
if int(liste[i][digit]) == 0:
zero.append(liste[i])
print(zero)
if int(liste[i][digit]) == 1:
one.append(liste[i])
print(one)
if int(liste[i][digit]) == 2:
two.append(liste[i])
print(two)
if int(liste[i][digit]) == 3:
three.append(liste[i])
print(three)
if int(liste[i][digit]) == 4:
four.append(liste[i])
print(four)
if int(liste[i][digit]) == 5:
five.append(liste[i])
print(five)
if int(liste[i][digit]) == 6:
six.append(liste[i])
print(six)
if int(liste[i][digit]) == 7:
seven.append(liste[i])
print(seven)
if int(liste[i][digit]) == 8:
eight.append(liste[i])
print(eight)
if int(liste[i][digit]) == 9:
nine.append(liste[i])
print(nine)
i = i + 1
print(sorted_array)
put_into_bucket(liste=['0001', '0002', '0003', '0004', '0005', '0006'], iteration=0)
As jhnc mentioned, one problem is the bounds check in the while statement. It should be
while i < len(liste)
I still think the increment should be out side the if checks, but you say in the comments that moving it there doubles your output. Since this is not a MCVE, I can't help you with that.
Put the increment outside the if statements. After a successful if check, the next if statement is attempting to use i, which may no longer be valid.
Example
x = 1
if True:
x += 1
if True:
x += 1
print(x)
Output
3
Your other option is to put continue statements at the end of each if block, but you should really move the increment out of the if checks because then you only do it in one place.

Circular Walk in a string

You are given a string in which 'A' means you can move 1 step clockwise, 'C' means you can move one step anticlockwise and '?' means you can move one step either clockwise or anti clockwise. So Given a string find maximum distance from inital position at any given point of time.
for eg:-
input : - AACC?CC
output :- 3
explanation : - if ? is replaced with C then max distance will become 3
optimal approach to solve this problem?
str = "AACC?CC"
count = 0
extra = 0
for i in str:
if i == 'A':
count -= 1
elif i == 'C':
count += 1
else:
extra += 1
dist = abs(count) + extra
if count < 0:
print "AntiClockwise:",
else:
print "ClockWise:",
print dist
Just try this out. however 'A', 'C' movements must be taken place, so you should go left and right. the '?'is optional. You can just count How many '?' are there and add it to the final answer.

patterns with nested for

How do i achieve the following pattern, if input is 3?
AA
BBAA
AABBAA
Furthest i can get was:
AA
BBBB
AAAAAA
I have tried the following:
#mod operator used to alternate patterns
pattern_size = int (input ("Input height : "))
for level in range (1, pattern_size +1):
for x in range (level):
# print AA if remainder != 0
if level % 2 != 0:
print ("AA", end = '')
# print BB if remainder = 0
if level % 2 == 0:
print ("BB", end = '')
I guess it is a homework, and you will get more if you will find the solution on your own.
Firstly, if you want to alternate AA and BB when printing on the same level, it must depend on x (because x changes when level does not change). Moreover, each level starts with a different pattern. This way you may want to test (level + x) % 2 == 0 (choose the easy way of testing). If the boolean expression is true, print one pattern, else print the other.
Do not forget to print() without arguments after the x loop.
I prefer simpler usage of range() -- with a single argument. If pattern_size is 3, then the first loop can go through levels 0, 1, 2. However, the second for must loop at least once. Then you must go through range(level + 1).

Resources