Count number of occurences of each string using regex - python-3.x

Given a pattern like this
pattern = re.compile(r'\b(A|B|C)\b')
And a huge_string I would like to replace every substring matching the pattern with a string D and find the number of occurences for each string A, B and C. What is the most feasible approach?
One way is to split the pattern to 3 patterns for each string and then use subn
pattern_a = re.compile(r'\bA\b')
pattern_b = re.compile(r'\bB\b')
pattern_c = re.compile(r'\bC\b')
huge_string, no_a = re.subn(pattern_a, D, huge_string)
huge_string, no_b = re.subn(pattern_b, D, huge_string)
huge_string, no_c = re.subn(pattern_c, D, huge_string)
But it requires 3 passes through the huge_string. Is there a better way?

You may pass a callable as the replacement argument to re.sub and collect the necessary counting details during a single replacement pass:
import re
counter = {}
def repl(m):
if m.group() in counter:
counter[m.group()] += 1
else:
counter[m.group()] = 1
return 'd'
text = "a;b o a;c a l l e d;a;c a b"
rx = re.compile(r'\b(a|b|c)\b')
result = rx.sub(repl, text)
print(counter, result, sep="\n")
See the Python demo online, output;
{'a': 5, 'b': 2, 'c': 2}
d;d o d;d d l l e d;d;d d d

you could do it in 2 passes, the first just counting then the second doing the sub. this will mean if your search space grows like a|b|c|d|e etc you will still only do 2 passes, your number of passes will not be based on your number of possible matches.
import re
from collections import Counter
string = " a j h s j a b c "
pattern = re.compile(r'\b(a|b|c)\b')
counts = Counter(pattern.findall(string))
string_update = pattern.sub('d', string)
print(counts, string, string_update, sep="\n")
OUTPUT
Counter({'a': 2, 'b': 1, 'c': 1})
a j h s j a b c
d j h s j d d d

Related

Check how many consecutive times appear in a string

I want to to display a number or an alphabet which appears mostly
consecutive in a given string or numbers or both.
Example:
s= 'aabskeeebadeeee'
output: e appears 4 consecutive times
I thought about set the string then and for each element loop the string to check if equal with element set element if so count =+1 and check if next to it is not equal add counter value to list with same index as in set, if is add counter value to li list if value is bigger than existing.
The problem is error index out or range although I think I am watching it.
s = 'aabskeeebadeeee'
c = 0
t = list(set(s)) # list of characters in s
li=[0,0,0,0,0,0] # list for counted repeats
print(t)
for x in t:
h = t.index(x)
for index, i in enumerate(s):
maximus = len(s)
if i == x:
c += 1
if index < maximus:
if s[index +1] != x: # if next element is not x
if c > li[h]: #update c if bigger than existing
li[h] = c
c = 0
else:
if c > li[h]:
li[h] = c
for i in t:
n = t.index(i)
print(i,li[n])
print(f'{s[li.index(max(li))]} appears {max(li)} consecutive times')
Here is an O(n) time, O(1) space solution, that breaks ties by returning the earlier seen character:
def get_longest_consecutive_ch(s):
count = max_count = 0
longest_consecutive_ch = previous_ch = None
for ch in s:
if ch == previous_ch:
count += 1
else:
previous_ch = ch
count = 1
if count > max_count:
max_count = count
longest_consecutive_ch = ch
return longest_consecutive_ch, max_count
s = 'aabskeeebadeeee'
longest_consecutive_ch, count = get_longest_consecutive_ch(s)
print(f'{longest_consecutive_ch} appears {count} consecutive times in {s}')
Output:
e appears 4 consecutive times in aabskeeebadeeee
Regex offers a concise solution here:
inp = "aabskeeebadeeee"
matches = [m.group(0) for m in re.finditer(r'([a-z])\1*', inp)]
print(matches)
matches.sort(key=len, reverse=True)
print(matches[0])
This prints:
['aa', 'b', 's', 'k', 'eee', 'b', 'a', 'd', 'eeee']
eeee
The strategy here is to find all islands of similar characters using re.finditer with the regex pattern ([a-z])\1*. Then, we sort the resulting list descending by length to find the longest sequence.
Alternatively, you can leverage the power of itertools.groupby() to approach this type of problem (for quick counting for similar items in groups. [Note, this can be applied to some broader cases, eg. numbers]
from itertools import groupby
>>> char_counts = [str(len(list(g)))+k for k, g in groupby(s)]
>>> char_counts
['2a', '1b', '1s', '1k', '3e', '1b', '1a', '1d', '4e']
>>> max(char_counts)
'4e'
# you can continue to do the rest of splitting, or printing for your needs...
>>> ans = '4e' # example
>>> print(f' the most frequent character is {ans[-1]}, it appears {ans[:-1]} ')
Output:
the most frequent character is e, it appears 4
This answer was posted as an edit to the question Check how many consecutive times appear in a string by the OP Ziggy Witkowski under CC BY-SA 4.0.
I did not want to use any libraries.
s = 'aabskaaaabadcccc'
lil = tuple(set(s)) # Set a characters in s to remove duplicates and
then make a tuple
li=[0,0,0,0,0,0] # list for counted repeats, the index of number
repeats for character
# will be equal to index of its character in a tuple
for i in lil: #iter over tuple of letters
c = 0 #counter
h= lil.index(i) #take an index
for letter in s: #iterate ove the string characters
if letter == i: # check if equal with character from tuple
c += 1 # if equal Counter +1
if c > li[lil.index(letter)]: # Updated the counter if present is bigger than the one stored.
li[lil.index(letter)] = c
else:
c=0
continue
m = max(li)
for index, j in enumerate(li): #Check if the are
characters with same max value
if li[index] == m:
print(f'{lil[index]} appears {m} consecutive times')
Output:
c appears 4 consecutive times
a appears 4 consecutive times

How to split string by odd length

Lets say with a string = "AABBAAAAABBBBAAABBBBAA"
I want to return string split by the odd lengths of the string (i.e when A = 5 or A = 3),
What I want returned is 1) AABBAAAAA 2)BBBBAAA 3)BBBBAA,
How can I do that?
I tried using regex [A]+[B]+ for a slightly different case
One option might be to regex iterate using re.finditer with the following pattern:
.*?(?:AAA(?:AA)?|$)
This pattern will non greedily consume until reaching either 3 A's, 5 A's, or the end of the string. Then, we can print out each complete match as we iterate.
input = 'AABBAAAAABBBBAAABBBBAA'
pattern = '.*?(?:AAA(?:AA)?|$)'
for match in re.finditer(pattern, input):
print match.group()
This prints:
AABBAAAAA
BBBBAAA
BBBBAA
You can use itertools.groupby:
s = 'BBAAAAABBBBAAABBBBAA'
from itertools import groupby
out = ['']
for v, g in groupby(s):
l = [*g]
out[-1] += ''.join(l)
if v == 'A' and len(l) in (3, 5):
out.append('')
print(out)
Prints:
['BBAAAAA', 'BBBBAAA', 'BBBBAA']

Cut K sequences of length L to obtain the biggest number

We have a number of N digits (it can start with 0). We must find the biggest number which can be obtained cutting K disjoint sequences of length L.
N can be very big so our number should be stored as a string.
Example 1)
nr = 12122212212212121222
K = 2, L = 3
answer: 22212212221222
We can cut "121" (from 0th digit) and "121" (from 12th digit).
Example 2)
nr = 0739276145
K = 3, L = 3
answer: 9
We can cut "073", "276" and "145".
I have tried something like this:
void cut(string str, int K, int L) {
if (K == 0)
return;
// here we cut a single sequence of length L
// in a way that the new number is the biggest
cut(str, K - 1, L);
}
But in this way, I can cut 2 sequences which in the initial number are not disjoint, so my method it's not correct. Please help me solve the problem!
You can define cutsrecursively:
cuts(s, 0, L) = s
cuts(s, K, L) = max(s[i:j] + cuts(s[j+L:], K-1, L) for j=i..len(s)-K*L)
As is normal in these problems, you can use dynamic programming to avoid an exponential runtime. You can probably avoid so much string slicing and appending, but this is an example solution in Python:
def cuts(s, K, L):
dp = [s[i:] for i in xrange(len(s)+1)]
for k in xrange(1, K+1):
dp = [max(s[i:j] + dp[j+L] for j in xrange(i, len(dp)-L))
for i in xrange(len(dp)-L)]
return dp[0]
print cuts('12122212212212121222', 2, 3)
print cuts('0739276145', 3, 3)
Output:
22212212221222
9

Python adding text to a line

a = open('testlines.csv', 'r')
b = a.readlines()
a.close()
for c in range(0,1):
d = '<' + b[c] + '>'
d = b[c].replace(',', '><')
e = re.findall(r'<(.*?)>', d, re.DOTALL)
print(d)
print(e[0],e[1],e[2],e[3],e[4],e[5],e[6],e[7],e[8])
d does not print right, the < or > at the beginning of the line and the end of the line doesn't show up. If I reverse the two line to create/modify d then it doesn't replace the commas. What am I doing wrong here. I want the replace and I need to add in the < > at the beginning and end so I do the findall and create a multidimensional array in the end once it has split everything apart.
The problem is that after you assign d = '<'+b[c]+'>', you do nothing with d, but reassign a value to it. As a result the step where you add <...> is lost.
You can solve it by working on d instead of b[c], like:
for c in range(0,1):
d = '<' + b[c] + '>'
d = d.replace(',', '><') # use d instead of b[c]
e = re.findall(r'<(.*?)>', d, re.DOTALL)
print(d)
print(e[0],e[1],e[2],e[3],e[4],e[5],e[6],e[7],e[8])

python3 sum in stings each letter value

i need sum in string letters value ex.
a = 1
b = 2
c = 3
d = 4
alphabet = 'abcdefghijklmnopqrstuvwxyz'
v1
string = "abcd"
# #result = sum(string) so
if string[0] and string[1] and string[2] and string[3] in alphabet:
if string[0] is alphabet[0] and string[1] is alphabet[1] and string[2] is alphabet[2] and string[3] is alphabet[3]:
print(a+b+c+d)
v2
string = ("ab","aa","dc",)
if string[0][0] and string[0][1] and string[1][0] and string[1][1] and string[2][0] and string[2][1] in alphabet:
if string[0] is alphabet[0] and string[1] is alphabet[1] and string[2] is alphabet[2] and string[3] is alphabet[3]:
print(a+b+c+d)
what is the solution? can you help me
Use the sum() function and a generator expression; a dictionary built from string.ascii_lowercase can serve as a means to getting an integer value per letter:
from string import ascii_lowercase
letter_value = {c: i for i, c in enumerate(ascii_lowercase, 1)}
wordsum = sum(letter_value.get(c, 0) for c in word if c)
The enumerate(ascii_lowercase, 1) produces (index, letter) pairs when iterated over, starting at 1. That gives you (1, 'a'), (2, 'b'), etc. That can be converted to c: i letter pairs in a dictionary, mapping letter to integer number.
Next, using the dict.get() method lets you pick a default value; for any character in the input string, you get to look up the numeric value and map it to an integer, but if the character is not a lowercase letter, 0 is returned instead. The sum(...) part with the loop then simply adds those values up.
If you need to support sequences with words, just use sum() again. Put the above sum() call in a function, and apply that function to each word in a sequence:
from string import ascii_lowercase
letter_value = {c: i for i, c in enumerate(ascii_lowercase, 1)}
def sum_word(word):
return sum(letter_value.get(c, 0) for c in word if c)
def sum_words(words):
return sum(sum_word(word) for word in words)
The old-fashioned way is to take advantage of the fact that lowercase letters are contiguous, so that ord(b) - ord(a) == 1:
data = "abcd"
print("Sum:", sum(ord(c)-ord("a")+1 for c in data))
Of course you could "optimize" it to reduce the number of computations, though it seems silly in this case:
ord_a = ord("a")
print("Sum:", sum(ord(c)-ord_a for c in data)+len(data))

Resources