Conditional string splitting in Python using regex or loops - python-3.x

I have a string c that has a respective repetitive pattern of:
integer from 0 to 10,
character S, D, or T,
special character * or # (optional)
For instance, c could look like 1D2S#10S, or 1D#2S*3S, or so on.
I have a further calculation to make with c, but in order to do so I thought splitting c into substrings that include integer, character, and a possible special character would be helpful. Hence, for example, 1D2S#10S would be split into 1D, 2S#, 10S. 1D#2S*3S would be split into 1D#, 2S*, 3S.
I am aware that such string split can be concisely done with re.split(), but since this is quite conditional, I wasn't able to find an optimal way to split this. Instead, I tried using a for loop:
clist = []
n = 0
for i in range(len(c)):
if type(c[i]) != 'int':
if type(c[i+1]) == 'int':
clist.append(c[n:i+1])
n = i
else:
clist.append(c[n:i+2])
n = i
This raises an indexing issue, but even despite that I can tell it isn't optimal. Is there a way to use re to split it accordingly?

Use re.findall():
>>> re.findall(r'\d*[SDT][\*#]?', '1D2S#10S')
['1D', '2S#', '10S']
>>> re.findall(r'\d*[SDT][\*#]?', '1D#2S*3S')
['1D#', '2S*', '3S']

Related

Remove double quotes from a string in python

Would like my output which should not be string but my code returning string to me. Please look on my below code in which z is my output. I tried with regex, replace, strip, eval, ast.literal_eval but nothing worked for me as of now.
x = "'yyyymm'='202005','run_id'='51',drop_columns=run_id"
y = x.split(',')
print(y)
This will print:
["'yyyymm'='202005'","'run_id'='51'","drop_columns=run_id"]`
But I want:
['yyyymm'='202005','run_id'='51',drop_columns=run_id]
x is a string and if you split a string, you will get an array of strings. It is basically cutting it into pieces.
Your question is not really clear on what you want to achieve. If you want to have key-value-pairs, you'd need to split each token at the =. This would give you something like this:
[('yyyymm', '202005'), ('run_id', '51'), ('drop_columns', 'run_id')]
But the items in the tuples would still be strings. If you want to have integers, you would need to cast them which is only possible if the strings consist of digits. It would not be possible to cast 'run_id' to integer.
You can refer to this example. I'm not sure if that is 100% what you are looking for, but it should give you the correct idea.
x = "yyyymm=202005,run_id=51,drop_columns=run_id"
y = x.split(',')
tmp = []
for e in y:
tmp.append((e.split('=')[0], e.split('=')[1]))
out = []
for e in tmp:
if str.isnumeric(e[1]):
out.append((e[0], int(e[1])))
else:
out.append(e)
print(out)
This will give you:
[('yyyymm', 202005), ('run_id', 51), ('drop_columns', 'run_id')]

How to determine if two elements from a list appear consecutively in a string? Python

I am trying to solve a problem that can be modelled most simply as follows.
I have a large collection of letter sequences. The letters come from two lists: (1) member list (2) non-member list. The sequences are of different compositions and lengths (e.g. AQFG, CCPFAKXZ, HBODCSL, etc.). My goal is to insert the number '1' into these sequences when any 'member' is followed by any two 'non-members':
Rule 1: Insert '1' after the first member letter that is followed
by 2 or more non-members letters.
Rule 2: Insert not more than one '1' per sequence.
The 'Members': A, B, C, D
'Non-members': E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
In other words, once a member letter is followed by 2 non-member letters, insert a '1'. In total, only one '1' is inserted per sequence. Examples of what I am trying to achieve are this:
AQFG ---> A1QFG
CCPFAKXZ ---> CC1PFAKXZ
BDDCCA ---> BDDCCA1
HBODCSL ---> HBODC1SL
ABFCC ---> ABFCC
ACKTBB ---> AC1KTBB # there is no '1' to be inserted after BB
I assume the code will be something like this:
members = ['A','B','C','D']
non_members = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z']
strings = ['AQFG', 'CCPFAKXZ', 'BDDCCA', 'HBODCSL', 'ABFCC']
for i in members:
if i in strings:
if member is followed by 2 non-members: # Struggling here
i.insert(index_member, '1')
return i
return ''
EDIT
I have found that one solution could be to generate a list of all permutations of two 'non-member' items using itertools.permutations(non_members, 2), and then test for their presence in the string.
But is there a more elegant solution for this problem?
Generating all permutations is going to explode the number of things you are checking. you need to change how you are iterating something like:
members = ...
non_members = ...
s = 'AQFG'
out = ""
look = 2
for i in range(len(s)-look):
out += s[i]
if (s[i] in members) & \
(s[i+1] in non_members) & \
(s[i+2] in non_members):
out += '1' + s[i+1:]
break
This way you only need to go through the target string once, and you don't need to generate permutations, this method could be extended to look ahead many more than your method.
I believe can be done via regex also.
s = 'AQFG'
x = re.sub(r'([ABCD])([EFGHIJKLMNOPQRSTUVWXYZ])',r'\g<1>1\2',s)
print(x)
This will print A1QFG
Sorry. I missed that. re.sub can take an optional count parameter that can stop after the given number of replacements are made.
s = 'HBODCSL'
x = re.sub(r'([ABCD]+)([EFGHIJKLMNOPQRSTUVWXYZ])',r'\g<1>1\2',s,count=1)
print(x)
This will print HB1ODCSL

How to count number of substrings in python, if substrings overlap?

The count() function returns the number of times a substring occurs in a string, but it fails in case of overlapping strings.
Let's say my input is:
^_^_^-_-
I want to find how many times ^_^ occurs in the string.
mystr=input()
happy=mystr.count('^_^')
sad=mystr.count('-_-')
print(happy)
print(sad)
Output is:
1
1
I am expecting:
2
1
How can I achieve the desired result?
New Version
You can solve this problem without writing any explicit loops using regex. As #abhijith-pk's answer cleverly suggests, you can search for the first character only, with the remainder being placed in a positive lookahead, which will allow you to make the match with overlaps:
def count_overlapping(string, pattern):
regex = '{}(?={})'.format(re.escape(pattern[:1]), re.escape(pattern[1:]))
# Consume iterator, get count with minimal memory usage
return sum(1 for _ in re.finditer(regex, string))
[IDEOne Link]
Using [:1] and [1:] for the indices allows the function to handle the empty string without special processing, while using [0] and [1:] for the indices would not.
Old Version
You can always write your own routine using the fact that str.find allows you to specify a starting index. This routine will not be very efficient, but it should work:
def count_overlapping(string, pattern):
count = 0
start = -1
while True:
start = string.find(pattern, start + 1)
if start < 0:
return count
count += 1
[IDEOne Link]
Usage
Both versions return identical results. A sample usage would be:
>>> mystr = '^_^_^-_-'
>>> count_overlapping(mystr, '^_^')
2
>>> count_overlapping(mystr, '-_-')
1
>>> count_overlapping(mystr, '')
9
>>> count_overlapping(mystr, 'x')
0
Notice that the empty string is found len(mystr) + 1 times. I consider this to be intuitively correct because it is effectively between and around every character.
you can use regex for a quick and dirty solution :
import re
mystr='^_^_^-_-'
print(len(re.findall('\^(?=_\^)',mystr)))
You need something like this
def count_substr(string,substr):
n=len(substr)
count=0
for i in range(len(string)-len(substr)+1):
if(string[i:i+len(substr)] == substr):
count+=1
return count
mystr=input()
print(count_substr(mystr,'121'))
Input: 12121990
Output: 2

Algorithm for generating all string combinations

Say I have a list of strings, like so:
strings = ["abc", "def", "ghij"]
Note that the length of a string in the list can vary.
The way you generate a new string is to take one letter from each element of the list, in order. Examples: "adg" and "bfi", but not "dch" because the letters are not in the same order in which they appear in the list. So in this case where I know that there are only three elements in the list, I could fairly easily generate all possible combinations with a nested for loop structure, something like this:
for i in strings[0].length:
for ii in strings[1].length:
for iii in strings[2].length:
print(i+ii+iii)
The issue arises for me when I don't know how long the list of strings is going to be beforehand. If the list is n elements long, then my solution requires n for loops to succeed.
Can any one point me towards a relatively simple solution? I was thinking of a DFS based solution where I turn each letter into a node and creating a connection between all letters in adjacent strings, but this seems like too much effort.
In python, you would use itertools.product
eg.:
>>> for comb in itertools.product("abc", "def", "ghij"):
>>> print(''.join(comb))
adg
adh
adi
adj
aeg
aeh
...
Or, using an unpack:
>>> words = ["abc", "def", "ghij"]
>>> print('\n'.join(''.join(comb) for comb in itertools.product(*words)))
(same output)
The algorithm used by product is quite simple, as can be seen in its source code (Look particularly at function product_next). It basically enumerates all possible numbers in a mixed base system (where the multiplier for each digit position is the length of the corresponding word). A simple implementation which only works with strings and which does not implement the repeat keyword argument might be:
def product(words):
if words and all(len(w) for w in words):
indices = [0] * len(words)
while True:
# Change ''.join to tuple for a more accurate implementation
yield ''.join(w[indices[i]] for i, w in enumerate(words))
for i in range(len(indices), 0, -1):
if indices[i - 1] == len(words[i - 1]) - 1:
indices[i - 1] = 0
else:
indices[i - 1] += 1
break
else:
break
From your solution it seems that you need to have as many for loops as there are strings. For each character you generate in the final string, you need a for loop go through the list of possible characters. To do that you can make recursive solution. Every time you go one level deep in the recursion, you just run one for loop. You have as many level of recursion as there are strings.
Here is an example in python:
strings = ["abc", "def", "ghij"]
def rec(generated, k):
if k==len(strings):
print(generated)
return
for c in strings[k]:
rec(generated + c, k+1)
rec("", 0)
Here's how I would do it in Javascript (I assume that every string contains no duplicate characters):
function getPermutations(arr)
{
return getPermutationsHelper(arr, 0, "");
}
function getPermutationsHelper(arr, idx, prefix)
{
var foundInCurrent = [];
for(var i = 0; i < arr[idx].length; i++)
{
var str = prefix + arr[idx].charAt(i);
if(idx < arr.length - 1)
{
foundInCurrent = foundInCurrent.concat(getPermutationsHelper(arr, idx + 1, str));
}
else
{
foundInCurrent.push(str);
}
}
return foundInCurrent;
}
Basically, I'm using a recursive approach. My base case is when I have no more words left in my array, in which case I simply add prefix + c to my array for every c (character) in my last word.
Otherwise, I try each letter in the current word, and pass the prefix I've constructed on to the next word recursively.
For your example array, I got:
adg adh adi adj aeg aeh aei aej afg afh afi afj bdg bdh bdi
bdj beg beh bei bej bfg bfh bfi bfj cdg cdh cdi cdj ceg ceh
cei cej cfg cfh cfi cfj

finding DNA codon starting with a or t with regular expression

Given a DNA sequence of codons, I want to get the precentage of codons starting with A or T.
The DNA sequence would be something like: dna = "atgagtgaaagttaacgt". Eeach sequence starting in the 0,3,6 etc. positions <-and that's the source of the problem as far as my intentions goes
What we wrote and works:
import re
DNA = "atgagtgaaagttaacgt"
def atPct(dna):
'''
gets a dna sequence and returns the %
of sequences that are starting with a or t
'''
numOfCodons = re.findall(r'[a|t|c|g]{3}',dna) # [a|t][a|t|c|g]{2} won't give neceseraly in the pos % 3==0 subseq
count = 0
for x in numOfCodons:
if str(x)[0]== 'a' or str(x)[0]== 't':
count+=1
print(str(x))
return 100*count/len(numOfCodons)
print(atPct(DNA))
My goal is to find it without that for loop, somehow I feel there's a way more elegant way to do this just with regular expressions but I might be wrong, if there's a better way i would be glad to learn how! is there a way to cross the location and "[a|t][a|t|c|g]{2}" as a regular expression?
p.s question assume it's a valid dna sequence that's why i haven't even checked that
A loop will be faster than doing it another way. Still, you can use sum and a generator expression (another SO answer) to improve readability:
import re
def atPct(dna):
# Find all sequences
numSeqs = re.findall('[atgc]{3}', DNA)
# Count all sequences that start with 'a' or 't'
atSeqs = sum(1 for seq in numSeqs if re.match('[at]', seq))
# Return the calculation
return 100 * len(numSeqs) / atSeqs
DNA = "atgagtgaaagttaacgt"
print( atPct(DNA) )
So you just want to find out the percentage of times a or t appear in the first of every three characters in the string? Use the step parameter of a slice:
def atPct(dna):
starts = dna[::3] # Every third character of dna, starting with the first
return (starts.count('a') + starts.count('t')) / len(starts)

Resources