How to better code, when looking for substrings? - python-3.x

I want to extract the currency (along with the $ sign) from a list, and create two different currency lists which I have done. But is there a better way to code this?
The list is as below:
['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
Python code:
pp_list = []
up_list = []
for u in usual_price_list:
rep = u.replace("\n","")
rep = rep.replace("\t","")
s = rep.rsplit("$",1)
pp_list.append(s[0])
up_list.append("$"+s[1])

For this kind of problem, I tend to use a lot the re module, as it is more readable, more maintainble and does not depend on which character surround what you are looking for :
import re
pp_list = []
up_list = []
for u in usual_price_list:
prices = re.findall(r"\$\d{2}\.\d{2}", u)
length_prices = len(prices)
if length_prices > 0:
pp_list.append(prices[0])
if length_prices > 1:
up_list.append(prices[1])
Regular Expresion Breakdown
$ is the end of string character, so we need to escape it
\d matches any digit, so \d{2} matches exactly 2 digits
. matches any character, so we need to escape it
If you want it you can modify the number of digits for the cents with \d{1,2} for matches one or two digits, or \d* to match 0 digit or more

As already pointed for doing that task re module is useful - I would use re.split following way:
import re
data = ['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
prices = [re.split(r'[\n\t]+',i) for i in data]
prices0 = [i[1] for i in prices]
prices1 = [i[2] for i in prices]
print(prices0)
print(prices1)
Output:
['$59.90', '$55.00', '$38.50', '$49.00', '$68.80', '$49.80']
['$68.00', '$68.00', '$49.90', '$62.00', '', '$60.50']
Note that this will work assuming that there are solely \n and \t excluding prices and there is at least one \n or \t before first price and at least one \n or \t between prices.
[\n\t]+ denotes any string made from \n or \t with length 1 or greater, that is \n, \t, \n\n, \t\t, \n\t, \t\n and so on

Related

Split string with commas while keeping numeric parts

I'm using the following function to separate strings with commas right on the capitals, as long as it is not preceded by a blank space.
def func(x):
y = re.findall('[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
However, when I try to separate the next string it removes the part with numbers.
Input = '49ersRiders Mapple'
Output = 'Riders Mapple'
I tried the following code but now it removes the 'ers' part.
def test(x):
y = re.findall(r'\d+[A-Z]*|[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
Output = '49,Riders Mapple'
The output I'm looking for is this:
'49ers,Riders Mapple'
Is it possible to add this indication to my regex?
Thanks in advance
Maybe naive but why don't you use re.sub:
def func(x):
return re.sub(r'(?<!\s)([A-Z])', r',\1', x)
inp = '49ersRiders Mapple'
out = func(inp)
print(out)
# Output
49ers,Riders Mapple
Here is a regex re.findall approach:
inp = "49ersRiders"
output = ','.join(re.findall('(?:[A-Z]|[0-9])[^A-Z]+', inp))
print(output) # 49ers,Riders
The regex pattern used here says to match:
(?:
[A-Z] a leading uppercase letter (try to find this first)
| OR
[0-9] a leading number (fallback for no uppercase)
)
[^A-Z]+ one or more non capital letters following

Python regex multiple matches occurrences between two strings

I have a multi-line string with my start/end magic strings ("X" and "Y"). I'm trying to capture all occurrences but I'm experiencing some issues.
Here is the code
testString = '''AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF
FFFYGGG
'''
pattern = re.compile(r'(.*)X(.*)Y(.*)', re.MULTILINE)
match = re.search(pattern, testString)
print match.group(1) # output: AAAAAXBBBBBYCCCCC
print match.group(2) # output: DDDDD
print match.group(3) # output: EEEEEEXFFF
Basically, I'm trying to capture all occurrences of the following (And I have to maintain text order):
Text before the magic start string (e.g.: AAAAA, CCCCC, EEEEEE)
Text between start/end magic strings (e.g.: BBBBB, DDDDD, FFF\nFFF)
Text after the magic start string (e.g.: CCCCC, GGG)
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on. Printing them in that order would solve the issue. (Then I can pass each to other functions, ...etc.)
The code works nicely when the text is as simple as for example "AAXBBYCC", but with complicated strings I'm losing control.
Any ideas or alternative ways to do this?
You could match any character except X or Y in group 1 and then match X and do the same for Y. The "after the magic string" part you could capture in a lookahead with a third group.
The negated character class using [^ will also match an newline to match the FFFFFF part.
([^XY]+)X([^XY]+)Y(?=([^XY]+))
([^XY]+)X Capture group 1, match 1+ times any char except X or Y, then match X
([^XY]+)Y Capture group 2, match 1+ times any char except X or Y, then match Y
(?= Positive lookahead, assert what is directly to the right is
([^XY]+) Capture group 3, match 1+ times any char except X or Y
) Close lookahead
Regex demo | Python demo
import re
regex = r"([^XY]+)X([^XY]+)Y(?=([^XY]*))"
s = ("AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF\n"
"FFFYGGG")
matches = re.findall(regex, s)
print(matches)
Output
[('AAAAA', 'BBBBB', 'CCCCC'), ('CCCCC', 'DDDDD', 'EEEEEE'), ('EEEEEE', 'FFF\nFFF', 'GGG')]
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on.
Since it doesn't matter whether before or after start or end, it is as simple as:
import re
o = re.split("X|Y", testString)
print(*o, sep='\n')
Can't you just use:
pattern = re.compile(r'[^XY]+')
match = re.findall(pattern, testString)
print(match)
# ['AAAAA', 'BBBBB', 'CCCCC', 'DDDDD', 'EEEEEE', 'FFF\nFFF', 'GGG\n']

How can I slice and keep text in a list every specific character?

I used beautifulsoup and I got a result form .get_text(). The result contains a long text:
alpha = ['\n\n\n\nIntroduction!!\nGood\xa0morning.\n\n\n\nHow\xa0are\xa0you?\n\n']
It can be noticed that the number of \n is not the same, and there are \xa0 for spacing.
I want to slice every group of \n (\n\n or \n\n\n or \n\n\n\n ) and replace \xa0 with a space in a new list, to look like this:
beta = ['Introduction!!','Good morning.','How are you?']
How can I do it?
Thank you in advance.
I wrote a little script that solves your problem:
alpha = ['\n\n\n\nIntroduction!!\nGood\xa0morning.\n\n\n\nHow\xa0are\xa0you?\n\n']
beta = []
for s in alpha:
# Turning the \xa0 into spaces
s = s.replace('\xa0',' ')
# Breaking the string by \n
s = s.split('\n')
# Explanation 1
s = list(filter(lambda s: s!= '',s))
# Explanation 2
beta = beta + s
print(beta)
Explanation 1
As there is some sequences of \n inside the alpha string, the split() will generate some empty strings. The filter() that I wrote removes them from the list.
Explanation 2
When the s string got split, it turns into a list of strings. Then, we need to concatenate the lists.

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

string matching in matlab

I have two short (S with the size of 1x10) and very long (L with the size of 1x1000) strings and I am going to find the locations in L which are matched with S.
In this specific matching, I am just interested to match some specific strings in S (the black strings). Is there any function or method in matlab that can match some specific strings (for example string numbers of 1, 5, 9 in S)?
If I understand your question correctly, you want to find substrings in L that contain the same letters (characters) as S in certain positions (let's say given by array idx). Regular expressions are ideal here, so I suggest using regexp.
In regular expressions, a dot (.) matches any character, and curly braces ({}) optionally specify the number of desired occurrences. For example, to match a string of length 6, where the second character is 'a' and the fifth is 'b', our regular expression could be any of the following syntaxes:
.a..b.
.a.{2}b.
.{1}a.{2}b.{1}
any of these is correct. So let's construct a regular expression pattern first:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1); %// Intervals
ch = num2cell(S(idx(:))); %// Matched characters
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:}); %// Pattern for regexp
Now all is left is to feed regexp with L and the desired pattern:
loc = regexp(L, pat)
and voila!
Example
Let's assume that:
S = 'wbzder'
L = 'gabcdexybhdef'
idx = [2 4 5]
First we build a pattern:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1);
ch = num2cell(S(idx(:)));
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:});
The pattern we get is:
pat =
.{1}b.{1}d.{0}e.{1}
Obviously we can add code that beautifies this pattern into .b.de., but this is really an unnecessary optimization (regexp can handle the former just as well).
After we do:
loc = regexp(L, pat)
we get the following result:
loc =
2 8
Seems correct.

Resources