regex | extract numbers preceded by defined strings - python-3.x

I have strings like:
Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'
I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]
0.75 oz
8 oz
12 ml
30 ml
I have the below code:
import re
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
print(size_and_units)
Which outputs this:
0 [0.75 oz]
1 [8oz]
2 [4.4" dia x 2.5", 12ml]
3 [3.3" 30ml]
You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.
I am out of my depth for regex. Can someone help fix this.
Thank you!

Making as few changes to your regex, so you know what you did wrong:
in [0-9]+.*[0-9]*, replace . with \.. . means any character. \. means a period.
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+\.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
gives:
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
You don't need to use a lookahead at all though, since you also want to match the units. Just do
\d+\.*\d*\s*(?:oz|ml)
This gives the same result:
size_and_units = s.str.findall(
rf"\d+\.*\d*\s*(?:{look_ahead})")

Some notes about the pattern that you tried:
You can omit the lookahead (?!,) as it is always true because you start the next match for a digit
In this part .*[0-9]* *(?=oz|ml)[a-zA-Z]+) this is all optional .*[0-9]* * and will match until the end of the string. Then it will backtrack till it can match either oz or ml and will match 1 or more chars a-zA-Z so it could also match 0.75 ozaaaaaaa
If you want the matches, you don't need a capture group or lookarounds. You can match:
\b\d+(?:\.\d+)*\s*(?:oz|ml)\b
\b A word boundary to prevent a partial word match
\d+(?:\.\d+)* Match 1+ digits with an optional decimal part
\s*(?:oz|ml) Match optional whitespace chars and either oz or ml
\b A word boundary
Regex demo
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"\b\d+(?:\.\d+)*\s*(?:{look_ahead})\b")
print(size_and_units)
Output
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]

I think that regex expression will work for you.
[0-9]+\.*[0-9]* *(oz|ml)

Related

regular expression: How to match a list of words (allow combination)?

I'm trying to construct a regular expression to capture units and the corresponding values.
For example,
import re
candis = ['mmol','mm']
test_reg = '|'.join([ut+r"\-?[1-4]?" for ut in candis])
test_reg = r"\b(?:" + test_reg + r")\b"
test_reg = r"\d (?:" + test_reg + r"\s?){1,3}"
test_str = '3 mmol mm'
re.findall(test_reg,test_str)
the test_reg is constructed to capture the unit mmol mm and the corresponding value of 3.
However, as you can readily observe in the example, test_reg does not work for a string like 3 mmol2mm because of the \b.
How can I construct a regular expression that can also match 3 mmol2mm and 3 mmolmm, which only contains word combinations that are strictly from candis? (3 mmol mmb won't match)
You can use
\d+(?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))\1\b
See the regex demo. Details:
\d+ - one or more digits
(?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3})) - a positive lookahead with a capturing group inside used to imitate an atomic group, that matches a location that is immediately followed with
(?:\s*(?:mmol|mm)-?[1-4]?){1,3} - one, two or three occurrences of
\s* - zero or more whitespaces
(?:mmol|mm) - a candis value
-? - an optional - char
[1-4]? - an optional digit from 1 to 4
\1 - Group 1 value (backreferences do not allow backtracking)
\b - word boundary.
See the Python demo:
import re
candis = ['mmol','mm']
test_reg = r"\d+(?=((?:\s*(?:{})-?[1-4]?){{1,3}}))\1\b".format('|'.join(candis))
test_str = '3 mmol mm 3 mmol2mm and 3 mmolmm AND NOT 3 mmol mmb'
print( [x.group() for x in re.finditer(test_reg,test_str)] )
Output:
['3 mmol mm', '3 mmol2mm', '3 mmolmm']

Remove all spaces for chinese characters while keeping necessary spaces for english in Python regex

Let's say my dataframe has column which is mixed with english and chinese words or characters, I would like to remove all the whitespaces between them if they're chinese words, otherwise if they're english, then keep one space only between words:
I have found a solution for removing extra spaces between english from here
import re
import pandas as pd
s = pd.Series(['V e r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石 公 司', 'FAN STUD1O', 'beauty face 店 铺'])
Code:
regex = re.compile('(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1}) +(?=[a-zA-Z] |.$)')
s.str.replace(regex, '')
Out:
Out[87]:
0 Very calm
1 Keen and analytical
2 Rash and careless
3 Always joyful
4 你 好
5 黑 石 公 司
dtype: object
But as you see, it works out for english but didn't remove spaces between chinese, how could get an expected result as follows:
Out[87]:
0 Very calm
1 Keen and analytical
2 Rash and careless
3 Always joyful
4 你好
5 黑石公司
dtype: object
Reference: Remove all spaces between Chinese words with regex
You could use the Chinese (well, CJK) Unicode property \p{script=Han} or \p{Han}.
However, this only works if the regex engine supports UTS#18 Unicode regular expressions. The default Python re module does not but you can use the alternative (much improved) regex engine:
import regex as re
rex = r"(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})[ ]+(?=[a-zA-Z] |.$)|(?<=\p{Han}) +"
test_str = ("V e r y calm\n"
"Keen and a n a l y t i c a l\n"
"R a s h and careless\n"
"Always joyful\n"
"你 好\n"
"黑 石 公 司")
result = re.sub(rex, "", test_str, 0, re.MULTILINE | re.UNICODE)
Results in
Very calm
Keen and analytical
Rash and careless
Always joyful
你好
黑石公司
Online Demo (the demo is using PCRE for demonstration purposes only)
This regex should get you what you want. See the full code snippet at the bottom.
regex = re.compile(
"((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
re.UNICODE,
)
I made the following edits to your regex above:
Right now, the regex basically matches all spaces that appear after a single-letter word and before another single character word.
I added a part at the end of the regex that would select all spaces after a Chinese character (I used the unicode range [\u4e00-\u9fff] which would cover Japanese and Korean as well.
I changed the spaces in the regex to the whitespace character class \s so we could catch other input like tabs.
I also added the re.UNICODE flag so that \s would cover unicode spaces as well.
import re
import pandas as pd
s = pd.Series(
[
"V e r y calm",
"Keen and a n a l y t i c a l",
"R a s h and careless",
"Always joyful",
"你 好",
"黑 石 公 司",
]
)
regex = re.compile(
"((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
re.UNICODE,
)
s.str.replace(regex, "")
Output:
0 Very calm
1 Keen and analytical
2 Rash and careless
3 Always joyful
4 你好
5 黑石公司
dtype: object
Use word boundaries \b in look arounds:
(?<=\b\w\b) +(?=\b\w\b)
This matches spaces between solitary (bounded by word boundaries) "word characters", which includes Chinese characters.
Pre python 3 (and for java for example), \w only matches English letters, so you would need to add the unicode flag (?u) to the front of the regex.
s = ['V e r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石 公 司']
regex = r'(?<=\b\w\b) +(?=\b\w\b)'
res = [re.sub(regex, '', line) for line in s]
print(res)
Output:
['Very calm', 'Keen and analytical', 'Rash and careless', 'Always joyful', '你好', '黑石公司']

Split pandas column on number with %

I have a df with one of the columns that appears like:
**Share**
We are safe 25%
We are always safe 12.50% (India Aus, West)
We are ok (USA, EU)
We are not OK
What is this
Always wise 25.66%
I want to split this column such that the % values wherever applicable get split from the column into a new one.
So the output would be
Share Percent LOCATION
We are safe 25%
We are always safe 12.50% India Aus, West
We are ok USA, EU
We are not OK
What is this
Always wise 25.66%
I thought the below would split it from right, but it is not working
df['Percent'] = df['Share'].str.rsplit(r' \d',1).str[0]
You can extract those values:
df[['Share','Percent']] = df['Share'].str.split(r'\s+(?=\d+(?:\.\d+)?%\s*$)',expand=True).fillna("")
Pandas test:
import pandas as pd
df = pd.DataFrame({'Share':['We are safe 25%','We are ok', 'We are always safe 12.50%']})
df[['Share','Percent']] = df['Share'].str.split(r'\s+(?=\d+(?:\.\d+)?%\s*$)',expand=True).fillna("")
>>> df
Share Percent
0 We are safe 25%
1 We are ok
2 We are always safe 12.50%
See the regex demo. Details:
\s+ - one or more whitespaces
(?=\d+(?:\.\d+)?%\s*$) - a positive lookahead matching a location that is immediately followed with:
\d+ - one or more digits
(?:\.\d+)? - an optional sequence of . and one or more digits
% - a % symbol
\s* - 0 or more trailing (as $ comes next) whitespaces and
$ - end of string.

Python regular expression substitution help needed

I may have any of the following input strings -
i/p 1) Required 16 pcs
i/p 2) Required7 units
i/p 3) Requesting 12each
I wish to do some regular expression based substitution so that I have the following outputs for the above 3 strings -
o/p 1) Required 16 units
o/p 2) Required 7 units
o/p 3) Requesting 12 units
Basically, if my string contains pcs/units/each, and an integer before that, I want to do the following -
#1. replace the string "pcs" / "each" with "units" &
#2. add spaces before and after the integer value
I am using re in python 3.8. I guess I might have to use back referencing and numbered capturing groups, but not able to figure out how exactly do to make this work.
import re
txt = '''
Required 16 pcs
Required7 units
Requesting 12each
'''
print( re.sub(r'\s*(\d+)\s*(?:units|each|pcs)', r' \1 units', txt) )
Prints:
Required 16 units
Required 7 units
Requesting 12 units
import re
s = \
"""
Required 16 pcs
Required7 units
Requesting 12each
"""
s2 = re.sub(r'(\S*?)(\s*?)(\d*)(\s*?)(pcs|units|each)',r'\1 \3 each',s)
print(s2)
Explanation:
(\S*?) - \S non-space, non-greedy *? - capture group 1
capture group 3 is the digit(s)
re sub
replacement is group 1, followed by literal text of space followed by group 3 and ' each' as literal text. Corrects missing leading/trailing spaces. with groups 2 & 4 - optional non-greedy 1+ space.

How to better code, when looking for substrings?

I want to extract the currency (along with the $ sign) from a list, and create two different currency lists which I have done. But is there a better way to code this?
The list is as below:
['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
Python code:
pp_list = []
up_list = []
for u in usual_price_list:
rep = u.replace("\n","")
rep = rep.replace("\t","")
s = rep.rsplit("$",1)
pp_list.append(s[0])
up_list.append("$"+s[1])
For this kind of problem, I tend to use a lot the re module, as it is more readable, more maintainble and does not depend on which character surround what you are looking for :
import re
pp_list = []
up_list = []
for u in usual_price_list:
prices = re.findall(r"\$\d{2}\.\d{2}", u)
length_prices = len(prices)
if length_prices > 0:
pp_list.append(prices[0])
if length_prices > 1:
up_list.append(prices[1])
Regular Expresion Breakdown
$ is the end of string character, so we need to escape it
\d matches any digit, so \d{2} matches exactly 2 digits
. matches any character, so we need to escape it
If you want it you can modify the number of digits for the cents with \d{1,2} for matches one or two digits, or \d* to match 0 digit or more
As already pointed for doing that task re module is useful - I would use re.split following way:
import re
data = ['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
prices = [re.split(r'[\n\t]+',i) for i in data]
prices0 = [i[1] for i in prices]
prices1 = [i[2] for i in prices]
print(prices0)
print(prices1)
Output:
['$59.90', '$55.00', '$38.50', '$49.00', '$68.80', '$49.80']
['$68.00', '$68.00', '$49.90', '$62.00', '', '$60.50']
Note that this will work assuming that there are solely \n and \t excluding prices and there is at least one \n or \t before first price and at least one \n or \t between prices.
[\n\t]+ denotes any string made from \n or \t with length 1 or greater, that is \n, \t, \n\n, \t\t, \n\t, \t\n and so on

Resources