Python regular expression substitution help needed - python-3.x

I may have any of the following input strings -
i/p 1) Required 16 pcs
i/p 2) Required7 units
i/p 3) Requesting 12each
I wish to do some regular expression based substitution so that I have the following outputs for the above 3 strings -
o/p 1) Required 16 units
o/p 2) Required 7 units
o/p 3) Requesting 12 units
Basically, if my string contains pcs/units/each, and an integer before that, I want to do the following -
#1. replace the string "pcs" / "each" with "units" &
#2. add spaces before and after the integer value
I am using re in python 3.8. I guess I might have to use back referencing and numbered capturing groups, but not able to figure out how exactly do to make this work.

import re
txt = '''
Required 16 pcs
Required7 units
Requesting 12each
'''
print( re.sub(r'\s*(\d+)\s*(?:units|each|pcs)', r' \1 units', txt) )
Prints:
Required 16 units
Required 7 units
Requesting 12 units

import re
s = \
"""
Required 16 pcs
Required7 units
Requesting 12each
"""
s2 = re.sub(r'(\S*?)(\s*?)(\d*)(\s*?)(pcs|units|each)',r'\1 \3 each',s)
print(s2)
Explanation:
(\S*?) - \S non-space, non-greedy *? - capture group 1
capture group 3 is the digit(s)
re sub
replacement is group 1, followed by literal text of space followed by group 3 and ' each' as literal text. Corrects missing leading/trailing spaces. with groups 2 & 4 - optional non-greedy 1+ space.

Related

"python" regex at least 1 letter 1 number, at least 4

I want to choose 'MME73KH/A' in the below.
import re
pattern = re.compile("^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{4,}$")
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
More than one example could have helped to understand your requirements. From what I read, you want a pattern of at least 4 characters, with at least one letter, one digit, and possibly a slash "/" char (from your example, MME73KH/A). This should do the trick:
import re
pattern = re.compile('[A-Za-z\d/]+[A-Za-z][\d][A-Za-z\d/]+|[A-Za-z\d/]+[\d][A-Za-z][A-Za-z\d/]+')
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
# output: ['MME73KH/A']
Decomposition of the regex:
pattern = re.compile(
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'[A-Za-z]' # exactly one letter +
'\d' # exactly one digit +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
'|' # OR
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'\d' # exactly one digit +
'[A-Za-z]' # exactly one letter +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
)
This will retrieve strings like MME73KH/A, but also 32REGK2 or ABCD1234, while ignoring shorter strings or strings with only letters or only digits.

Python - regex extract numbers from text that may contain thousands or millions separators and convert them to dot separated decimal floats

I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}.
Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.
The target is converting 'valid' numbers to float, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.
I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).
Here's what I've tried so far but the problem is well above my regex skills:
p = '\d+(?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
print(s, re.findall(p, s, re.IGNORECASE))
You can use
import re
p = r'\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)'
def postprocess(x):
if x.group(3):
return f"{x.group(1).replace(',','').replace('.','')}.{x.group(3)}"
elif x.group(2):
return f"{x.group(1).replace(',','').replace('.','')}"
else:
return None
texts = ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']
for s in texts:
print(s, '=>', list(filter(None, [postprocess(x) for x in re.finditer(p, s)])) )
Output:
blabla 1,25 10.587.256,25 euros => ['1.25', '10587256.25']
6.010,12 => ['6010.12']
6.010 => ['6010']
6,010 => ['6010']
6,010.12 => ['6010.12']
6010,124 => ['6010.124']
05.12.2018 => []
12.05.18 => []
The regex is
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)
Details:
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b| - matches a whole word, 1-2 digits, ., 1-2 digits, ., 2 or 4 digits (this match will be skipped)
\b - a word boundary
(?<!\d[.,]) - a negative lookbehind failing the match if there is a digit and a . or , immediately on the left
(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+) - Group 1:
\d{1,3} - one, two or three digits
(?=([.,])?) - there must be an optional Group 2 capturing a . or , immediately on the right
(?:\2\d{3})* - zero or more sequences of Group 2 value and then any three digits
| - or
\d+ - one or more digits
(?:(?(2)(?!\2))[.,](\d+))? - an optional sequence of
(?(2)(?!\2)) - if Group 2 matched, the next char cannot be Group 2 value
[.,] - a comma or dot
(\d+) - Group 3: one or more digits
\b - a word boundary
(?![,.]\d) - a negative lookahead failing the match if there is a , or . and a digit immediately on the right.
The postprocess method returns None if no capturing group matched, or a number with no commas or dots in the integer part.

Python reformatting strings based on contents

In a pandas dataframe I have rows with contents in the following format:
1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4
All of these should be simply: target 4
So far my cleaning procedure is as follows:
df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')
This returns:
1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4
What I believe I need is:
a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5
I'm looking at regex and re but the following is not having effect (add space before the last number)
df["point_id"] = re.sub(r'\D+$', '', df["point_id"])
Reading the rules, you might use 2 capture groups and check for the group values:
\btarget\s*(\d+)|.*[t.](\d+)$
\btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
| Or
.*[t.] Match 0+ characters followed by either t or a .
(\d+)$ Capture 1+ digits in group 2 at the end of the string
Regex demo | Python demo
Python example:
import re
import pandas as pd
pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
"abc123-Target 4-ufs",
"abc123-target4-ufs",
"geo.4",
"j123T4"
]
df = pd.DataFrame(strings, columns=["point_id"])
def change(s):
m = re.search(pattern, s, re.IGNORECASE)
return "target " + (m.group(2) if m.group(2) else m.group(1))
df["point_id"] = df["point_id"].apply(change)
print(df)
Output
point_id
0 target 4
1 target 4
2 target 4
3 target 4
You can use
df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
# point_id
# 0 target 4
# 1 target 4
# 2 target 4
# 3 target 4
The regex is (?i)Target\s*\d+|\w+[.t]\d+$:
(?i) - case insensitive matching
.* - any 0+ chars other than line break chars, as many as possible
Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
.* - any 0+ chars other than line break chars, as many as possible
The second regex matches
(?i) - case insensitive matching
\w - a word char, then
[.t] - a . or t and then
\d+$ - one or more digits at the end of string.
The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.
See regex #1 demo and regex #2 demo.

find better way to find the text in string contains multi same signs

I have below text which each info (text and length) between "|" is different by time , only the number of "|" is fixed. I can retrieve the info i want ("XYZGM")but do we have better way to do ?
"#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
Current code i used:
text="{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
# get text from 6th position to 7th position of "|"
pos_count=0
z=0
for i in range(z,len(text)):
pos=text.find('|', z, len(text))
if pos>0:
pos_count+=1
z=pos+1
if pos_count==6:
x=pos+1
if pos_count==7:
y=pos
break
print("X: {}, Y: {}".format(x,y))
result=text[x:y]
print(result)
and the result is : "XYZGM"
Another option could be using a pattern:
^{#(?:[^|]*\|){6}([^|]+)
^ Start of string
{# Match {#
(?:[^|]*\|){6} Repeat 6 times any char except | then match |
([^|]+) Capture group 1, match 1+ times any char except |
Regex demo
import re
pattern = r"^{#(?:[^|]*\|){6}([^|]+)"
s = "{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
match = re.match(pattern, s)
if match:
print(match.group(1))
Output
XYZGM
No need using regex:
text="{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
if text.startswith("{#"):
print(text[2:].split("|")[6])
Make sure there is {# text at the beginning, split the rest with |, and get the sixth value.
Python code.

How to match different groups with different positions with one single pattern in Python3 Regex

I want to match different groups with different positions with one pattern only.
notice the last 5 digits are in different position, this is my actual inquiry.
import re
line = "Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"
pattern = r"(Jul\s\d\s\d+:+\d+:+\d+)" # but I coudn't recognize how to match another group with different position which is the 5 digits between brackets
result = re.search(pattern, line)
print(result) # output should be: Jul 6 14:02:08 29187
# my actual output: Jul 6 14:02:08 I still don't know how to match a group with different position using one pattern only
You may use
def show_time_of_pid(line):
pattern = r"^(Jul\s+\d+\s+[\d:]*\d).*?\[(\d+)]"
result = re.search(pattern, line)
return "{} pid:{}".format(result.group(1),result.group(2)) if result else ""
See the regex demo.
Regex details
^ - start of string
(Jul\s+\d+\s+[\d:]*\d) - Group 1: Jul, 1+ whitespaces, 1+ digits, 1+ whitespaces, zero or more digits or colons and then a digit
.*? - any 0+ chars, other than line break chars, as few as possible
\[(\d+)] - [, Group 2 capturing 1 or more digits, and then a ].
See Python demo:
print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
# => Jul 6 14:01:23 pid:29440
print(show_time_of_pid("Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"))
# => Jul 6 14:02:08 pid:29187

Resources