In a pandas dataframe I have rows with contents in the following format:
1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4
All of these should be simply: target 4
So far my cleaning procedure is as follows:
df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')
This returns:
1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4
What I believe I need is:
a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5
I'm looking at regex and re but the following is not having effect (add space before the last number)
df["point_id"] = re.sub(r'\D+$', '', df["point_id"])
Reading the rules, you might use 2 capture groups and check for the group values:
\btarget\s*(\d+)|.*[t.](\d+)$
\btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
| Or
.*[t.] Match 0+ characters followed by either t or a .
(\d+)$ Capture 1+ digits in group 2 at the end of the string
Regex demo | Python demo
Python example:
import re
import pandas as pd
pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
"abc123-Target 4-ufs",
"abc123-target4-ufs",
"geo.4",
"j123T4"
]
df = pd.DataFrame(strings, columns=["point_id"])
def change(s):
m = re.search(pattern, s, re.IGNORECASE)
return "target " + (m.group(2) if m.group(2) else m.group(1))
df["point_id"] = df["point_id"].apply(change)
print(df)
Output
point_id
0 target 4
1 target 4
2 target 4
3 target 4
You can use
df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
# point_id
# 0 target 4
# 1 target 4
# 2 target 4
# 3 target 4
The regex is (?i)Target\s*\d+|\w+[.t]\d+$:
(?i) - case insensitive matching
.* - any 0+ chars other than line break chars, as many as possible
Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
.* - any 0+ chars other than line break chars, as many as possible
The second regex matches
(?i) - case insensitive matching
\w - a word char, then
[.t] - a . or t and then
\d+$ - one or more digits at the end of string.
The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.
See regex #1 demo and regex #2 demo.
Related
I have a python list and I want a regular expression to remove substring which contains at least 5 uppercases. And another regex which could remove the part of string from ‘?’ till ‘:’
INPUT : list = [‘helLo/aPPle/BuTTeRfLY:Missed’,’bliss/ScIENCEs/brew?Dyna=skjdk:Nest’,’Self/NESTeDsd/hello/MiSSInG:Good’]
Output : list = [‘helLo/aPPle/:Missed’,’bliss//brew:Nest’,’Self//hello/:Good’]
Here make 2 regex:
(\w*[A-Z]\w*){5,} - find atleast 5 uppercase letters
?.*(?=:) - find substring start with ? and end with :
if we find string match with regex pattern then replace string with '' and update value in list
import re
reg =r'(\w*[A-Z]\w*){5,}|\?.*(?=:)'
input_list = ["helLo/aPPle/BuTTeRfLY:Missed","bliss/ScIENCEs/brew?Dyna=skjdk:Nest","Self/NESTeDsd/hello/MiSSInG:Good"]
for data in input_list:
match = re.finditer(reg,data)
if match:
for match_word in match:
print(match_word)
if match_word.group() in data:
# if uppercase char >5 then replace this substring with ''
final_str = data.replace(str(match_word.group()),'')
# find index of data
index = input_list.index(data)
# replce new value in list
input_list[index] = data =final_str
print(input_list)
Output: :- ['helLo/aPPle/:Missed', 'bliss//brew:Nest', 'Self//hello/:Good']
I want to choose 'MME73KH/A' in the below.
import re
pattern = re.compile("^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{4,}$")
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
More than one example could have helped to understand your requirements. From what I read, you want a pattern of at least 4 characters, with at least one letter, one digit, and possibly a slash "/" char (from your example, MME73KH/A). This should do the trick:
import re
pattern = re.compile('[A-Za-z\d/]+[A-Za-z][\d][A-Za-z\d/]+|[A-Za-z\d/]+[\d][A-Za-z][A-Za-z\d/]+')
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
# output: ['MME73KH/A']
Decomposition of the regex:
pattern = re.compile(
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'[A-Za-z]' # exactly one letter +
'\d' # exactly one digit +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
'|' # OR
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'\d' # exactly one digit +
'[A-Za-z]' # exactly one letter +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
)
This will retrieve strings like MME73KH/A, but also 32REGK2 or ABCD1234, while ignoring shorter strings or strings with only letters or only digits.
I have a code with asking for errors on interfaces from my network switches. The output that I'm getting varies sometimes.
the output that i get from the switches in this format : (number changes from time to time)
output
so i want to print from the output that i get only line with end with number that grater then 0 like the line with start BAG16
my code is going like that :
import re
kobi = '''
BAGG11 13917779236 10133016 16491979 64
BAGG15 30841323485 22747672 19201545 0
BAGG16 811970 0 811970 0
'''
err = re.findall (r'[BAGG]',kobi)
print(err)
I think this can be done without regex.
Try this:
kobi = '''
BAGG11 13917779236 10133016 16491979 64
BAGG15 30841323485 22747672 19201545 0
BAGG16 811970 0 811970 0
'''
lst = kobi.split()
lines = [lst[i:i+5] for i in range(0, len(lst), 5)]
for line in lines:
if int(line[-1]) > 0:
print(' '.join(line))
Output:
BAGG11 13917779236 10133016 16491979 64
I've made some assumptions about your input:
it's always five rows
last row is a numerical value
You might use a pattern to match BAGG and digits at the start and match a digit starting with 1-9 at the end.
^BAGG\d+[^\S\r\n].*[^\S\r\n][1-9]\d*$
Regex demo
If there should be 3 columns following, a bit more precise match could be using a quantifier {3} to match the number of "columns" in the middle.
^BAGG\d+(?:[^\S\r\n]+\d+){3}[^\S\r\n]+[1-9]\d*$
Explanation
^ Start of line
BAGG\d+ Match BAGG and 1+ digits
(?: Non capture group
[^\S\r\n]+\d+ Match 1+ whitespace chars without a newline followed by 1+ digits
){3} Close non capture group and repeat 3 times
[^\S\r\n]+ Match 1+ whitespace chars without newlines
[1-9]\d* Match a digit 1-9 followed by optional digits
$ End of line
Regex demo | Python demo
For example
import re
kobi = '''
BAGG11 13917779236 10133016 16491979 64
BAGG15 30841323485 22747672 19201545 0
BAGG16 811970 0 811970 0
'''
err = re.findall (r'^BAGG\d+(?:[^\S\r\n]+\d+){3}[^\S\r\n]+[1-9]\d*$', kobi, re.MULTILINE)
print(err)
Output
['BAGG11 13917779236 10133016 16491979 64']
I may have any of the following input strings -
i/p 1) Required 16 pcs
i/p 2) Required7 units
i/p 3) Requesting 12each
I wish to do some regular expression based substitution so that I have the following outputs for the above 3 strings -
o/p 1) Required 16 units
o/p 2) Required 7 units
o/p 3) Requesting 12 units
Basically, if my string contains pcs/units/each, and an integer before that, I want to do the following -
#1. replace the string "pcs" / "each" with "units" &
#2. add spaces before and after the integer value
I am using re in python 3.8. I guess I might have to use back referencing and numbered capturing groups, but not able to figure out how exactly do to make this work.
import re
txt = '''
Required 16 pcs
Required7 units
Requesting 12each
'''
print( re.sub(r'\s*(\d+)\s*(?:units|each|pcs)', r' \1 units', txt) )
Prints:
Required 16 units
Required 7 units
Requesting 12 units
import re
s = \
"""
Required 16 pcs
Required7 units
Requesting 12each
"""
s2 = re.sub(r'(\S*?)(\s*?)(\d*)(\s*?)(pcs|units|each)',r'\1 \3 each',s)
print(s2)
Explanation:
(\S*?) - \S non-space, non-greedy *? - capture group 1
capture group 3 is the digit(s)
re sub
replacement is group 1, followed by literal text of space followed by group 3 and ' each' as literal text. Corrects missing leading/trailing spaces. with groups 2 & 4 - optional non-greedy 1+ space.
I want to match different groups with different positions with one pattern only.
notice the last 5 digits are in different position, this is my actual inquiry.
import re
line = "Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"
pattern = r"(Jul\s\d\s\d+:+\d+:+\d+)" # but I coudn't recognize how to match another group with different position which is the 5 digits between brackets
result = re.search(pattern, line)
print(result) # output should be: Jul 6 14:02:08 29187
# my actual output: Jul 6 14:02:08 I still don't know how to match a group with different position using one pattern only
You may use
def show_time_of_pid(line):
pattern = r"^(Jul\s+\d+\s+[\d:]*\d).*?\[(\d+)]"
result = re.search(pattern, line)
return "{} pid:{}".format(result.group(1),result.group(2)) if result else ""
See the regex demo.
Regex details
^ - start of string
(Jul\s+\d+\s+[\d:]*\d) - Group 1: Jul, 1+ whitespaces, 1+ digits, 1+ whitespaces, zero or more digits or colons and then a digit
.*? - any 0+ chars, other than line break chars, as few as possible
\[(\d+)] - [, Group 2 capturing 1 or more digits, and then a ].
See Python demo:
print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
# => Jul 6 14:01:23 pid:29440
print(show_time_of_pid("Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"))
# => Jul 6 14:02:08 pid:29187