I want to choose 'MME73KH/A' in the below.
import re
pattern = re.compile("^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{4,}$")
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
More than one example could have helped to understand your requirements. From what I read, you want a pattern of at least 4 characters, with at least one letter, one digit, and possibly a slash "/" char (from your example, MME73KH/A). This should do the trick:
import re
pattern = re.compile('[A-Za-z\d/]+[A-Za-z][\d][A-Za-z\d/]+|[A-Za-z\d/]+[\d][A-Za-z][A-Za-z\d/]+')
findalled = pattern.findall('[최대10%혜택] Apple 에어팟 3세대 2021년형 (MME73KH/A) : 애플 공식 브랜드스토어')
print(findalled)
# output: ['MME73KH/A']
Decomposition of the regex:
pattern = re.compile(
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'[A-Za-z]' # exactly one letter +
'\d' # exactly one digit +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
'|' # OR
'[A-Za-z\d/]+' # at least one letter or digit or "/" +
'\d' # exactly one digit +
'[A-Za-z]' # exactly one letter +
'[A-Za-z\d/]+' # at least one letter or digit or "/" >= 4 chars
)
This will retrieve strings like MME73KH/A, but also 32REGK2 or ABCD1234, while ignoring shorter strings or strings with only letters or only digits.
Related
I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}.
Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.
The target is converting 'valid' numbers to float, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.
I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).
Here's what I've tried so far but the problem is well above my regex skills:
p = '\d+(?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
print(s, re.findall(p, s, re.IGNORECASE))
You can use
import re
p = r'\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)'
def postprocess(x):
if x.group(3):
return f"{x.group(1).replace(',','').replace('.','')}.{x.group(3)}"
elif x.group(2):
return f"{x.group(1).replace(',','').replace('.','')}"
else:
return None
texts = ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']
for s in texts:
print(s, '=>', list(filter(None, [postprocess(x) for x in re.finditer(p, s)])) )
Output:
blabla 1,25 10.587.256,25 euros => ['1.25', '10587256.25']
6.010,12 => ['6010.12']
6.010 => ['6010']
6,010 => ['6010']
6,010.12 => ['6010.12']
6010,124 => ['6010.124']
05.12.2018 => []
12.05.18 => []
The regex is
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)
Details:
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b| - matches a whole word, 1-2 digits, ., 1-2 digits, ., 2 or 4 digits (this match will be skipped)
\b - a word boundary
(?<!\d[.,]) - a negative lookbehind failing the match if there is a digit and a . or , immediately on the left
(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+) - Group 1:
\d{1,3} - one, two or three digits
(?=([.,])?) - there must be an optional Group 2 capturing a . or , immediately on the right
(?:\2\d{3})* - zero or more sequences of Group 2 value and then any three digits
| - or
\d+ - one or more digits
(?:(?(2)(?!\2))[.,](\d+))? - an optional sequence of
(?(2)(?!\2)) - if Group 2 matched, the next char cannot be Group 2 value
[.,] - a comma or dot
(\d+) - Group 3: one or more digits
\b - a word boundary
(?![,.]\d) - a negative lookahead failing the match if there is a , or . and a digit immediately on the right.
The postprocess method returns None if no capturing group matched, or a number with no commas or dots in the integer part.
In a pandas dataframe I have rows with contents in the following format:
1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4
All of these should be simply: target 4
So far my cleaning procedure is as follows:
df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')
This returns:
1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4
What I believe I need is:
a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5
I'm looking at regex and re but the following is not having effect (add space before the last number)
df["point_id"] = re.sub(r'\D+$', '', df["point_id"])
Reading the rules, you might use 2 capture groups and check for the group values:
\btarget\s*(\d+)|.*[t.](\d+)$
\btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
| Or
.*[t.] Match 0+ characters followed by either t or a .
(\d+)$ Capture 1+ digits in group 2 at the end of the string
Regex demo | Python demo
Python example:
import re
import pandas as pd
pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
"abc123-Target 4-ufs",
"abc123-target4-ufs",
"geo.4",
"j123T4"
]
df = pd.DataFrame(strings, columns=["point_id"])
def change(s):
m = re.search(pattern, s, re.IGNORECASE)
return "target " + (m.group(2) if m.group(2) else m.group(1))
df["point_id"] = df["point_id"].apply(change)
print(df)
Output
point_id
0 target 4
1 target 4
2 target 4
3 target 4
You can use
df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
# point_id
# 0 target 4
# 1 target 4
# 2 target 4
# 3 target 4
The regex is (?i)Target\s*\d+|\w+[.t]\d+$:
(?i) - case insensitive matching
.* - any 0+ chars other than line break chars, as many as possible
Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
.* - any 0+ chars other than line break chars, as many as possible
The second regex matches
(?i) - case insensitive matching
\w - a word char, then
[.t] - a . or t and then
\d+$ - one or more digits at the end of string.
The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.
See regex #1 demo and regex #2 demo.
I want to match different groups with different positions with one pattern only.
notice the last 5 digits are in different position, this is my actual inquiry.
import re
line = "Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"
pattern = r"(Jul\s\d\s\d+:+\d+:+\d+)" # but I coudn't recognize how to match another group with different position which is the 5 digits between brackets
result = re.search(pattern, line)
print(result) # output should be: Jul 6 14:02:08 29187
# my actual output: Jul 6 14:02:08 I still don't know how to match a group with different position using one pattern only
You may use
def show_time_of_pid(line):
pattern = r"^(Jul\s+\d+\s+[\d:]*\d).*?\[(\d+)]"
result = re.search(pattern, line)
return "{} pid:{}".format(result.group(1),result.group(2)) if result else ""
See the regex demo.
Regex details
^ - start of string
(Jul\s+\d+\s+[\d:]*\d) - Group 1: Jul, 1+ whitespaces, 1+ digits, 1+ whitespaces, zero or more digits or colons and then a digit
.*? - any 0+ chars, other than line break chars, as few as possible
\[(\d+)] - [, Group 2 capturing 1 or more digits, and then a ].
See Python demo:
print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
# => Jul 6 14:01:23 pid:29440
print(show_time_of_pid("Jul 6 14:02:08 computer.name jam_tag=psim[29187]: (UUID:006)"))
# => Jul 6 14:02:08 pid:29187
For example, If my string was 'HelloWorld'
I want the output to be ######orld
My Code:
myString = 'ThisIsAString'
hashedString = string.replace(string[:-4], '#')
print(hashedString)
Output >> #ring
I expected the output to have just one # symbol since it is replacing argument 1 with argument 2.
Can anyone help me with this?
You could multiply # by the word length - 4 and then use the string slicing.
myString = 'HelloWorld'
print('#' * (len(myString) - 4) + myString[-4:])
myString = 'ThisIsAString'
print('#' * (len(myString) - 4) + myString[-4:])
string.replace(old, new) replaces all instances of old with new. So the code you provided is actually replacing the entire beginning of the string with a single pound sign.
You will also notice that input like abcdabcd will give the output ##, since you are replacing all 'abcd' substrings.
Using replace, you could do
hashes = '#' * len(string[:-4])
hashedString = string.replace(string[:-4], hashes, 1)
Note the string multiplication to get the right number of pound symbols, and the 1 passed to replace, which tells it only to replace the first case it finds.
A better method would be to not use replace at all:
hashes = '#' * (len(string) - 4)
leftover = string[-4:]
hashedString = hashes + leftover
This time we do the same work with getting the pound sign string, but instead of replacing we just take the last 4 characters and add them after the pound signs.
I have a list containing string patterns for digits 0-3. I am trying to print them onto the same line, so that print(digits1+col+digits[2]+col+digits[3]) prints '1 2 3' from the # pattern strings from the respective list index, but can only get the number patterns printed on their own.
# Create strings for each number 0-3 and store in digits list.
zero = '#'*3+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'*3
one = '#\n'.rjust(4)*6
two = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.ljust(3)+'\n'+'#'*3
three = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3
digits = [zero, one, two, three]
col = '\n'.ljust(1)*6 # A divider column between each printed digit.
print(digits[1]+col+digits[2]+col+digits[3],end='')
The result of the above code.
One way to solve this is by reversing the digits matrix, right now each index in digits list has the complete digit values but if we keep horizontal values at each index it will print properly.
think it would be better represented in code...https://repl.it/#pavanskipo/DirectTriangularSlash
# Digits replaced horizntally
digits_rev = [digits[0].split("\n"),
digits[1].split("\n"),
digits[2].split("\n"),
digits[3].split("\n")]
for i in range(0, len(digits)+1):
print(digits_rev[0][i] + '\t' +
digits_rev[1][i] + '\t' +
digits_rev[2][i] + '\t' +
digits_rev[3][i])
click on the link and hit run, let me know if it works