Regex to replace text between two alphanumeric digits - python - python-3.x

I have the timestamp like this 2022-04-13T00':'21':'45Z','source':'natwest'.
I want to replace ':' which is in between T sufix with two numerical digits following by ':' two numerical digits following by ':' and Z prefix two numerical digits.
Replace ':' with AND
I have tried to use this regex.
'.*'
But it is replacing the one which is after **source **but I don't want that.

This is a quite strict pattern description, where you can make the substitution of both ':' in one go, using a capture group to regenerate the two digits that sit between the two occurrences:
s = "2022-04-13T00':'21':'45Z','source':'natwest'"
result = re.sub(r"(?<=T\d\d)':'(\d\d)':'(?=\d\dZ)", r"AND\1AND", s)

Related

Replace characters other than A-Za-z0-9 and decimal values with space using regex

I want to keep alphanumeric characters and also the decimal numbers present in my text string and replace all other characters with space.
For alphanumeric characters, I can use
def clean_up(text):
return re.sub(r"[^A-Za-z0-9]", " ", text)
But this will replace all . whether they are between two digits or a fullstop or at random locations. I just want to keep the . if they come between two digits.
I thought of [^((A-Za-z0-9)|(\d\.\d))], but it doesn't seem to work.
You can match and capture the patterns you need to keep and just match any char otherwise. Then, using the lambda expression as the replacement argument, you can either replace with the captured substring or a space.
The patterns are:
[+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)? - matches any number
[^\W_] - matches any alphanumeric, Unicode included
. - matches any char (with re.S or re.DOTALL).
The solution looks like
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
See the online demo:
import re
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
print( clean_up("+1.2E02 ANT01-TEXT_HERE!") )
Output: +1.2E02 ANT01 TEXT HERE
[^A-Za-z0-9](?!\d)
You can use Negated Character Class with Negative lookahead.
[...] is a character set, which match the literal character inside, so [^((A-Za-z0-9)|(\d\.\d))] means not to match [A-Za-z0-9] and the literal (, |, . and ).
You may try [^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d):
[^a-zA-Z0-9.] match all non-alphanumeric characters except .
(?<!\d)\. match . after non-digit
\.(?!\d) match . before non-digit
Test run:
print(re.sub(r"[^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d)", " ", ".aBc.1e2#3.4F5$6.gHi."))
Output:
aBc 1e2 3.4F5 6 gHi

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Regex to find compensations in text

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.
You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo
You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."
The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

How to write a better regex in python?

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

What is the right pattern to for numbers with negative symbol?

I have a string of numbers separated by spaces and I need to store them in a table but for some reason negative symbol is not getting recognize.
cord = "-53 2 -21"
map = {}
for num in cord:gmatch("%w+") do
table.insert(map, num)
end
map[1], map[2], map[3] = tonumber(map[1]), tonumber(map[2]), tonumber(map[3])
print(map[1])
print(map[2])
print(map[3])
This is the output I'm getting:
53
2
21
I think the problem is with the pattern I'm using, what should I change?
The pattern "%w" is for alphanumeric characters, which doesn't include -, use this pattern instead:
"%-?%w+"
or better:
"%-?%d+"
since numbers are all you need.
%w+ does not attempt to mach only numbers, so try %S+ to get all "words", that is, all sequences of non-zero characters.
If you want to match only numbers, try %-?%d+. Note the optional minus sign in the pattern. Note also that you must escape the minus sign.

Resources