Remove comma from substring in Python - python-3.x

The string is the following:
s = 'AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1,455.85",($56.10),($56.10),-3.71%'
I would like the comma inside this substring "$1,455.85" to be removed but not the other commas.
I tried this but failed:
import re
pattern = r'$\d(,)'
re.sub(pattern, '', s)
Why doesn't this work?

You need a positive lookbehind assertion, i.e., match a comma if it is preceded by a $ (note that $ needs to be escaped as \$) followed by a digit (\d). Try:
>>> s = 'AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1,455.85",($56.10),($56.10),-3.71%'
>>> pattern = r'(?<=\$\d),'
>>> re.sub(pattern, '', s)
'AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1455.85",($56.10),($56.10),-3.71%'

import re
pattern = r"(\$\d+),"
s = 'AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1,455.85",($56.10),($56.10),-3.71%'
print(s)
s = re.sub(pattern, r'\1', s)
print(s)
Output:
AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1,455.85",($56.10),($56.10),-3.71%
AUDC,AUDIOCODES COM,+55,27.49,26.47,"$1455.85",($56.10),($56.10),-3.71%
But it doesn't work for "$1,455,789.85"

Related

regex in python: Can you filter string by deliminator with exceptions?

I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:
s='"12345","X","description of x","X,Y",,,"345355"'
output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']
I am using split to delimitate by commas:
s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')
This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes
Split_Output
Is there a way I can delaminate by commas except for in quotes?
I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.
Thanks!
Regex_Output
split by " (quote) instead of by , (comma) then it will split the string into a list with extra commas, then you can just remove all elements that are commas
s='"12345","X","description of x","X,Y",,,"345355"'
temp = s.split('"')
print(temp)
#> ['', '12345', ',', 'X', ',', 'description of x', ',', 'X,Y', ',,,', '345355', '']
values_to_remove = ['', ',', ',,,']
result = list(filter(lambda val: not val in values_to_remove, temp))
print(result)
#> ['12345', 'X', 'description of x', 'X,Y', '345355']
this should work:
In [1]: import re
In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'
In [3]: pattern = r"(?<=[\",]),(?=[\",])"
In [4]: re.split(pattern, s)
Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']
Explanation:
(?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
(?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.
You can replace all quotes with empty string.
s='"12345","X","description of x","X,Y",,,"345355"'
n = ''
i = 0
while i < len(s):
if i >= len(s):
break
if i<len(s) and s[i] == '"':
i+=1
while i<len(s) and s[i] != '"':
n+=s[i]
i+=1
i+=1
if i < len(s) and s[i] == ",":
n+=", "
i+=1
n.split(", ")
output: ['12345', 'X', 'description of x', 'X,Y', '', '', '345355']

How can we skip some part in the string using regex

I have a string:
st="[~620cc13778d079432b9bc7b1:Hello WorldGuest]"
I just want the part after ":" and before "]". The part in between can have a maximum length of 64 characters.
The part after "[~" is 24 character UUID.
So the resulting string would be "Hello WorldGuest".
I'm using the following regex:
r"(\[\~[a-z0-9]{24}:)(?=.{0,64})"
But that is only matching the string till ":", I also want to match the ending "]".
Given:
>>> import re
>>> st = "[~620cc13778d079432b9bc7b1:Hello WorldGuest]"
Two simple ways:
>>> re.sub(r'[^:]*:([^\]]*)\]',r'\1',st)
'Hello WorldGuest'
>>> st.partition(':')[-1].rstrip(']')
'Hello WorldGuest'
If you want to be super specific:
>>> re.sub(r'^\[~[a-z0-9]{24}:([^\]]{0,64})\]$',r'\1',st)
'Hello WorldGuest'
If you want to correct your pattern, you can do:
>>> m=re.search(r'(?:\[~[a-z0-9]{24}:)(?=([^\]]{0,64})\])', st)
>>> m.group(1)
'Hello WorldGuest'
Or with anchors:
>>> m=re.search(r'(?:^\[~[a-z0-9]{24}:)(?=([^\]]{0,64})\]$)', st)
>>> m.group(1)
'Hello WorldGuest'
Note:
I just used your regex for a UUID even though it is not correct. The correct regex for a UUID is:
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
But that would not match your example...

Using the OR (|) function in regex [duplicate]

The source string is:
# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
and here is my pattern:
pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'
however, re.search can give me correct result:
m = re.search(pattern, s)
print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
re.findall just dump out an empty list:
L = re.findall(pattern, s)
print(L) # output: ['', '', '']
why can't re.findall give me the expected list:
['123', '3.1415926']
There are two things to note here:
re.findall returns captured texts if the regex pattern contains capturing groups in it
the r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.
See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Note that to make re.findall return just match values, you may usually
remove redundant capturing groups (e.g. (a(b)c) -> abc)
convert all capturing groups into non-capturing (that is, replace ( with (?:) unless there are backreferences that refer to the group values in the pattern (then see below)
use re.finditer instead ([x.group() for x in re.finditer(pattern, s)])
In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \.
To match the numbers, you need to use
-?\d*\.?\d+
The regex matches:
-? - Optional minus sign
\d* - Optional digits
\.? - Optional decimal separator
\d+ - 1 or more digits.
See demo
Here is IDEONE demo:
import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)
You dont need to escape twice when you are using raw mode.
Output:['123', '3.1415926']
Also the return type will be a list of strings. If you want return type as integers and floats use map
import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))
Output: [123, 3.1415926]
Just to explain why you think that search returned what you want and findall didn't?
search return a SRE_Match object that hold some information like:
string : attribute contains the string that was passed to search function.
re : REGEX object used in search function.
groups() : list of string captured by the capturing groups inside the REGEX.
group(index): to retrieve the captured string by group using index > 0.
group(0) : return the string matched by the REGEX.
search stops when It found the first mach build the SRE_Match Object and returning it, check this code:
import re
s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string) # 'abc123d'
print(m.group(0)) # REGEX matched 123
print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book'
print(m2.string) # abc123d
print(m2.group(0)) # REGEX matched 3.1415926
print(m2.groups()) # the captured group has captured this part '.1415926'
findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups:
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['', '.1415926']
the first element is return when the first mach was found witch is '123' the capturing group captured only '', but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926'.
If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups(?:):
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['123', '3.1415926']

Replace number into character from string using python

I have a string like this
convert_text = "tet1+tet2+tet34+tet12+tet3"
I want to replace digits into character from above string.That mapping list available separately.so,When am trying to replace digit 1 with character 'g' using replace like below
import re
convert_text = convert_text.replace('1','g')
print(convert_text)
output is
"tetg+tet2+tet34+tetg2+tet3"
How to differentiate single digit and two digit values.Is there is any way to do with Regexp or something else?
You can use a regular expression with a callable replacement argument to substitute consecutive runs of digits with a value in a lookup table, eg:
import re
# Input text
convert_text = "tet1+tet2+tet34+tet12+tet3"
# to->from of digits to string
replacements = {'1': 'A', '2': 'B', '3': 'C', '12': 'T', '34': 'X'}
# Do actual replacement of digits to string
converted_text = re.sub('(\d+)', lambda m: replacements[m.group()], convert_text)
Which gives you:
'tetA+tetB+tetX+tetT+tetC'
import re
convert_text = "tet1+tet2+tet34+tet12+tet3"
pattern = re.compile(r'((?<!\d)\d(?!\d))')
convert_text2=pattern.sub('g',convert_text)
convert_text2
Out[2]: 'tetg+tetg+tet34+tet12+tetg'
You have to use negative lookahead and negative lookbehind patterns which are in between parenthesis
(?!pat) and
(?<!pat),
you have the same with = instead of ! for positive lookahead/lookbehind.
EDIT: if you need replacement of strings of digits, regex is
pattern2 = re.compile(r'\d+')
In any pattern you can replace \d by a specific digit you need.

How can I print only integers/numbers from string

Hello I am fairly new at programming and python and I have a question.
How would I go about printing or returning only numbers from a string
For example:
"Hu765adjH665Sdjda"
output:
"765665"
You can use re.sub to remove any character that is not a number.
import re
string = "Hu765adjH665Sdjda"
string = re.sub('[^0-9]', '', string)
print string
#'765665'
re.sub scan the string from left to right. everytime it finds a character that is not a number it replaces it for the empty string (which is the same as removing it for all practical purpose).
>>> s = "Hu765adjH665Sdjda"
>>> ''.join(c for c in s if c in '0123456789')
'765665'
a='a345g5'
for i in a:
if int(i.isnumeric()):
print(i,end=' ')
Try filter
>>> str='1qaz2wsx3edc4rfv5tgb6yhn7ujm8ik9ol'
>>> print str
1qaz2wsx3edc4rfv5tgb6yhn7ujm8ik9ol
>>> filter(lambda x:x>='0' and x<='9', str)
'123456789'
sentence = "Hu765adjH665Sdjda"
for number in sentence:
if number in "0123456789":
print(number)

Resources