Extract number from str with pattern - string

Example like "3-5/description". I'd like extract the numbers next to the dash -, which are 3 and 5 in this example. The description next to the / is a str containing no number.
I want a tool to help me extract number from str like this like this func("3-5/description")returns [3,5]

You can do a regex match to capture the digits before and after the '-'
import re
def func(input):
return re.match(r'(\d+)[-](\d+)', x).groups()
func("3-5/description")

Related

Python regex multiple matches occurrences between two strings

I have a multi-line string with my start/end magic strings ("X" and "Y"). I'm trying to capture all occurrences but I'm experiencing some issues.
Here is the code
testString = '''AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF
FFFYGGG
'''
pattern = re.compile(r'(.*)X(.*)Y(.*)', re.MULTILINE)
match = re.search(pattern, testString)
print match.group(1) # output: AAAAAXBBBBBYCCCCC
print match.group(2) # output: DDDDD
print match.group(3) # output: EEEEEEXFFF
Basically, I'm trying to capture all occurrences of the following (And I have to maintain text order):
Text before the magic start string (e.g.: AAAAA, CCCCC, EEEEEE)
Text between start/end magic strings (e.g.: BBBBB, DDDDD, FFF\nFFF)
Text after the magic start string (e.g.: CCCCC, GGG)
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on. Printing them in that order would solve the issue. (Then I can pass each to other functions, ...etc.)
The code works nicely when the text is as simple as for example "AAXBBYCC", but with complicated strings I'm losing control.
Any ideas or alternative ways to do this?
You could match any character except X or Y in group 1 and then match X and do the same for Y. The "after the magic string" part you could capture in a lookahead with a third group.
The negated character class using [^ will also match an newline to match the FFFFFF part.
([^XY]+)X([^XY]+)Y(?=([^XY]+))
([^XY]+)X Capture group 1, match 1+ times any char except X or Y, then match X
([^XY]+)Y Capture group 2, match 1+ times any char except X or Y, then match Y
(?= Positive lookahead, assert what is directly to the right is
([^XY]+) Capture group 3, match 1+ times any char except X or Y
) Close lookahead
Regex demo | Python demo
import re
regex = r"([^XY]+)X([^XY]+)Y(?=([^XY]*))"
s = ("AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF\n"
"FFFYGGG")
matches = re.findall(regex, s)
print(matches)
Output
[('AAAAA', 'BBBBB', 'CCCCC'), ('CCCCC', 'DDDDD', 'EEEEEE'), ('EEEEEE', 'FFF\nFFF', 'GGG')]
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on.
Since it doesn't matter whether before or after start or end, it is as simple as:
import re
o = re.split("X|Y", testString)
print(*o, sep='\n')
Can't you just use:
pattern = re.compile(r'[^XY]+')
match = re.findall(pattern, testString)
print(match)
# ['AAAAA', 'BBBBB', 'CCCCC', 'DDDDD', 'EEEEEE', 'FFF\nFFF', 'GGG\n']

Can we replace an integer to English letter in a document python

I have a document and it contain numbers in between is there a way I can replace all the numbers to the English equivalent ?
eg:
My age is 10. I am in my 7th grade.
expected-o/p :
My age is Ten and I am in my seventh grade.
Thanks in advance
You'll want to take a look at num2words.
You'll have to construct regexp to catch the numbers you want to replace and pass them to num2words. Based on example provided, you also might need the ordinal flag.
import re
from num2words import num2words
# this is just an example NOT ready to use code
text = "My age is 10. I am in my 7th grade."
to_replace = set(re.findall('\d+', text)) # find numbers to replace
longest = sorted(to_replace, key=len, reverse=True) # sort so longest are replaced first
for m in longest:
n = int(m) # convert from string to number
result = num2words(n) # generate text representation
text = re.sub(m, result, text) # substitute in the text
print(text)
edited to reflect that OP wants to catch all digits

How to better code, when looking for substrings?

I want to extract the currency (along with the $ sign) from a list, and create two different currency lists which I have done. But is there a better way to code this?
The list is as below:
['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
Python code:
pp_list = []
up_list = []
for u in usual_price_list:
rep = u.replace("\n","")
rep = rep.replace("\t","")
s = rep.rsplit("$",1)
pp_list.append(s[0])
up_list.append("$"+s[1])
For this kind of problem, I tend to use a lot the re module, as it is more readable, more maintainble and does not depend on which character surround what you are looking for :
import re
pp_list = []
up_list = []
for u in usual_price_list:
prices = re.findall(r"\$\d{2}\.\d{2}", u)
length_prices = len(prices)
if length_prices > 0:
pp_list.append(prices[0])
if length_prices > 1:
up_list.append(prices[1])
Regular Expresion Breakdown
$ is the end of string character, so we need to escape it
\d matches any digit, so \d{2} matches exactly 2 digits
. matches any character, so we need to escape it
If you want it you can modify the number of digits for the cents with \d{1,2} for matches one or two digits, or \d* to match 0 digit or more
As already pointed for doing that task re module is useful - I would use re.split following way:
import re
data = ['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
prices = [re.split(r'[\n\t]+',i) for i in data]
prices0 = [i[1] for i in prices]
prices1 = [i[2] for i in prices]
print(prices0)
print(prices1)
Output:
['$59.90', '$55.00', '$38.50', '$49.00', '$68.80', '$49.80']
['$68.00', '$68.00', '$49.90', '$62.00', '', '$60.50']
Note that this will work assuming that there are solely \n and \t excluding prices and there is at least one \n or \t before first price and at least one \n or \t between prices.
[\n\t]+ denotes any string made from \n or \t with length 1 or greater, that is \n, \t, \n\n, \t\t, \n\t, \t\n and so on

How to split a string after a dot UNLESS the characters after the dot are numbers

I need to take only the letters and numbers at the beginning of a string, but some numbers are decimals. The strings are not all formatted the same. Here are a few examples of some of the data and what I would need returned:
HB61 .M16 1973 I need HB61 returned
HB97.52 .R6163 1982 I need HB97.52 returned
HB98.V38 1994 I need HB98 returned
HB 119.G74 A3 2007 I need HB119 returned
I'm very new to coding so I'm hoping there's some simple solution that I just don't know?
I was going to just split it at the first dot and then get rid of the spaces, but this wouldn't allow me to keep the decimals such as HB97.52 which I need. I currently have code written just to test one string at a time. The code is as follows:
data = input("Data: ")
components = data.split(".")
str(components)
print(components[0].replace(" ", ""))
This works as expected except for the strings with decimals. for HB97.52 .R6163 1982 I would like HB97.52 returned but it only returns HB97.
The following regular expression extracts the letters at the beginning of a string, followed by optional spaces, followed by a [possibly floating point] number:
s = ['HB61 .M16 1973', 'HB97.52 .R6163 1982',
'HB98.V38 1994', 'HB 119.G74 A3 2007']
import re
pattern = r"^[a-z]+\s*\d+(?:\.\d+)?"
[re.findall(pattern, part, flags=re.I)[0] for part in s]
#['HB61', 'HB97.52', 'HB98', 'HB 119']
If you do not want the spaces in the output, this slightly different pattern extracts the letter part and the number part separately, and then they are joined:
pattern = r"(^[a-z]+)\s*(\d+(?:\.\d+)?)"
list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s]))
#['HB61', 'HB97.52', 'HB98', 'HB119']
For something like HB61.45.78.R5000 what do you want? If you want HB61.45.78 then use this first snippet:
data = data.replace(' ', '')
data = data.split('.')
wanted = data[0]
for i in range(1,len(data)):
if data[i][0].isalpha():
break
else:
wanted += '.' + data[i]
Otherwise, if you want only HB61.45 then use
data = data.replace(' ', '')
data = data.split('.')
wanted = data[0]
if not data[1][0].isalpha():
wanted += '.' + data[1]

Get digits at end of string in a pythonic way

I'm using python 3.x. I'm trying to get the (int) number at the end of a string with format
string_example_1 = l-45-98-567-567-12
string_example_2 = s-89-657
or in general, a single lowercase letter followed by a number of integers separated by '-'. What I need is to get the last number (12 and 657 in these cases). I have archived this with the function
def ending(the_string):
out = ''
while the_string[-1].isdigit():
out = the_string[-1] + out
the_string = the_string[:-1]
return out
but I'm sure there must be a more pythonic way to do this. In a previous instance I check manually that the string starts the way I like by doing something like
if st[0].isalpha() and st[1]=='-' and st[2].isdigit():
statement...
I would just split the string on -, take the last of the splits and convert it to an integer.
string_example_1 = "l-45-98-567-567-12"
string_example_2 = "s-89-657"
def last_number(s):
return int(s.split("-")[-1])
print(last_number(string_example_1))
# 12
print(last_number(string_example_2))
# 657
Without regular expressions, you could reverse the string, take elements from the string while they're still numbers, and then reverse the result. In Python:
from itertools import takewhile
def extract_final_digits(s):
return int(''.join(reversed(list(takewhile(lambda c: c.isdigit(), reversed(s))))))
But the simplest is to just split on a delimiter and take the final element in the split list.

Resources