Split pandas column on number with % - python-3.x

I have a df with one of the columns that appears like:
**Share**
We are safe 25%
We are always safe 12.50% (India Aus, West)
We are ok (USA, EU)
We are not OK
What is this
Always wise 25.66%
I want to split this column such that the % values wherever applicable get split from the column into a new one.
So the output would be
Share Percent LOCATION
We are safe 25%
We are always safe 12.50% India Aus, West
We are ok USA, EU
We are not OK
What is this
Always wise 25.66%
I thought the below would split it from right, but it is not working
df['Percent'] = df['Share'].str.rsplit(r' \d',1).str[0]

You can extract those values:
df[['Share','Percent']] = df['Share'].str.split(r'\s+(?=\d+(?:\.\d+)?%\s*$)',expand=True).fillna("")
Pandas test:
import pandas as pd
df = pd.DataFrame({'Share':['We are safe 25%','We are ok', 'We are always safe 12.50%']})
df[['Share','Percent']] = df['Share'].str.split(r'\s+(?=\d+(?:\.\d+)?%\s*$)',expand=True).fillna("")
>>> df
Share Percent
0 We are safe 25%
1 We are ok
2 We are always safe 12.50%
See the regex demo. Details:
\s+ - one or more whitespaces
(?=\d+(?:\.\d+)?%\s*$) - a positive lookahead matching a location that is immediately followed with:
\d+ - one or more digits
(?:\.\d+)? - an optional sequence of . and one or more digits
% - a % symbol
\s* - 0 or more trailing (as $ comes next) whitespaces and
$ - end of string.

Related

Python - regex extract numbers from text that may contain thousands or millions separators and convert them to dot separated decimal floats

I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}.
Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.
The target is converting 'valid' numbers to float, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.
I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).
Here's what I've tried so far but the problem is well above my regex skills:
p = '\d+(?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
print(s, re.findall(p, s, re.IGNORECASE))
You can use
import re
p = r'\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)'
def postprocess(x):
if x.group(3):
return f"{x.group(1).replace(',','').replace('.','')}.{x.group(3)}"
elif x.group(2):
return f"{x.group(1).replace(',','').replace('.','')}"
else:
return None
texts = ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']
for s in texts:
print(s, '=>', list(filter(None, [postprocess(x) for x in re.finditer(p, s)])) )
Output:
blabla 1,25 10.587.256,25 euros => ['1.25', '10587256.25']
6.010,12 => ['6010.12']
6.010 => ['6010']
6,010 => ['6010']
6,010.12 => ['6010.12']
6010,124 => ['6010.124']
05.12.2018 => []
12.05.18 => []
The regex is
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)
Details:
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b| - matches a whole word, 1-2 digits, ., 1-2 digits, ., 2 or 4 digits (this match will be skipped)
\b - a word boundary
(?<!\d[.,]) - a negative lookbehind failing the match if there is a digit and a . or , immediately on the left
(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+) - Group 1:
\d{1,3} - one, two or three digits
(?=([.,])?) - there must be an optional Group 2 capturing a . or , immediately on the right
(?:\2\d{3})* - zero or more sequences of Group 2 value and then any three digits
| - or
\d+ - one or more digits
(?:(?(2)(?!\2))[.,](\d+))? - an optional sequence of
(?(2)(?!\2)) - if Group 2 matched, the next char cannot be Group 2 value
[.,] - a comma or dot
(\d+) - Group 3: one or more digits
\b - a word boundary
(?![,.]\d) - a negative lookahead failing the match if there is a , or . and a digit immediately on the right.
The postprocess method returns None if no capturing group matched, or a number with no commas or dots in the integer part.

Python reformatting strings based on contents

In a pandas dataframe I have rows with contents in the following format:
1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4
All of these should be simply: target 4
So far my cleaning procedure is as follows:
df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')
This returns:
1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4
What I believe I need is:
a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5
I'm looking at regex and re but the following is not having effect (add space before the last number)
df["point_id"] = re.sub(r'\D+$', '', df["point_id"])
Reading the rules, you might use 2 capture groups and check for the group values:
\btarget\s*(\d+)|.*[t.](\d+)$
\btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
| Or
.*[t.] Match 0+ characters followed by either t or a .
(\d+)$ Capture 1+ digits in group 2 at the end of the string
Regex demo | Python demo
Python example:
import re
import pandas as pd
pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
"abc123-Target 4-ufs",
"abc123-target4-ufs",
"geo.4",
"j123T4"
]
df = pd.DataFrame(strings, columns=["point_id"])
def change(s):
m = re.search(pattern, s, re.IGNORECASE)
return "target " + (m.group(2) if m.group(2) else m.group(1))
df["point_id"] = df["point_id"].apply(change)
print(df)
Output
point_id
0 target 4
1 target 4
2 target 4
3 target 4
You can use
df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
# point_id
# 0 target 4
# 1 target 4
# 2 target 4
# 3 target 4
The regex is (?i)Target\s*\d+|\w+[.t]\d+$:
(?i) - case insensitive matching
.* - any 0+ chars other than line break chars, as many as possible
Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
.* - any 0+ chars other than line break chars, as many as possible
The second regex matches
(?i) - case insensitive matching
\w - a word char, then
[.t] - a . or t and then
\d+$ - one or more digits at the end of string.
The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.
See regex #1 demo and regex #2 demo.

Python regular expression substitution help needed

I may have any of the following input strings -
i/p 1) Required 16 pcs
i/p 2) Required7 units
i/p 3) Requesting 12each
I wish to do some regular expression based substitution so that I have the following outputs for the above 3 strings -
o/p 1) Required 16 units
o/p 2) Required 7 units
o/p 3) Requesting 12 units
Basically, if my string contains pcs/units/each, and an integer before that, I want to do the following -
#1. replace the string "pcs" / "each" with "units" &
#2. add spaces before and after the integer value
I am using re in python 3.8. I guess I might have to use back referencing and numbered capturing groups, but not able to figure out how exactly do to make this work.
import re
txt = '''
Required 16 pcs
Required7 units
Requesting 12each
'''
print( re.sub(r'\s*(\d+)\s*(?:units|each|pcs)', r' \1 units', txt) )
Prints:
Required 16 units
Required 7 units
Requesting 12 units
import re
s = \
"""
Required 16 pcs
Required7 units
Requesting 12each
"""
s2 = re.sub(r'(\S*?)(\s*?)(\d*)(\s*?)(pcs|units|each)',r'\1 \3 each',s)
print(s2)
Explanation:
(\S*?) - \S non-space, non-greedy *? - capture group 1
capture group 3 is the digit(s)
re sub
replacement is group 1, followed by literal text of space followed by group 3 and ' each' as literal text. Corrects missing leading/trailing spaces. with groups 2 & 4 - optional non-greedy 1+ space.

How to better code, when looking for substrings?

I want to extract the currency (along with the $ sign) from a list, and create two different currency lists which I have done. But is there a better way to code this?
The list is as below:
['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
Python code:
pp_list = []
up_list = []
for u in usual_price_list:
rep = u.replace("\n","")
rep = rep.replace("\t","")
s = rep.rsplit("$",1)
pp_list.append(s[0])
up_list.append("$"+s[1])
For this kind of problem, I tend to use a lot the re module, as it is more readable, more maintainble and does not depend on which character surround what you are looking for :
import re
pp_list = []
up_list = []
for u in usual_price_list:
prices = re.findall(r"\$\d{2}\.\d{2}", u)
length_prices = len(prices)
if length_prices > 0:
pp_list.append(prices[0])
if length_prices > 1:
up_list.append(prices[1])
Regular Expresion Breakdown
$ is the end of string character, so we need to escape it
\d matches any digit, so \d{2} matches exactly 2 digits
. matches any character, so we need to escape it
If you want it you can modify the number of digits for the cents with \d{1,2} for matches one or two digits, or \d* to match 0 digit or more
As already pointed for doing that task re module is useful - I would use re.split following way:
import re
data = ['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
prices = [re.split(r'[\n\t]+',i) for i in data]
prices0 = [i[1] for i in prices]
prices1 = [i[2] for i in prices]
print(prices0)
print(prices1)
Output:
['$59.90', '$55.00', '$38.50', '$49.00', '$68.80', '$49.80']
['$68.00', '$68.00', '$49.90', '$62.00', '', '$60.50']
Note that this will work assuming that there are solely \n and \t excluding prices and there is at least one \n or \t before first price and at least one \n or \t between prices.
[\n\t]+ denotes any string made from \n or \t with length 1 or greater, that is \n, \t, \n\n, \t\t, \n\t, \t\n and so on

Add selected columns of a file as values to a dictionary

I am analyzing a small corpus, and want to create a dictionary based on 500k text files.
These text files consist of numbered lines which are tab-separated columns with some strings (or numbers), i.e.:
1 string1 string2 string3 # ...and so on, but I only need columns 2-4
2 string1 string2 string3 # ...and so on...
3 string1 string2 string3 # ...and so on...
4 string1 string2 string3 # ...and so on...
# ...and so on...
I am only simplifying it, these words not necessarily are the same in every line, but they do repeat over the whole corpus.
I want to create a dictionary with second column (with "string1") as a key and 3rd and 4th columns as values for that key, but also with a sum of all repetitions of a specific key within that corpus.
Should be something like this:
my_dict = {
"string1": [99, "string2", "string3"],
"string1": [51, "string2", "string3"],
# ...and so on...
}
So, "string1" stands for tokens, number is a counter for these tokens, "string2" stands for lemma, "string3" stands for category (some of them need to be omitted, as in the code below).
I've managed (with a big help of stackoverflow) to write-copy some code:
import os
import re
import operator
test_paths = ["path1", "path2", "path3"]
cat_to_omit = ["CAT1", "CAT2"]
tokens = {}
for path in test_paths:
dirs = os.listdir(path)
for file in dirs:
file_path = path + file
with open(file_path) as f:
for line in f:
if re.match(r"^\d+.*", line): #selecting only lines starting with numbers, because some of them don't, and I don't need these
check = line.split()[3]
if check not in cat_to_omit: #omitting some categories that I don't need
token_lst = line.lower().split()[1]
for token in token_lst.split():
tokens[token] = tokens.get(token, 0) + 1
print(tokens)
Now I am only getting, which is obvious, "string1" (which is a token) as a key + counter for this token's occurences within my corpus as a value. How can I add a list of 3 values for each key (token):
1. counter, which I already have as the only value for each key,
2. lemma, which should be taken from column 3 ("string2"),
3. category, which should be taken from column 4 ("string3").
It seems I just don't understand, how to turn my "key: value" dictionary into a "key: 3 values" one.

Resources