How to use regex in Pandas? - python-3.x

I have extracted the column names from a .csv file and now I want to use a regex expression in order to capitalise the first letter of the word and the first letter after the _ character.
Example: loan_status -> Loan_Status
Loan_ID
loan_status
Principal
terms
effective_date
due_date
paid_off_time
past_due_days
age
education
Gender
This is what I have come up with so far (^[a-z])+\w+
UPDATE
Thanks to Wiktor Stribiżew, this is what I came up with.
I am wondering if there is a more compact way to do the below.
import csv
import pandas as pd
import re
dataFrame = pd.read_csv('Loan_payments_data_2020_unclean.csv')
columnsDict = {"columnName": list(dataFrame.columns)}
columnsDataFrame = pd.DataFrame(columnsDict)
replacedColumns = columnsDataFrame['columnName'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
dataFrame.columns = list(replacedColumns)
print(dataFrame)

You may use
>>> df = pd.DataFrame({'Loan_ID': ['loan_status','Principal','terms','effective_date','due_date','paid_off_time','past_due_days','age','education','Gender']})
>>> df['Loan_ID'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
0 Loan_Status
1 Principal
2 Terms
3 Effective_Date
4 Due_Date
5 Paid_Off_Time
6 Past_Due_Days
7 Age
8 Education
9 Gender
Name: Loan_ID, dtype: object
The (?<![^_]). regex matches any char other than line break char that is either at the start of string or appears immediately after a _ char. It is equal to (?:(?<=^)|(?<=_)). regex, see its demo online.
Since you cannot manipulate the matched value from within a string replacement pattern, a callable is required as the replacement argument. lambda x: x.group().upper() just grabs the match value and turns it to upper case.

Related

sorting a pandas Series not working correctly

I am trying to sort a given series in python pandas but as per my knowledge it is not correct , it should be like [1,3,5,10,python]
can you please guide on what basis it is sorting this way ?
s1 = pd.Series(['1','3','python','10','5'])
s1.sort_values(ascending=True)
enter image description here
As explained in the comments, you have strings so '5' is greater than '10' (strings are compared character by character and '5' > '1').
One workaround is to use natsort for natural sorting:
from natsort import natsort_key
s1.sort_values(ascending=True, key=natsort_key)
output:
0 1
1 3
4 5
3 10
2 python
dtype: object
alternative without natsort (numbers first, strings after):
key = lambda s: (pd.concat([pd.to_numeric(s, errors='coerce')
.fillna(float('inf')), s], axis=1)
.agg(tuple, axis=1)
)
s1.sort_values(ascending=True, key=key)

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

masking string and phone number for dataframe in python pandas

Here I am trying to mask a data frame/dataset which have columns both integers and String values like this:
sno,Name,Type 1,Type 2,phonenumber
1,Bulbasaur,Grass,Poison,9876543212
2,Ivysaur,Grass,Poison,9876543212
3,Venusaur,Grass,Poison,9876543212
This is the code I am using,below code is working fine for string values it is masking well but for integers it is not masking:
import pandas as pd
filename = "path/to/file"
columnname= "phonenumber"
valuetomask = "9876543212"
column_dataset1 = pd.read_csv(filename)
print(column_dataset1)
# if(choice == "True"):
#masking for particular string/number in a column
column_dataset1[columnname]=column_dataset1[columnname].mask(column_dataset1[columnname] == valuetomask,"XXXXXXXXXX")
print(column_dataset1)
# masking last four digits
column_dataset1[columnname]=column_dataset1[columnname].str[:-4]+"****"
print(column_dataset1)
The above code is perfectly working for strings but when I gave "phonenumber"(any integer value) column it is not working.
Note: I need to do full masking(whole value should be masked) and partial masking(i.e last three digits/characters or first three digits/characters from above file) for any file which is given.
Convert to str and replace last four digits:
>>> df['phonenumber'].astype(str).str.replace(r'\d{4}$' , '****', regex=True)
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object
Which is the same of what #babakfifoo suggested:
>>> df['phonenumber'].astype(str).str[:-4] + '****'
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object
Convert your phone numbers to string and then try masking:
mask_len = 5 # length of digits to mask from right side
column_dataset1['phonenumber'] = (
column_dataset1['phonenumber'].astype(str) # convert to string
.str[:-mask_len]+"*" * mask_len # masking digits
)

Python: Using Pandas and Regex to Clean Phone Numbers with Country Code

I'm attempting to use pandas to clean phone numbers so that it returns only the 10 digit phone number and removes the country code if it is present and any special characters.
Here's some sample code:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
Returns
0 11921674056
1 1233454568
2 1233455678
3 1231231234
As you can see, this regex works well except for the country code. And unfortunately, this system I'm loading into cannot accept the country code. What I'm struggling with, is finding a regex that with strip the country code as well. All the regex's I've found will match the 10 digits I need, and in this case with using pandas, I need to not match them.
I could easily write a function and use .apply but I feel like there is likely a simple regex solution that I'm missing.
Thanks for any help!
I don't think regex is necessary here, which is nice because regex is a pain in the buns.
To append your current solution:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
phone_series = phone_series.apply(lambda x: x[-10:])
My lazier solution:
>>> phone_series = pd.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
>>> p2 = phone_series.apply(lambda x: ''.join([i for i in x if str.isnumeric(i)])[-10:])
>>> p2
0 1921674056
1 1233454568
2 1233455678
3 1231231234
dtype: object

Pandas select rows where a value in a columns does not starts with a string

I have a data where I need to filter out any rows that do start with a certain values - emphasis on plural:
Below the data exactly as it appears in file data.xlsx
Name Remains
GESDSRPPZ0161 TRUE
RT6000996 TRUE
RT6000994 TRUE
RT6000467 TRUE
RT6000431 TRUE
MCOPSR0034 FALSE
MCOPSR0033 FALSE
I need to be able to return a dataframe where name DOES NOT start with MCO, GE,etc.
import pandas as pd
import numpy as np
### data
file = r'C:\Users\user\Desktop\data.xlsx'
data = pd.read_excel(file, na_values = '')
data['name'] = data['name'].str.upper()
prefixes = ['IM%','JE%','GE%','GV%','CHE%','MCO%']
new_data = data.select(lambda x: x not in prefixes)
new_data.shape
the last call returns exactly the same dataset as I started with.
I tried:
pandas select from Dataframe using startswith
but it excludes data if the string is elsewhere (not only starts with)
df = df[df['Column Name'].isin(['Value']) == False]
The above answer would work if I knew exactly the string in question, however it changes (the common part is MCOxxxxx, GVxxxxxx, GExxxxx...)
The vvery same happens with this one:
How to implement 'in' and 'not in' for Pandas dataframe
because the values I have to pass have to be exact. Is there any way to do with using the same logic as here (Is there any equivalent for wildcard characters like SQL?):
How do I select rows where a column value starts with a certain string?
Thanks for help! Can we expand please on the below?
#jezrael although I've chosen the other solution for simplicity (and my lack of understanding of your solution), but I'd like to ask for a bit of explanation please. What does '^' + '|^' do in this code and how is it different from Wen's solution? How does it compare performance wise when you have for loop construct as oppose to operation on series like map or apply? If I understand correctly contains() is not bothered with the location whereby startswith() specifically looks at the beggining of the string. Does it mean the
^indicates to contains() to do what? Start at the beginning?
And | is it another special char for the method or is it treated like logical OR? I'd really want to learn this if you don't mind sharing. Thanks
You can using startswith , the ~ in the front will convert from in to not in
prefixes = ['IM','JE','GE','GV','CHE','MCO']
df[~df.Name.str.startswith(tuple(prefixes))]
Out[424]:
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Use str.contains with ^ for start of string and filter by boolean indexing:
prefixes = ['IM','JE','GE','GV','CHE','MCO']
pat = '|'.join([r'^{}'.format(x) for x in prefixes])
df = df[~df['Name'].str.contains(pat)]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Thanks, #Zero for another solution:
df = df[~df['Name'].str.contains('^' + '|^'.join(prefixes))]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True

Resources