masking string and phone number for dataframe in python pandas - python-3.x

Here I am trying to mask a data frame/dataset which have columns both integers and String values like this:
sno,Name,Type 1,Type 2,phonenumber
1,Bulbasaur,Grass,Poison,9876543212
2,Ivysaur,Grass,Poison,9876543212
3,Venusaur,Grass,Poison,9876543212
This is the code I am using,below code is working fine for string values it is masking well but for integers it is not masking:
import pandas as pd
filename = "path/to/file"
columnname= "phonenumber"
valuetomask = "9876543212"
column_dataset1 = pd.read_csv(filename)
print(column_dataset1)
# if(choice == "True"):
#masking for particular string/number in a column
column_dataset1[columnname]=column_dataset1[columnname].mask(column_dataset1[columnname] == valuetomask,"XXXXXXXXXX")
print(column_dataset1)
# masking last four digits
column_dataset1[columnname]=column_dataset1[columnname].str[:-4]+"****"
print(column_dataset1)
The above code is perfectly working for strings but when I gave "phonenumber"(any integer value) column it is not working.
Note: I need to do full masking(whole value should be masked) and partial masking(i.e last three digits/characters or first three digits/characters from above file) for any file which is given.

Convert to str and replace last four digits:
>>> df['phonenumber'].astype(str).str.replace(r'\d{4}$' , '****', regex=True)
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object
Which is the same of what #babakfifoo suggested:
>>> df['phonenumber'].astype(str).str[:-4] + '****'
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object

Convert your phone numbers to string and then try masking:
mask_len = 5 # length of digits to mask from right side
column_dataset1['phonenumber'] = (
column_dataset1['phonenumber'].astype(str) # convert to string
.str[:-mask_len]+"*" * mask_len # masking digits
)

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

How to use regex in Pandas?

I have extracted the column names from a .csv file and now I want to use a regex expression in order to capitalise the first letter of the word and the first letter after the _ character.
Example: loan_status -> Loan_Status
Loan_ID
loan_status
Principal
terms
effective_date
due_date
paid_off_time
past_due_days
age
education
Gender
This is what I have come up with so far (^[a-z])+\w+
UPDATE
Thanks to Wiktor Stribiżew, this is what I came up with.
I am wondering if there is a more compact way to do the below.
import csv
import pandas as pd
import re
dataFrame = pd.read_csv('Loan_payments_data_2020_unclean.csv')
columnsDict = {"columnName": list(dataFrame.columns)}
columnsDataFrame = pd.DataFrame(columnsDict)
replacedColumns = columnsDataFrame['columnName'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
dataFrame.columns = list(replacedColumns)
print(dataFrame)
You may use
>>> df = pd.DataFrame({'Loan_ID': ['loan_status','Principal','terms','effective_date','due_date','paid_off_time','past_due_days','age','education','Gender']})
>>> df['Loan_ID'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
0 Loan_Status
1 Principal
2 Terms
3 Effective_Date
4 Due_Date
5 Paid_Off_Time
6 Past_Due_Days
7 Age
8 Education
9 Gender
Name: Loan_ID, dtype: object
The (?<![^_]). regex matches any char other than line break char that is either at the start of string or appears immediately after a _ char. It is equal to (?:(?<=^)|(?<=_)). regex, see its demo online.
Since you cannot manipulate the matched value from within a string replacement pattern, a callable is required as the replacement argument. lambda x: x.group().upper() just grabs the match value and turns it to upper case.

Using pandas manipulate number format

just out of my curiosity, I have a name list with phone numbers in a csv file, and I want to change these phone numbers from ############ (11 digits) to the format of ###-####-####, adding two minus sign in between 3-4 and 7-8 place.
is this possible?
If it's Dataframe you can use apply with formate string
df
num
0 09187543839
1 08745763412
df.num = df.num.apply(lambda x : "{}-{}-{}".format(x[:3],x[3:7],x[7:]))
df
num
0 091-8754-3839
1 087-4576-3412
Yes, it is possible. Below is a code-snippet that accomplishes what you want:
phone = str(55512354567)
print(f'{phone[:3]}-{phone[3:7]}-{phone[7:]}')
You can adapt the above idea to your Pandas dataframe as shown below:
# Sample data
data_df = pd.DataFrame([[55512345678], [55587654321]], columns=['phone'])
# Create a string column
data_df['phone_str'] = data_df['phone'].map(lambda x: str(x))
# Convert the column values to the right format
data_df['phone_str'] = data_df['phone_str'].map(lambda x: f'{x[:3]}-{x[3:7]}-{x[7:]}')
I may not be using pandas but this could potentially work...
n = 3
n1 = 7
str = "12345678901"
l, m, r = str[:n], str[n:n1], str[n1:]
final = l+"-"+m+"-"+r
print(final)
Output:
123-4567-8901

How to convert Excel negative value to Pandas negative value

I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0

Use de startswith function for numbers in Python

I have a column vector with number and characters like this:
Data
123456
789101
159482
Airplane
Car
Blue
159874
I need to filter just the numeric values.
I try to use the Data.int.startswith function, but i believe what this function doesn't exist.
Thanks.
Not sure exactly what you are asking, but if you mean that you want to filter out a list of ints from the string, you can do the following:
string = """Data
123456
789101
159482
Airplane
Car
Blue
159874""" #The data you provided
def isInt(s): #returns true if the string is an int
try:
int(s)
return True
except ValueError:
return False
print( [i for i in string.splitlines() if isInt(i)] ) #Loop through the lines in the string, checking if they are integers.
This will return the following list:
[123456, 789101, 159482, 159874]

Resources