Hash each row of pandas dataframe column using apply - python-3.x

I'm trying to hash each value of a python 3.6 pandas dataframe column with the following algorithm on the dataframe-column ORIG:
HK_ORIG = base64.b64encode(hashlib.sha1(str(df.ORIG).encode("UTF-8")).digest())
However, the above mentioned code does not hash each value of the column, so, in order to hash each value of the df-column ORIG, I need to use the apply function. Unfortunatelly, I don't seem to be good enough to get this done.
I imagine it to look like the following code:
df["HK_ORIG"] = str(df['ORIG']).encode("UTF-8")).apply(hashlib.sha1)
I'm looking very much forward to your answers!
Many thanks in advance!

You can either create a named function and apply it - or apply a lambda function. In either case, do as much processing as possible withing the dataframe.
A lambda-based solution:
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(lambda x: base64.b64encode(hashlib.sha1(x).digest()))
A named function solution:
def hashme(x):
return base64.b64encode(hashlib.sha1(x).digest())
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(hashme)

Related

Sum all counts when their fuzz.WRatio > 90 otherwise leave intact

What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})

Iterate over 4 pandas data frame columns and store them into 4 lists with one for loop instead of 4 for loops

I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()

replacing a special character in a pandas dataframe

I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?
If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))

How to extract several dataframes from dictionary

I currently trying to extract several dataframes from a dictionary. The problem is, that the number of dataframes will vary, sometimes I'll have two dataframes in there and sometimes 30.
At the beginning I create a dictionary (dict_of_exceptions) from a dataframe (exceptions_df). In this dictionary I'll have several dataframes depending on how many different 'Source Wells' I have. With the current code I can extract the first dataframe from the dictionary which is j:
dict_of_exceptions = {k: v for k, v in exceptions_df.groupby('Source Well') }
print (dict_of_exceptions)
for k in dict_of_exceptions.keys():
j = dict_of_exceptions[k]
Could someone help me modify the last line to go trough the dictionary and extract each dataframe (and name them like the corresponding key)?
I think I get your intention, but could not really read your intentions from your code though. Currently, as #cyrilb38 stated in comments, your loop is overriding j, so you would only be able to see the result of last iteration. Anyways, rather transforming use dataframe instead, and I think (may be wrong) that you call your row a dataframe. Replacing a groupby object with dict is not what you wanted, or it is just prolonging the process for nothing.
If you want to see the info of Well X only for example, try this
exceptions_df[exceptins_df['Source Well'] == 'Well X']

Iterating throughput dataframe columns and using .apply() gives KeyError

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

Resources