pandas affective way of fuzzy match of value in 2 DataFrames - python-3.x

what is a better approach to make fuzzy matching of values in 2 different DF for 2 different columns. the current approach is to use nested loop but it is very slow.
this is the code I use
import pandas as pd
from fuzzywuzzy import fuzz
#td - it is DF with 12k+rows and 46+ columns
#tr - it is DF with 45k+ rows and 75+ columns
for td_id, td in td_df.iterrows():
td_target = td['Target(s)']
td_buyer = td['Buyer(s)']
td_announcement_date = td['Announcement Date']
for i, tr in tr_df.iterrows():
tr_target = tr['Target Name']
tr_buyer = tr['Acquiror Name']
tr_announcement_date = tr[' Date\nAnnounced']
date_delta = (td_announcement_date.date() - tr_announcement_date.date()).days
date_delta = max(date_delta, date_delta * -1)
if date_delta <= 30:
target_ratio = fuzz.ratio(
re.sub(r"\.|\,|:", '', str(td_target).lower()),
re.sub(r"\.|\,|:", '', str(tr_target).lower()))
buyer_ratio = fuzz.ratio(
re.sub(r"\.|\,|:", '', str(td_buyer).lower()),
re.sub(r"\.|\,|:", '', str(tr_buyer).lower()))
if target_ratio > 90 and buyer_ratio > 90:
match_df = match_df.append(tr, ignore_index=True)[['Target Name', 'Acquiror Name', ' Date\nAnnounced']]
match_df['Matched TD ID'] = td_id
match_df['target_ratio'] = target_ratio
match_df['buyer_ratio'] = buyer_ratio
match_df['tr_id'] = i
match_df['TD Target Name'] = td_target
match_df['TD Acquiror Name (Buyer)'] = td_buyer
match_df.to_sql(name='matches', if_exists='append', con=conn)

Related

How to create variables based on multiple conditions?

In the below code, I need to create two variables, namely flag1 and flag2. They are created based on multiple conditions. I used np.select approach as below. However, I wonder what would be the other ways to do this? In my real work situation, there would be more conditions to create the flag. Any advices or suggestions would be great.
import numpy as np
import pandas as pd
start_date = '2020-04-01'
end_date = '2020-05-01'
d1 = {'customer type':['walk in', 'online app', 'phone app', 'referral'], 'office visit':
['location1','location1','location1','location1'],'date1':['2020-04-17','2020-05-17','2020-03-
01','2020-05-01'],'date2':['2020-05-18','2020-04-18','2020-04-03','2020-05-19']}
df1=pd.DataFrame(data=d1)
con1 = [ (df1['date1'] >= start_date ) & (df1['date1'] < end_date )]
result1 = ['yes']
df1['flag1'] = np.select(con1, result1)
con2 = [ (df1['date2'] >= start_date ) & (df1['date2'] < end_date )]
result2 = ['yes']
df1['flag2'] = np.select(con2, result2)
You could use a dictionary and dynamically update the keys to the variable names and add the corresponding value of the variables.
For example:
import numpy as np
import pandas as pd
start_date = '2020-04-01'
end_date = '2020-05-01'
flags = dict()
flag_string = 'flag'
# This creates the strings flag1 and flag2 automatically
for i in range(1, 3):
# concatenate the flag_string with the index of the loop
flags[flag_string + str(i)] = flag_string + str(i)
print(flags)
d1 = {'customer type': ['walk in', 'online app', 'phone app', 'referral'],
'office visit': ['location1','location1','location1','location1'],'date1':['2020-04-17','2020-05-17','2020-03- \
01','2020-05-01'],'date2':['2020-05-18','2020-04-18','2020-04-03','2020-05-19']}
df1=pd.DataFrame(data=d1)
con1 = [ (df1['date1'] >= start_date ) & (df1['date1'] < end_date )]
result1 = ['yes']
df1[flags['flag1']] = np.select(con1, result1)
con2 = [ (df1['date2'] >= start_date ) & (df1['date2'] < end_date )]
result2 = ['yes']
df1[flags['flag2']] = np.select(con2, result2)
This is how you can substitute dictionary values as variables. I've also included a for loop that builds your flag dictionary.

Format certain rows after writing to excel file

I have some code which compares two excel files and determines any new rows (new_rows) added or any rows which were deleted (dropped_rows). It then uses xlsxwriter to write this to a excel sheet. The bit of code I am having trouble with is that it is supposed to then iterate through the rows and if the row was a new row or a dropped row it is supposed to format it a certain way. For whatever reason this part of the code isn't working correct and being ignored.
I've tried a whole host of different syntax to make this work but no luck.
UPDATE
After some more trial and error the issue seems to be caused by the index column. It is a Case Number column and the values have a prefix like "Case_123, Case_456, Case_789, etc..". This seems to be the root of the issue. But not sure how to solve for it.
grey_fmt = workbook.add_format({'font_color': '#E0E0E0'})
highlight_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'})
new_fmt = workbook.add_format({'font_color': '#32CD32','bold':True})
# set format over range
## highlight changed cells
worksheet.conditional_format('A1:J10000', {'type': 'text',
'criteria': 'containing',
'value':'→',
'format': highlight_fmt})
# highlight new/changed rows
for row in range(dfDiff.shape[0]):
if row+1 in newRows:
worksheet.set_row(row+1, 15, new_fmt)
if row+1 in droppedRows:
worksheet.set_row(row+1, 15, grey_fmt)
the last part # highlight new/changed rows is the bit that is not working. The conditional format portion works fine.
the rest of the code:
import pandas as pd
from pathlib import Path
def excel_diff(path_OLD, path_NEW, index_col):
df_OLD = pd.read_excel(path_OLD, index_col=index_col).fillna(0)
df_NEW = pd.read_excel(path_NEW, index_col=index_col).fillna(0)
# Perform Diff
dfDiff = df_NEW.copy()
droppedRows = []
newRows = []
cols_OLD = df_OLD.columns
cols_NEW = df_NEW.columns
sharedCols = list(set(cols_OLD).intersection(cols_NEW))
for row in dfDiff.index:
if (row in df_OLD.index) and (row in df_NEW.index):
for col in sharedCols:
value_OLD = df_OLD.loc[row,col]
value_NEW = df_NEW.loc[row,col]
if value_OLD==value_NEW:
dfDiff.loc[row,col] = df_NEW.loc[row,col]
else:
dfDiff.loc[row,col] = ('{}→{}').format(value_OLD,value_NEW)
else:
newRows.append(row)
for row in df_OLD.index:
if row not in df_NEW.index:
droppedRows.append(row)
dfDiff = dfDiff.append(df_OLD.loc[row,:])
dfDiff = dfDiff.sort_index().fillna('')
print(dfDiff)
print('\nNew Rows: {}'.format(newRows))
print('Dropped Rows: {}'.format(droppedRows))
# Save output and format
fname = '{} vs {}.xlsx'.format(path_OLD.stem,path_NEW.stem)
writer = pd.ExcelWriter(fname, engine='xlsxwriter')
dfDiff.to_excel(writer, sheet_name='DIFF', index=True)
df_NEW.to_excel(writer, sheet_name=path_NEW.stem, index=True)
df_OLD.to_excel(writer, sheet_name=path_OLD.stem, index=True)
# get xlsxwriter objects
workbook = writer.book
worksheet = writer.sheets['DIFF']
worksheet.hide_gridlines(2)
worksheet.set_default_row(15)
# define formats
date_fmt = workbook.add_format({'align': 'center', 'num_format': 'yyyy-mm-dd'})
center_fmt = workbook.add_format({'align': 'center'})
number_fmt = workbook.add_format({'align': 'center', 'num_format': '#,##0.00'})
cur_fmt = workbook.add_format({'align': 'center', 'num_format': '$#,##0.00'})
perc_fmt = workbook.add_format({'align': 'center', 'num_format': '0%'})
grey_fmt = workbook.add_format({'font_color': '#E0E0E0'})
highlight_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'})
new_fmt = workbook.add_format({'font_color': '#32CD32','bold':True})
# set format over range
## highlight changed cells
worksheet.conditional_format('A1:J10000', {'type': 'text',
'criteria': 'containing',
'value':'→',
'format': highlight_fmt})
# highlight new/changed rows
for row in range(dfDiff.shape[0]):
if row+1 in newRows:
worksheet.set_row(row+1, 15, new_fmt)
if row+1 in droppedRows:
worksheet.set_row(row+1, 15, grey_fmt)
# save
writer.save()
print('\nDone.\n')
def main():
path_OLD = Path('file1.xlsx')
path_NEW = Path('file2.xlsx')
# get index col from data
df = pd.read_excel(path_NEW)
index_col = df.columns[0]
print('\nIndex column: {}\n'.format(index_col))
excel_diff(path_OLD, path_NEW, index_col)
if __name__ == '__main__':
main()

BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table

I'm trying to pull the data from this url: https://www.winstonslab.com/players/player.php?id=98 and I keep getting the same error when I try to access the tables.
My scraping code is below. I run this, then hp = HTMLTableParser() and table = hp.parse_url('https://www.winstonslab.com/players/player.php?id=98')[0][1] returns the error 'index 0 is out of bounds for axis 0 with size 0'
import requests
import pandas as pd
from bs4 import BeautifulSoup
class HTMLTableParser:
def parse_url(self, url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return [(table['id'],self.parse_html_table(table))\
for table in soup.find_all('table')]
def parse_html_table(self, table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
If the data that you need is just the table, you can accomplish that with pandas.read_html() function.

Unable to retrieve data from frame

I am trying to retrieve specific data from data-frame with particular condition, but it show empty data frame. I am new to data science, trying to learn data science. Here is my code.
file = open('/home/jeet/files1/files/ch03/adult.data', 'r')
def chr_int(a):
if a.isdigit(): return int(a)
else: return 0
data = []
for line in file:
data1 = line.split(',')
if len(data1) == 15:
data.append([chr_int(data1[0]), data1[1],
chr_int(data1[2]), data1[3],
chr_int(data1[4]), data1[5],
data1[6], data1[7], data1[8],
data1[9], chr_int(data1[10]),
chr_int(data1[11]),
chr_int(data1[12]),
data1[13], data1[14]])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['age', 'type-employer', 'fnlwgt', 'education','education_num', 'marital','occupation', 'relationship','race','sex','capital_gain','capital_loss','hr_per_week','country','income']
ml = df[(df.sex == 'Male')] # here i retrive data who is male
ml1 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
print(ml1.head()) # here i printing that data
fm =df[(df.sex == 'Female')]
fm1 = df [(df.sex == 'Female') & (df.income =='>50K\n')]
output:
Empty DataFrame
Columns: [age, type-employer, fnlwgt, education, education_num, marital, occupation, relationship, race, sex, capital_gain, capital_loss, hr_per_week, country, income]
Index: []
what's wrong with the code. why data frame is empty.
If you check the values carefully, you may see the problem:
print(df.income.unique())
>>> [' <=50K\n' ' >50K\n']
There are spaces in front of each values. So values should be either processed to get rid of these spaces, or the code should be modified like this:
ml1 = df[(df.sex == 'Male') & (df.income == ' >50K\n')]
fm1 = df [(df.sex == 'Female') & (df.income ==' <=50K\n')]

Python3, with pandas.dataframe, how to select certain data by some rules to show

I have a pandas.dataframe, and I want to select certain data by some rules.
The following codes generate the dataframe
import datetime
import pandas as pd
import numpy as np
today = datetime.date.today()
dates = list()
for k in range(10):
a_day = today - datetime.timedelta(days=k)
dates.append(np.datetime64(a_day))
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10, 3)),
columns=('other1', 'actual', 'other2'),
index=['{}'.format(i) for i in range(10)])
df.insert(0, 'dates', dates)
df['err_m'] = np.random.rand(10, 1)*0.1
df['std'] = np.random.rand(10, 1)*0.05
df['gain'] = np.random.rand(10, 1)
Now, I want select by the following rules:
1. compute the sum of 'err_m' and 'std', then sort the df so that the sum is descending
2. from the result of step 1, select the part where 'actual' is > 50
Thanks
Create a new column and then sort by this one:
df['errsum'] = df['err_m'] + df['std']
# Return a sorted dataframe
df_sorted = df.sort('errsum', ascending = False)
Select the lines you want
# Create an array with True where the condition is met
selector = df_sorted['errsum'] > 50
# Return a view of sorted_dataframe with only the lines you want
df_sorted[selector]

Resources