Fuzzy logic for excel data -Pandas - python-3.x

I have two dataframes DF(~100k rows)which is a raw data file and DF1(15k rows), mapping file. I'm trying to match the DF.address and DF.Name columns to DF1.Address and DF1.Name. Once the match is found DF1.ID should be populated in DF.ID(if DF1.ID is not None) else DF1.top_ID should be populated in DF.ID.
I'm able to match the address and name with the help of fuzzy logic but i'm stuck how to connect the result obtained to populate the ID.
DF1-Mapping file
DF Raw Data file
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter
df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)
df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()
def fuzzy_match(x, choice, scorer, cutoff):
return (process.extractOne(x,
choices=choice,
scorer=scorer,
score_cutoff=cutoff))
for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
#====Address Matching=====#
choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
result=fuzzy_match(Add,choice,fuzz.ratio,70)
#====Name Matching========#
if (result is not None):
if (result[3]>70):
choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
print(ID)
if (result_1 is not None):
if(result_1[3]>95):
#Here populating the matching ID
print("ok")
else:
continue
else:
continue
else:
continue
else:

IIUC: Here is a solution:
from fuzzywuzzy import fuzz
import pandas as pd
#Read raw data from clipboard
raw = pd.read_clipboard()
#Read map data from clipboard
mp = pd.read_clipboard()
#Merge raw data and mp data as following
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')
#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)
#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result
This link has sample data to test the solution provided.

Related

Altair/Vega-Lite heatmap: Filter top k

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

finding latest trip information from a large data frame

I have one requirement:
I have a dataframe "df_input" having 20M rows which includes trip details. columns are "vehicle-no", "geolocation","start","end".
For each of the vehicle number there are multiple rows each having different geolocation for different trips.
Now I want to create a new dataframe df_final which will have only the first record for all of the vehicle-no. How can do that in efficient way?
I used something like below which is taking more than 5 hours to complete:
import dfply as dp
from dfply import X
output_df_columns = ["vehicle-no","start", "end", "geolocations"]
df_final = pd.DataFrame(columns = output_df_columns) #create empty dataframe
unique_vehicle_no = list(df_input["vehicle-no"].unique())
df_input.sort_values(["start"],inplace=True)
for each_vehicle in unique_vehicle_no:
df_temp = (df_input >> dp.mask(X.vehicle-no == each_vehicle))
df_final = df_final.append(df_temp.head(1),ignore_index=True, sort=False)
I think this will work out
import pandas as pd
import numpy as np
df_input=pd.DataFrame(np.random.randint(10,size=(1000,3)),columns=['Geolocation','start','end'])
df_input['vehicle_number']=np.random.randint(100,size=(1000))
print(df_input.shape)
print(df_input['vehicle_number'].nunique())
df_final=df_input.groupby('vehicle_number').apply(lambda x : x.head(1)).reset_index(drop=True)
print(df_final['vehicle_number'].nunique())
print(df_final.shape)

Is there a loop function for automating data pull request from Google Trends?

I want to create a loop that helps me to pull data from Google Trends via PyTrends. I need to iterate through a lot of keywords but Google Trends allows only to compare five keywords at the time, hence I need to iterate through the keywords manually and create a dataframe in pandas. However, it seems something is off.
I get data but my dataframe with pandas creates the dataframe with values that are shifted in different rows and with duplicate "NaN" values.
instead of 62 rows I get 372 rows(with duplicate "NaN").
from pytrends.request import TrendReq
import pandas as pd
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df1 = pd.DataFrame()
for i in kw_list:
kw_list = i
pytrend.build_payload([kw_list], timeframe='2015-10-14 2015-12-14', geo='FR')
df1 = df1.append(pytrend.interest_over_time())
print(df1.head)
I want to have one coherent dataframe, with the columns 'cool', 'fun', 'big','house', 'phone', 'garden' and their respective values in each column on the same row. Like e.g. a dataframe with 62 rows and 6 columns.
I'm probably going to lose some rep points because this is an old question (oldish, anyway), but I was struggling with the same problem and I solved it like this:
import pytrends
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df_gtrends_kw = {}
df_gtrends = pd.DataFrame()
for kw in kw_list:
pytrend.build_payload(kw_list = [kw], timeframe='today 12-m')
df_gtrends_kw[kw] = pytrend.interest_by_region(resolution='COUNTRY')
df_gtrends = pd.concat([df_gtrends_kw[key] for key in kw_list], join = 'inner', axis = 1)
According to the official doc, one has to specify the axes along which one is to glue the dataframes; in this case, the Column names, since the index name is the same for each dataframe.

Error Unorderable Types. Select rows where on hdf5 files

I work with python 3.5 and I have the next problem to import some datas from a hdf5 files.
I will show a very simple example which resume what happen. I have created a small dataframe and I have inserted it into a hdf5 files. Then I have tried to select from this hdf5 file the rows which have on the column "A" a value less that 1. So I get the error:
"Type error: unorderable types: str() < int()"
image
import pandas as pd
import numpy as np
import datetime
import time
import h5py
from pandas import DataFrame, HDFStore
def test_conected():
hdf_nombre_archivo ="1_Archivo.h5"
hdf = HDFStore(hdf_nombre_archivo)
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=3)
df = pd.DataFrame(np.random.randn(3, 4), index=index, columns=
['A', 'B','C','F'])
print(df)
with h5py.File(hdf_nombre_archivo) as f:
df.to_hdf(hdf_nombre_archivo, 'df',format='table')
print("")
with h5py.File(hdf_nombre_archivo) as f:
df_nuevo = pd.read_hdf(hdf_nombre_archivo, 'df',where= ['A' < 1])
print(df_nuevo )
def Fin():
print(" ")
print("FIN")
if __name__ == "__main__":
test_conected()
Fin()
print(time.strftime("%H:%M:%S"))
I have been investigating but I dont get to solve this error. Some idea?
Thanks
Angel
where= ['A' < 1]
in your condition statement 'A' is consider as string or char and 1 is int so first make them in same type by typecasting.
ex:
str(1)

Cannot get Pandas to concat/append

I'm trying to parse a website's tables and I'm pretty noobish still. For each link only the second table/dataframe is appended to the SS. There are multiple links and therefore it requires a while loop. Using what little I could find only I am just stuck with this which Im pretty sure is totally off:
import pandas as pd
from pandas import ExcelWriter
a=1
alist = []
writer = ExcelWriter('name.xlsx')
def dffunc():
dfs = pd.read_html('http://websitepath{}.htm'.format(a))
df = dfs[1]
alist.append(df,ignore_index=True)
alist = pd.concat(df, axis=0)
while a<9:
dffunc()
a+=1
alist.to_excel(writer, index=False)
writer.save()
df=dfs[1] takes the second table in the list. Is that what you want?
old:
df = dfs[1]
alist.append(df,ignore_index=True)
alist = pd.concat(df, axis=0)
You're appending the 2nd table in the dfs collection to global alist
You're assigning the 2nd table in the dfs collection to alist, undoing all previous steps
Operating on a global var that is written to file once at the end of the loop defeats the purpose of your loop given second bullet; alist will only ever take on the value of the 2nd table in the last query when you write to file
new:
import pandas as pd
from pandas import ExcelWriter
writer = ExcelWriter('name.xlsx')
writer_kwargs = {'index': False}
A = 9
def dffunc(a):
dfs = pd.read_html('http://websitepath{}.htm'.format(a))
return pd.concat(dfs, axis=0)
def dfhandler(df, writer, **kwargs):
df.to_excel(writer, sheet_name=a, **kwargs)
for a in xrange(1, A):
dfhandler(dffunc(a), writer, **writer_kwargs)
writer.save()

Resources