Text Match combinations - text

My data is something like below
# dummy data
ID = c(1,2,3,4,5,6,7,8,9,10,11,12)
addrs = c("3 xx road sg" , "4 yy road sg" , "5 apt 04-3 sg" , "Bung 2 , kl road sg","4 yy road sg" , "3 xx road sg" ,"Bung 2 , kl road sg" ,"5 apt 04-3 sg","3 xx road sg","Bung 2 , sg kl road","3xx Road sg","4 yy sg")
data.1=data.table(ID,addrs)
data looks like
ID addrs
1: 1 3 xx road sg
2: 2 4 yy road sg
3: 3 5 apt 04-3 sg
4: 4 Bung 2 , kl road sg
5: 5 4 yy road sg
6: 6 3 xx road sg
7: 7 Bung 2 , kl road sg
8: 8 5 apt 04-3 sg
9: 9 3 xx road sg
i want to get matching combinations ( based on addrs ) output required is ( only example for "3 xx road sg") - if Addr matches for A and B , table should have A-B - Match and B-A-Match
ID.1 ID.2 Match.1 Match.2 Accuracy
1 6 3 xx road sg 3 xx road sg 100%
1 9 3 xx road sg 3 xx road sg 100%
6 9 3 xx road sg 3 xx road sg 100%
9 6 3 xx road sg 3 xx road sg 100%
9 1 3 xx road sg 3 xx road sg 100%
6 1 3 xx road sg 3 xx road sg 100%
showing output where the text may differ by spaces , order of characters , or characters
ID.1 ID.2 Match.1 Match.2 Accuracy
1 11 3 xx road sg 3xx Road sg 100 %
2 12 4 yy road sg 4 yy sg 70 %
4 10 Bung 2 , kl road sg Bung 2 , sg kl road 100 %
Any further inputs on how to deal with the text matching when the data may be similar but written differently ?

r <- merge(data.1, data.1, by="addrs", all=T, suffixes = c(".1",".2"))
r[r$ID.1 != r$ID.2,]
addrs ID.1 ID.2
2 3 xx road sg 1 6
3 3 xx road sg 1 9
4 3 xx road sg 6 1
6 3 xx road sg 6 9
7 3 xx road sg 9 1
8 3 xx road sg 9 6
11 4 yy road sg 2 5
12 4 yy road sg 5 2
15 5 apt 04-3 sg 3 8
16 5 apt 04-3 sg 8 3
19 Bung 2 , kl road sg 7 4
20 Bung 2 , kl road sg 4 7

Related

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

Building tripple from two different data frames

I want to to build the triples: source --> target --> edge and Store these triples in a new dataframe.
I have two data frames
Accident_ID Location CarID_1 CarID_2 DriverID_1 DriverID_2
0 1 Tartu 1000 1001 1 3
1 2 Tallin 1002 1003 2 5
2 3 Tartu 1004 1005 4 6
3 4 Tallin 1006 1007 7 8
User_ID First Name Last Name Age Address Accident_ID ROLE
0 1 Chester Murphy 25 Narva 108, Tartu 1 Driver
1 2 Walter Turner 26 Tilgi 49, Tartu 2 Driver
2 3 Daryl Fowler 25 Piik 67, Tartu 1 Driver
3 4 Ted Nelson 45 Herne 20, Tartu 3 Driver
4 5 Olivia Crawford 38 Kalevi 25, Tartu 2 Driver
5 1 Chester Murphy 25 Narva 108, Tartu 2 Witness
6 6 Amy Miller 27 Riia 408, Tartu 3 Driver
7 7 Tes Smith 25 Narva 108, Tartu 4 Driver
8 8 Josh Blake 36 Parnu 37, Tallin 4 Driver
9 3 Daryl Fowler 25 Piik 67, Tartu 4 Witness
The triples which I have to formed is in this pattern
[![enter image description here][2]][2]
what is the python code for this? I have written this one but I am getting error witness is not defined
df3 = df1.merge(df2,on='Accident_ID')
df3["train"] = df3.Accident_ID < 5
df3["train"] .value_counts()
triples = []
for _, row in df3[df3["train"]].iterrows():
if row["ROLE"] == "Driver":
if row["User_ID"] == row["DriverID_1"]:
Drives = (row["User_ID"],row["CarID_1"], "Drives")
elif row["User_ID"] == row["DriverID_2"]:
Drives = (row["User_ID"],row["CarID_2"], "Drives")
else:
Witness = (row["User_ID"],row["Accident_ID"], "Witness")
Involved_in_first = (row["CarID_1"],row["Accident_ID"], "Involved in")
Involved_in_second = (row["CarID_2"],row["Accident_ID"], "Involved in")
Happened_in = (row["Accident_ID"],row["Location"], "Happened in")
Lives_in = (row["User_ID"],row["Address"], "Lives in")
triples.extend((Drives , Witness , Involved_in_first,Involved_in_second, Happened_in , Lives_in ))
triples_df = pd.DataFrame(triples, columns=["Source", "Target", "Edge"])
triples_df.shape
You should something like this and follow the same process for the rest of the edges:
df = df2.merge(df1, on=['Accident_ID'], how='inner')
print(df)
columns = ['Source', 'Target', 'Edge']
rows = []
for i in range(0, df.shape[0]):
row1 = [
df.iloc[i]['First_Name'],
df.iloc[i]['CarID_1'],
'Drives'
]
row2 = [
df.iloc[i]['First_Name'],
df.iloc[i]['Accident_ID'],
'Witness'
]
rows.append(row1)
rows.append(row2)
df_g = pd.DataFrame(rows, columns=columns)
print(df_g)
Output:
Source Target Edge
0 Chester 1000 Drives
1 Chester 1 Witness
2 Daryl 1000 Drives
3 Daryl 1 Witness
4 Walter 1002 Drives
5 Walter 2 Witness
6 Olivia 1002 Drives
7 Olivia 2 Witness
8 Chester 1002 Drives
9 Chester 2 Witness
10 Ted 1004 Drives
11 Ted 3 Witness
12 Amy 1004 Drives
13 Amy 3 Witness
14 Tes 1006 Drives
15 Tes 4 Witness
16 Josh 1006 Drives
17 Josh 4 Witness
18 Daryl 1006 Drives
19 Daryl 4 Witness

What excel formula can be used to get the mean for the following data? And how to apply it?

I have my data as follows, the values in the spreadsheet are the quantities, while 'red','yellow', 'green' are the categories
items place red green yellow
a VA 1 7 9
b VA 3 0 19
c VA 5 1 0
d VA 11 3 4
e VA 2 2 1
a NJ 0 0 3
b NJ 3 0 9
c NJ 2 4 0
d NJ 0 5 6
e NJ 2 7 1
a MO 0 0 5
b MO 1 0 4
c MO 1 4 0
d MO 0 0 5
e MO 1 7 1
For each place-category combination, I would like to compute the mean of these quantities across all 5 items (a,b,c,d,e),
category place Avg_quantity
red VA ..
green VA . ..
yellow VA ..
red NJ ..
green NJ ..
yellow NJ ..
red MO ..
green MO ..
yellow MO ..
I tried using averageifs but it gives an error since my arguments length is different for category and place
Use, in J2 and drag down:
=AVERAGE(IF($B$2:$B$16=$I2,INDEX($C$2:$E$16, ,MATCH($H2,$C$1:$E$1,0))))
Entered with Ctrl + Shift + Enter i.e. array formula
Data
Excluding zeroes:
=AVERAGEIFS(INDEX($C$2:$E$16,,MATCH($H2,$C$1:$E$1,0)),INDEX($C$2:$E$16,,MATCH($H2,$C$1:$E$1,0)),">0",$B$2:$B$16,$I2)

Resources