Text Match combinations

Text Match combinations - text

My data is something like below
# dummy data
ID = c(1,2,3,4,5,6,7,8,9,10,11,12)
addrs = c("3 xx road sg" , "4 yy road sg" , "5 apt 04-3 sg" , "Bung 2 , kl road sg","4 yy road sg" , "3 xx road sg" ,"Bung 2 , kl road sg" ,"5 apt 04-3 sg","3 xx road sg","Bung 2 , sg kl road","3xx Road sg","4 yy sg")
data.1=data.table(ID,addrs)
data looks like
ID addrs
1: 1 3 xx road sg
2: 2 4 yy road sg
3: 3 5 apt 04-3 sg
4: 4 Bung 2 , kl road sg
5: 5 4 yy road sg
6: 6 3 xx road sg
7: 7 Bung 2 , kl road sg
8: 8 5 apt 04-3 sg
9: 9 3 xx road sg
i want to get matching combinations ( based on addrs ) output required is ( only example for "3 xx road sg") - if Addr matches for A and B , table should have A-B - Match and B-A-Match
ID.1 ID.2 Match.1 Match.2 Accuracy
1 6 3 xx road sg 3 xx road sg 100%
1 9 3 xx road sg 3 xx road sg 100%
6 9 3 xx road sg 3 xx road sg 100%
9 6 3 xx road sg 3 xx road sg 100%
9 1 3 xx road sg 3 xx road sg 100%
6 1 3 xx road sg 3 xx road sg 100%
showing output where the text may differ by spaces , order of characters , or characters
ID.1 ID.2 Match.1 Match.2 Accuracy
1 11 3 xx road sg 3xx Road sg 100 %
2 12 4 yy road sg 4 yy sg 70 %
4 10 Bung 2 , kl road sg Bung 2 , sg kl road 100 %
Any further inputs on how to deal with the text matching when the data may be similar but written differently ?

r <- merge(data.1, data.1, by="addrs", all=T, suffixes = c(".1",".2"))
r[r$ID.1 != r$ID.2,]
addrs ID.1 ID.2
2 3 xx road sg 1 6
3 3 xx road sg 1 9
4 3 xx road sg 6 1
6 3 xx road sg 6 9
7 3 xx road sg 9 1
8 3 xx road sg 9 6
11 4 yy road sg 2 5
12 4 yy road sg 5 2
15 5 apt 04-3 sg 3 8
16 5 apt 04-3 sg 8 3
19 Bung 2 , kl road sg 7 4
20 Bung 2 , kl road sg 4 7

Related

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie

Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38

I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d

Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

Building tripple from two different data frames

I want to to build the triples: source --> target --> edge and Store these triples in a new dataframe.
I have two data frames
Accident_ID Location CarID_1 CarID_2 DriverID_1 DriverID_2
0 1 Tartu 1000 1001 1 3
1 2 Tallin 1002 1003 2 5
2 3 Tartu 1004 1005 4 6
3 4 Tallin 1006 1007 7 8
User_ID First Name Last Name Age Address Accident_ID ROLE
0 1 Chester Murphy 25 Narva 108, Tartu 1 Driver
1 2 Walter Turner 26 Tilgi 49, Tartu 2 Driver
2 3 Daryl Fowler 25 Piik 67, Tartu 1 Driver
3 4 Ted Nelson 45 Herne 20, Tartu 3 Driver
4 5 Olivia Crawford 38 Kalevi 25, Tartu 2 Driver
5 1 Chester Murphy 25 Narva 108, Tartu 2 Witness
6 6 Amy Miller 27 Riia 408, Tartu 3 Driver
7 7 Tes Smith 25 Narva 108, Tartu 4 Driver
8 8 Josh Blake 36 Parnu 37, Tallin 4 Driver
9 3 Daryl Fowler 25 Piik 67, Tartu 4 Witness
The triples which I have to formed is in this pattern
[![enter image description here][2]][2]
what is the python code for this? I have written this one but I am getting error witness is not defined
df3 = df1.merge(df2,on='Accident_ID')
df3["train"] = df3.Accident_ID < 5
df3["train"] .value_counts()
triples = []
for _, row in df3[df3["train"]].iterrows():
if row["ROLE"] == "Driver":
if row["User_ID"] == row["DriverID_1"]:
Drives = (row["User_ID"],row["CarID_1"], "Drives")
elif row["User_ID"] == row["DriverID_2"]:
Drives = (row["User_ID"],row["CarID_2"], "Drives")
else:
Witness = (row["User_ID"],row["Accident_ID"], "Witness")
Involved_in_first = (row["CarID_1"],row["Accident_ID"], "Involved in")
Involved_in_second = (row["CarID_2"],row["Accident_ID"], "Involved in")
Happened_in = (row["Accident_ID"],row["Location"], "Happened in")
Lives_in = (row["User_ID"],row["Address"], "Lives in")
triples.extend((Drives , Witness , Involved_in_first,Involved_in_second, Happened_in , Lives_in ))
triples_df = pd.DataFrame(triples, columns=["Source", "Target", "Edge"])
triples_df.shape

You should something like this and follow the same process for the rest of the edges:
df = df2.merge(df1, on=['Accident_ID'], how='inner')
print(df)
columns = ['Source', 'Target', 'Edge']
rows = []
for i in range(0, df.shape[0]):
row1 = [
df.iloc[i]['First_Name'],
df.iloc[i]['CarID_1'],
'Drives'
]
row2 = [
df.iloc[i]['First_Name'],
df.iloc[i]['Accident_ID'],
'Witness'
]
rows.append(row1)
rows.append(row2)
df_g = pd.DataFrame(rows, columns=columns)
print(df_g)
Output:
Source Target Edge
0 Chester 1000 Drives
1 Chester 1 Witness
2 Daryl 1000 Drives
3 Daryl 1 Witness
4 Walter 1002 Drives
5 Walter 2 Witness
6 Olivia 1002 Drives
7 Olivia 2 Witness
8 Chester 1002 Drives
9 Chester 2 Witness
10 Ted 1004 Drives
11 Ted 3 Witness
12 Amy 1004 Drives
13 Amy 3 Witness
14 Tes 1006 Drives
15 Tes 4 Witness
16 Josh 1006 Drives
17 Josh 4 Witness
18 Daryl 1006 Drives
19 Daryl 4 Witness

What excel formula can be used to get the mean for the following data? And how to apply it?

I have my data as follows, the values in the spreadsheet are the quantities, while 'red','yellow', 'green' are the categories
items place red green yellow
a VA 1 7 9
b VA 3 0 19
c VA 5 1 0
d VA 11 3 4
e VA 2 2 1
a NJ 0 0 3
b NJ 3 0 9
c NJ 2 4 0
d NJ 0 5 6
e NJ 2 7 1
a MO 0 0 5
b MO 1 0 4
c MO 1 4 0
d MO 0 0 5
e MO 1 7 1
For each place-category combination, I would like to compute the mean of these quantities across all 5 items (a,b,c,d,e),
category place Avg_quantity
red VA ..
green VA . ..
yellow VA ..
red NJ ..
green NJ ..
yellow NJ ..
red MO ..
green MO ..
yellow MO ..
I tried using averageifs but it gives an error since my arguments length is different for category and place

Use, in J2 and drag down:
=AVERAGE(IF($B$2:$B$16=$I2,INDEX($C$2:$E$16, ,MATCH($H2,$C$1:$E$1,0))))
Entered with Ctrl + Shift + Enter i.e. array formula
Data
Excluding zeroes:
=AVERAGEIFS(INDEX($C$2:$E$16,,MATCH($H2,$C$1:$E$1,0)),INDEX($C$2:$E$16,,MATCH($H2,$C$1:$E$1,0)),">0",$B$2:$B$16,$I2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Text Match combinations - text

Related

How to merge data with duplicates using panda python

Calculate Percentage using Pandas DataFrame

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Building tripple from two different data frames

What excel formula can be used to get the mean for the following data? And how to apply it?

Categories

Resources