How to fill in between rows gap comparing with other dataframe using pandas? - python-3.x

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:

Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to merge two data frames with duplicate rows?

I have two data frames df1 and df2. The df1 has repeated text wrt column name but column hobby changes. The df2 also has repeated text in the column name. I want to merge both the data frames and keep everything.
df1:
name hobby
mike cricket
mike football
jack chess
jack football
jack vollyball
pieter sleeping
pieter cyclying
my df2 is
df2:
name
mike
pieter
jack
mike
pieter
Now I have to merge df2 with df1 on name column
So my resultant df3 should look like this:
df3:
name hobby
mike cricket
mike football
pieter sleeping
pieter cyclying
jack chess
jack football
jack vollyball
mike cricket
mike football
pieter sleeping
pieter cyclying
You want to assign an order for df2, merge on name, then sort by the said order:
(df2.assign(rank=np.arange(len(df2)))
.merge(df1, on='name')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
name hobby
0 mike cricket
1 mike football
4 pieter sleeping
5 pieter cyclying
8 jack chess
9 jack football
10 jack vollyball
2 mike cricket
3 mike football
6 pieter sleeping
7 pieter cyclying

Pandas merge two dataframe and overwrite rows

I have two data frames that I am trying to combine -
Dataframe 1 -
Product Buyer Date Store
TV Person A 9/18/2018 Boston
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 9/18/2018 Boston
Sound System Person C 3/05/2018 Washington
Dataframe 2 -
Product Type Buyer Date Store
TV Person B 5/29/2018 New York
Phone Person A 2/10/2018 Washington
The first dataframe has about 500k rows while the second dataframe has about 80k rows. There are time when the second dataframe has home columns but I am trying to get the final output with to show the same columns as the Dataframe 1 and update the Dataframe 1 rows with Dataframe 2.
The output looks like this -
Product Buyer Date Store
TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington
I tried the join but the columns are repeated. Is there an elegant solution to do this?
Edit 1-
I have already tried -
pd.merge(df,df_correction, left_on = ['Product'], right_on = ['Product Type'],how = 'outer')
Product Buyer_x Date_x Store_x Product Type Buyer_y Date_y Store_y
TV Person B 5/29/2018 New York TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York NaN NaN NaN NaN
Blue-ray Player Person C 9/19/2018 Boston NaN NaN NaN NaN
Phone Person A 2/10/2018 Washington Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington NaN NaN NaN NaN
i think combine first is the function you are looking for https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine_first.html
can you try:
d1.rename(columns={'ProductType':'Product'}).set_index('Product').combine_first(d2.set_index('Product')).reset_index()

Complex pandas aggregation

I have a table as below :
User_ID Cricket Football Chess Video_ID Category Time
1 200 150 100 111 A Morning
1 200 150 100 222 B Morning
1 200 150 100 111 A Afternoon
1 200 150 100 333 A Morning
2 100 160 80 444 C Evening
2 100 160 80 222 C Evening
2 100 160 80 333 A Morning
2 100 160 80 333 A Morning
Above table is a transactional table, each entry represents the transaction of a user watching a video.
For Eg. “User_ID” - 1 has watched video’s 4 times.
What all video’s watched are given in “Video_ID” : 111,222,111,333
NOTE :
Video_ID - 111 was watched twice by this user.
Cricket, Football, Chess : The values are duplicate for each row. (I.e) No of times “User_ID” 1 played cricket , football, chess are 200,150,100. ( They are duplicate in other rows for that particular “User_ID”.
Category : Which Category that particular Video_ID belongs to.
Time : What time the Video_ID was watched.
I am trying to get the below information from the table :
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
1 Cricket Football A B Morning
2 Football Cricket C A Evening
NOTE : If the count of Category is same then any one can be kept as Top_1_Category.
Its bit complex though, can anyone help on this ?
First get top values per groups by User_ID and Video_ID with Series.value_counts and index[0]:
df1 = df.groupby(['User_ID','Video_ID']).agg(lambda x: x.value_counts().index[0])
Then get second top Category by GroupBy.nth:
s = df1.groupby(level=0)['Category'].nth(1)
Remove duplicates by User_ID with DataFrame.drop_duplicates:
df1 = df1.reset_index().drop_duplicates('User_ID').drop('Video_ID', axis=1)
cols = ['User_ID','Category','Time']
cols1 = df1.columns.difference(cols)
Get top2 games by this solution:
df2 = pd.DataFrame((cols1[np.argsort(-df1[cols1].values, axis=1)[:,:2]]),
columns=['Top_1_Game','Top_2_Game'],
index=df1['User_ID'])
Filter Category and Time with rename columns names:
df3 = (df1[cols].set_index('User_ID')
.rename(columns={'Category':'Top_1_Cat','Time':'Top_Time'}))
Join together by DataFrame.join and DataFrame.insert Top_2_Cat values:
df = df2.join(df3).reset_index()
df.insert(4, 'Top_2_Cat', s.values)
print (df)
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
0 1 Cricket Football A B Morning
1 2 Football Cricket C A Evening

How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

******Edited with Solution Below*******
I have carefully read the guidelines, hope the question is acceptable.
I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.
i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas?
but I am not able to use this solution.
So far I have:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn
1 robert johnes Person … 0.0000 1 roberto johnes Person …
Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?
I have solved the original problem by unpacking the dataframes in the loop:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)
I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.
df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
Result:
NameId_x RefName Type_x mergekey NameId_y TarName Type_y Distance
0 1 robert johnes Person 0 1 roberto johnes Person 0.000000
1 1 robert johnes Person 0 2 lew malinosky Person 0.705882
2 1 robert johnes Person 0 3 andreatta della blatta Person 0.538462
3 2 lew malinsky Person 0 1 roberto johnes Person 0.764706
4 2 lew malinsky Person 0 2 lew malinosky Person 0.083333
5 2 lew malinsky Person 0 3 andreatta della blatta Person 0.666667
6 3 gioberto delle lanterne Person 0 1 roberto johnes Person 0.533333
7 3 gioberto delle lanterne Person 0 2 lew malinosky Person 0.588235
8 3 gioberto delle lanterne Person 0 3 andreatta della blatta Person 0.250000

Resources