Updating multiple columns of df from another df - python-3.x

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks

Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Related

Is there a Pandas way to group dates that are less than 2 minutes apart in a dataframe?

i would like to make a groupby on my data to put together dates that are close. (less than 2 minutes)
Here an example of what i get
> datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
> '2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24
> 17:23:55.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
> 'id5', '2020-05-23 17:31:10.341']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
> 'dates'])
The 2 first rows have close dates, same for the 3th and 4th rows, 5th row is alone.
I would like to get something like this :
> datas = [['A', 51, 'id1 id2', 'date_1'], ['B', 45, 'id3 id4',
> 'date_2'], ['C', 34, 'id5', 'date_3']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'col_id',
> 'dates'])
Making it in a pythonic way is not that hard, but i have to make it on big dataframe, a pandas way using groupby method would be much efficient.
After apply a datetime method on the dates column i tried :
> df.groupby([df['dates'].dt.date]).agg(','.join)
but the .dt.date method gives a date every day and not every 2 minutes.
Do you have a solution ?
Thank you
Looking at the output seems like we are trying to group dates by 2 min freq and col1, col2.
Code
df['dates'] = pd.to_datetime(df.dates)
df.groupby([pd.Grouper(key='dates', freq='2 min'),
'col1', 'col2']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Output
dates col1 col2 cold_id
0 2020-05-27 05:50:00 A 51 id1,id2
1 2020-05-24 17:22:00 B 45 id3,id4
2 2020-05-23 17:30:00 C 34 id5
Using only dates, this is what I did to classifiate your rows :
First of all, convert date to timestamp to compare them easily :
from datetime import datetime
import time
df["dates"] = df["dates"].apply(lambda x : int(time.mktime(datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f").timetuple())))
Then, sort them by date :
df = df.sort_values("dates")
Finally, using this answer, I create a column group in order to identify close dates. The first line add 1 into group column if dates are close enough (120 seconds in our case). The second line will fill in the column group to remove the NaN and assign groups :
df.loc[(df.dates.shift() < df.dates - 120),"group"] = 1
df['group'] = df['group'].cumsum().ffill().fillna(0)
This gives me :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4 1590333810 1.00
2 B 45 id3 1590333835 1.00
0 A 51 id1 1590551443 2.00
1 A 51 id2 1590551468 2.00
Now, to concatenate your cold_id, you groupby your groups and join the cold_id of each group by using transform:
df["cold_id"] = df.groupby(["group"],as_index=False)["cold_id"].transform(lambda x: ','.join(x))
df = df.drop_duplicates(subset=["cold_id"])
This finally gives you this dataframe :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4,id3 1590333810 1.00
0 A 51 id1,id2 1590551443 2.00

How to replace a Dataframe with another dataframe?

I am passing in a single dataframe for performing various other data cleansing processes. While doing so, one of the process I am unable to complete without having another dataframe.
data= {'ID':[1,2], '2020-11-01' :[10,15], '2020-11-02':[43,35]}
df1 = pd.DataFrame.from_dict(data)
df1.head()
ID 2020-11-01 2020-11-02
0 1 10 43
1 2 15 35
I would need to convert those dates as rows so used melt
df2 = df1.melt(id_vars = ["ID"], var_name = "ReportDate", value_name= "Units")
df2.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
2 1 2020-11-02 43
3 2 2020-11-02 35
Now I need to drop everything from df1 and need to capture the df2 details to df1.
I tried to drop all columns from df1(using inplace=True) and then do
df1["ID"] = df2["ID"]
df1["ReportDate"] = df2["ReportDate"]
df1["Units] = df2[Units]
df1.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
But I ended up with only 2 rows since the previous shape of df1 was 2x3
I need my output to look like
df1.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
2 1 2020-11-02 43
3 2 2020-11-02 35
How do I get df1 to have all the contents of df2?
I understand the objective is to assign the content of df2 to df1 while making sure that id(df1) does not change through this operation. This seems to do it but probably not the most elegant way. Main difference from what you tried is dropping the index as well as columns
df1.drop(df1.columns, axis=1, inplace=True)
df1.drop(df1.index, inplace=True)
df1[df2.columns] = df2[df2.columns]
df1.head()
it maybe better design to have a function process_data that can be used as such
df1 = process_data(df1)
then df1 can be changed inside your function but when returned from the function it is assigned to the same variable

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Python-Splitting dataframe

My goal is to group the dataframe based on the column['quantity'] in the below dataframes
my dataframe :
df
ordercode quantity
PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 2
PMC11-AA1U1FJWWJA 3
PMC11-AA1L1FJWWJA 3
df1:
ordercode quantity
PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 2
df2
ordercode quantity
My coding:
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df) // 3, 1)) * 4)[0:len(df)]
df = df.groupby(['group', 'ordercode']).sum()
print(df)
With the above coding I got my result in df as below.
Group ordercode quantity
0 PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 1
1 PMC11-AA1U1FBWWJA+I7 1
PMC11-AA1U1FJWWJA 3
2 PMC11-AA1L1FJWWJA 3
In group0 & group1 the total values (1+1+1+1=4)(1+3=4)(i.e keeping the max vale of quantity as 4). In group2 we can see that no values to add so the group is formed by the left over(here it is 3).in group0 & group1 we can see that PMC11-AA1U1FBWWJA+I7's value splits.
No problem in it.
In df1 & df2 its showing value error.
in df1:
value error: length of values does not match length of index
raise Value error('length of value does not match length of index')
in df2:
value error:need at least one array to concatenate.
I could understand that my df2 is empty and has no index. I used pd.Series but again the same error.
how to solve this problem?

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Resources