How to replace a Dataframe with another dataframe? - python-3.x

I am passing in a single dataframe for performing various other data cleansing processes. While doing so, one of the process I am unable to complete without having another dataframe.
data= {'ID':[1,2], '2020-11-01' :[10,15], '2020-11-02':[43,35]}
df1 = pd.DataFrame.from_dict(data)
df1.head()
ID 2020-11-01 2020-11-02
0 1 10 43
1 2 15 35
I would need to convert those dates as rows so used melt
df2 = df1.melt(id_vars = ["ID"], var_name = "ReportDate", value_name= "Units")
df2.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
2 1 2020-11-02 43
3 2 2020-11-02 35
Now I need to drop everything from df1 and need to capture the df2 details to df1.
I tried to drop all columns from df1(using inplace=True) and then do
df1["ID"] = df2["ID"]
df1["ReportDate"] = df2["ReportDate"]
df1["Units] = df2[Units]
df1.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
But I ended up with only 2 rows since the previous shape of df1 was 2x3
I need my output to look like
df1.head()
ID ReportDate Units
0 1 2020-11-01 10
1 2 2020-11-01 15
2 1 2020-11-02 43
3 2 2020-11-02 35
How do I get df1 to have all the contents of df2?

I understand the objective is to assign the content of df2 to df1 while making sure that id(df1) does not change through this operation. This seems to do it but probably not the most elegant way. Main difference from what you tried is dropping the index as well as columns
df1.drop(df1.columns, axis=1, inplace=True)
df1.drop(df1.index, inplace=True)
df1[df2.columns] = df2[df2.columns]
df1.head()
it maybe better design to have a function process_data that can be used as such
df1 = process_data(df1)
then df1 can be changed inside your function but when returned from the function it is assigned to the same variable

Related

How can I sort 3 columns and assign it to one python pandas

I have a dataframe:
df = {A:[1,1,1], B:[2012,3014,3343], C:[12,13,45], D:[111,222,444]}
but I need to join the last 3 columns in consecutive order horizontally and thus assign it to the first column, some like this:
df2 = {A:[1,1,1,2,2,2], Fusion3:[2012,12,111,3014,13,222]}
I have tried with .melt, but you are struggling with some ideas and grateful for your comments
From the desired output I'm making the assumption that the initial dataframe should have 1,2,3 in the A column rather 1,1,1
import pandas as pd
df= pd.DataFrame({'A':[1,2,3], 'B':[2012,3014,3343], 'C':[12,13,45], 'D':[111,222,444]})
df = df.set_index('A')
df = df.stack().droplevel(1)
will give you this series:
A
1 2012
1 12
1 111
2 3014
2 13
2 222
3 3343
3 45
3 444
Check melt
out = df.melt('A').drop('variable',1)
Out[15]:
A value
0 1 2012
1 2 3014
2 3 3343
3 1 12
4 2 13
5 3 45
6 1 111
7 2 222
8 3 444

Speed Up Pandas Iterations

I have DataFrame which consist of 3 columns: CustomerId, Amount and Status(success or failed).
The DataFrame is not sorted in any way. A CustomerId can repeat multiple times in DataFrame.
I want to introduce new columns into this DataFrame with below logic:
df[totalamount]= sum of amount for each customer where status was success.
I already have a running code but with df.iterrows which takes too much time. Thus requesting you to kindly provide alternate methods like pandas vectorization or numpy vectorization.
For Example, I want to create the 'totalamount' column from the first three columns:
CustomerID Amount Status totalamount
0 1 5 Success 105 # since both transatctions were successful
1 2 10 Failed 80 # since one transaction was successful
2 3 50 Success 50
3 1 100 Success 105
4 2 80 Success 80
5 4 60 Failed 0
Use where to mask the 'Failed' rows with NaN while preserving the length of the DataFrame. Then groupby the CustomerID and transform the sum of 'Amount' column to bring the result back to every row.
df['totalamount'] = (df.where(df['Status'].eq('Success'))
.groupby(df['CustomerID'])['Amount']
.transform('sum'))
CustomerID Amount Status totalamount
0 1 5 Success 105.0
1 2 10 Faled 80.0
2 3 50 Success 50.0
3 1 100 Success 105.0
4 2 80 Success 80.0
5 4 60 Failed 0.0
The reason for using where (as opposed to subsetting the DataFrame) is because groupby + sum defaults to sum an entirely NaN group to 0, so we don't need anything extra to deal with CustomerID 4, for instance.
df_new = df.groupby(['CustomerID', 'Status'], sort=False)['Amount'].sum().reset_index()
df_new = (df_new[df_new['Status'] == 'Success']
.drop(columns='Status')
.rename(columns={'Amount': 'totalamount'}))
df = pd.merge(df, df_new , on=['CustomerID'], how='left')
I'm not sure at all but I think this may work

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Pandas: flag a value modification through columns

I have a dataframe like this:
In [24]: df = pd.DataFrame({'id': ['a','a','b','b','c','c'],'date':[201708,201709,201708,201709,201708,201709],'value':[0,15,20,30,20,0]})
In [25]: df
Out[25]:
date id value
0 201708 a 0
1 201709 a 15
2 201708 b 20
3 201709 b 30
4 201708 c 20
5 201709 c 0
And I have this derived pivot table:
In [26]: base=pd.pivot_table(df,index='id',columns='date',values='value',aggfunc='sum',fill_value=0,margins=False)
In [27]: base
Out[27]:
date 201708 201709
id
a 0 15
b 20 30
c 20 0
I need to create another df from this pivot table. In this new dataframe I need to show the values, for each id, that are larger than zero on date=t and evaluated as zero on the prior date(date=t-1). The result that I need is this df:
date 201708 201709
id
a 0 15
b 0 0
c 0 0
Does anyone know how to achieve this?
Thanks in advance.
Assuming your dataframe is df, use pd.DataFrame.where
df.where(
df.gt(0) & df.shift(axis=1).eq(0),
0
)

Resources