I have grouped and aggregated transactions per account number (to calculate monthly statisitics) and now I want to merge the output with another dataframe on account numbers. The account numbers are however no longer in the index/columns.
Group transactions per account and month and perform aggregated calculations
df1 = df.groupby(['AcctNr','Month']).sum().groupby(level=0).agg({'Amount': 'mean', 'median', max, 'std', percentile(75), iqr]})
df1.columns = ["_".join(x) for x in df1.columns.ravel()]
This results in the following results from
df1.columns:
Index(['Amount_mean', 'Amount_median', 'Amount_max', 'Amount_std',
'Amount_percentile_75', 'Amount_iqr', 'UpperBP'],
dtype='object')
When I try to merge with another DF on AcctNr i get:
df3 = df1.merge(df2, on='AcctNr')
KeyError: 'AcctNr'
You need to keep account number in your df1 because without it you can not join.
If you don't need it in your final df you can drop it with
df3 = df3.drop("AcctNr", axis=1)
Related
I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -
This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)
I have two dfs:- df1 and df2.:-
dfs=[df1,df2]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Serial_Nbr'), dfs)
I want to select only one column apart from the merge column Serial_Nbr in df1while doing the merge.
how do i do this..?
Filter column in df1:
dfs=[df1[['Serial_Nbr']],df2]
Or if only 2 DataFrames remove reduce:
df_final = pd.merge(df1[['Serial_Nbr']], df2, on='Serial_Nbr')
I have two dataframe
Dataframe 1
Dataframe 2
ID column is not unique in the two tables. I want to compare all the columns in both the tables except ID's and print the unique rows
Expected output
I tried 'isin' function, but not working. Each dataframe size is 150000 and I removed duplicates in both the tables. Please advise how to do that?
You can use df.append to combine the dataframe, then use df.duplicated which will flag the duplicates.
df3 = df1.append(df, ignore_index=True)
df4 = df3.duplicated(subset=['Team', 'name', 'Country', 'Token'], keep=False)
My data fields are "DATE", "ITEM", "CURRVAL", and I have daily data since 2008 (about 4 million records). I want to transform this so that each "ITEM" is a column with the "DATE" as the index. This will result in about 1,000 columns (one for each "ITEM") and one row per "DATE".
I'm currently creating a new DF, iterating through the unique ITEMs, and merging a new column to the new DF for each. This is very slow. Any tips on how to improve? Thanks!
dfNew = pd.DataFrame()
dfNew["DATE"] = sorted(df["TRADE_DATE"].unique())
dfNew.set_index(["DATE"], inplace=True)
for item in df["ITEM"].unique():
dfTemp = df[df["ITEM"] == item][["CURRVAL", "TRADE_DATE"]]
dfTemp.set_index("TRADE_DATE", inplace=True)
dfNew = dfNew.merge(dfTemp, how="left", left_index=True, right_index=True)
dfNew.rename(columns={"CURRVAL": item}, inplace=True)
As per comments, the solution using df.unstack() and df.pivot(). Pivot does the job more directly:
dfNew = df.copy()
dfNew = dfNew.pivot(index="TRADE_DATE", columns="ITEM", values="CURRVAL")
dfNew = df.copy()
dfNew.set_index(["TRADE_DATE", "ITEM"], inplace=True)
dfNew = dfNew.unstack(-1)
I have several dataframes that I have concatenated with pandas in the line:
xspc = pd.concat([df1,df2,df3], axis = 1, join_axes = [df3.index])
In df2 the index values read one day later than the values of df1, and df3. So for instance when the most current date is 7/1/19 the index values for df1 and df3 will read "7/1/19" while df2 reads '7/2/19'. I would like to be able to concatenate each series so that each dataframe is joined on the most recent date, so in other words I would like all the dataframe values from df1 index value '7/1/19' to be concatenated with dataframe 2 index value '7/2/19' and dataframe 3 index value '7/1/19'. When methods can I use to shift the data around to join on these not matching index values?
You can reset the index of the data frame and then concat the dataframes
df1=df1.reset_index()
df2=df2.reset_index()
df3=df3.reset_index()
df_final = pd.concat([df1,df2,df3],axis=1, join_axes=[df3.index])
This should work since you mentioned that the date in df2 will be one day after df1 or df3