compare two dfs and delete duplicates from second - python-3.x

I have a one df generate weekly, what I want to do is compare it with another dataframe and delete duplicates from newly generated one.
I have tried this:
#adding master column to old df
df['master'] = 'master'
df.set_index('master', append=True, inplace=True)
#dropping duplicates from new df
new_df.drop_duplicates( keep=False, inplace=True)
#adding daily column to newly generated df
new_df['daily'] = 'daily'
new_df.set_index('daily', append=True, inplace=True)
#merging both dfs
merged = df.append(new_df)
#droping duplicates from merged df
merged = merged.drop_duplicates().sort_index()
#updating new df with updated df with no duplicates
idx = pd.IndexSlice
new_df = merged.loc[idx[:, 'daily'], :]
but this is not working as expected and is not deleting duplicates

Incase if you are rows are not identical, then you need to set the column names in the subset. Also you can use keep to keep the first, last etc..
E.g.
df.drop_duplicates(subset=['brand', 'style'], keep='last')

Related

How do I give col names for reduce way of merging data frames

I have two dfs:- df1 and df2.:-
dfs=[df1,df2]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Serial_Nbr'), dfs)
I want to select only one column apart from the merge column Serial_Nbr in df1while doing the merge.
how do i do this..?
Filter column in df1:
dfs=[df1[['Serial_Nbr']],df2]
Or if only 2 DataFrames remove reduce:
df_final = pd.merge(df1[['Serial_Nbr']], df2, on='Serial_Nbr')

Pandas combining rows as header info

This is how I am reading and creating the dataframe with pandas
def get_sheet_data(sheet_name='SomeName'):
df = pd.read_excel(f'{full_q_name}',
sheet_name=sheet_name,
header=[0,1],
index_col=0)#.fillna(method='ffill')
df = df.swapaxes(axis1="index", axis2="columns")
return df.set_index('Product Code')
printing this tabularized gives me(this potentially will have hundreds of columns):
I cant seem to add those first two rows into the header, I've tried:
python:pandas - How to combine first two rows of pandas dataframe to dataframe header?https://stackoverflow.com/questions/59837241/combine-first-row-and-header-with-pandas
and I'm failing at each point. I think its because of the multiindex, not necessarily the axis swap? But using: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html is kind of going over my head right now. Please help me add those two rows into the header?
The output of df.columns is massive so Ive cut it down alot:
Index(['Product Code','Product Narrative\nHigh-level service description','Product Name','Huawei Product ID','Type','Bill Cycle Alignment',nan,'Stackable',nan,
and ends with:
nan], dtype='object')
We Create new column names and set them to df.columns, the new column names are generated by joining the 3 Multindex headers and the 1st row of the DataFrame.
df.columns = ['_'.join(i) for i in zip(df.columns.get_level_values(0).tolist(), df.columns.get_level_values(1).tolist(), df.iloc[0,:].replace(np.nan,'').tolist())]

Best way to transform dataframe into a date indexed one w/ many columns

My data fields are "DATE", "ITEM", "CURRVAL", and I have daily data since 2008 (about 4 million records). I want to transform this so that each "ITEM" is a column with the "DATE" as the index. This will result in about 1,000 columns (one for each "ITEM") and one row per "DATE".
I'm currently creating a new DF, iterating through the unique ITEMs, and merging a new column to the new DF for each. This is very slow. Any tips on how to improve? Thanks!
dfNew = pd.DataFrame()
dfNew["DATE"] = sorted(df["TRADE_DATE"].unique())
dfNew.set_index(["DATE"], inplace=True)
for item in df["ITEM"].unique():
dfTemp = df[df["ITEM"] == item][["CURRVAL", "TRADE_DATE"]]
dfTemp.set_index("TRADE_DATE", inplace=True)
dfNew = dfNew.merge(dfTemp, how="left", left_index=True, right_index=True)
dfNew.rename(columns={"CURRVAL": item}, inplace=True)
As per comments, the solution using df.unstack() and df.pivot(). Pivot does the job more directly:
dfNew = df.copy()
dfNew = dfNew.pivot(index="TRADE_DATE", columns="ITEM", values="CURRVAL")
dfNew = df.copy()
dfNew.set_index(["TRADE_DATE", "ITEM"], inplace=True)
dfNew = dfNew.unstack(-1)

pandas drop duplicates doesn't return dataframe with duplicates removed

I have a dataframe:
df = pd.Dataframe({'src':['A','B','C'],'trg':['A','C','B'],'wgt':[1,3,7]})
I want to drop the duplicates from this dataframe for columns src and trg
df = df.drop_duplicates(subset=['src','trg'],keep='first',inplace=False)
This should drop the first row where src=A and trg='A'
But this is not happening. There is no change in the dataframe. What am I doing wrong ?
TO remove the duplicate, you can refer to the following example which I have solved on pyNb
Or use df = df[df['src'] != df['trg']]

how to remove duplicates from all sheets in dataframe in python

I have a dataframe with number of sheets,i wants to delete duplicate from all sheets.i used below code
df = df.drop_duplicates(subset='Month',keep='last')
after that i save this df
df.to_excel(path,index=False)
but its removing only 1st sheet duplicate and showing only one sheet
I would suggest treating each sheet of your document as an separate data frame, then in iteration remove the duplicates of each set according to your criteria. This is quick draft of concept I had on mind, for 2 sheets:
xls = pd.ExcelFile('myFile.xls')
xls_dfs = []
df1 = pd.read_excel(xls, 'Sheet1')
xls_dfs.append(df1)
df2 = pd.read_excel(xls, 'Sheet2')
xls_dfs.append(df2)
for df in xls_dfs:
df = df.drop_duplicates(subset='Month',keep='last')
df.to_excel('myFile.xls',index=False)

Resources