how to remove duplicates from all sheets in dataframe in python - python-3.x

I have a dataframe with number of sheets,i wants to delete duplicate from all sheets.i used below code
df = df.drop_duplicates(subset='Month',keep='last')
after that i save this df
df.to_excel(path,index=False)
but its removing only 1st sheet duplicate and showing only one sheet

I would suggest treating each sheet of your document as an separate data frame, then in iteration remove the duplicates of each set according to your criteria. This is quick draft of concept I had on mind, for 2 sheets:
xls = pd.ExcelFile('myFile.xls')
xls_dfs = []
df1 = pd.read_excel(xls, 'Sheet1')
xls_dfs.append(df1)
df2 = pd.read_excel(xls, 'Sheet2')
xls_dfs.append(df2)
for df in xls_dfs:
df = df.drop_duplicates(subset='Month',keep='last')
df.to_excel('myFile.xls',index=False)

Related

compare two dfs and delete duplicates from second

I have a one df generate weekly, what I want to do is compare it with another dataframe and delete duplicates from newly generated one.
I have tried this:
#adding master column to old df
df['master'] = 'master'
df.set_index('master', append=True, inplace=True)
#dropping duplicates from new df
new_df.drop_duplicates( keep=False, inplace=True)
#adding daily column to newly generated df
new_df['daily'] = 'daily'
new_df.set_index('daily', append=True, inplace=True)
#merging both dfs
merged = df.append(new_df)
#droping duplicates from merged df
merged = merged.drop_duplicates().sort_index()
#updating new df with updated df with no duplicates
idx = pd.IndexSlice
new_df = merged.loc[idx[:, 'daily'], :]
but this is not working as expected and is not deleting duplicates
Incase if you are rows are not identical, then you need to set the column names in the subset. Also you can use keep to keep the first, last etc..
E.g.
df.drop_duplicates(subset=['brand', 'style'], keep='last')

Split Pandas dataframe into multiple dataframes based on empty column delimiter

I'm reading the following excel sheet into a dataframe.
I want to split it into three dataframes by product. The tables will always be delimited by a single blank column in between, but each table can have different number of columns.
Based on the article introduced in the comment, you can process it as follows.
import pandas as pd
#### Read excel file to dataframe
df = pd.read_excel('test.xlsx', index_col=None, header=None)
#### Find empty column and listed
empcols = [col for col in df.columns if df[col].isnull().all()]
df.fillna('', inplace=True)
#### Split into consecutive columns of valid data
allcols = list(range(len(df.columns)))
start = 0
colslist = []
for sepcol in empcols:
colslist.append(allcols[start:sepcol])
start = sepcol+1
colslist.append(allcols[start:])
#### Extract consecutive columns of valid data and store them in a dictionary
dfdic = {}
for i in range(len(colslist)):
wkdf = df.iloc[:, colslist[i]]
title = ''.join(wkdf.iloc[0].tolist())
wkcols = wkdf.iloc[1].tolist()
wkdf.drop(wkdf.index[[0,1]], inplace=True)
wkdf.columns = wkcols
dfdic[title] = wkdf.reset_index(drop=True)
#### Display each DataFrame stored in the dictionary
dfkeys = dfdic.keys()
for k in dfkeys:
print(k)
print(dfdic[k])
print()

How to read and store names of all the columns from multiple sheets in excel using Python?

I have 25 sheets in the excel file and I want the list of column names(top row/header) from each of the sheets.
Can you specify how you want the answers collected? Do you want all the column names from each sheet in the same list or dataframe?
Assuming you want the results in one DataFrame: I will assume you want to collect the results into one DataFrame where each row represents one sheet and each column represents one column name. The general idea is to loop through The pd.read_excel() method specifying a different sheet name each time.
import pandas as pd
import numpy as np
n_sheets = 25
int_sheet_names = np.arange(0,n_sheets,1)
df = pd.DataFrame()
for i in int_sheet_names:
sheet_i_col_names = pd.read_excel('file.xlsx', sheet_name = i, header=None, nrows=1)
df = df.append(sheet_i_col_names)
The resulting DataFrame can be further manipulated based on your specific requirements.
Output from my example excel sheet, which only had 4 sheets
Alternatively, you can pass a list to the sheet_names argument. In this case, you are given a dictionary, which I find to be less useful. In this case, int_sheet_names must be a list and not a numpy array.
n_sheets = 25
int_sheet_names = list(range(0,n_sheets))
dict = pd.read_excel('file.xlsx', sheet_name = int_sheet_names, head=None, nrows=1)
Output as a dictionary when passing a list to sheet_name kwarg

How to write to an existing excel file with openpyxl, while preserving pivot tables

I have this excel file with multiple sheet. One sheet contains two pivot tables, normal table based on the data from pivot, some graphs based on pivot as well.
I am updating the sheets without pivots using below code. The content for these sheets are generated as dataframes and straight away right the data frame.
Method 1
book = xl.load_workbook(fn)
writer = pd.ExcelWriter(fn,engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
DF.to_excel(writer, 'ABC', header=None, startrow=book.active.max_row)
writer.save()
But, when the file is written, the pivot table is converted to plain text. The solution I found to preserve the pivot table is to read and write the workbook using below methods.
Method 2
workbook = load_workbook(filename=updating_file)
sheet = workbook["Pivot"]
pivot = sheet._pivots[0]
# any will do as they share the same cache
pivot.cache.refreshOnLoad = True
workbook.save(filename=updating_file)
This adds an additional row to the pivot table as 'Value' which ruins the values of the tables based on the pivot.
According to here using pd.ExcelWriter would not preserve pivot tables. The only example I found to update an existing excel file with data frame requires pandas ExcelWriter.
Some help would be highly appreciated, as I am unable to find a method to fulfill both requirements.
Only option I can see so far is to write the data parts with Pandas. Then, drop the existing Pivot sheet and copy a sheet from original fie. But, again I have to find a way to clear the table based on the pivot and rewrite with openpyxl using 2nd method. (We can't copy sheets between workbooks)
Stick with your Method 1: if you convert the df to a pivot table in pandas, and then export to excel, it will work.
An example:
import pandas as pd
import numpy as np
# create dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
table = pd.pivot_table(df, values ='A', index =['B', 'C'],
columns =['B'], aggfunc = np.sum)
table.to_excel("filename.xlsx")
Outputs
I found a way to iterate the data frame as rows. If it was adding rows to the end of exisitng table, this would have been much easier. Since, I have to insert rows to middle, I followed below approach to insert blank rows and write the cell values.
current_sheet.insert_rows(idx=11, amount=len(backend_report_df))
sheet_row_idx = 11
is_valid_row = False
for row in dataframe_to_rows(backend_report_df, index=True, header=True):
is_valid_row = False
for col_idx in range (0, len(row)):
if col_idx == 0 and row[col_idx] is None:
logger.info("Header row/blank row")
break
else:
is_valid_row = True
if col_idx != 0:
current_sheet.cell(row=sheet_row_idx, column=col_idx).value = row[col_idx]
if is_valid_row:
sheet_row_idx = sheet_row_idx + 1

Python: Loop through Excel sheets, assign header info to columns on each sheet, then merge to one file

I am new to Python and trying to automate some tasks. I have an Excel file with 8 sheets where each sheet has some identifier on top followed below that are tabular data with headers. Each sheet has the identifiers of interest and the tables in the same location.
What I want to do is to extract some data from the top of each sheet and insert them as columns, remove unwanted rows(after I have assigned some of them to columns) and columns and then merge into one CSV file as output.
The code I have written does the job. My code reads in each sheet, performs the operations on the sheet, then I start the same process for the next sheet (8 times) before using .concat to merge them.
import pandas as pd
import numpy as np
inputfile = "input.xlsx"
outputfile = "merged.csv"
##LN X: READ FIRST SHEET AND ASSIGN HEADER INFORMATION TO COLUMNS
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
#Define cell locations of fields in the header area to be assigned to
columns
#THIS CELL LOCATIONS ARE SAME ON ALL SHEETS
A = df1.iloc[3,4]
B = df1.iloc[2,9]
C = df1.iloc[3,9]
D = df1.iloc[5,9]
E = df1.iloc[4,9]
#Insert well header info as columns in data for worksheet1
df1.insert(0,"column_name", A)
df1.insert(1,"column_name", B)
df1.insert(4,"column_name", E)
# Rename the columns in `enter code here`worksheet1 DataFrame to reflect
actual column headers
df1.rename(columns={'Unnamed: 0': 'Header1',
'Unnamed: 1': 'Header2', }, inplace=True)
df_merged = pd.concat([df1, df2, df3, df4, df5, df6, df7,
df8],ignore_index=True, sort=False)
#LN Y: Remove non-numerical entries
df_merged = df_merged.replace(np.nan, 0)
##Write results to CSV file
df_merged.to_csv(outputfile, index=False)
Since this code will be used on other Excel files with varying numbers of sheets, I am looking for any pointers on how to include the repeating operations in each sheet in a loop. Basically repeating the steps between LN X to LN Y for each sheet (8 times!!). I am struggling with how to use a loop function Thanks in advance for your assistance.
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
You should change the argument sheet_name to
sheet_name=None
Then df1 will be a dictionary of DataFrames. Then you can loop over df1 using
for df in df1:
df1[df].insert(0,"column_name", A)
....
Now perform your operations and merge the dfs. You can loop over them again and concatenate them to one final df.

Resources