Python Merge Multiple Excel sheets to form a summary sheet - excel

I need to merge data from multiple sheets of an Excel to form a new summary sheet using Python. I am using pandas to read the excel sheets and create new summary sheet. After concatenation the table format is getting lost i.e. Header and borders.
Is there a way to read from source sheet with the format and write to final sheet.
if first is not possible how to format the data after concatenation
Python Code to concatenate:
import pandas as pd
df = []
xlsFile = "some path excel"
sheetNames = ['Sheet1', 'Sheet2','Sheet3']
for nms in sheetNames:
data = pd.read_excel(xlsFile, sheet_name = nms, header=None, skiprows=1)
df.append(data)
final = "some other path excel "
df = pd.concat(df)
df.to_excel(final, index=False, header=None)
Sheet 1 Input Data
Sheet 2 Input Data
Sheet 3 Input Data
Summary Sheet output

You can try the following code:
df = pd.concat(pd.read_excel('some path excel.xlsx', sheet_name=None), ignore_index=True)
If you set sheet_name=None you can read all the sheets in the workbook at one time.

I suggest you the library xlrd
(https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966
and https://github.com/python-excel/xlrd)
It is a good library to do that.
from xlrd import open_workbook
path = '/Users/.../Desktop/Workbook1.xls'
wb = open_workbook(path, formatting_info=True)
sheet = wb.sheet_by_name("Sheet1")
cell = sheet.cell(0, 0) # The first cell
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()
see also:
Using XLRD module and Python to determine cell font style (italics or not)
and How to read excel cell and retain or detect its format in Python (I brought the above code from this address)

Related

How to save a new sheet to the beginning of an existing excel workbook?

I found part of the answer from this post and it was very useful
https://stackoverflow.com/a/42375263/13765378
However, every time I ran this code with new data, a new sheet gets added to the end of a workbook.
After a while, it is quite an effort to get to that new sheet that was just added.
Is there a way to specify adding to the beginning of the workbook, so it will be the default sheet when we open the workbook?
This will help you.use the second line.this uses openpyxl module
help link https://openpyxl.readthedocs.io/en/stable/tutorial.html
ws1 = wb.create_sheet("Mysheet") # insert at the end (default)
ws2 = wb.create_sheet("Mysheet", 0) # insert at first position
Thanks to Vignesh's answer, and Thanks to and modifying the code from
writing pandas data frame to existing workbook
I got the following code to work: [Every time the program is run, a new sheet will be created at the beginning of the workbook and contains (new) data. Rest of the code just testing out the function append_df_to_excel()]
The function append_df_to_excel() seems over-kill for what I need to do, but for now I could not find a better and cleaner way to do it.
I also do not understand why saving the workbook at the end will not save the data.
import os
from openpyxl import load_workbook
import xlsxwriter
import pandas as pd
from datetime import datetime
filename = r'C:\test\test.xlsx'
if not os.path.exists(filename):
wb = xlsxwriter.Workbook(filename)
wb.close()
def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn't exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: '/path/to/file.xlsx')
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: 'Sheet1')
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
"""
# ignore [engine] parameter if it was passed
if 'engine' in to_excel_kwargs:
to_excel_kwargs.pop('engine')
writer = pd.ExcelWriter(filename, engine='openpyxl')
if not os.path.exists(filename):
wb = xlsxwriter.Workbook(filename)
wb.close()
try:
# try to open an existing workbook
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# writer.book.create_sheet(sheet_name, 0) #not working
# copy existing sheets
writer.sheets = {ws.title:ws for ws in writer.book.worksheets}
except FileNotFoundError:
# file does not exist yet, we will create it
pass
if startrow is None:
startrow = 0
# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
# save the workbook
writer.save()
A = [[0,1,2],[3,4,5],[6,7,8],[9,10,11]]
df = pd.DataFrame(A, columns=list('XYZ'))
newSheet = "New_" + datetime.now().strftime('%Y-%m-%d_%H%M%S')
wb = load_workbook(filename)
ws = wb.create_sheet(newSheet, 0)
wb.save(filename)
append_df_to_excel(filename, df, sheet_name="Old2", startrow=1, startcol=1)
append_df_to_excel(filename, df, sheet_name="Old3", index=False)
append_df_to_excel(filename, df, sheet_name="Old1", startcol=2, index=False)
append_df_to_excel(filename, df, sheet_name=newSheet, columns=df.columns.values, startrow=0, startcol=0, index=False)
# wb.save(filename) # Do not do this, will get nothing written to workbook

Updating excel sheet with Pandas without overwriting the file

I am trying to update an excel sheet with Python codes. I read specific cell and update it accordingly but Padadas overwrites the entire excelsheet which I loss other pages as well as formatting. Anyone can tell me how I can avoid it?
Record = pd.read_excel("Myfile.xlsx", sheet_name'Sheet1', index_col=False)
Record.loc[1, 'WORDS'] = int(self.New_Word_box.get())
Record.loc[1, 'STATUS'] = self.Stat.get()
Record.to_excel("Myfile.xlsx", sheet_name='Student_Data', index =False)
My code are above, as you can see, I only want to update few cells but it overwrites the entire excel file. I tried to search for answer but couldn't find any specific answer.
Appreciate your help.
Update: Added more clarifications
Steps:
1) Read the sheet which needs changes in a dataframe and make changes in that dataframe.
2) Now the changes are reflected in the dataframe but not in the sheet. Use the following function with the dataframe in step 1 and name of the sheet to be modified. You will use the truncate_sheet param to completely replace the sheet of concern.
The function call would be like so:
append_df_to_excel(filename, df, sheet_name, startrow=0, truncate_sheet=True)
from openpyxl import load_workbook
import pandas as pd
def append_df_to_excel(filename, df, sheet_name="Sheet1", startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn"t exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: "/path/to/file.xlsx")
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: "Sheet1")
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
"""
# ignore [engine] parameter if it was passed
if "engine" in to_excel_kwargs:
to_excel_kwargs.pop("engine")
writer = pd.ExcelWriter(filename, engine="openpyxl")
# Python 2.x: define [FileNotFoundError] exception if it doesn"t exist
try:
FileNotFoundError
except NameError:
FileNotFoundError = IOError
if "index" not in to_excel_kwargs:
to_excel_kwargs["index"] = False
try:
# try to open an existing workbook
if "header" not in to_excel_kwargs:
to_excel_kwargs["header"] = True
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
to_excel_kwargs["header"] = False
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# copy existing sheets
writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
except FileNotFoundError:
# file does not exist yet, we will create it
to_excel_kwargs["header"] = True
if startrow is None:
startrow = 0
# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
# save the workbook
writer.save()
We can't replace openpyxl engine here to write excel files as asked in comment. Refer reference 2.
References:
1) https://stackoverflow.com/a/38075046/6741053
2) xlsxwriter: is there a way to open an existing worksheet in my workbook?

For Loop - Reading in all excel tabs into Panda Df's

I have an .xlsx book and I would like to write a function or loop that would create Panda(s) DF's for each tab in excel. So for example, let's say that I have an excel book called book.xlsx and tabs called sheet1 - sheet6. I would like to read in the excel file and create 6 Panda DF's (sheet1 - sheet6) from a function or loop?
To load the file:
path = '../files_to_load/my_file.xlsx'
print(path)
excel_file = pd.ExcelFile(path)
print('File uploaded ✔')
To get a specific sheet:
# Get a specific sheet
raw_data = excel_file.parse('sheet1')
Here an example for the Loop:
You will have all of you sheets stored in a list. All the sheets will be dataframes
In [1]:
import pandas as pd
path = 'my_path/my_file.xlsx'
excel_file = pd.ExcelFile(path)
sheets = []
for sheet in excel_file.sheet_names:
data = excel_file.parse(sheet)
sheets.append(data)
You need to set sheet_name argument to None - it would create an ordered dictionary of sheets stored as dataframes.
dataframes = pd.read_excel(file_name, sheet_name=None)
>>> type(dataframes)
<class 'collections.OrderedDict'>
>>> type(dataframes['first']) # `first` is the name a sheet
<class 'pandas.core.frame.DataFrame'>

Making my function that iterates through excel sheets more efficient

I have written the following function for a program that is supposed to search through an excel file and manipulate data frames, but the function is insanely slow and I am not sure how to make it more efficient. is there another way to iterate through excel sheets that works better than this?
def read_masterfile(masterfile_path):
sheets_dict = pd.ExcelFile(masterfile_path).sheet_names
for sheet in sheets_dict:
df = pd.read_excel(masterfile_path, sheet_name = sheet)
print(sheet)
print(df.columns)
user_input= input()
masterfile_dir = (r"C:\Users\path\Desktop\July15\masterfile.xlsx")
if user_input == 'y':
calculated = read_masterfile(masterfile_dir)
By doing the following:
for sheet in sheets_dict:
df = pd.read_excel(masterfile_path, sheet_name = sheet)
You are opening the excel file from zero multiple times. I would guess this is what's causing your code to be slow.
You can read all the sheets on one excel file using:
pd.read_excel(file, sheet_name=None)
This will return a dictionary where the keys are sheet names and the values are dataframes.

Way to compare two excel files and CSV file

I need to compare two excel files and a csv file, then write some data from one excel file to another.
It looks like this:
CSV file with names which I will compare. For example (spam, eggs)
First Excel file with name and value of it. For example (spam, 100)
Second Excel file with name. For example (eggs)
Now, when I input file (second) into program I need to ensure that eggs == spam with csv file and then save value of 100 to the eggs.
For operating on excel files I'm using openpyxl and for csv I'm using csv.
Can I count on your help? Maybe there are better libraries to do that, because my trials proved to be a total failure.
Got it by myself. Some complex way, but it works like I wanted to. Will be glad for some tips to it.
import openpyxl
import numpy as np
lines = np.genfromtxt("csvtest.csv", delimiter=";", dtype=None)
compdict = dict()
for i in range(len(lines)):
compdict[lines[i][0]] = lines[i][1]
wb1 = openpyxl.load_workbook('inputtest.xlsx')
wb2 = openpyxl.load_workbook(filename='spistest.xlsx')
ws = wb1.get_sheet_by_name('Sheet1')
spis = wb2.get_sheet_by_name('Sheet1')
for row in ws.iter_rows(min_row=1, max_row=ws.max_row, min_col=1):
for cell in row:
if cell.value in compdict:
for wiersz in spis.iter_rows(min_row=1, max_row=spis.max_row, min_col=1):
for komorka in wiersz:
if komorka.value == compdict[cell.value]:
cena = spis.cell(row=komorka.row, column=2)
ws.cell(row=cell.row, column=2, value=cena.value)
wb1.save('inputtest.xlsx')
wb2.close()

Resources