Making my function that iterates through excel sheets more efficient - python-3.x

I have written the following function for a program that is supposed to search through an excel file and manipulate data frames, but the function is insanely slow and I am not sure how to make it more efficient. is there another way to iterate through excel sheets that works better than this?
def read_masterfile(masterfile_path):
sheets_dict = pd.ExcelFile(masterfile_path).sheet_names
for sheet in sheets_dict:
df = pd.read_excel(masterfile_path, sheet_name = sheet)
print(sheet)
print(df.columns)
user_input= input()
masterfile_dir = (r"C:\Users\path\Desktop\July15\masterfile.xlsx")
if user_input == 'y':
calculated = read_masterfile(masterfile_dir)

By doing the following:
for sheet in sheets_dict:
df = pd.read_excel(masterfile_path, sheet_name = sheet)
You are opening the excel file from zero multiple times. I would guess this is what's causing your code to be slow.
You can read all the sheets on one excel file using:
pd.read_excel(file, sheet_name=None)
This will return a dictionary where the keys are sheet names and the values are dataframes.

Related

Appending data from multiple excel files into a single excel file without overwriting using python pandas

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

For Loop - Reading in all excel tabs into Panda Df's

I have an .xlsx book and I would like to write a function or loop that would create Panda(s) DF's for each tab in excel. So for example, let's say that I have an excel book called book.xlsx and tabs called sheet1 - sheet6. I would like to read in the excel file and create 6 Panda DF's (sheet1 - sheet6) from a function or loop?
To load the file:
path = '../files_to_load/my_file.xlsx'
print(path)
excel_file = pd.ExcelFile(path)
print('File uploaded ✔')
To get a specific sheet:
# Get a specific sheet
raw_data = excel_file.parse('sheet1')
Here an example for the Loop:
You will have all of you sheets stored in a list. All the sheets will be dataframes
In [1]:
import pandas as pd
path = 'my_path/my_file.xlsx'
excel_file = pd.ExcelFile(path)
sheets = []
for sheet in excel_file.sheet_names:
data = excel_file.parse(sheet)
sheets.append(data)
You need to set sheet_name argument to None - it would create an ordered dictionary of sheets stored as dataframes.
dataframes = pd.read_excel(file_name, sheet_name=None)
>>> type(dataframes)
<class 'collections.OrderedDict'>
>>> type(dataframes['first']) # `first` is the name a sheet
<class 'pandas.core.frame.DataFrame'>

Python Merge Multiple Excel sheets to form a summary sheet

I need to merge data from multiple sheets of an Excel to form a new summary sheet using Python. I am using pandas to read the excel sheets and create new summary sheet. After concatenation the table format is getting lost i.e. Header and borders.
Is there a way to read from source sheet with the format and write to final sheet.
if first is not possible how to format the data after concatenation
Python Code to concatenate:
import pandas as pd
df = []
xlsFile = "some path excel"
sheetNames = ['Sheet1', 'Sheet2','Sheet3']
for nms in sheetNames:
data = pd.read_excel(xlsFile, sheet_name = nms, header=None, skiprows=1)
df.append(data)
final = "some other path excel "
df = pd.concat(df)
df.to_excel(final, index=False, header=None)
Sheet 1 Input Data
Sheet 2 Input Data
Sheet 3 Input Data
Summary Sheet output
You can try the following code:
df = pd.concat(pd.read_excel('some path excel.xlsx', sheet_name=None), ignore_index=True)
If you set sheet_name=None you can read all the sheets in the workbook at one time.
I suggest you the library xlrd
(https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966
and https://github.com/python-excel/xlrd)
It is a good library to do that.
from xlrd import open_workbook
path = '/Users/.../Desktop/Workbook1.xls'
wb = open_workbook(path, formatting_info=True)
sheet = wb.sheet_by_name("Sheet1")
cell = sheet.cell(0, 0) # The first cell
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()
see also:
Using XLRD module and Python to determine cell font style (italics or not)
and How to read excel cell and retain or detect its format in Python (I brought the above code from this address)

Creating a dictionary from one excel workbook, matching the keys with another workbook, paste values

I hope someone can provide a little help. I'm attempting to pull data from one excel workbook, titled DownTime, and create a dictionary of coil(product) numbers matched with "codes" that coil has experienced. I have been able to accomplish this part, it's pretty straight forward.
The part that is tripping me up, is how to match the coil numbers with a different excel workbook, and paste in the corresponding "codes".
So here is what I have so far:
import openpyxl
from collections import defaultdict
DT = openpyxl.load_workbook('DownTime.xlsm')
bl2 = DT.get_sheet_by_name('BL2')
CS = openpyxl.load_workbook('CoilSummary.xlsm')
line = CS.get_sheet_by_name('BL2')
#opening needed workbooks with specific worksheets
coil =[]
rc = []
code = defaultdict(set)
cnum = ''
next_row = 2
col = 32
for row in range(2, bl2.max_row + 1):
coil = bl2['K' + str(row)].value
rc = bl2['D' + str(row)].value
code[coil].add(rc)
# Creating a dictionary that represents each coil with corresponding codes
for key,value in code.items():
cnum = line['B' + str(row)].value
if cnum == key:
line.write(next_row, col, value)
next_row+=1
# Attempting to match coil numbers with dictionary and column B
# if the key is present, paste the value in column AF
CS.close()
DT.close()
A sample output of the dictionary looks as follows:
('M30434269': {106, 107, 173}, 'M30434270': {132, 424, 106, 173, 188}, 'M30434271': {194, 426, 202, 106, 173}})
Only there are about 22,000 entries.
So to reiterate what I want to accomplish:
I want to take this dictionary that I made from the workbook DownTime, match the keys with a column in CoilSummary, and if the keys match the cell entry, paste the value into a blank cell at the end of the table.
Example:
"CoilNum" "Date" "Shift" "info1" "info2" "Code"
M30322386 03/03/2017 06:48:30 3 1052 1722 ' '
M30322390 03/03/2017 05:18:26 3 703 1662 ' '
I would like to match the "CoilNum" with the keys in the dictionary, and paste the values into "Code".
I hope I explained that well enough. Any help with the code, or point to a website for reference, would be very much appreciated. I just don't want to have to type all of these codes in by hand!
Thank you!
After much research and trial and error, accidentally corrupting excel files and getting generally frustrated with python and excel, I figured it out. Here is what I have:
# -*- coding: utf-8 -*-
# importing tools needed for the code to work
import pyexcel as pe
from collections import defaultdict
import openpyxl as op
coil =''
rc = {}
code = defaultdict(list)
next_row = 2
col = 33
cnum = []
temp = ''
def write_data(code,cnum):
''' Used to open a given sheet in a workbook. The code will then compare values
collected from one column in a specific sheet referred to as "coils" and compares it to a dictionary where the key's are also "coils."
If the coil number matches, the code will then paste the values in a new workbook. From here the values can be copied by hand and pasted into the excel file of choice.'''
sheet = pe.get_sheet(file_name="CoilSummaryTesting.xlsx")
next_row = 2
lst = []
while next_row <= len(cnum):
for key in code.keys():
for step in cnum:
if str(step) == str(key):
for val in code.values():
temp = val
lst.append(temp)
next_row+=1
if step!=key:
break
break
for item in lst:
sublist = (" ").join(str(item))
sheet.row+= [sublist]
sheet.save_as("CoilSummaryTest.xlsx")
print("\nCoils Compared: ",next_row)
def open_downtime():
''' Pull data from a second excel file to obtain the coil numbers with corresponding downtime codes'''
DT = op.load_workbook('DownTime.xlsm')
bl2 = DT.get_sheet_by_name('BL2')
n = 1
for row in bl2.iter_cols(min_col=11,max_col=11):
for colD in row:
code[colD.offset(row=1,column=0).value].append(colD.offset(row=1,column=-7).value
n+=1
print('\nNumber of rows in DownTime file: ',n)
return code
def open_coil():
'''Opens the first workbook and sheet to know how many rows are needed for coil comparision.'''
i = 1
CSR = op.load_workbook('CoilSummaryTesting.xlsx')
line_read = CSR.get_sheet_by_name('BL2')
for rows in line_read.iter_cols(min_col=2, max_col=2):
for col in rows:
cnum.append(col.offset(row=1,column=0).value)
i+=1
print('\nNumber of rows in CoilSummary file: ',i)
return write_data(open_downtime(),cnum)
def main():
sheet = open_coil()
if __name__ == "__main__":
main()
I understand this is probably not the shortest version of this code and there are probably a lot of ways to get it to paste directly into the excel file of my choice, but I couldn't figure that part out yet.
What I did differently is using pyexcel. This proved to be the easiest when it came to just pasting values into rows or columns. Using join, I broke the generated list of lists up to allow each sublist to be inserted in its own row. I currently settled on having the generated rows saved to a different excel workbook because having continuously corrupted workbooks during this exploration; however, if anyone knows how to manipulate this code to eliminate the last step of having to copy the rows to paste into the desired workbook, please let me know.

Way to compare two excel files and CSV file

I need to compare two excel files and a csv file, then write some data from one excel file to another.
It looks like this:
CSV file with names which I will compare. For example (spam, eggs)
First Excel file with name and value of it. For example (spam, 100)
Second Excel file with name. For example (eggs)
Now, when I input file (second) into program I need to ensure that eggs == spam with csv file and then save value of 100 to the eggs.
For operating on excel files I'm using openpyxl and for csv I'm using csv.
Can I count on your help? Maybe there are better libraries to do that, because my trials proved to be a total failure.
Got it by myself. Some complex way, but it works like I wanted to. Will be glad for some tips to it.
import openpyxl
import numpy as np
lines = np.genfromtxt("csvtest.csv", delimiter=";", dtype=None)
compdict = dict()
for i in range(len(lines)):
compdict[lines[i][0]] = lines[i][1]
wb1 = openpyxl.load_workbook('inputtest.xlsx')
wb2 = openpyxl.load_workbook(filename='spistest.xlsx')
ws = wb1.get_sheet_by_name('Sheet1')
spis = wb2.get_sheet_by_name('Sheet1')
for row in ws.iter_rows(min_row=1, max_row=ws.max_row, min_col=1):
for cell in row:
if cell.value in compdict:
for wiersz in spis.iter_rows(min_row=1, max_row=spis.max_row, min_col=1):
for komorka in wiersz:
if komorka.value == compdict[cell.value]:
cena = spis.cell(row=komorka.row, column=2)
ws.cell(row=cell.row, column=2, value=cena.value)
wb1.save('inputtest.xlsx')
wb2.close()

Resources