Appending data from multiple excel files into a single excel file without overwriting using python pandas - python-3.x

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.

Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

Related

Read an excel from a URL

I am trying to read an excel file from the following URL: http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx
I used the code:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores = pd.read_excel(ruta_indicadores)
But when i run the code, the dataframe is empty, but the file is not, so i dont know why it isn't reading excel file.
Here is the screenshoot for the excel file:
The problem is the pd.read_excel() function by default read the first sheet, but the table you want have a special sheet name, which is "HOJA1".
Here is the code that worked:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores = pd.read_excel(ruta_indicadores, sheet_name='HOJA1')
further more, a more robust solution:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores_dict = pd.read_excel(ruta_indicadores, ,sheet_name=None)
# remove the empty sheet
sheetname_list = list(filter(lambda x: not indicadores_dict[x].empty, indicadores_dict.keys()))
df_list = [indicadores_dict[s] for s in sheetname_list]
ref. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Let's, first of all, discuss the Scenario that Why your Code is not able to print Output. Then we will move towards how we can resolve it. The Issue was :-
You are directly fetching Table from URL. So, it has some Cache in Sheet. So, due to this, your pd.read_csv() module is not able to find your Primary Sheet.
How I found that there is another sheet in your data. For that kindly follow the code given below:-
# Import all important-libraries
from openpyxl import load_workbook
# Load 'Cuadro_17.xlsx' Excel Sheet to Workbook
indicadores = load_workbook(filename = "Cuadro_17.xlsx")
# Print Sheet Names of 'Cuadro_17.xlsx'
indicadores.sheetnames
# Output of above Cell:-
['Cognos_Office_Connection_Cache', 'HOJA1']
As you can see our first sheet is Cognos_Office_Connection_Cache and we can't fetch it.
Appropriate Solution in this Scenario:-
Now we know that our data has been stored in HOJA1 Sheet. So, we can fetch that specific Sheet. and another important thing is your Data contains Multi-Indexing. So, we have to fetch data accordingly. Code for the Same was stated below:-
# Import all important-libraries
import pandas as pd
# Store 'URL' in 'ruta_indicadores' Variable
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
# 'Read CSV' from 'URL' Using 'ps.read_excel()' Module and also Specifies 'Sheet Name', 'Starting Range' of 'Table' and 'header' for 'Multi-level Indexing'
indicadores = pd.read_excel(ruta_indicadores, sheet_name = 'HOJA1', skiprows = 7, header = [0, 1])
# 'Drop' unnecessary 'Column'
indicadores.drop('Unnamed: 0_level_0', axis = 1, inplace = True)
# Rename Child level Column of 'Conceptos'
indicadores.rename(columns={'Unnamed: 1_level_1': ''}, inplace = True)
# Remove 'NaN' Entries from the 'indicadores' Data
indicadores = indicadores.fillna('')
# Print Few records of 'indicadores' Data
indicadores.head()
I can't print that big Output here. So, I have Attached Sample Output of above mentioned Code in the Image given below:-
As you can see we have fetched Table successfully. Hope this Solution helps you.

Pandas Copy Values from Rows to other files without disturbing the existing data

I have 20 csv files pertaining to different individuals.
And I have a Main csv file, which is based on the final row values in specific columns. Below are the sample for both kinds of files.
All Individual Files look like this:
alex.csv
name,day,calls,closed,commision($)
alex,25-05-2019,68,6,15
alex,27-05-2019,71,8,20
alex,28-05-2019,65,7,17.5
alex,29-05-2019,68,8,20
stacy.csv
name,day,calls,closed,commision($)
stacy,25-05-2019,82,16,56.00
stacy,27-05-2019,76,13,45.50
stacy,28-05-2019,80,19,66.50
stacy,29-05-2019,79,18,63.00
But the Main File(single day report), which is the output file, looks like this:
name,day,designation,calls,weekly_avg_calls,closed,commision($)
alex,29-05-2019,rep,68,67,8,20
stacy,29-05-2019,sme,79,81,18,63
madhu,29-05-2019,rep,74,77,16,56
gabrielle,29-05-2019,rep,59,61,6,15
I require to copy the required values from the columns(calls,closed,commision($)) of the last line, for end-of-today's report, and then populate it to the Main File(template that already has some columns filled like the {name,day,designation....}).
And so, how can I write a for or a while program, for all the csv files in the "Employee_performance_DB" list.
Employee_performance_DB = ['alex.csv', 'stacy.csv', 'poduzav.csv', 'ankit.csv' .... .... .... 'gabrielle.csv']
for employee_db in Employee_performance_DB:
read_object = pd.read_csv(employee_db)
read_object2 = read_object.tail(1)
read_object2.to_csv("Main_Report.csv", header=False, index=False, columns=["calls", "closed", "commision($)"], mode='a')
How to copy values of {calls,closed,commision($)} from the 'Employee_performance_DB' list of files to the exact column in the 'Main_Report.csv' for those exact empolyees?
Well, as I had no answers for this, it took a while for me to find a solution.
The code below fixed my issue...
# Created a list of all the files in "employees_list"
employees_list = ['alex.csv', ......, 'stacy.csv']
for employees in employees_list:
read_object = pd.read_csv(employees)
read_object2 = read_object.tail(1)
read_object2.to_csv("Employee_performance_DB.csv", index=False, mode='a', header=False)

Updating excel sheet with Pandas without overwriting the file

I am trying to update an excel sheet with Python codes. I read specific cell and update it accordingly but Padadas overwrites the entire excelsheet which I loss other pages as well as formatting. Anyone can tell me how I can avoid it?
Record = pd.read_excel("Myfile.xlsx", sheet_name'Sheet1', index_col=False)
Record.loc[1, 'WORDS'] = int(self.New_Word_box.get())
Record.loc[1, 'STATUS'] = self.Stat.get()
Record.to_excel("Myfile.xlsx", sheet_name='Student_Data', index =False)
My code are above, as you can see, I only want to update few cells but it overwrites the entire excel file. I tried to search for answer but couldn't find any specific answer.
Appreciate your help.
Update: Added more clarifications
Steps:
1) Read the sheet which needs changes in a dataframe and make changes in that dataframe.
2) Now the changes are reflected in the dataframe but not in the sheet. Use the following function with the dataframe in step 1 and name of the sheet to be modified. You will use the truncate_sheet param to completely replace the sheet of concern.
The function call would be like so:
append_df_to_excel(filename, df, sheet_name, startrow=0, truncate_sheet=True)
from openpyxl import load_workbook
import pandas as pd
def append_df_to_excel(filename, df, sheet_name="Sheet1", startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn"t exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: "/path/to/file.xlsx")
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: "Sheet1")
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
"""
# ignore [engine] parameter if it was passed
if "engine" in to_excel_kwargs:
to_excel_kwargs.pop("engine")
writer = pd.ExcelWriter(filename, engine="openpyxl")
# Python 2.x: define [FileNotFoundError] exception if it doesn"t exist
try:
FileNotFoundError
except NameError:
FileNotFoundError = IOError
if "index" not in to_excel_kwargs:
to_excel_kwargs["index"] = False
try:
# try to open an existing workbook
if "header" not in to_excel_kwargs:
to_excel_kwargs["header"] = True
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
to_excel_kwargs["header"] = False
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# copy existing sheets
writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
except FileNotFoundError:
# file does not exist yet, we will create it
to_excel_kwargs["header"] = True
if startrow is None:
startrow = 0
# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
# save the workbook
writer.save()
We can't replace openpyxl engine here to write excel files as asked in comment. Refer reference 2.
References:
1) https://stackoverflow.com/a/38075046/6741053
2) xlsxwriter: is there a way to open an existing worksheet in my workbook?

How to save a Dataframe into an excel sheet without deleting other sheets?

I am triyng to pull some data from a stock market and saving them in different excel files. Every stock trade process has different timeframes like 1m, 3m, 5m, 15m and so on..
I want to create an excel file for each stock and different sheets for each time frames.
My code creates excel file for a stock (symbol) and adds sheets into it (1m,3m,5m...) and saves the file and then pulls the data from stock market api and saves into correct sheet. Such as ETH/BTC, create the file and sheets and pull "1m" data and save it into "1m" sheet.
Code creates file and sheets, I tested it.
The problem is after dataframe is written into excel file it deletes all other sheets. I tried to pull all data for each symbol. But when I opened the excel file only last time frame (1w) has been written and all other sheets are deleted. So please help.
I checked other problems but didn't find the same problem. At last part I am not trying to add a new sheet I am trying to save df to existed sheet.
#get_bars function pulls the data
def get_bars(symbol, interval):
.
.
.
return df
...
timeseries=['1m','3m','5m','15m','30m','1h','2h','4h','6h','12h','1d','1w']
from pandas import ExcelWriter
from openpyxl import load_workbook
for symbol in symbols:
file = ('C:/Users/mi/Desktop/Kripto/' + symbol + '.xlsx')
workbook = xlsxwriter.Workbook(file)
workbook.close()
wb = load_workbook(file)
for x in range(len(timeseries)):
ws = wb.create_sheet(timeseries[x])
print(wb.sheetnames)
wb.save(file)
workbook.close()
xrpusdt = get_bars(symbol,interval='1m')
writer = pd.ExcelWriter(file, engine='xlsxwriter')
xrpusdt.to_excel(writer, sheet_name='1m')
writer.save()
I think instead of defining the ExcelWriter as a variable, you need to use it in a With statement and use the append mode since you have already created an excel file using xlsxwriter like below
for x in range(len(timeseries)):
xrpusdt = get_bars(symbol,interval=timeseries[x])
with pd.ExcelWriter(file,engine='openpyxl', mode='a') as writer:
xrpusdt.to_excel(writer, sheet_name=timeseries[x])
And in your code above, you're using a static interval as "1m" in the xrpusdt variable which is changed into variable in this code.
Resources:
Pandas ExcelWriter: here you can see the use-case of append mode https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html#pandas.ExcelWriter
Pandas df.to_excel: here you can see how to write to more than one sheet
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

Way to compare two excel files and CSV file

I need to compare two excel files and a csv file, then write some data from one excel file to another.
It looks like this:
CSV file with names which I will compare. For example (spam, eggs)
First Excel file with name and value of it. For example (spam, 100)
Second Excel file with name. For example (eggs)
Now, when I input file (second) into program I need to ensure that eggs == spam with csv file and then save value of 100 to the eggs.
For operating on excel files I'm using openpyxl and for csv I'm using csv.
Can I count on your help? Maybe there are better libraries to do that, because my trials proved to be a total failure.
Got it by myself. Some complex way, but it works like I wanted to. Will be glad for some tips to it.
import openpyxl
import numpy as np
lines = np.genfromtxt("csvtest.csv", delimiter=";", dtype=None)
compdict = dict()
for i in range(len(lines)):
compdict[lines[i][0]] = lines[i][1]
wb1 = openpyxl.load_workbook('inputtest.xlsx')
wb2 = openpyxl.load_workbook(filename='spistest.xlsx')
ws = wb1.get_sheet_by_name('Sheet1')
spis = wb2.get_sheet_by_name('Sheet1')
for row in ws.iter_rows(min_row=1, max_row=ws.max_row, min_col=1):
for cell in row:
if cell.value in compdict:
for wiersz in spis.iter_rows(min_row=1, max_row=spis.max_row, min_col=1):
for komorka in wiersz:
if komorka.value == compdict[cell.value]:
cena = spis.cell(row=komorka.row, column=2)
ws.cell(row=cell.row, column=2, value=cena.value)
wb1.save('inputtest.xlsx')
wb2.close()

Resources