How do I concatenate multiple xlsx files with the same sheet_names. For example,
I have 3 xlsx files, Rob_schedule.xlsx, Mike_schdule.xlsx and Jerome_schedule.xlsx.
Each file has the following sheet/tab names : home, office & school.
The code below generates the 3 xlsx files ( you can copy + paste and run to generate the excel files)
##############################Generating the data for Rob_schedule.xlsx########################
import pandas as pd
import numpy as np
df= {
'Date':[10232020,10242020,10252020,10262020],
'Class':['AP_Bio','AP_Chem','Physics','History'],
'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school
df2= {
'Date':[10232020,10242020,10252020,10262020],
'Meeting':['MQ1','MQ6','MQ2','MQ8'],
'Lunch':[1,1,1,3],
'code':['java','python','C','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office
df3= {
'cooking':['C','B','D','B'],
'Laundry':['color','white','White','color'],
'cleaning':['balcony','garage','restroom','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home
import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('Rob_schedule.xlsx', engine='xlsxwriter')
#store your dataframes in a dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
'school':school}
#now loop thru and put each on a specific sheet
for sheet, frame in frames.items():
frame.to_excel(writer, sheet_name = sheet,index = False)
#critical last step
writer.save()
################################ generating Mike_schedule.xlsx###################################
import pandas as pd
import numpy as np
df= {
'Date':[10232020,10242020,10252020,10262020],
'Class':['AP_Bio','AP_Chem','Physics','History'],
'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school
df2= {
'Date':[10232020,10242020,10252020,10262020],
'Meeting':['MQ1','MQ2','MQ4','MQ5'],
'Lunch':[1,1,1,3],
'code':['javascript','R','C','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office
df3= {
'cooking':['A','B','D','B'],
'Laundry':['color','white','white','color'],
'cleaning':['patio','garage','living_room','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home
#initialze the excel writer
writer = pd.ExcelWriter('Mike_schedule.xlsx', engine='xlsxwriter')
#store your dataframes in a dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
'school':school}
#now loop thru and put each on a specific sheet
for sheet, frame in frames.items(): # .use .items for python 3.X
frame.to_excel(writer, sheet_name = sheet,index = False)
#critical last step
writer.save()
######################### Generate Jerome schedule###########################################
df= {
'Date':[10232020,10242020,10252020,10262020],
'Class':['French','Math','Physics','History'],
'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school
df2= {
'Date':[10232020,10242020,10252020,10262020],
'Meeting':['MQ1','MQ2','MQ4','MQ5'],
'Lunch':[1,1,1,3],
'code':['javascript','python','R','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office
df3= {
'cooking':['X','B','D','C'],
'Laundry':['color','white','white','color'],
'cleaning':['patio','garage','living_room','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home
import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('Jerome_schedule.xlsx', engine='xlsxwriter')
#store your dataframes in a dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
'school':school}
#now loop thru and put each on a specific sheet
for sheet, frame in frames.items(): # .use .items for python 3.X
frame.to_excel(writer, sheet_name = sheet,index = False)
#critical last step
writer.save()
I want to
concatenate the corresponding sheets/tabs :home, office, and school for Rob_schedule.xlsx,Mike_schedule.xlsx & Jerome_schedule.xlsx
export the concatenated dataframes as family_schedule.xlsx with home, office and school tabs
My attempt:
# This code concatenates all the tabs into one tab, but what I want is to concatenate all by their corresponding sheet/tab names
import pandas as pd
path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files
def create_df_from_excel(file_name):
file = pd.ExcelFile(file_name)
names = file.sheet_names
return pd.concat([file.parse(name) for name in names])
df = pd.concat(
[create_df_from_excel(xl) for xl in excel_files]
)
# save the data frame
writer = pd.ExcelWriter('family_reschedule.xlsx')
df.to_excel(writer, '')
writer.save()
I would iterate over each file, and then over each worksheet, adding each sheet to a different list based on the sheet name.
Then you'll have a structure like...
{
'sheet1': [df_file1_sheet1, df_file2_sheet1, df_file3_sheet1],
'sheet2': [df_file1_sheet2, df_file2_sheet2, df_file3_sheet2],
'sheet3': [df_file1_sheet3, df_file2_sheet3, df_file3_sheet3],
}
Then concatenate each list in to a single dataframe, them write the three dataframes to an excel file.
# This part is just your own code, I've added it here because you
# couldn't figure out where `excel_files` came from
#################################################################
import os
import pandas as pd
path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files
# This part is my actual answer
###############################
from collections import defaultdict
worksheet_lists = defaultdict(list)
for file_name in excel_files:
workbook = pd.ExcelFile(file_name)
for sheet_name in workbook.sheet_names:
worksheet = workbook.parse(sheet_name)
worksheet['source'] = file_name
worksheet_lists[sheet_name].append(worksheet)
worksheets = {
sheet_name: pd.concat(sheet_list)
for (sheet_name, sheet_list)
in worksheet_lists.items()
}
writer = pd.ExcelWriter('family_reschedule.xlsx')
for sheet_name, df in worksheets.items():
df.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
Consider building a list of concatenated data frames with list/dict comprehensions by running an outer iteration across sheet names and inner iteration across workbooks:
import pandas as pd
path = "/path/to/workbooks"
workbooks = [f for f in os.listdir(path) if f.endswith(".xlsx")]
sheets = ["home", "office", "school"]
df_dicts = {
sh: pd.concat(
[pd.read_excel(os.path.join(path, wb), sheet_name=sh)
for wb in workbooks]
)
for sh in sheets
}
Then, export to single file:
with pd.ExcelWriter('family_reschedule.xlsx') as writer:
for sh, df in df_dict.items():
df.to_excel(writer, sheet_name=sh, index=False)
writer.save()
I am new to python and trying various things to learn the fundamentals. One of the things that i'm currently stuck on is for loops. I have the following code and am positive it can be built out more efficiently using a loop but i'm not sure exactly how.
import pandas as pd
import numpy as np
url1 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=1'
url2 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=2'
url3 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=3'
df1 = pd.read_html(url1)
df1[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df2 = pd.read_html(url2)
df2[0].to_csv ('NFL_Receiving_Page2.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df3 = pd.read_html(url3)
df3[0].to_csv ('NFL_Receiving_Page3.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df_receiving_agg = pd.concat([df1[0], df2[0], df3[0]])
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
I'm ultimately trying to combine the data in the above URL's into a single table in a csv file.
You can try this:
urls = [url1,url2,url3]
df_receiving_agg = pd.DataFrame()
for url in urls:
df = pd.read_html(url)
df_receiving_agg = pd.concat([df_receiving_agg, df])
df_receiving_agg.to_csv('filepath.csv',index=False)
You can do this:
base_url = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page='
dfs = []
for page in range(1, 4):
url = f'{base_url}{page}'
df = pd.read_html(url)
df.to_csv(f'NFL_Receiving_Page{page}.csv', index=False)
dfs.append(df)
df_receiving_agg = pd.concat(dfs)
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False)
So, i've been looking all over and i can't seem to figure out why i can't get the results from my scrape to write to a xlsx file.
I'm running a list of urls from a .csv file. I throw 10 urls in there, beautifulsoup scrapes them. If i just print the dataframe, it comes our right.
If i try and save the results as a xlsx(which is preferred) or csv, it will only give me the results from the last url.
If i run this, it prints out perfect
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
But i can't seem to get it to append to an xlsx file. I'm only getting the last result, which i figure is because it is just "writing" and not appending.
I've tried:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
AND
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
I started reading into openpyxl, but at this point I am so confused, i don't understand it.
Any and all help is appreciated
You are iterating over your csv data line-by-line, but you are recreating your dataframe at every iteration, so you are losing the value of the previous one each time. You will need to create the df first outside of the loop, and add data in your for loop.
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
Your assumption of writing and not appending is correct, but you need to append the dataframe and then write it to excel, and not append data to the excel(if I understood correctly).
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
The pd.ExcelWriter will only produce the output when you run:
writer.save()
I have a similar code that opens the file with the following parameters and it works:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
Note that according to the documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" the document with another dataframe with the exact same format to the document structure.
For it to be reproducable, you could add a template xlsx, but I hope it helps. Please let me know.