Delete null values on multiple worksheets and export to excel - python-3.x

I am trying to write a code that deletes null values on multiple excel sheets on specific columns and export the file. Any help is appreciated!
Code below:
import pandas as pd
fileName = 'data.xls'
df = pd.ExcelFile(fileName)
arrayOf_SheetNames = df.sheet_names
for sheetName in arrayOf_SheetNames:
masterdf = pd.read_excel(fileName, sheet_name=sheetName, header=4)
masterdf = masterdf.dropna(subset=['Column 1', 'Column 2'], inplace=True)
masterdf.to_excel('file_path.xls')

One problem you're having is you are redefining what masterdf is for every sheet in the for loop. Another problem is you aren't saving it at the end with writer.save().
dfs = pd.read_excel('/tmp/Untitled spreadsheet-2.xlsx', sheet_name=None, header=4)
writer = pd.ExcelWriter('/tmp/out.xlsx')
for sheetname, df in dfs.items():
df.dropna(subset=['Column 1', 'Column 2'], inplace=True)
df.to_excel(writer, sheetname, index=False)
writer.save()

Related

Append values to Dataframe in loop and if conditions

Need help please.
I have a dataframe that reads rows from Excel and appends to Dataframe if certain columns exist.
I need to add an additional Dataframe if the columns don't exist in a sheet and append filename and sheetname and write all the file names and sheet names for those sheets to an excel file. Also I want the values to be unique.
I tried adding to dfErrorList but it only showed the last sheetname and filename and repeated itself many times in the output excel file
from xlsxwriter import Workbook
import pandas as pd
import openpyxl
import glob
import os
path = 'filestoimport/*.xlsx'
list_of_dfs = []
list_of_dferror = []
dfErrorList = pd.DataFrame() #create empty df
for filepath in glob.glob(path):
xl = pd.ExcelFile(filepath)
# Define an empty list to store individual DataFrames
for sheet_name in xl.sheet_names:
df = pd.read_excel(filepath, sheet_name=sheet_name)
df['sheetname'] = sheet_name
file_name = os.path.basename(filepath)
df['sourcefilename'] = file_name
if "Project ID" in df.columns and "Status" in df.columns:
print('')
*else:
dfErrorList['sheetname'] = df['sheetname'] # adds `sheet_name` into the column
dfErrorList['sourcefilename'] = df['sourcefilename']
continue
list_of_dferror.append((dfErrorList))
df['Status'].fillna('', inplace=True)
df['Added by'].fillna('', inplace=True)
list_of_dfs.append(df)
# # Combine all DataFrames into one
data = pd.concat(list_of_dfs, ignore_index=True)
dataErrors = pd.concat(list_of_dferror, ignore_index=True)
dataErrors.to_excel(r'error.xlsx', index=False)
# data.to_excel("total_countries.xlsx", index=None)

Creating new sheet overwrites existing sheet created via openpyxl

I'm trying to create a bar chart directly in excel, using a pandas dataframe. In the same output excel, I'd like to save in a separate sheet the original csv used for the bar chart. My code:
wb = openpyxl.Workbook()
ws = wb.active
for row in dataframe_to_rows(new_df, index=False, header=False):
ws.append(row)
chart = BarChart()
values = Reference(ws, min_col=1, min_row=1, max_col=2, max_row=ws.max_row)
labels = Reference(ws, min_col=1, min_row=1, max_col=1, max_row=ws.max_row)
chart.add_data(values)
chart.set_categories(labels)
ws.add_chart(chart, "E2")
wb.save("~/barChart.xlsx")
writer = pd.ExcelWriter("~/barChart.xlsx", engine='openpyxl')
df.to_excel(writer, sheet_name="Source_data")
writer.save()
The problem I get is the the last three lines, which overwrite the produced bar chart. How do I overcome this?
from pandas documentation:
ExcelWriter can also be used to append to an existing Excel file:
with pd.ExcelWriter('output.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet_name_3')

How do you append rows to xlsx file when using beautifulsoup and pandas to scrape?

So, i've been looking all over and i can't seem to figure out why i can't get the results from my scrape to write to a xlsx file.
I'm running a list of urls from a .csv file. I throw 10 urls in there, beautifulsoup scrapes them. If i just print the dataframe, it comes our right.
If i try and save the results as a xlsx(which is preferred) or csv, it will only give me the results from the last url.
If i run this, it prints out perfect
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
But i can't seem to get it to append to an xlsx file. I'm only getting the last result, which i figure is because it is just "writing" and not appending.
I've tried:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
AND
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
I started reading into openpyxl, but at this point I am so confused, i don't understand it.
Any and all help is appreciated
You are iterating over your csv data line-by-line, but you are recreating your dataframe at every iteration, so you are losing the value of the previous one each time. You will need to create the df first outside of the loop, and add data in your for loop.
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
Your assumption of writing and not appending is correct, but you need to append the dataframe and then write it to excel, and not append data to the excel(if I understood correctly).
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
The pd.ExcelWriter will only produce the output when you run:
writer.save()
I have a similar code that opens the file with the following parameters and it works:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
Note that according to the documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" the document with another dataframe with the exact same format to the document structure.
For it to be reproducable, you could add a template xlsx, but I hope it helps. Please let me know.

Saving loop output to multiple excel sheets

I have a csv file full of multiple years of water data. I've broken up each water year into it's own data frame. Now I want to do some math to those water years then save each water year to it's own excel sheet.
The math part of the code is working, but I'm having trouble with the final step of naming and saving the output of the loop correctly. Right now I have it creating the excel file and creating the sheet names correctly, but the loop just saves the final iteration to all the sheets. I've googled around but I can't get any other of the similar questions answers to work. This is my first python program so advice would be appreciated.
import pandas as pd
with open(r'wft.csv') as csvfile:
tdata = pd.read_csv(csvfile)
tdata['date'] = pd.to_datetime(tdata['date'], format='%m/%d/%Y %H:%M')
tdata = tdata.set_index(['date'])
wy2015 = tdata.loc['2014-10-1 00:00' : '2015-7-1 00:00']
wy2016 = tdata.loc['2015-10-1 00:00' : '2016-7-1 00:00']
wy2017 = tdata.loc['2016-10-1 00:00' : '2017-7-1 00:00']
writer = pd.ExcelWriter('WFT.xlsx', engine='xlsxwriter')
wyID = [wy2014, wy2015, wy2016, wy2017]
seq = ['wy2014', 'wy2015', 'wy2016', 'wy2017']
for df in wyID:
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
for name in seq:
df.to_excel(writer, sheet_name= name)
writer.save()
Issues in your code
writer = pd.ExcelWriter('WFT.xlsx', engine='xlsxwriter')
wyID = [wy2014, wy2015, wy2016, wy2017]
seq = ['wy2014', 'wy2015', 'wy2016', 'wy2017']
for df in wyID: # outer loop that figures out wy20xx
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
for name in seq: # you loop through all the names and write all sheets every time. you want to be writing just one
df.to_excel(writer, sheet_name= name)
writer.save()
Instead try this.
for i, df in enumerate(wyID): # outer loop that figures out wy20xx
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
df.to_excel(writer, sheet_name= seq[i]) # writes to correct wy20xx sheet
writer.save() # Now you're done writing the excel

Appending Columns from several worksheets Python

I am trying to import certain columns of data from several different sheets inside of a workbook. However, while appending it only seems to append 'q2 survey' to a new workbook. How do I get this to append properly?
import sys, os
import pandas as pd
import xlrd
import xlwt
b = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
df_t = pd.DataFrame(columns=["Month","Date", "Year"]) #column Name
xls = "path_to_file/R.xls"
sheet=[]
df_b=pd.DataFrame()
pd.read_excel(xls,sheet)
for sheet in b:
df=pd.read_excel(xls,sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
bill=df_b.append(df[df_t])
bill.to_excel('Survey.xlsx', index=False)
I think if you do:
b = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
list_col = ["Month","Date", "Year"] #column Name
xls = "path_to_file/R.xls"
#create the empty df named bill to append after
bill= pd.DataFrame(columns = list_col)
for sheet in b:
# read the sheet
df=pd.read_excel(xls,sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
# need to assign bill again
bill=bill.append(df[list_col])
# to excel
bill.to_excel('Survey.xlsx', index=False)
it should work and correct the errors in your code, but you can do a bit differently using pd.concat:
list_sheet = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
list_col = ["Month","Date", "Year"] #column Name
# read once the xls file and then access the sheet in the loop, should be faster
xls_file = pd.ExcelFile("path_to_file/R.xls")
#create a list to append the df
list_df_to_concat = []
for sheet in list_sheet :
# read the sheet
df= pd.read_excel(xls_file, sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
# append the df to the list
list_df_to_concat.append(df[list_col])
# to excel
pd.concat(list_df_to_concat).to_excel('Survey.xlsx', index=False)

Resources