I am trying to export data by running dynamically generated SQLs and storing the data into dataframes which I eventually export into an excel sheet. However, through I am able to generate the different results by successfully running the dynamic sqls, I am not able to export it into different worksheets within the same excel file. It eventually overwrites the previous result with the last resultant data.
for func_name in df_data['FUNCTION_NAME']:
sheet_name = func_name
sql = f"""select * from table({ev_dwh_name}.OVERDRAFT.""" + sheet_name + """())"""
print(sql)
dft_tf_data = pd.read_sql(sql,sf_conn)
print('dft_tf_data')
print(dft_tf_data)
# dft.to_excel(writer, sheet_name=sheet_name, index=False)
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
#dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file, engine = 'xlsxwriter')
dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
print(temp_file)
I am trying to achieve the below scenario.
Based on the FUNCTION_NAME, it should add a new sheet in the existing excel and then write the data from the query into the worksheet.
The final file should have all the worksheets.
Is there a way to do it. Please suggest.
I'd only expect a file not found that to happen once (first run) if fp.xlsx doesn't exist. fp.xlsx gets created on the line
writer=
if it doesn't exist and since the line is referencing that file it must exist or the file not found error will occur. Once it exists then there should be no problems.
I'm not sure of the reasoning of creating a temp xlsx file. I dont see why it would be needed and you dont appear to use it.
The following works fine for me, where fp.xlsx initially saved as a blank workbook before running the code.
sheet_name = 'Sheet1'
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file,
mode='a',
if_sheet_exists='overlay',
engine='openpyxl')
dft_tf_data.to_excel(writer,
sheet_name=sheet_name,
startrow=writer.sheets[sheet_name].max_row+2,
index=False)
writer.save()
print(temp_file)
Related
I am trying to create a function that;
gets all excel files in a folder
reads a specific sheet
adds that to an excel workbook.
The following function works, however i can't open the file itself as I get a
'Excel cannot open the file ’New.xlsx’ because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.'
I have tried understanding why through lots of key word searches but i cannot figure it. out.
def mul_sheet_mul_excel_combiner(path, sheet, newbookname):
df = pd.DataFrame()
files = []
for i in os.listdir(path):
if i.endswith('.csv') or i.endswith('.xlsx') or i.endswith('.xls'):
files.append(i)
out_path = newbookname+'.xlsx'
writer = pd.ExcelWriter(out_path, engine='xlsxwriter')
n = 0
for file in files:
print(file)
df = pd.read_excel(file, sheet_name = sheet, engine='openpyxl')
df.to_excel(writer,sheet_name=str(n))
n +=1
path = "."
sheetname = 0
newbookname = 'New'
mul_sheet_mul_excel_combiner(path, sheetname, newbookname)
So I'm trying to move my csv files from the source folder to the dest folder after performing an action on each file using nested for loops
Below are the nested for loops.
What's happening now is that the top file gets copied into the table in the database, but it doesn't get moved to destination folder after its contents are inserted into the sql table, and then the loop breaks after first run and prints the error in try block.
If I remove the shutil statement, all rows from each csv file successfully copies into database.
Essentially I want that to happen, but I also want to move each file, after I've copied all the data into the table, to the dest folder.
This script will be triggered on a power automate action that will run once a file is added to the folder. So I don't want to add/duplicate the rows in my database from the same file.
I'm also adding variables below this code so you can get an idea of what the function is doing as well.
Thanks for any help you can provide on this, and let me know if more clarification is needed.
My attempt:
for file in dir_list:
source = r"C:\Users\username\source\{}".format(file)
df = pd.read_csv(path2)
df = df.dropna()
rows= df_to_row_tuples(df)
for row in rows:
cursor.execute(sql, row)
conn.commit()
shutil.move(source, destination)
Variables:
def df_to_row_tuples(df):
df = df.fillna('')
rows = [tuple(cell) for cell in df.values]
return rows
conn = sqlite3.connect(r'C:\Users\some.db')
cursor = conn.cursor()
sql = "INSERT INTO tblrandomtble VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
path = r'C:\Users\username\source'
dir_list = os.listdir(path)
source=""
destination= r"C:\Users\username\destination"
df = pd.DataFrame()
rows = tuple()
If the file already exists, the move function will overwrite it, provided you pass the whole path...including the file name
So add the file name to the destination arg of the shutil.move function...
Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)
a struggling python newbie. I would like to do the following:
(1.) fix multiple corrupted excel files in folder by looping over them to fix them; save restored/fixed files to new location
(2) merge all (or selected) of the fixed/restored excel files one pandas dataframe. If possible, I would like the code to be able to choose say first 10 files, due to low memory.
The code stops running at the very first file and indicates no such file, while the file does exist in the directory. Assistance with both codes would be highly appreciated. Thanks.
Please find attached the notepad containing the code and the error message (issues pasting code here).
file_dir = r"""C:\Users\Documents\corrupted_files"""
for filename in os.listdir(file_dir):
print(filename)
file= os.path.splitext(filename)[0]
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")[0]
data = file1.readlines()
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save(r"C:\\Users\\Documents\\restored_data\\" + file + ".xlsx", 51)
#need assistance with code to loop over fixed(restored) multiple excel files, combine, e.g.all or
only first 10 into one dataframe
ERROR MESSAGE BELOW
*20181124_file_01.csv
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-37-17a38b97f646> in <module>
4 file= os.path.splitext(filename)[0]
5 # Opening the file using 'utf-16' encoding
6 file1 = io.open(filename, "r", encoding="utf-16")[0]
7 data = file1.readlines()
8 xldoc = Workbook()
FileNotFoundError: [Errno 2] No such file or directory: '20181124_file01.csv'*
Code should be:
for filename in os.listdir(file_dir):
print(filename)
file = os.path.join(file_dir, os.path.splitext(filename)[0])
with open(os.path.join(file_dir, filename), "r", encoding="utf-16") as fh:
xldoc = Workbook(fh) ## think you can use a file handle as a reference here.
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
listdir returns only the filename and not the entire file path.
It might also be worth putting a with statement around your call to io.open so that it destroys references after using.
You shouldn't get low memory - as reference to each workbook should be destroyed before you open the next.
I have a csv which I'm creating from pandas data-frame.
But as soon as I append it, it throws: OSError: [Errno 95] Operation not supported
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
currentDate = datetime.strftime(single_date,"%Y-%m-%d")
#Send request for one day to the API and store it in a daily csv file
response = requests.get(endpoint+f"?startDate={currentDate}&endDate={currentDate}",headers=headers)
rawData = pd.read_csv(io.StringIO(response.content.decode('utf-8')))
outFileName = 'test1.csv'
outdir = '/dbfs/mnt/project/test2/'
if not os.path.exists(outdir):
os.mkdir(outdir)
fullname = os.path.join(outdir, outFileName)
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
with open(fullname, 'a') as f: #This part gives error... If i write 'w' as mode, its overwriting and working fine.
pdf.to_csv(f, header=False, index=False, mode='a')
I am guessing it because you opened the file in an append mode and then you are passing mode = 'a' again in your call to to_csv. Can you try simply do that?
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
pdf.to_csv(fullname, header=False, index=False, mode='a')
It didn't work out, with appending. So I created parque files and then read them as data frame.
I was having a similar issue and the root cause was Databrick Runtime > 6 does not support append or random write operation on the files which exist in DBFS. It was working fine for me until I updated my runtime from 5.5 to 6 as they suggested to do this because they were no longer supporting Runtime < 6 at that time.
I followed this workaround, read the file in code, appended the data, and overwritten it.