File corruption while writing using Pandas - excel

I am reading data from a perfectly valid xlsx file and processing it using Pandas in Python 3.5. At the end I am writing the final dataframe to an Excel file using :
writer = pd.ExcelWriter(os.path.join(DATA_DIR, 'Data.xlsx'),
engine='xlsxwriter', options={'strings_to_urls': False})
manual_labelling_data.to_excel(writer, 'Sheet_A', index=False)
writer.save()
While trying to open the Data.xlsx, I am getting the error : We found a problem with some content in 'Data.xlsx'... On proceeding the file loads into Excel with info : Removed Records: Formula from /xl/worksheets/sheet1.xml part
I cannot find out what the problem is.

Thanks a lot to #jmcnamara for the help in comment. The issue was that some strings in the data were wrongly being interpreted as formulas. The corrected code is :
options = {}
options['strings_to_formulas'] = False
options['strings_to_urls'] = False
writer = pd.ExcelWriter(os.path.join(DATA_DIR, 'Data.xlsx'),engine='xlsxwriter',options=options)
manual_labelling_data.to_excel(writer, 'Sheet_A', index=False)
writer.save()

Related

csv (;) to excel and back to csv(;), comma disapears

This drives me crazy.
I have the following csv file:
Short name;Calculation;29221
peter;foster;1,755345
karin;paris;0,2343543
john;dee;0
lisa;long;1,434534
lauren;lovely;0,123124
linda;loss;0,0234
I read this file in pandas, print it and everything looks fine in pandas.
Then I write it to an existing excel workbook and the values are partly corrupted.
THis is my code
import pandas as pd
import xlwings as xw
#öffne csv
QTH = pd.read_csv(r"C:/Users/A692517/PhytonStuff/testCSVtoExcel.csv",sep = ';')#,
# engine = 'python')
for idx, row in QTH.iterrows():
#c=QoSFTTH[row[2]].at[idx]
myString = str(row[2])
row[2]=myString
#ziel workbook
fn="C:/Users/A692517/PhytonStuff/myClist.xlsx"
wb = xw.Book(fn)
ws = wb.sheets["Tabelle1"]
#schreibe QoSFTTH dataframe in zielworkbook
ws["A1"].options(pd.DataFrame, header=1, index=False, expand='table').value = QTH
wb.save(fn)
wb.close()
When I export the Excel result in a new csv(;) you see what I mean:
Short name;Calculation;29221,00
peter;foster;1755345,00
karin;paris;0,2343543
john;dee;0,00
lisa;long;1434534,00
lauren;lovely;0,123124
linda;loss;0,0234
You may have stumbled on a pd.read_csv bug found via this stack question. Change the engine to engine = c and try thousands=','.
pd.read_csv('path', sep=';', thousands=',', engine='c')

Unable to add worksheets to a xlsx file in Python

I am trying to export data by running dynamically generated SQLs and storing the data into dataframes which I eventually export into an excel sheet. However, through I am able to generate the different results by successfully running the dynamic sqls, I am not able to export it into different worksheets within the same excel file. It eventually overwrites the previous result with the last resultant data.
for func_name in df_data['FUNCTION_NAME']:
sheet_name = func_name
sql = f"""select * from table({ev_dwh_name}.OVERDRAFT.""" + sheet_name + """())"""
print(sql)
dft_tf_data = pd.read_sql(sql,sf_conn)
print('dft_tf_data')
print(dft_tf_data)
# dft.to_excel(writer, sheet_name=sheet_name, index=False)
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
#dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file, engine = 'xlsxwriter')
dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
print(temp_file)
I am trying to achieve the below scenario.
Based on the FUNCTION_NAME, it should add a new sheet in the existing excel and then write the data from the query into the worksheet.
The final file should have all the worksheets.
Is there a way to do it. Please suggest.
I'd only expect a file not found that to happen once (first run) if fp.xlsx doesn't exist. fp.xlsx gets created on the line
writer=
if it doesn't exist and since the line is referencing that file it must exist or the file not found error will occur. Once it exists then there should be no problems.
I'm not sure of the reasoning of creating a temp xlsx file. I dont see why it would be needed and you dont appear to use it.
The following works fine for me, where fp.xlsx initially saved as a blank workbook before running the code.
sheet_name = 'Sheet1'
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file,
mode='a',
if_sheet_exists='overlay',
engine='openpyxl')
dft_tf_data.to_excel(writer,
sheet_name=sheet_name,
startrow=writer.sheets[sheet_name].max_row+2,
index=False)
writer.save()
print(temp_file)

Python Tabula Library - Output File Is Empty

I am using the Tabula module in Python.
I am trying to output text from a PDF.
I am using this code:
pdf_read = tabula.read_pdf(
input_path = "Test File.pdf",
pages = start_page_number,
guess=False,
area=(81.735,18.55,391.285,273.61),
relative_area = False,
format="TSV",
output_path="testing_area.tsv"
)
When I go to run my code, it says "The output file is empty."
Any idea why this could be?
Edit: If I remove everything except the input_path and pages, my data is getting read into pdf_read correctly, it just does not output into an external file.
Something is wrong with this option...hmm...
Edit #2: I figured out why the area part was not working and now it is, but I still can't get this to output a file for some reason.
Edit #3: I tried looking at this: How to convert PDF to CSV with tabula-py?
But I keep getting an error message: "build_options() got an unexpected keyword argument 'spreadsheet'
Edit #4: I'm using the latest version of tabula.py, which doesn't have the spreadsheet option.
Still can't output a file with data though.
I don't know why that wasn't working above, so the output of pdf_read is a list.
I converted the list into a dataframe and then output the dataframe using to_csv.
Code is below:
import pandas as pd
df = pd.DataFrame(pdf_read,columns=["column_a"])
output_df = df.to_csv(
"alternative_attempt_1.txt",
header=True,
index=True,
sep='\t',
mode='w',
encoding="cp1252"
)

How to extract the contents of the mth column of the nth row from a csv file using python

I created a CSV file and was able to add headers for it. I tried using loc to extract the contents but to no avail.
I want to get e as an output or use it in code for something.
The code I've used is as follows:
import pandas as pd
import csv
with open("boo.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(('a','b', 'c'))
df = pd.read_csv("boo.csv", header=None)
df.to_csv("boo.csv", header=["alpha", "beta", "gamma"], index=False)
with open('boo.csv','a') as f:
writer=csv.writer(f)
writer.writerow(('c','d','e'))
writer.writerow(('f','g','h'))
print(df.loc[(df["alpha"]=='c')]["gamma"])
Upon running this code, I'm getting a KeyError for alpha. Please help with this. I'm pretty new to handling CSV files and pandas.
Thank you. :)

Update pandas data into existing csv

I have a csv which I'm creating from pandas data-frame.
But as soon as I append it, it throws: OSError: [Errno 95] Operation not supported
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
currentDate = datetime.strftime(single_date,"%Y-%m-%d")
#Send request for one day to the API and store it in a daily csv file
response = requests.get(endpoint+f"?startDate={currentDate}&endDate={currentDate}",headers=headers)
rawData = pd.read_csv(io.StringIO(response.content.decode('utf-8')))
outFileName = 'test1.csv'
outdir = '/dbfs/mnt/project/test2/'
if not os.path.exists(outdir):
os.mkdir(outdir)
fullname = os.path.join(outdir, outFileName)
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
with open(fullname, 'a') as f: #This part gives error... If i write 'w' as mode, its overwriting and working fine.
pdf.to_csv(f, header=False, index=False, mode='a')
I am guessing it because you opened the file in an append mode and then you are passing mode = 'a' again in your call to to_csv. Can you try simply do that?
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
pdf.to_csv(fullname, header=False, index=False, mode='a')
It didn't work out, with appending. So I created parque files and then read them as data frame.
I was having a similar issue and the root cause was Databrick Runtime > 6 does not support append or random write operation on the files which exist in DBFS. It was working fine for me until I updated my runtime from 5.5 to 6 as they suggested to do this because they were no longer supporting Runtime < 6 at that time.
I followed this workaround, read the file in code, appended the data, and overwritten it.

Resources