Update pandas data into existing csv - python-3.x

I have a csv which I'm creating from pandas data-frame.
But as soon as I append it, it throws: OSError: [Errno 95] Operation not supported
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
currentDate = datetime.strftime(single_date,"%Y-%m-%d")
#Send request for one day to the API and store it in a daily csv file
response = requests.get(endpoint+f"?startDate={currentDate}&endDate={currentDate}",headers=headers)
rawData = pd.read_csv(io.StringIO(response.content.decode('utf-8')))
outFileName = 'test1.csv'
outdir = '/dbfs/mnt/project/test2/'
if not os.path.exists(outdir):
os.mkdir(outdir)
fullname = os.path.join(outdir, outFileName)
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
with open(fullname, 'a') as f: #This part gives error... If i write 'w' as mode, its overwriting and working fine.
pdf.to_csv(f, header=False, index=False, mode='a')

I am guessing it because you opened the file in an append mode and then you are passing mode = 'a' again in your call to to_csv. Can you try simply do that?
pdf = pd.DataFrame(rawData)
if not os.path.isfile(fullname):
pdf.to_csv(fullname, header=True, index=False)
else: # else it exists so append without writing the header
pdf.to_csv(fullname, header=False, index=False, mode='a')

It didn't work out, with appending. So I created parque files and then read them as data frame.

I was having a similar issue and the root cause was Databrick Runtime > 6 does not support append or random write operation on the files which exist in DBFS. It was working fine for me until I updated my runtime from 5.5 to 6 as they suggested to do this because they were no longer supporting Runtime < 6 at that time.
I followed this workaround, read the file in code, appended the data, and overwritten it.

Related

Unable to add worksheets to a xlsx file in Python

I am trying to export data by running dynamically generated SQLs and storing the data into dataframes which I eventually export into an excel sheet. However, through I am able to generate the different results by successfully running the dynamic sqls, I am not able to export it into different worksheets within the same excel file. It eventually overwrites the previous result with the last resultant data.
for func_name in df_data['FUNCTION_NAME']:
sheet_name = func_name
sql = f"""select * from table({ev_dwh_name}.OVERDRAFT.""" + sheet_name + """())"""
print(sql)
dft_tf_data = pd.read_sql(sql,sf_conn)
print('dft_tf_data')
print(dft_tf_data)
# dft.to_excel(writer, sheet_name=sheet_name, index=False)
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
#dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file, engine = 'xlsxwriter')
dft_tf_data.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
print(temp_file)
I am trying to achieve the below scenario.
Based on the FUNCTION_NAME, it should add a new sheet in the existing excel and then write the data from the query into the worksheet.
The final file should have all the worksheets.
Is there a way to do it. Please suggest.
I'd only expect a file not found that to happen once (first run) if fp.xlsx doesn't exist. fp.xlsx gets created on the line
writer=
if it doesn't exist and since the line is referencing that file it must exist or the file not found error will occur. Once it exists then there should be no problems.
I'm not sure of the reasoning of creating a temp xlsx file. I dont see why it would be needed and you dont appear to use it.
The following works fine for me, where fp.xlsx initially saved as a blank workbook before running the code.
sheet_name = 'Sheet1'
with tempfile.NamedTemporaryFile('w+b', suffix='.xlsx', delete=False) as fp:
print('Inside Temp File creation')
temp_file = path + f'/fp.xlsx'
writer = pd.ExcelWriter(temp_file,
mode='a',
if_sheet_exists='overlay',
engine='openpyxl')
dft_tf_data.to_excel(writer,
sheet_name=sheet_name,
startrow=writer.sheets[sheet_name].max_row+2,
index=False)
writer.save()
print(temp_file)

Python. Append CSV files in a folder into one big file

I am a little confused with a Pandas library and would really appreciate your help.
The task is to combine all *.csv files in a folder into one big file.
CSV files don't have a header, so I just want to append all of them and add a header in the end.
Here is the code I use.
The final file is "ALBERTA GENERAL", in the beginning I delete the old one before creating an updated version.
os.chdir(dataFolder)
with io.open("ALBERTA GENERAL.csv", "w+", encoding='utf8') as f:
os.remove("ALBERTA GENERAL.csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in all_filenames], axis=0, ignore_index = True)
print(combined_csv)
with io.open('ALBERTA GENERAL.csv', "w+", encoding='utf8') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames=["Brand, Name, Strain, Genre, Product type, Date"],delimiter=";")
writer.writeheader()
combined_csv.to_csv(outcsv, index=False, encoding='utf-8-sig')
But I get a confusing result that I don't know how to fix.
The final file doesn't append intermediate files one below another, but it adds columns for the next file. I tried to add the same headers to the intermediate files but it did not help.
Other than that the header is not split by columns and is recognized as one line.
Can anyone help me to fix my code, please?
Here is the link to the files
Just to fix the irregularities of the first file:
with open('ALBERTA GENERAL.csv','r') as f_in, open('ALBERTA GENERAL_fixed.csv','w') as f_out:
for line in f_in:
line = line.replace(',',';')
line = line.strip().rstrip(';')
line = line.strip().lstrip(';')
f_out.write(line + '\n')
os.remove('ALBERTA_GENERAL.csv')
We will import the first file separately because it has different requirements than the others:
df1 = pd.read_csv('ALBERTA GENERAL_fixed.csv',header=0,sep=';')
We can then do the other two:
df2 = pd.read_csv('file_ALBERTA_05.14.2020.csv',header=None,sep=';')
df3 = pd.read_csv('file_ALBERTA_05.18.2020.csv',header=None,sep=';')
df2.columns = df1.columns
df3.columns = df1.columns
Final steps:
combined = pd.concat([df1,df2,df3])
combined.to_csv('out.csv',index=False)

Pandas scan directory to fix multiple corrupted excel files;merge all or few fixed files into one dataframe

a struggling python newbie. I would like to do the following:
(1.) fix multiple corrupted excel files in folder by looping over them to fix them; save restored/fixed files to new location
(2) merge all (or selected) of the fixed/restored excel files one pandas dataframe. If possible, I would like the code to be able to choose say first 10 files, due to low memory.
The code stops running at the very first file and indicates no such file, while the file does exist in the directory. Assistance with both codes would be highly appreciated. Thanks.
Please find attached the notepad containing the code and the error message (issues pasting code here).
file_dir = r"""C:\Users\Documents\corrupted_files"""
for filename in os.listdir(file_dir):
print(filename)
file= os.path.splitext(filename)[0]
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")[0]
data = file1.readlines()
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save(r"C:\\Users\\Documents\\restored_data\\" + file + ".xlsx", 51)
#need assistance with code to loop over fixed(restored) multiple excel files, combine, e.g.all or
only first 10 into one dataframe
ERROR MESSAGE BELOW
*20181124_file_01.csv
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-37-17a38b97f646> in <module>
4 file= os.path.splitext(filename)[0]
5 # Opening the file using 'utf-16' encoding
6 file1 = io.open(filename, "r", encoding="utf-16")[0]
7 data = file1.readlines()
8 xldoc = Workbook()
FileNotFoundError: [Errno 2] No such file or directory: '20181124_file01.csv'*
Code should be:
for filename in os.listdir(file_dir):
print(filename)
file = os.path.join(file_dir, os.path.splitext(filename)[0])
with open(os.path.join(file_dir, filename), "r", encoding="utf-16") as fh:
xldoc = Workbook(fh) ## think you can use a file handle as a reference here.
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
listdir returns only the filename and not the entire file path.
It might also be worth putting a with statement around your call to io.open so that it destroys references after using.
You shouldn't get low memory - as reference to each workbook should be destroyed before you open the next.

add new row to numpy using realtime reading

I am using a microstacknode accelerometer and intend to save it into csv file.
while True:
numpy.loadtxt('foo.csv', delimiter=",")
raw = accelerometer.get_xyz(raw=True)
g = accelerometer.get_xyz()
ms = accelerometer.get_xyz_ms2()
a = numpy.asarray([[raw['x'],raw['y'],raw['z']]])
numpy.savetxt("foo.csv",a,delimiter=",",newline="\n")
However, the saving is only done on 1 line. Any help given? Still quite a noobie on python.
NumPy is not the best solution for this type of things.
This should do what you intend:
while True:
raw = accelerometer.get_xyz(raw=True)
fobj = open('foo.csv', 'a')
fobj.write('{},{},{}\n'.format(raw['x'], raw['y'], raw['z']))
fobj.close()
Here fobj = open('foo.csv', 'a') opens the file in append mode. So if the file already exists, the next writing will go to the end of file, keeping the data in the file.
Let's have look at your code. This line:
numpy.loadtxt('foo.csv', delimiter=",")
reads the whole file but doe not do anything with the at it read, because you don't assign to a variable. You would need to do something like this:
data = numpy.loadtxt('foo.csv', delimiter=",")
This line:
numpy.savetxt("foo.csv",a,delimiter=",",newline="\n")
Creates a new file with the name foo.csv overwriting the existing one. Therefore, you see only one line, the last one written.
This should do the same but dos not open and close the file all the time:
with open('foo.csv', 'a') as fobj:
while True:
raw = accelerometer.get_xyz(raw=True)
fobj.write('{},{},{}\n'.format(raw['x'], raw['y'], raw['z']))
The with open() opens the file with the promise to close it even in case of an exception. For example, if you break out of the while True loop with Ctrl-C.

Why is csvreader for Python starting good then producing NULL bytes?

OKay so I am reading an excel workbook. I read the file for a while and it started off a .csv after debugging and doing other things below the code i am showing you it changed to a xlsx I started getting IOError no such file or directory. I figured out why and changed FFA.csv to FFA.xlsx and it worked error free. Then I started doing other things and debugging. Got up this morning and now i get the following Error : line contains NULL byte. weird because the code started out good. Now it can't read. I put in the print repr() to debug and it infact now prints NULL bytes. So how do i fix this and prevent it in the future? here is the 1st 200 bytes:
PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00b\xee\x9dh^\x01\x00\x00\x90\x04\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
import csv
def readFile():
count = 0
print repr(open("FFA.xlsx", "rb").read(200)) #dump 1st 200 bytes
with open("FFA.xlsx","rb") as csvfile:
FFAreader = csv.reader(csvfile, delimiter=",")
for row in FFAreader:
idd = row[0]
name = row[1]
pos = row[2]
team = row[3]
pts = row[4]
oecr = row[5]
oR = row[6]
posR = row[7]
up = row[8]
low =row[9]
risk = row[10]
swing = row[11]
readFile()
The code you have posted have a small but dangerous mistake, since you are leaking the file handle by opening it twice.
1) You are opening the file and reading 200 bytes from it, but not closing it.
2) You are then opening the file the proper way, via a context manager, which in fact could read anything from it.
Some questions that may help you to debug the problem:
Is the file you are opening stored in a network'd resource? (CIFS, NFS, etc)
Have you checked the file is not opened by another process? lsof can help you to check that.
Is this running on windows or Linux? Can you test in under linux, if it happens in windows, and viceversa?
I forgot to mention that you should not use CSV for anything related to Excel, even when the file seems to be a CSV data-wise. Use XLRD module (https://pypi.python.org/pypi/xlrd) , it's cross-platform and opens and reads perfectly fine both XSL and XSLX files since version 0.8.
This little piece of code will show you how to open the workbook and parse it in a basic manner:
import xlrd
def open_excel():
with xlrd.open_workbook('FFA.xlsx') as wb:
sh = wb.sheet_by_name('Sheet1')
for rownum in xrange(sh.nrows):
[Do whatever you need here]
I agree with Marc, I did a training exercise importing an excel file and I think pandas library would help in that case where you can import pandas as pd and use pd.read_excel(file_name) as part of a data_processing function like read_file() post import.
So this is what I did. But I am intersted in learning the xlrd method i have the module but no documentation. This works no error messages. Still not sure why it changed from .csv to xlsx but its working now. What is the script like in xlrd?
import csv
def readFile():
count = 0
#print repr(open("FFA.csv", "rb").read(200)) #dump 1st 200 bytes check if null values produced.
with open("FFA.csv","rb") as csvfile:
FFAreader = csv.reader(csvfile, delimiter=",")
for row in FFAreader:
idd = row[0]
name = row[1]
pos = row[2]
team = row[3]
pts = row[4]
oecr = row[5]
oR = row[6]
posR = row[7]
up = row[8]
low =row[9]
risk = row[10]
swing = row[11]
readFile()

Resources