How to write on each iteration to a csv file - python-3.x

How do i write to a csv file in each iteration one line.
I would like to have this kind of behaviour.
import time
import csv
path = 'C:/Blender_Scripts/test.csv'
for i in range(0,100):
time.sleep(1)
with open(path, 'a+', newline='') as Pt_file:
Pt_writer = csv.writer(Pt_file)
Pt_writer.writerow([i])
Is there a way to do this in a perfomance useful way?

Related

Extracting data from a UCI dataset Online using python if the file is compressed(.zip)

I want to use web scrapping to get the data from file
https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
How can I do that using requests in python?
You can use this example how to load the zip file using requests and built-in zipfile module:
import requests
from io import BytesIO
from zipfile import ZipFile
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip"
with ZipFile(BytesIO(requests.get(url).content), "r") as myzip:
# print content of zip:
# print(myzip.namelist())
# print content of one of the file:
with myzip.open("Youtube01-Psy.csv", "r") as f_in:
print(f_in.read())
Prints:
b'COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS\n
...

Merge Multiple txt files and removing duplicates from resulting big file

I've been trying to get this one to work, without success because:
The files to be merged are large (up to 20MB each);
Duplicate lines come in separate files. That's why I need to remove it from the resulting merged file;
Right now, the code is running, but nothing shows up and it basically merged the files, not dealing with duplicates.
import os
import io
import pandas as pd
merged_df = pd.DataFrame()
for file in os.listdir(r"C:\Users\username\Desktop\txt"):
if file.endswith(".txt"):
file_path = os.path.join(r"C:\Users\username\Desktop\txt", file)
bytes = open(file_path, 'rb').read()
merged_df = merged_df.append(pd.read_csv(io.StringIO(
bytes.decode('latin-1')), sep=";", parse_dates=['Data']))
SellOutCombined = open('test.txt', 'a')
SellOutCombined.write(merged_df.to_string())
SellOutCombined.close()
print(len(merged_df))
Any help is appreciated.

Compress a CSV file written to a StringIO Buffer in Python3

I'm parsing text from pdf files into rows of ordered char metadata; I need to serialize these files to cloud storage, which is all working fine, however due to their size I'd also like to gzip these files but I've run into some issues there.
Here is my code:
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
stored_format = zlib.compress(output_buffer)
This reads each row into the io.StringIO Buffer successfully, but gzip/zlib seem to only work with bytes-like objects like io.BytesIO so the last line errors; I cannot create read a csv into a BytesIO Buffer because DictWriter/Writer error unless io.StringIO() is used.
Thank you for your help!
I figured this out and wanted to show my answer for anyone who runs into this:
The issue is that zlib.compress is expecting a Bytes-like object; this actually doesn't mean either StringIO or BytesIO as both of these are "file-like" objects which implment read() and your normal unix file handles.
All you have to do to fix this is use StringIO() to write the csv file to and then call get the string from the StringIO() object and encode it into a bytestring; it can then be compressed by zlib.
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
encoded = output_buffer.getvalue().encode()
stored_format = zlib.compress(encoded)
I have an alternative answer for anyone interested which should use less intermediate space, it needs python 3.3 and over to use the getbuffer() method:
from io import BytesIO, TextIOWrapper
import csv
import zlib
def compress_csv(series):
byte_buf = BytesIO()
fp = TextIOWrapper(byte_buf, newline='', encoding='utf-8')
writer = csv.writer(fp)
for row in series:
writer.writerow(row)
compressed = zlib.compress(byte_buf.getbuffer())
fp.close()
byte_buf.close()
return compressed

Convert Dictionary to String and back in python

I am writing to a dictionary to a file to save the data stored in it. When I read the file and try to convert it back it converts to a list. I added print(type()) to see what type it is going into the file and what time it is coming out.
import ast
f = open("testfile.txt", "a+")
print (type(dic1))
f.write(str(dic1.items()) + "\n")
f.close()
this me writing it to the file
([('people', '1'), ('date', '01/01/1970'), ('t0', 'epoch'), ('time', '0'), ('p0', 'Tim Berners-Lee'), ('memory', 'This is the day time was created')])
this is what it looks like in the written file.
loadDict = ast.literal_eval(x)
print (type(loadDict))
this is the code when trying to convert back to a dictionary
try using pickle, it is the preferred way to store and load python objects:
to store:
import pickle
with open("testfile.txt", "w+") as f:
pickle.dump(dic1,f)
to load:
import pickle
with open("testfile.txt", "r+") as f:
dic1 = pickle.load(f)
if you want to save multiple objects in a list you can save a list to the file then just load the list from file, add what you want to add to it then save it again

Combine multiple csv files into a single xls workbook Python 3

We are in the transition at work from python 2.7 to python 3.5. It's a company wide change and most of our current scripts were written in 2.7 and no additional libraries. I've taken advantage of the Anaconda distro we are using and have already change most of our scripts over using the 2to3 module or completely rewriting them. I am stuck on one piece of code though, which I did not write and the original author is not here. He also did not supply comments so I can only guess at the whole of the script. 95% of the script works correctly until the end where after it creates 7 csv files with different parsed information it has a custom function to combine the csv files into and xls workbook with each csv as new tab.
import csv
import xlwt
import glob
import openpyxl
from openpyxl import Workbook
Parsefiles = glob.glob(directory + '/' + "Parsed*.csv")
def xlsmaker():
for f in Parsefiles:
(path, name) = os.path.split(f)
(chort_name, extension) = os.path.splittext(name)
ws = wb.add_sheet(short_name)
xreader = csv.reader(open(f, 'rb'))
newdata = [line for line in xreader]
for rowx, row in enumerate(newdata)
for colx, value in enumerate(row):
if value.isdigit():
ws.write(rowx, colx, value)
xlsmaker()
for f in Parsefiles:
os.remove(f)
wb.save(directory + '/' + "Finished" + '' + oshort + '' + timestr + ".xls")
This was written all in python 2.7 and still works correctly if I run it in python 2.7. The issue is that it throws an error when running in python 3.5.
File "parsetool.py", line 521, in (module)
xlsmaker()
File "parsetool.py", line 511, in xlsmaker
ws = wb.add_sheet(short_name)
File "c:\pythonscripts\workbook.py", line 168 in add_sheet
raise TypeError("The paramete you have given is not of the type '%s'"% self._worksheet_class.__name__)
TypeError: The parameter you have given is not of the type "Worksheet"
Any ideas about what should be done to fix the above error? Iv'e tried multiple rewrites, but I get similar errors or new errors. I'm considering just figuring our a whole new method to create the xls, possibly pandas instead.
Not sure why it errs. It is worth the effort to rewrite the code and use pandas instead. Pandas can read each csv file into a separate dataframe and save all dataframes as a separate sheet in an xls(x) file. This can be done by using the ExcelWriter of pandas. E.g.
import pandas as pd
writer = pd.ExcelWriter('yourfile.xlsx', engine='xlsxwriter')
df = pd.read_csv('originalfile.csv')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
Since you have multiple csv files, you would probably want to read all csv files and store them as a df in a dict. Then write each df to Excel with a new sheet name.
Multi-csv Example:
import pandas as pd
import sys
import os
writer = pd.ExcelWriter('default.xlsx') # Arbitrary output name
for csvfilename in sys.argv[1:]:
df = pd.read_csv(csvfilename)
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
writer.save()
(Note that it may be necessary to pip install openpyxl to resolve errors with xlsxwriter import missing.)
You can use the code below, to read multiple .csv files into one big .xlsx Excel file.
I also added the code for replacing ',' by '.' (or vice versa) for improved compatibility on windows environments and according to your locale settings.
import pandas as pd
import sys
import os
import glob
from pathlib import Path
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
writer = pd.ExcelWriter('fc15.xlsx') # Arbitrary output name
for csvfilename in all_filenames:
txt = Path(csvfilename).read_text()
txt = txt.replace(',', '.')
text_file = open(csvfilename, "w")
text_file.write(txt)
text_file.close()
print("Loading "+ csvfilename)
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8')
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
print("done")
writer.save()
print("task completed")
Here's a slight extension to the accepted answer. Pandas 1.5 complains about the call to writer.save(). The fix is to use the writer as a context manager.
import sys
from pathlib import Path
import pandas as pd
with pd.ExcelWriter("default.xlsx") as writer:
for csvfilename in sys.argv[1:]:
p = Path(csvfilename)
sheet_name = p.stem[:31]
df = pd.read_csv(p)
df.to_excel(writer, sheet_name=sheet_name)
This version also trims the sheet name down to fit in Excel's maximum sheet name length, which is 31 characters.
If your csv file is in Chinese with gbk encoding, you can use the following code
import pandas as pd
import glob
import datetime
from pathlib import Path
now = datetime.datetime.now()
extension = "csv"
all_filenames = [i for i in glob.glob(f"*.{extension}")]
with pd.ExcelWriter(f"{now:%Y%m%d}.xlsx") as writer:
for csvfilename in all_filenames:
print("Loading " + csvfilename)
df = pd.read_csv(csvfilename, encoding="gb18030")
df.to_excel(writer, index=False, sheet_name=Path(csvfilename).stem)
print("done")
print("task completed")

Resources