How to open and append nested zip archives into dataframe without extracting? - python-3.x

I am trying to open a large number of csv files which found in several layers of zip files. Given the nature of this project, I am trying to open, read_csv them into a dataframe, append that data to an aggregate dataframe then continue through the loop.
Example: Folder Directory/First Zip/Second Zip/Third Zip/csv file.csv
My existing code can loop through the contents of the second and third zip file and get the name of each csv file. I am aware that this code can probably be made more simple by importing glob, but I'm unfamiliar.
import os
import pandas as pd
import zipfile, re, io
directory = 'C:/Test/'
os.chdir(directory)
fname = "test" + ".zip"
with zipfile.ZipFile(fname, 'r') as zfile:
# second level of zip files
for zipname in zfile.namelist():
if re.search(r'\.zip$', zipname) != None:
zfiledata = io.BytesIO(zfile.read(zipname))
# third level of zip files
with zipfile.ZipFile(zfiledata) as zfile2:
for zipname2 in zfile2.namelist():
# this zipfile contains xml and csv contents. This filters out the xmls
if zipname2.find("csv") > 0:
zfiledata2 = io.BytesIO(zfile2.read(zipname2))
with zipfile.ZipFile(zfiledata2) as zfile3:
fullpath = directory + fname + "/" + zipname + "/" + zipname2 + "/"
# csv file names are always the same as their zips. this cleans the string.
csvf = zipname2.replace('_csv.zip',".csv")
filehandle = open(fullpath, 'rb')
# the above statement is erroring: FileNotFoundError: [Errno 2] No such file or directory:
zfilehandle = zipfile.ZipFile(filehandle)
data = []
csvdata = StringIO.StringIO(zfilehandle.read(csvf))
df = pd.read_csv(csvdata)
data.append(df)
print(data.head())

Related

Loop over excel files' paths under a directory and pass them to data manipulation function in Python

I need to check the excel files under a directory /Users/x/Documents/test/ by DataCheck function from data_check.py, so I can do data manipulation of many excel files, data_check.py has code structure as follows:
import pandas as pd
def DataCheck(filePath):
df = pd.read_excel(filePath)
try:
df = df.dropna(subset=['building', 'floor', 'room'], how = 'all')
...
...
...
df.to_excel(writer, 'Sheet1', index = False)
if __name__ == '__main__':
status = True
while status:
rawPath = input(r"")
filePath = rawPath.strip('\"')
if filePath.strip() == "":
status = False
DataCheck(filePath)
In order to loop all the excel files' paths under a directory, I use:
import os
directory = '/Users/x/Documents/test/'
for filename in os.listdir(directory):
if filename.endswith(".xlsx") or filename.endswith(".xls"):
print(os.path.join(directory, filename))
else:
pass
Out:
/Users/x/Documents/test/test 3.xlsx
/Users/x/Documents/test/test 2.xlsx
/Users/x/Documents/test/test 4.xlsx
/Users/x/Documents/test/test.xlsx
But I don't know how to combine the code above together, to pass the excel files' paths to DataCheck(filePath).
Thanks for your kind help at advance.
Call the function with the names instead of printing them:
import os
directory = '/Users/x/Documents/test/'
for filename in os.listdir(directory):
if filename.endswith(".xlsx") or filename.endswith(".xls"):
fullname = os.path.join(directory, filename)
DataCheck(fullname)

Rename by Appending a prefix to a file name

I would appreciate if someone could give me a hint. I have to rename a batch of files by adding a prefix (date) to the file name, so files are organized in ordered manner in the folder: from older to newer.
The date itself contained inside of the file. Therefore, my script has to open the file, find the date and use it as a "prefix" to add to the file name.
from datetime import datetime
import re
import os
file = open('blog_entry.txt', 'r', encoding='utf-8')
source_code = file.read()
<...>
# convert the date:
date = datetime.strptime(date_only, "%d-%b-%Y")
new_date = date.strftime('%Y_%m_%d')
The new_date variable should be used as a "prefix", so the new file name looks like "yyyy_mm_dd blog_entry.txt"
I cannot wrap my head around how to generate a "new name" using this prefix, so I can apply os.rename(old_name, new_name) command to the file. apply
Here is one way, using string concatenation to build the new filename you want:
from datetime import datetime
import re
import os
file = open('blog_entry.txt', 'r', encoding='utf-8')
source_code = file.read()
# read the date from the file contents
date = datetime.strptime(date_only, "%d-%b-%Y")
new_date = date.strftime('%Y_%m_%d')
path = "/path/to/your/file/"
os.rename(path + 'blog_entry.txt', path + new_date + ' ' + 'blog_entry.txt')

Python 3.x | FileNotFoundError: [Errno 2] No such file or directory | writing .csv from .xlxs

I was working on a file converting function for a xlxs -> csv format. I was able to make the function work when I specified the exact file, but I'm running into issues when I try to iterate the process over a dir folder. Below is the code:
def ExceltoCSV(excel_file, csv_file_base_path):
workbook = xlrd.open_workbook(excel_file)
## get the worksheet names
for sheet_name in workbook.sheet_names():
print('processing - ' + sheet_name)
## extract the data from each worksheet
worksheet = workbook.sheet_by_name(sheet_name)
## create a new csv file, with the name being the original Excel worksheet name; tidied up a bit replacing spaces and dashes
csv_file_full_path = csv_file_base_path + sheet_name.lower().replace(" - ", "_").replace(" ","_") + '.csv'
csvfile = open(csv_file_full_path, 'w')
## write into the new csv file, one row at a time
writetocsv = csv.writer(csvfile, quoting = csv.QUOTE_ALL)
for rownum in range(worksheet.nrows):
writetocsv.writerow(
list(x.encode('utf-8') if type(x) == type(u'') else x for x in worksheet.row_values(rownum)
)
)
csvfile.close()
print(sheet_name + ' has been saved at - ' + csv_file_full_path)
## Paths as strings
p = r'//Network/TestingFolder/'
nf_p = r'//Network/TestingFolder/CSV_Only/'
## directory reference
directory = r'//Network/TestingFolder/' # for os.listdir() function below
file_list = []
## for iterating over directory and spitting out the paths for each file { to be used in conjunction
with ExceltoCSV() }
for filename in os.listdir(directory):
if filename.endswith(".xlsx"): # or filename.endswith(".csv")
file_path = os.path.join(directory, filename)
file_list.append(file_path)
else:
continue
for paths in file_list:
print(paths)
ExceltoCSV(paths, nf_p)
My error is occurring with the line >> csvfile = open(csv_file_full_path, 'w')
Error is: FileNotFoundError: [Errno 2] No such file or directory

How can I loop through all the xls files in a folder to find the sheet names, and then replace them

I am trying to loop through all the XLS files in a folder, and then replace the worksheet name by another string. This has to be done for all the files inside.
I am relatively new to programming, and here is my Python code. It runs okay (partially, when I do it for one file at a time), however, I am unable to get it to work for all the files in the folder.
from xlutils.copy import copy
from xlrd import open_workbook
# open the file
direc = input('Enter file name: ')
rb = open_workbook(direc)
wb = copy(rb)
#index of a sheet
pointSheet = rb.sheet_names()
print(pointSheet)
idx = pointSheet.index(pointSheet)
wb.get_sheet(idx).name = u'RenamedSheet1'
wb.save(direc)
Error message:
Traceback (most recent call last):
File "./Rename.py", line 13, in <module>
idx = pointSheet.index(pointSheet)
ValueError: ['x xxx xxxx xxxxxx'] is not in list
My bad! The above code is for testing with a single file. Here is the loop:
files = []
for dirname, dirnames, filenames in os.walk('D:\Temp\Final'):
# print path to all subdirectories first.
for subdirname in dirnames:
files.append(os.path.join(dirname, subdirname))
# print path to all filenames.
for filename in filenames:
files.append(os.path.join(dirname, filename))
pprint(files)
for i in range(0,len(files)):
rb = open_workbook(files[i])
wb = copy(rb)
idx = rb.sheet_names().index('5 new bulk rename')
wb.get_sheet(idx).name = u'RenamedSheet1'
wb.save(files[i])
print('Operation succeeded!')
Try something like this (untested) for a single file:
from xlutils.copy import copy
from xlrd import open_workbook
# open the file
direc = input('Enter file name: ')
rb = open_workbook(direc)
wb = copy(rb)
for pointSheet in rb.sheet_names()
print(pointSheet)
idx = pointSheet.index(pointSheet)
wb.get_sheet(idx).name = u'RenamedSheet1'
wb.save(direc)
And wrap that in another loop using listdir (taken from here):
import os
for file in os.listdir("/mydir"):
if file.endswith(".xls"):
# <do what you did for a single file>

python 3.X concatenate zipped csv files to one non-zipped csv file

here is my python 3 code:
import zipfile
import os
import time
from timeit import default_timer as timer
import re
import glob
import pandas as pd
# local variabless
# pc version
# the_dir = r'c:\ImpExpData'
# linux version
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95'
def main():
"""
this is the function that controls the processing
"""
start_time = timer()
for root, dirs, files in os.walk(the_dir):
for file in files:
if file.endswith(".zip"):
print("working dir is ...", the_dir)
zipPath = os.path.join(root, file)
z = zipfile.ZipFile(zipPath, "r")
for filename in z.namelist():
if filename.endswith(".csv"):
# print filename
if re.match(r'^Trade-Geo.*\.csv$', filename):
pass # do somethin with geo file
# print " Geo data: " , filename
elif re.match(r'^Trade-Metadata.*\.csv$', filename):
pass # do something with metadata file
# print "Metadata: ", filename
else:
try:
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
# print("send to test def...", filename)
# print(zipPath)
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
frame = pd.DataFrame()
# EmptyDataError: No columns to parse from file -- how to deal with this error
train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252")
# train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252")
list_ = []
list_.append(train_df)
# print(list_)
frame = pd.concat(list_, ignore_index=True)
frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works
except: # catches EmptyDataError: No columns to parse from file
print("EmptyDataError...." ,filename, "...", zipPath)
# GetSubDirList(the_dir)
end_time = timer()
print("Elapsed time was %g seconds" % (end_time - start_time))
if __name__ == '__main__':
main()
it mostly works -- only it does not concatenate all zipped csv files into one. there is one empty file and all csv files have the same field structure with all csv files varing in number of rows.
here is what spyder reports when i run it:
runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb')
working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95
EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip
/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
execfile(filename, namespace)
Elapsed time was 104.857 seconds
the final csvfile is the last zipped csv file processed; the csv file changes in size as it processes the files
there are 99 csv files in the zipped file that i wish to concat into one non-zipped csv file
the field or column names are:
colmNames = ["hs_code", "uom", "country", "state", "prov", "value", "quatity", "year", "month"]
the csvfiles are labled: chp01.csv, cht02.csv, etc to chp99.csv with the "uom" (unit of measure) being either empty, or an integer or a string depending on the hs_code
Question: how do I get the zipped csv files to get concatenated into one large(estimated 100 mb uncompressed) csv file?
added details:
i am trying not to unzip the csv files, i would then have to go an delete them. I need to concat files because i have additional processing to do. The extracting of the zipped csv files is a viable option, i was hoping not having to do that
Is there any reason you don't want to do this with your shell?
Assuming the order in which you concatenate is irrelevant:
cd "/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95"
unzip "Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped
for f in Trade-Exports-Chp*.csv; do tail --lines=+2 "$f" >> concat.csv; done
This removes the first line (column names) from each csv file before appending to concat.csv.
If you just did:
tail --lines=+2 "Trade-Exports-Chp*.csv" > concat.csv
You'd end up with:
==> Trade-Exports-Chp-1.csv <==
...
==> Trade-Exports-Chp-10.csv <==
...
==> Trade-Exports-Chp-2.csv <==
...
etc.
If you care about the order, change Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csv to Trade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv.
Although it's doable in Python I don't think it's the right tool for the job in this case.
If you want to do the job in place without actually extracting the zip file:
for i in {1..99}; do
unzip -p "Trade-Exports-Yr1992-1995.zip" "Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv
done

Resources