Zip Directory with Python - python-3.x

I'm trying to zip bunch of folders individually. The folders contain files. I wrote a script that seems to work perfectly, except that the resulting zip files are not actually compressed. THey're the same size as the original directory!
Here is my code:
import os, zipfile
workspace = "C:\\ziptest"
dirList = os.listdir(workspace)
def zipDir(path, zip):
for root, dirs, files in os.walk(path):
for file in files:
zip.write(os.path.join(root, file))
for item in dirList:
zip = zipfile.ZipFile('%s.zip' % item, 'w')
zipDir('C:\\ziptest\%s' % item, zip)
zip.close()

I'm not a Python expert, but a quick lookup shows that there is another argument for zip.write such as zipfile.ZIP_DEFLATED. I grabbed that from here. I quote:
The third, optional argument to the write method controls what compression method to use. Or rather, it controls whether data should be compressed at all. The default is zipfile.ZIP_STORED, which stores the data in the archive without any compression at all. If the zlib module is installed, you can also use zipfile.ZIP_DEFLATED, which gives you “deflate” compression.
The reference is here. Look for the constant ZIP_DEFLATED; it's definition:
The numeric constant for the usual ZIP compression method. This requires the zlib module. No other compression methods are currently supported.
I suppose that means that only default compression is supported... hope that helps!

Is there any reason you don't just call the shell command, like
def zipDir(path, zip):
subprocess.Popen('7z a -tzip %s %s'%(path, zip))

Related

Zip multiple directories and files in memory without storing zip on disk

I have a web application which should combine multiple directories and files into one ZIP file and then send this ZIP file to the end user. The end user then automatically receives a ZIP file which is downloaded.
This means that I shouldn't actually store the ZIP on the server disk and that I can generate it in memory so I can pass it along right away. My initial plan was to go with shutil.make_archive but this will store a ZIP file somewhere on the disk. The second problem with make_archive is that it takes a directory but doesn't seem to contain logic to combine multiple directories into one ZIP. Here's the minimal example code for reference:
import shutil
zipper = shutil.make_archive('/tmp/test123', 'zip', 'foldera')
I then started looking at zipfile which is pretty powerful and can be used to loop over both directories and files recursively and applying them to a ZIP. The answer from Jerr on another topic was a great start. The only problem with zipfile.ZipFile seems to be that it cannot ZIP without storing on disk? The code now looks like this:
import os
import zipfile
import io
def zip_directory(path, ziph):
# ziph is a zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file),
os.path.relpath(os.path.join(root, file),
os.path.join(path, '..')))
def zip_content(dir_list, zip_name):
zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
for dir in dir_list:
zip_directory(dir, zipf)
zipf.close()
zip_content(['foldera', 'folderb'], 'test.zip')
My folder structure looks like this:
foldera
├── test.txt
|── test2.txt
folderb
├── test.py
├── folderc
├── script.py
However, the issue is that this ZIP is being stored on the disk. I don't have any use in storing it and as I have to generate thousands of ZIP's a day it would fill up way too much storage.
Is there a way to not store the ZIP and convert the output to a BytesIO, str or other type where I can work with in memory and just dispose it when I'm done?
I've also noticed that if you have just a folder without any files in it that it will not be added to the ZIP either (for example a folder 'folderd' without anything in it). Is it possible to add a folder to the ZIP even if if it is empty?
As far as I know, there would be no way to create ZIPs without storing them on a disk (unless you figure out some hacky way of saving it to memory, but that would be a hog of memory). I noticed you brought up that generating thousands of ZIPs a day would fill up your storage device, but you could simply delete the ZIP after it is sent back to the user. While this would make a file on your server, it would only be temporary and therefor would not require a ton of storage as long as you delete it after it is sent.
use BytesIO
filestream=BytesIO()
with zipfile.ZipFile(filestream, mode='w', compression=zipfile.ZIP_DEFLATED) as zipf:
for dir in dir_list:
zip_directory(dir, zipf)
you didn't ask how to send it, but for the record, the way I send it in my flask code is:
filestream.seek(0)
return send_file(filestream, attachment_filename=attachment_filename,
as_attachment=True, mimetype='application/zip')

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

Python and Pandas - Reading the only CSV file in a directory without knowing the file name

I'm putting together a Python script that uses pandas to read data from a CSV file, sort and filter that data, then save the file to another location.
This is something I have to run regularly - at least weekly if not daily. The original file is updated every day and is placed in a folder but each day the file name changes and the old file is removed so there is only one file in the directory.
I am able to make all this work by specifying the file location and name in the script, but since the name of the file changes each day, I'd rather not have to edit the script every every time I want to run it.
Is there a way to read that file based solely on the location? As I mentioned, it's the only file in the directory. Or is there a way to use a wildcard in the name? The name of the file is always something like: ABC_DEF_XXX_YYY.csv where XXX and YYY change daily.
I appreciate any help. Thanks!
from os import listdir
CSV_Files = [file for file in listdir('<path to folder>') if file.endswith('.csv')
If there is only 1 CSV file in the folder, you can do
CSV_File = CSV_Files[0]
afterwards.
To get the file names solely based on the location:
import os, glob
os.chdir("/ParentDirectory")
for file in glob.glob("*.csv"):
print(file)
Assume that dirName holds the directory holding your file.
A call to os.listdir(dirName) gives you files or child directories in this directory (of course, you must earlier import os).
To limit the list to just files, we must write a little more, e.g.
[f for f in os.listdir(dirName) if os.path.isfile(os.path.join(dirName, f))]
So we have a full list of files. To get the first file, add [0] to the
above expression, so
fn = [f for f in os.listdir(dirName) if os.path.isfile(os.path.join(dirName, f))][0]
gives you the name of the first file, but without the directory.
To have the full path, use os.path.join(dirname, fn)
So the whole script, adding check for proper extension, can be:
import os
dirName = r"C:\Users\YourName\whatever_path_you_wish"
fn = [f for f in os.listdir(dirName)\
if f.endswith('.csv') and os.path.isfile(os.path.join(dirName, f))][0]
path = os.path.join(dirName, fn)
Then you can e.g. open this file or make any use of them, as you need.
Edit
The above program will fail if the directory given does not contain any file
with the required extension. To make the program more robust, change it to
something like below:
fnList = [f for f in os.listdir(dirName)\
if f.endswith('.csv') and os.path.isfile(os.path.join(dirName, f))]
if len(fnList) > 0:
fn = fnList[0]
path = os.path.join(dirName, fn)
print(path)
# Process this file
else:
print('No such file')

use glob module to get certain files with python?

The following is a piece of the code:
files = glob.iglob(studentDir + '/**/*.py',recursive=True)
for file in files:
shutil.copy(file, newDir)
The thing is: I plan to get all the files with extension .py and also all the files whose names contain "write". Is there anything I can do to change my code? Many thanks for your time and attention.
If you want that recursive option, you could use :
patterns = ['/**/*write*','/**/*.py']
for p in patterns:
files = glob.iglob(studentDir + p, recursive=True)
for file in files:
shutil.copy(file, newDir)
If the wanted files are in the same directory you could simply use :
certainfiles = [glob.glob(e) for e in ['*.py', '*write*']]
for file in certainfiles:
shutil.copy(file, newDir)
I would suggest the use of pathlib which has been available from version 3.4. It makes many things considerably easier.
In this case '**' stands for 'descend the entire folder'.
'*.py' has its usual meaning.
path is an object but you can recover its string representation using the str function to get just the file name.
When you want the entire path name, use path.absolute and get the str of that.
Don't worry, you'll get used to it. :) If you look at the other goodies in pathlib you'll see it's worth it.
from pathlib import Path
studentDir = <something>
newDir = <something else>
for path in Path(studentDir).glob('**/*.py'):
if 'write' in str(path):
shutil.copy(str(path.absolute()), newDir)

List files of specific file type and directories

I'm attempting to create a script that looks into a specific directory and then lists all the files of my chosen types in addition to all folders within the original location.
I have managed the first part of listing all the files of the chosen types, however am encountering issues listing the folders.
The code I have is:
import datetime, os
now = datetime.datetime.now()
myFolder = 'F:\\'
textFile = 'myTextFile.txt'
outToFile = open(textFile, mode='w', encoding='utf-8')
filmDir = os.listdir(path=myFolder)
for file in filmDir:
if file.endswith(('avi','mp4','mkv','pdf')):
outToFile.write(os.path.splitext(file)[0] + '\n')
if os.path.isdir(file):
outToFile.write(os.path.splitext(file)[0] + '\n')
outToFile.close()
It is successfully listing all avi/mp4/mkv/pdf files, however isn't ever going into the if os.path.isdir(file): even though there are multiple folders in my F: directory.
Any help would be greatly appreciated. Even if it is suggesting a more effective/efficient method entirely that does the job.
Solution found thanks to Son of a Beach
if os.path.isdir(file):
changed to
if os.path.isdir(os.path.join(myFolder, file)):
os.listdir returns the names of the files, not the fully-qualified paths to the files.
You should use a fully qualified path name in os.path.isdir() (unless you've already told Python where to look).
Eg, instead of using if os.path.isdir(file): you could use:
if os.path.isdir(os.path.join(myFolder, file)):

Resources