I have a CSV file, which is of size 350 MB. I want to zip using python so that I can mail this file.
I tried :
zipfile.ZipFile(file_name+'.zip', mode='w').write(file_name)
But it is just renaming CSV file to ZIP file. It's not reducing file size.
Pass a compression method to the constructor:
zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_LZMA)
By default, the library uses ZIP_STORED, which is uncompressed archive member.
Source: Python Docs
Related
I have issues trying to rename a compressed file using the processor CompressContent.
I need to rename a flowfile to content.txt, then compress it using a different file name (e.g: compressed.gz)
To accomplish that, I'm using these processors:
ConvertRecord > UpdateAttribute > CompressContent > UpdateAttribute
Flow_diagram ,
Update_config , Compress_config
But there is a problem, if I rename the compressed file it also affects the name of the txt file (making it unusable).
Is it possible to rename the compressed file without changing the name of the txt file?
Thanks in advance.
I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.
I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.
I need a microcontroller to read a data file that's been stored in a Zip file (actually a custom OPC-based file format, but that's irrelevant). The data file has a known name, path, and size, and is guaranteed to be stored uncompressed. What is the simplest & fastest way to find the start of the file data within the Zip file without writing a complete Zip parser?
NOTE: I'm aware of the Zip comment section, but the data file in question is too big to fit in it.
I ended up writing a simple parser that finds the file data in question using the central directory.
I'm trying to zip bunch of folders individually. The folders contain files. I wrote a script that seems to work perfectly, except that the resulting zip files are not actually compressed. THey're the same size as the original directory!
Here is my code:
import os, zipfile
workspace = "C:\\ziptest"
dirList = os.listdir(workspace)
def zipDir(path, zip):
for root, dirs, files in os.walk(path):
for file in files:
zip.write(os.path.join(root, file))
for item in dirList:
zip = zipfile.ZipFile('%s.zip' % item, 'w')
zipDir('C:\\ziptest\%s' % item, zip)
zip.close()
I'm not a Python expert, but a quick lookup shows that there is another argument for zip.write such as zipfile.ZIP_DEFLATED. I grabbed that from here. I quote:
The third, optional argument to the write method controls what compression method to use. Or rather, it controls whether data should be compressed at all. The default is zipfile.ZIP_STORED, which stores the data in the archive without any compression at all. If the zlib module is installed, you can also use zipfile.ZIP_DEFLATED, which gives you “deflate” compression.
The reference is here. Look for the constant ZIP_DEFLATED; it's definition:
The numeric constant for the usual ZIP compression method. This requires the zlib module. No other compression methods are currently supported.
I suppose that means that only default compression is supported... hope that helps!
Is there any reason you don't just call the shell command, like
def zipDir(path, zip):
subprocess.Popen('7z a -tzip %s %s'%(path, zip))