I am using the System.IO.Compression in .net 4.5 and ZipArchive and ZipFile are used to add txt files to the archive. There are around 75 files .
The files when put in a folder and measured the size, it was around 15 KB
But when used to put in archive using the ZipArchive the zip file size generated was of 21KB.
Am I doing something wrong or this ZipArhive is just to put the files into an archive single file rather than compressing ?
Is that Deflate alogrithm used for the .zip creation ?
This is what I have done. Is there any high level compression that can be used ?for .7zip the file size is even smaller, around 1KB only
using (ZipArchive zippedFile = ZipFile.Open(zipFileName, ZipArchiveMode.Create))
{
foreach (string file in filesTobeZipped)
{
zippedFile.CreateEntryFromFile(file, Path.GetFileName(file), CompressionLevel.Optimal);
}
}
Each entry in a zip file has an overhead for the local and central headers of 76 bytes plus the length of the file name with path, twice, plus a single end record of 22 bytes. For 75 files each with, say, a three-character name, the total overhead would be about 6K. Each file averages about 200 bytes in length uncompressed, which is too short to compress effectively. If each file remained a 200 byte entry in the zip file, then you would end up with a 21K zip file. Which is in fact what you got.
Related
I have used exiftool to make .txt files with the EXIF data of some images and videos I am working with. I will also eventually need to create .csv manifests for these files, and I know there are simple ways to convert .txt files to .csv files, but the instructions I've found have described how to do so if the .txt file has the information to go in different columns within the same line, while mine are on different lines. Is there a way to do this conversion with .txt files that are organized differently?
For example, I have seen instructions for how to convert something like this
filename date size
abc.JPG 1/1/2001 1MB
def.JPG 1/1/2001 1MB
hij.JPG 1/1/2001 1MB
to
filename,date,size
abc.JPG,1/1/2001,1MB
def.JPG,1/1/2001,1MB
hij.JPG,1/1/2001,1MB
The .txt files I have, on the other hand, are formatted like this:
========abc.JPG
File Name abc.JPG
Date/Time Original 2001:01:01 1:00:00
Size 1 MB
========def.JPG
File Name def.JPG
Date/Time Original 2001:01:01 1:01:00
Size 1 MB
========hij.JPG
File Name hij.JPG
Date/Time Original 2001:01:01 1:02:00
Size 1 MB
but I still need an output like
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB
Using sed and Miller, you can run
<input.txt sed -r 's/==.+//g;s/([a-zA-Z]) ([a-zA-Z])/\1-\2/' | mlr --x2c label filename,date,size >output.csv
to have
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB
I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.
I have a CSV file, which is of size 350 MB. I want to zip using python so that I can mail this file.
I tried :
zipfile.ZipFile(file_name+'.zip', mode='w').write(file_name)
But it is just renaming CSV file to ZIP file. It's not reducing file size.
Pass a compression method to the constructor:
zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_LZMA)
By default, the library uses ZIP_STORED, which is uncompressed archive member.
Source: Python Docs
I am trying to use the bz2 and/or lzma packages in python. I am trying to compress a database dump in csv format and then put it to a zip file. I got it to work with one-shot compression with both the packages.
Code for which looks like this:
with ZipFile('something.zip', 'w') as zf:
content = bz2.compress(bytes(csv_string, 'UTF-8')) # also with lzma
zf.writestr(
'something.csv' + '.bz2',
content,
compress_type=ZIP_DEFLATED
)
When I try to use incremental compression then it creates a .zip file which when I try to extract keeps giving some archive file recursively.
Code for which looks like this:
with ZipFile('something.zip', 'w') as zf:
compressor = bz2.BZ2Compressor()
content = compressor.compress(bytes(csv_string, 'UTF-8')) # also with lzma
zf.writestr(
'something.csv' + '.bz2',
content,
compress_type=ZIP_DEFLATED
)
compressor.flush()
I went through the documentation and also look for information about the compression techniques, and there seems to be no comprehensive information about what one-shot and incremental compression are.
The difference between one-shot and incremental is that with one-shot mode you need to have the entire data in memory; if you are compressing a 100 gigabyte file, you ought to have loads of RAM.
With the incremental encoder your code can feed the compressor 1 megabyte or 1 kilobyte at a time and write whatever data results, into a file as soon as it is available. Another benefit is that an incremental compressor you can use to stream data - you can start writing compressed data before all uncompressed data is available!
Your second code is incorrect and it will cause you to lose your data. The flush may return more data that you need to save as well. Here I am compressing a string of 1000 'a' characters in Python 3; the result from compress is an empty string; the actual compressed data is returned from flush.
>>> c = bz2.BZ2Compressor()
>>> c.compress(b'a' * 1000)
b''
>>> c.flush()
b'BZh91AY&SYI\xdcOc\x00\x00\x01\x81\x01\xa0\x00\x00\x80\x00\x08 \x00
\xaamA\x98\xba\x83\xc5\xdc\x91N\x14$\x12w\x13\xd8\xc0'
Thus your second code should be:
compressor = bz2.BZ2Compressor()
content = compressor.compress(bytes(csv_string, 'UTF-8')) # also with lzma
content += compressor.flush()
But actually you're still doing the one-shot compression, in a very complicated manner.
I am requesting a zip file from an API and I'm trying to retrieve it by bytes range (setting a Range header) and then parsing each of the parts individually. After reading some about gzip and zip compression, I'm having a hard time figuring out:
Can I parse a portion out of a zip file?
I know that gzip files usually compresses a single file so you can decompress and parse it in parts, but what about zip files?
I am using node-js and tried several libraries like adm-zip or zlib but it doesn't look like they allow this kind of possibility.
Zip files have a catalog at the end of the file (in addition to the same basic information before each item), which lists the file names and the location in the zip file of each item. Generally each item is compressed using deflate, which is the same algorithm that gzip uses (but gzip has a custom header before the deflate stream).
So yes, it's entirely feasible to extract the compressed byte stream for one item in a zip file, and prepend a fabricated gzip header (IIRC 14 bytes is the minimum size of this header) to allow you to decompress just that file by passing it to gunzip.
If you want to write code to inflate the deflated stream yourself, I recommend you make a different plan. I've done it, and it's really not fun. Use zlib if you must do it, don't try to reimplement the decompression.