I can use the DotNetZip library to unzip a single zip file:
Using Zip As ZipFile = ZipFile.Read(ZipFilePath)
AddMessage("Unzipping " & Zip.Entries.Count & " entries")
End Using
However, if I try and pass in the first segment of a split archive produced using 7-Zip (e.g. Pubs.zip.001) then it throws an error:
Could not read block - no data! (position 0x03210FCE)
The documentation seems to infer that you don't have to do anything special to read a split archive:
This property has no effect when reading a split archive. You can read
a split archive in the normal way with DotNetZip.
What am I doing wrong?
Related
I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.
I'm looking for a tool that can extract files by searching aggressively through a ZIP archive. The compressed files are preceded with LFHs but no CDHs are present. Unzip outputs an empty folder.
I found one called 'binwalk' but even though it finds the hidden files inside ZIP archives it seems not to know how to extract them.
Thank You in advance.
You can try sunzip. It reads the zip file as a stream, and will extract files as it encounters the local headers and compressed data.
Use the -r option to retain the files decompressed in the event of an error. You will be left with a temporary directory starting with _z containing the extracted files, but with temporary, random names.
I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.
I wanted to copy a file from one location to another using a Groovy script. I found that the copied files was orders of magnitude larger than the original file after copying.
After some trial and error I found the correct way to copy but am still puzzled as to why it should be bigger.
def existingFile = new File("/x/y/x.zip")
def newFile1 = new File("/x/y/y.zip")
def newFile2 = new File("/x/y/z.zip")
new File(newFile1) << new File(existingFile).bytes
new File(newFile2).bytes = new File(existingFile).bytes
If you run this code, newFile1 will be much larger than existingFile, while newFile2 will be the same size as existingFile.
Note that both zip files are valid afterwards.
Does anyone know why this happens? Am I use the first copy incorrectly? Or is it something odd in my setup?
If the file already exists before this code is called then you'll get different behaviour from << and .bytes = ...
file << byteArray
will append the contents of byteArray to the end of the file, whereas
file.bytes = byteArray
will overwrite the file with the specified content. When the byteArray is ZIP data, both versions will give you a result that is a valid ZIP file, because the ZIP format can cope with arbitrary data prepended to the beginning of the file before the actual ZIP data without invalidating the file (typically this is used for things like self-extracting .exe stubs).
I am using Minizip API to zip and unzip file to and from my archive. I have a requirement to delete the zip entry from the zip as soon as i extract it.
if the zip archive has multiple zip entries , i am able to delete a particular zip entry soon as i extract it and then able to zip archive with the remaining zip entries. i am able to achieve this using a temp zip .
But when i have a single file inside the zip archive, i am only able to delete the zip after complete extraction....Can there be a optimize way for this situation where i can extract and delete the zip entry in chunks. there is no direct API's in minizip to delete, i am using raw write and read.
Thanks in advance,
JP
No, there is no way to delete part of a file in a ZIP archive, short of extracting the whole file and archiving the part you don't want. (Which doesn't make sense here, since you're already trying to extract the file!)