Is it necessary to calculate crc32 when writing zip files? - zip

I'm designing my own tiny zip writer and RAM is pretty limited.
I'm also not using any compression...
So I'm wonder:
Is it possible to create a zip file that do not contain any crc32 checksums in either the entry header or the central directory and have it still be a valid zip file? is crc32 optional or a requirement?
can i set some flag to something to indicate that it dose not have crc32 checksum and set the crc bits to just 1 or 0?
I just want to avoid having to read each file and calculate the checksum and simply just do a kind of concatenation of multiple files

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

Does ActiveStorage use the checksum for anything?

In a very real sense, my question is actually 'can I skip generating a checksum', but answering that question rests on the above question.
To give you some background, I'm (finally) converting from Paperclip to ActiveStorage, and one of the pains of my particular conversion process is that I'm storing a decent sized number of fairly large files -- in addition to normal sized thumbnail images, I'm also storing large multimedia files, some in excess of 10GBs (currently poking at a 15GB file).
The basic conversion process has me downloading the file to generate a checksum, and a few other minor details that could be done with a head request instead of downloading the full file. We also copy the file from it's old 'home' to its new 'home', but that is done as an S3 to S3 copy, and doesn't take as long as downloading and uploading.
I'd love to skip the download & generate checksum process -- or at least, put it off for another day, as a cleanup step that isn't important to what we're actually doing.
So the question is: does the checksum actually do anything in ActiveStorage, or is it just a 'nice-to-have' feature that would allow me to, for example, publish the checksum if someone wanted to verify their version?
Found in code Rails
Prior to uploading, we compute the checksum, which is sent to the
service for transit integrity validation. If the checksum does not
match what the service receives, an exception will be raised.
You can create your own checksum without downloading the file:
Found in code Rails
def compute_checksum_in_chunks(io)
OpenSSL::Digest::MD5.new.tap do |checksum|
while chunk = io.read(5.megabytes)
checksum << chunk
end
io.rewind
end.base64digest
end

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

How to find the position of Central Directory in a Zip file?

I am trying to find the position of the first Central Directory file header in a Zip file.
I'm reading these:
http://en.wikipedia.org/wiki/Zip_(file_format)
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...
If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.
To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?
After reading about End Of Central Directory record, Wikipedia says:
This ordering allows a zip file to be created in one pass, but it is
usually decompressed by first reading the central directory at the
end.
How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?
P.S. I'm writing a Zip file reader.
Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.
If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.
I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.
Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

Resources