Zipping a folder into equal size parts - zip

I've been using 7Zip for a few years now and always liked that I could zip a folder into several parts of a specific size. For example, the website BOX only allows uploads under 100MB so anything I wanted to put into BOX, I just split the zip file into 95MB files. However, recently I've needed to do something similar except instead of breaking into a certain size, I need to split them up into a specific number of files but all equaling the same size. Right now, 7zip breaks them into the max size you allow and the last file is any remaining data ranging from 1KB up to the limit specified.
For example, say I have a 826MB file, I want it to zip up 5 files that are all the same size. Is there any program out there that will do this?
Thanks in advanced!

I don't know of any program that does this, but if this is something that you're doing regularly, you could write a script that:
Finds out the size of the file
Calculates the maximum piece size to use if you want to split it into n pieces.
Constructs a corresponding 7zip command

Related

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Optimum directory structure for large number of files to display on a page

I currently have a single directory call "files" which contains 200,000 photos from about 100,000 members. When the number of members increases to millions, I would expect the number of files in the "files" directory to get very large. The name of the files are all random because the users named them. The only way I can do is to sort them by the user name who created those files. In essence, each user will have their own sub-directory.
The server I am running is on Linux with ext3 file system. I am wondering if I shall split up the files into sub-directories inside the "files" directory? Is there any benefit to split up the files into many sub-directories? I saw some argument that it doesn't matter.
If I do need to split, I am thinking of creating directories base on the first two characters of user ID, then a third level sub-directory with the user ID like this:
files/0/0/00024userid/ (so all user ids started with 00 will go in files/0/0/...)
files/0/1/01auser/
files/0/2/0242myuserid/
.
files/0/a/0auser/
files/0/b/0bsomeuser/
files/0/c/0comeuser/
.
files/0/z/0zero/
files/1/0/10293832/
files/1/1/11029user/
.
files/9/z/9zl34/
files/a/0/a023user2/
..
files/z/z/zztopuser/
I will be showing 50 photos at a time. What is the most efficient(fast) way for the server to pick up the files for static display? All from the same directory or from 50 different sub-directories? Any comments or thoughts is appreciated. Thanks.
Depending on the file system, there might be an upper limit to how many files a directory can hold. This, and the performance impact of storing many files in one directory is also discussed at some length in another question.
Also keep in mind that your file names will likely not be truly random - quite a lot might start with "DSC", "IMG" and the like. In a similar vein, the different users (or, indeed, the same user) might try storing two images with the same name, necessitating a level of abstraction from the file name anyway.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Splitting long input into multiple text files

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt
If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.
split(1) appears to be exactly the tool for your job.
Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.
You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."
4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Resources