How to better merge .gz files? - linux

I want to merge two files ending with .gz. I have tried two ways among others. For the first way, I directly concatenated the files using cat; for the other way, I first decompressed each file through gunzip, and then concatenated the decompressed files before compressing again using gzip. Interestingly, I found that the resulting files vary in size. Could anyone answer my puzzle here?
Thank you in advance!

If your question is which is better, then concatenating is faster, but recompressing will give you better compression. So it depends on how you define "better".

Related

How can I combine many files into single file without compression, keeping the same behavior across platforms?

I have a folder which includes a lot of subfolders and files. I want to combine all those files into one single large file. That file should be able to get expanded rendering back the original folder and files.
Another requirement is that the method to do it should render exactly the same output (single large file) across different platforms (Node.js, Android, iOS). I've tried ZIP utility's store mode, it indeed renders one file combining all input files and it doesn't compress them, which is good. However, when I try it on Node.js and Windows 7Zip software (ZIP format Store mode), I find that the outputs are not exactly the same. The two large files' sizes are slightly different and of course with different md5. Though they can both be expanded and return back identical files, the single files doesn't meet my requirement.
Another option I tried is Tar file format. Node.js and 7Zip renders different output as well.
Do you know anything I miss using ZIP store mode and Tar file? e.g. using specific versions or customized ZIP util?
Or, could you provide another method to realize my tasks?
I need a method to combine files which shares exactly the same protocol across Node.js, android, and iOS platforms.
Thank you.
The problem is your requirement. You should only require that the files and directory structure be exactly reconstructed after extraction. Not that the archive itself be exactly the same. Instead of running your MD5 on the archive, instead run it on the reconstructed files.
There is no way to assure the same zip result using different compressors, or different versions of the same compressor, or the same version of the same code with different settings. If you do not have complete control of the code creating and compressing the data, e.g., by virtue of having written it yourself and assuring portability across platforms, then you cannot guarantee that the archive files will be the same.
More importantly, there is no need to have that guarantee. If you want to assure the integrity of the transfer, check the result of extraction, not the intermediate archive file. Then your check is even better than checking the archive, since you are then also verifying that there were no bugs in the construction and extraction processes.

Can antlr4 be used to parse very large gzip compressed files?

I am trying to parse very large gzip compressed (10+GB) file in python3. Instead of creating the parse tree, instead I used embedded actions based on the suggestions in this answer.
However, looking at the FileStream code it wants to read the entire file and then parse it. This will not work for big files.
So, this is a two part question.
Can ANTLR4 use a file stream, probably custom, that allows it to read chunks of the file at a time? What should the class interface look like?
Predicated on the above having "yes", would that class need to handle seek operations, which would be a problem if the underlying file is gzip compressed?
Short anser: no, not possible.
Long(er) answer: ANTLR4 can potentially use unlimited lookahead, so it relies on the stream to seek to any position with no delay or parsing speed will drop to nearly a hold. For that reason all runtimes use a normal file stream that reads in the entire file at once.
There were discussions/attempts in the past to create a stream that buffers only part of the input, but I haven't heard of anything that actually works.

how to check compression type without decompressing?

I wrote code in nodejs to decompress different file types (like tar, tar.gz etc..)
I do not have the filename available to me.
Currently I use brute force to decompress. The first one that succeeds, wins..
I want to improve this by knowing the compression type beforehand.
Is there a way to do this?
Your "brute force" approach would actually work very well, since the software would determine incredibly quickly, usually within the first few bytes, that it had been handed the wrong thing. Except for the one that will work.
You can see this answer for a list of prefix bytes for common formats. You would also need to detect the tar format within a compressed format, which is not detailed there. Even if you find a matching prefix, you still need to proceed to decompress and decode to test the hypothesis, which is essentially your brute force method.

Is it possible to fix PNG files corrupted by ASCII conversion?

I've accidentally downloaded PNG images as ASCII files. The original files are already deleted so I have now only the downloaded files. Is it possible to fix PNG files corrupted by ASCII conversion?
It depends. What kind of convertion was done? ( \r\n -> \n ? or the reverse?). If the image is really small, there is some probability of successful recovery but blindly doing the reverse convertion. See eg fixgz. Otherwise you should try all the alternatives, which can be a lot. The fact that PNG is structured in fixed length chunk can help, but it would take some work.
Generally there are too many permutations, for example if 3974 bytes have been replaced it'll take 2^3974 attempts to work out the image. It's much better to look for a similar image online and do a fuzzy comparison pctf.

Using a hashset to break up a .txt file

I am trying to write a simple plagiarism program by taking one file and and comparing it to other files by splitting up each file every six words and comparing then to the other files, which is also split up the same way. I was reading up on hashsets and I figured I might try and split them up with hashsets but I have no idea how. Any advice would be appreciated.

Resources