md5 different after decompressing the file and compressing it again - linux

Host: Ubuntu 14.04
Command md5sum
File size: Before/After decompressing: 77.8 M - 323.9
I downloaded the file from Ubuntu official website.
Where I download it from ( device.tar.xz )
Before decompressing the file, I use md5sum to generate the md5 number for this compressed file.
After this, I decompressed the file, however, I dont modify any content inside. And then I re-compressed the file ( device2.tar.xz ).
By comparing two md5 number, it is different. I doubt my decompression may cause something changed.
Is there anyway to ensure that the content will be exactly the same after re-compressing ?
Thanks

You're hashing two different compressed representations of the same uncompressed data.
The xz file format includes some meta-data, which you can see with xz -l foo.xz. So even if you used the same version of the same compression program with the same settings, you could get output files that weren't byte-for-byte identical.

Related

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

Reading Gzip File Footer with CURL or WGET

I have a gzip file in the web server. I want to download the file only if there is enough disk space to decompress the file. Is it possible to know
the decompress size before downloading the file?
The decompressed size is encoded in the footer of the gzip file[1]. We can extract the decompressed size with the following command
gzip -l
But, the file need to be downloaded. I want to avoid the file downloading if I could know the decompressed size.
You can hack your way with the HTTP Range header, but it will take many http requests and your server needs to accept the Range header.
Send a first request with the HEAD method, to figure out the total file size content-length
Send a second request with a Range header to get the last 4 bytes of the file. Compute theses bytes to know the file size
If you have enough size available on the disk (file-size + uncompressed size), download the full file.

Parse bytes of a zip file?

I am requesting a zip file from an API and I'm trying to retrieve it by bytes range (setting a Range header) and then parsing each of the parts individually. After reading some about gzip and zip compression, I'm having a hard time figuring out:
Can I parse a portion out of a zip file?
I know that gzip files usually compresses a single file so you can decompress and parse it in parts, but what about zip files?
I am using node-js and tried several libraries like adm-zip or zlib but it doesn't look like they allow this kind of possibility.
Zip files have a catalog at the end of the file (in addition to the same basic information before each item), which lists the file names and the location in the zip file of each item. Generally each item is compressed using deflate, which is the same algorithm that gzip uses (but gzip has a custom header before the deflate stream).
So yes, it's entirely feasible to extract the compressed byte stream for one item in a zip file, and prepend a fabricated gzip header (IIRC 14 bytes is the minimum size of this header) to allow you to decompress just that file by passing it to gunzip.
If you want to write code to inflate the deflated stream yourself, I recommend you make a different plan. I've done it, and it's really not fun. Use zlib if you must do it, don't try to reimplement the decompression.

Tclsh convert base64 dump into zip file

I have written a Tclsh code that will fetch a zip file content in base64 format through xml-rpc method. I am dumping that base64 data into a file using the following snippet:
#!/usr/bin/tclsh
...
set mybase64Dump [myXmlRpcCallToReturnThisDump]
set zipFilePtr [open "xyz.zip" "w"]
puts $zipFilePtr $mybase64Dump
close $zipFilePte
Zip file was getting generated with XKbytes of size, but when trying to open using 7zip it says, Is not Archive. But I copy pasted the same base64 dump in a online converter. It was giving me a proper extractable zip file.
Is it something I am doing wrongly?
You probably need to configure the output file to be binary, not ascii. The default translation for a newly opened file is "auto", which does system-specific translation of the end-of-line characters, which is not what you want for a .zip file. Configure this using fconfigure on the handle after opening it or by adding the BINARY access flag to the open command.
See http://www.tcl.tk/man/tcl8.5/TclCmd/open.htm and http://www.tcl.tk/man/tcl8.5/TclCmd/fconfigure.htm for details on the syntax.

How to detect if a file is encoded using mp3PRO?

I have a folder which contains lot of MP3 files, some of them are encoded using mp3PRO.
Since this format is now obsolete, I'd like to convert them back to MP3 (converters can be found easily).
Is there is a way to detect programatically if a file is encoded using mp3PRO format ? (eg : by looking at file header or specific signatures using an hex editor)
The official player is able to detect if file is encoded using mp3PRO (the logo is highlighted or not) so I suppose this is technically possible.
What I found so far is that bitrate of mp3PRO file appears to be pretty low (50% of non encoded file) : eg : a 128 kbps file will appears as 64kbps. However a 320 kbps file will appears as 160 kpbs (which are pretty common) so it cannot be used as a rule.
Here is what I found out and how I fixed it. I wrote in here in case somebody would need it :
MP3Pro files does not contains any special flag in the mp3 header that would help to recognize them.
They are technically very similar to usual mp3 files, except they are encoded half the bit and sample rate (eg : a 128kpbs 44100hz file will be encoded as a 64kps 22050hz file, resulting in mp3pro file being approx half the size of original file).
This has been made for compatibility, so default players can play them without any change.
They also contains some SBR data, which allow to synthetically rebuild the lost audio part (high frequencies) and to play them it was before the mp3 pro conversion.
Detecting the SBR data seems very hard if not impossible : it would require to decode the actual mp3 frames. Also there is no documentation to be found about mp3pro format.
What I did (which works but required some manual effort) : I added all files to be checked to playlist of an mp3 player (foobar 2000 in my case) then sorted the files on the sample rate column : most 22050 hz mp3 files were indeed mp3 pro files.
They were converted back to mp3 using winamp + the mp3pro plugin made for it, available here : http://www.wav-mp3.com/mp3pro-to-mp3.htm

Resources