md5sum relationship between splite files and combined large file [duplicate]

md5sum relationship between splite files and combined large file [duplicate] - linux

I have a situation where I have one VERY large file that I'm using the linux "split" command to break into smaller parts. Later I use the linux "cat" command to bring the parts all back together again.
In the interim, however, I'm curious...
If I get an MD5 fingerprint on the large file before splitting it, then later get the MD5 fingerprints on all the independent file parts that result from the split command, is there a way to take the independent fingerprints and somehow deduce that the sum or average (or whatever you like to all it) of their parts is equal to the fingerprint of the single large file?
By (very) loose example...
bigoldfile.txt MD5 = 737da789
smallfile1.txt MD5 = 23489a89
smallfile2.txt MD5 = 1238g89d
smallfile3.txt MD5 = 01234cd7
someoperator(23489a89,1238g89d,01234cd7) = 737da789 (the fingerprint of the original file)

You likely can't do that - MD5 is complex enough inside and depends on actual data as well as the "initial" hash value.
You could instead generate "incremental" hashes - hash of first part, hash of first plus second part, etc.

Not exactly but the next best thing would be to do this:
cat filepart1 filepart2 | md5sum
or
cat filepart* | md5sum
Be sure to cat them back together in the correct order.
by piping the output of cat you don't have to worry about creating a combined file that is too large.

Related

Comparing a big file on two servers

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.

You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.

Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz

This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

How can a the extension of the PCR value be replicated with e.g. sha1sum?

this is somewhat related to the post in:
Perform OR on two hash outputs of sha1sum
I have a sample set of TPM measurements, e.g. the following:
10 1ca03ef9cca98b0a04e5b01dabe1ff825ff0280a ima 0ea26e75253dc2fda7e4210980537d035e2fb9f8 boot_aggregate
10 7f36b991f8ae94141753bcb2cf78936476d82f1d ima d0eee5a3d35f0a6912b5c6e51d00a360e859a668 /init
10 8bc0209c604fd4d3b54b6089eac786a4e0cb1fbf ima cc57839b8e5c4c58612daaf6fff48abd4bac1bd7 /init
10 d30b96ced261df085c800968fe34abe5fa0e3f4d ima 1712b5017baec2d24c8165dfc1b98168cdf6aa25 ld-linux-x86-64.so.2
According to the TPM spec, also referred to in the above post, the PCR extend operation is: PCR := SHA1(PCR || data), i.e. "concatenate the old value of PCR with the data, hash the concatenated string and store the hash in PCR". Also, the spec multiple papers and presentations I have found mention that data is a hash of the software to be loaded.
However, when I do an operation like echo H(PCR)||H(data) | sha1sum, I do not obtain a correct resulting value. I.e., when calculatinng (using the above hashes): echo 1ca03ef9cca98b0a04e5b01dabe1ff825ff0280a0ea26e75253dc2fda7e4210980537d035e2fb9f8 | sha1sum, the resuting value is NOT 7f36b991f8ae94141753bcb2cf78936476d82f1d.
Is my understanding of the TPM_Extend operation correct? if so, why is the resulting hash different from the one in the sample measurement file?
Thanks!
/n

To answer your very first question: Your understanding of extend operation is more or less correct. But you have 2 problems:
You are misinterpreting the things you have copied in here
You can't calculate hashes like you do on the shell
The log output you provided here is from Linux's IMA. According to the
documentation the first hash is template-hash and defined as
template-hash: SHA1(filedata-hash | filename-hint)
filedata-hash: SHA1(filedata)
So for the first line: SHA1(0ea26e75253dc2fda7e4210980537d035e2fb9f8 | "boot_aggregate")
results in 1ca03ef9cca98b0a04e5b01dabe1ff825ff0280a.
Note that the filename-hint is 256 byte long - it is 0-padded at the end.
(thumbs up for digging this out of the kernel source ;))
So to make it clear: In your log are no PCR values.
I wrote something in Ruby to verify my findings:
require 'digest/sha1'
filedata_hash = ["0ea26e75253dc2fda7e4210980537d035e2fb9f8"].pack('H*')
filename_hint = "boot_aggregate".ljust(256, "\x00")
puts Digest::SHA1.hexdigest(filedata_hash + filename_hint)
Now to your commands:
The way you are using it here, you are interpreting the hashes as ASCII-strings.
Also note that echo will add an additional new line character to the output.
The character sequence 1ca03ef9cca98b0a04e5b01dabe1ff825ff0280a is hexadecimal
encoding of 160 bit binary data - a SHA1 hash value. So basically you are right,
you have to concatenate the two values and calculate the SHA1 of the resulting
320 bit of data.
So the correct command for the command line would be something like
printf "\x1c\xa0\x3e\xf9\xcc\xa9\x8b\x0a\x04\xe5\xb0\x1d\xab\xe1\xff\x82\x5f\xf0\x28\x0a\x0e\xa2\x6e\x75\x25\x3d\xc2\xfd\xa7\xe4\x21\x09\x80\x53\x7d\x03\x5e\x2f\xb9\xf8" | sha1sum
The \xXX in the printf string will convert the hex code XX into one byte of
binary output.
This will result in the output of d14f958b2804cc930f2f5226494bd60ee5174cfa,
and that's fine.

Using sed on a compressed file

I have written a file processing program and now it needs to read from a zipped file(.gz unzipped file may get as large as 2TB),
Is there a sed equivalent for zipped files like (zcat/cat) or else what would be the best approach to do the following efficiently
ONE=`zcat filename.gz| sed -n $counts`
$counts : counter to read(line by line)
The above method works, but is quite slow for large file as I need to read each line and perform the matching on certain fields.
Thanks
EDIT
Though not directly helpful, here are a set of zcommands
http://www.cyberciti.biz/tips/decompress-and-expand-text-files.html

Well you either can have more speed (i.e. use uncompressed files) or more free space (i.e. use compressed files and the pipe you showed)... sorry. Using compressed files will always have an overhead.

If you understand the internal structure of the compression format it is possible that you could write a pattern matcher that can operate on compressed data without fully decompressing it, but instead by simply determining from the compressed data if the pattern would be present in a given piece of decompressed data.
If the pattern has any complexity at all this sounds like quite a complicated project as you'd have to handle cases where the pattern could be satisfied by the combination of output from two (or more) separate pieces of decompression.

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,

You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.

If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.

You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.

I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0

exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

File containing its own checksum

Is it possible to create a file that will contain its own checksum (MD5, SHA1, whatever)? And to upset jokers I mean checksum in plain, not function calculating it.

I created a piece of code in C, then ran bruteforce for less than 2 minutes and got this wonder:
The CRC32 of this string is 4A1C449B
Note the must be no characters (end of line, etc) after the sentence.
You can check it here:
http://www.crc-online.com.ar/index.php?d=The+CRC32+of+this+string+is+4A1C449B&en=Calcular+CRC32
This one is also fun:
I killed 56e9dee4 cows and all I got was...
Source code (sorry it's a little messy) here: http://www.latinsud.com/pub/crc32/

Yes. It's possible, and it's common with simple checksums. Getting a file to include it's own md5sum would be quite challenging.
In the most basic case, create a checksum value which will cause the summed modulus to equal zero. The checksum function then becomes something like
(n1 + n2 ... + CRC) % 256 == 0
If the checksum then becomes a part of the file, and is checked itself. A very common example of this is the Luhn algorithm used in credit card numbers. The last digit is a check digit, and is itself part of the 16 digit number.

Check this:
echo -e '#!/bin/bash\necho My cksum is 918329835' > magic

"I wish my crc32 was 802892ef..."
Well, I thought this was interesting so today I coded a little java program to find collisions. Thought I'd leave it here in case someone finds it useful:
import java.util.zip.CRC32;
public class Crc32_recurse2 {
public static void main(String[] args) throws InterruptedException {
long endval = Long.parseLong("ffffffff", 16);
long startval = 0L;
// startval = Long.parseLong("802892ef",16); //uncomment to save yourself some time
float percent = 0;
long time = System.currentTimeMillis();
long updates = 10000000L; // how often to print some status info
for (long i=startval;i<endval;i++) {
String testval = Long.toHexString(i);
String cmpval = getCRC("I wish my crc32 was " + testval + "...");
if (testval.equals(cmpval)) {
System.out.println("Match found!!! Message is:");
System.out.println("I wish my crc32 was " + testval + "...");
System.out.println("crc32 of message is " + testval);
System.exit(0);
}
if (i%updates==0) {
if (i==0) {
continue; // kludge to avoid divide by zero at the start
}
long timetaken = System.currentTimeMillis() - time;
long speed = updates/timetaken*1000;
percent = (i*100.0f)/endval;
long timeleft = (endval-i)/speed; // in seconds
System.out.println(percent+"% through - "+ "done "+i/1000000+"M so far"
+ " - " + speed+" tested per second - "+timeleft+
"s till the last value.");
time = System.currentTimeMillis();
}
}
}
public static String getCRC(String input) {
CRC32 crc = new CRC32();
crc.update(input.getBytes());
return Long.toHexString(crc.getValue());
}
}
The output:
49.825756% through - done 2140M so far - 1731000 tested per second - 1244s till the last value.
50.05859% through - done 2150M so far - 1770000 tested per second - 1211s till the last value.
Match found!!! Message is:
I wish my crc32 was 802892ef...
crc32 of message is 802892ef
Note the dots at the end of the message are actually part of the message.
On my i5-2500 it was going to take ~40 minutes to search the whole crc32 space from 00000000 to ffffffff, doing about 1.8 million tests/second. It was maxing out one core.
I'm fairly new with java so any constructive comments on my code would be appreciated.
"My crc32 was c8cb204, and all I got was this lousy T-Shirt!"

Certainly, it is possible. But one of the uses of checksums is to detect tampering of a file - how would you know if a file has been modified, if the modifier can also replace the checksum?

Sure, you could concatenate the digest of the file itself to the end of the file. To check it, you would calculate the digest of all but the last part, then compare it to the value in the last part. Of course, without some form of encryption, anyone can recalculate the digest and replace it.
edit
I should add that this is not so unusual. One technique is to concatenate a CRC-32 so that the CRC-32 of the whole file (including that digest) is zero. This won't work with digests based on cryptographic hashes, though.

I don't know if I understand your question correctly, but you could make the first 16 bytes of the file the checksum of the rest of the file.
So before writing a file, you calculate the hash, write the hash value first and then write the file contents.

There is a neat implementation of the Luhn Mod N algorithm in the python-stdnum library ( see luhn.py). The calc_check_digit function will calculate a digit or character which, when appended to the file (expressed as a string) will create a valid Luhn Mod N string. As noted in many answers above, this gives a sanity check on the validity of the file, but no significant security against tampering. The receiver will need to know what alphabet is being used to define Luhn mod N validity.

If the question is asking whether a file can contain its own checksum (in addition to other content), the answer is trivially yes for fixed-size checksums, because a file could contain all possible checksum values.
If the question is whether a file could consist of its own checksum (and nothing else), it's trivial to construct a checksum algorithm that would make such a file impossible: for an n-byte checksum, take the binary representation of the first n bytes of the file and add 1. Since it's also trivial to construct a checksum that always encodes itself (i.e. do the above without adding 1), clearly there are some checksums that can encode themselves, and some that cannot. It would probably be quite difficult to tell which of these a standard checksum is.

There are many ways to embed information in order to detect transmission errors etc. CRC checksums are good at detecting runs of consecutive bit-flips and might be added in such a way that the checksum is always e.g. 0. These kind of checksums (including error correcting codes) are however easy to recreate and doesn't stop malicious tampering.
It is impossible to embed something in the message so that the receiver can verify its authenticity if the receiver knows nothing else about/from the sender. The receiver could for instance share a secret key with the sender. The sender can then append an encrypted checksum (which needs to be cryptographically secure such as md5/sha1). It is also possible to use asymmetric encryption where the sender can publish his public key and sign the md5 checksum/hash with his private key. The hash and the signature can then be tagged onto the data as a new kind of checksum. This is done all the time on internet nowadays.
The remaining problems then are 1. How can the receiver be sure that he got the right public key and 2. How secure is all this stuff in reality?. The answer to 1 might vary. On internet it's common to have the public key signed by someone everyone trusts. Another simple solution is that the receiver got the public key from a meeting in personal... The answer to 2 might change from day-to-day, but what's costly to force to day will probably be cheap to break some time in the future. By that time new algorithms and/or enlarged key sizes has hopefully emerged.

You can of course, but in that case the SHA digest of the whole file will not be the SHA you included, because it is a cryptographic hash function, so changing a single bit in the file changes the whole hash. What you are looking for is a checksum calculated using the content of the file in way to match a set of criteria.

Sure.
The simplest way would be to run the file through an MD5 algorithm and embed that data within the file. You can split up the check sum and place it at known points of the file (based on a portion size of the file e.g. 30%, 50%, 75%) if you wish to try and hide it.
Similarly you could encrypt the file, or encrypt a portion of the file (along with the MD5 checksum) and embed that in the file.
Edit
I forgot to say that you would need to remove the checksum data before using it.
Of course if your file needs to be readily readable by another program e.g. Word then things become a little more complicated as you don't want to "corrupt" the file so that it is no longer readable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string