How does the client divide the file? - protocols

The way I understand the torrent file format is that it contains a field pieces, which specifies a hash list of each piece's SHA-1 hash. But, does it specify how large each piece should be and at which byte the division should occur? How does the client know how to divide the original file?
Thanks

You are looking for the "piece length" in the Info dictionary. Every piece is of equal length except for the final piece, which is irregular. The number of pieces is thus determined by 'ceil( total length / piece size )'.
https://wiki.theory.org/BitTorrentSpecification#Info_Dictionary

Related

How to get pieces value from .torrent file

I am trying to build a .torrent file interpreter. The problem is that I can't seem to understand how to go about interpreting the pieces value. I am aware that the pieces key contains a concatenation of the SHA-1 hashes for each piece and that SHA-1 contains 20 bytes. A result of this is that the final output should be a multiple of 20 bytes. However, after counting the bytes from the pieces value as a string or in hexadecimal form it still does not satisfy this. How should I interpret the pieces key?
Here we use bencode and bdecode, and the pieces value can get easily. I think you need to firstly read BEP for more details. What's more, you can see this and use it as an example.
From looking at a real torrent file, I found that the SHA-1 hashes had to be taken from its hexadecimal string format, but I previously thought that it was wrong because the byte length of the hash was not a multiple of 20. Turns out I forgot to add a trailing 0 to hexadecimals that were only 1 character (e.g. a had to be changed to 0a)

The .torrent file contains gibberish characters

Can someone explain the gibberish characters at the end of every .torrent file?
The picture shows the understandable information along with only a part of the gibberish section. It just seems like the comprehensible part ends so abruptly at the pink pipeline I painted.
By the way, I am viewing it in VIM with UTF-8 encoding, which torrent files should be encoded with if I am not mistaken.
The data you are referring to is the value for the dictionary entry with a key of pieces. The 6:pieces129140: before your marked position indicates that the entry's key has a length of 6 characters, which allows us to determine that the key is pieces. The 129140 which follows the key is the length of the entry's value, in bytes. This data structure is a result of bencoding.
The pieces dictionary entry in the .torrent file contains the SHA1 hashes for all pieces concatenated into one long string. Hashes are important as they allow the user to ensure the pieces they have downloaded are valid. Using hashes for individual pieces is better than only having the hash for the whole file, as it reduces the wasted data; you don't have to download the whole file before your client realises that the data is invalid.
SHA1 hashes consist of 20 bytes, which are stored as raw bytes in the .torrent file. This is why the data appears malformed in your editor.
pieces maps to a string whose length is a multiple of 20. It is to be subdivided into strings of length 20, each of which is the SHA1 hash of the piece at the corresponding index.
Taken from this BitTorrent protocol specification document.

SHA-1 Digest Reduction

I'm using QR Code barcodes to store UUIDs in my system and I need to check that the barcodes generated are mine and not someone else's. I also need to keep the encoded data short so that the QR Codes remain in the lower version range and remain easy to scan.
My approach is to take the UUID raw value number (a 128-bit value) and a 16 bit checksum and then Base64 encoded that data before converting to a QR code. So far so good, this works perfectly.
To generate the checksum I take the string version of the UUID and combine it with a long secret string and XOR the odd bytes together to produce a SHA-1 hash. But this hash is too long, so I XOR all the old bytes together to produce half the checksum, and likewise with the even bytes to produce the other half.
What worries me is that I have compromised the SHA-1 system needlessly by XORing it down. Would it be better to just take two unmanipulated bytes from somewhere within the result? I accept that a 16-bit checksum won't be as secure as a 160-bit checksum, but that is a price I have to pay for usability with the barcodes. What I really don't want to find is that I've now provided a checksum that is easy to crack as the UUID is transmitted in the clear.
If there is a better way of generating the checksum that would also be a suitable answer to the question. As always many thanks for your time or just reading this, double plus good thanks if you post an answer.
There's no reason to do any XORing. Simply taking the first two bytes will be as (in)secure.
To keep the code version as small as possible, you might want to convert the 144 bit value to a decimal string and encode that. QR Codes have different characters sets and encode numbers efficiently. Base64 can only be encoded as 8 bit values in QR codes so you add 30% right there.

how can there be same md5 value for two different length strings

I have an md5 function which i have confirmed to work well for both files and strings. But when i use it on variable sized chunks of very large files it generates md5 values which are the same but the size of the chunks is different.
I wonder if there is a probability that two chunks with different lengths but may be with the same content result in similar md5 fingerprints.
The odds that this happens is 1 / (2^128), since MD5 is a 128-bit hash. That means 1/(3.4 x 10^38), so it's very unlikely but not impossible.
It's more probable, I think, that you're doing something wrong and you are actually calculating the MD5 of the same text/file every time.
You have no chance to have the same MD5 hash without try to do it.
Check here for more information about collision: http://www.mscs.dal.ca/~selinger/md5collision/

How safely can I assume unicity of a part of SHA1 hash?

I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?
To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).
If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.
It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. #rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.
If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only 16^8 (2^32) possibilities, you'll have 64^8 (2^48) possibilities.
Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.

Resources