I am trying to build a .torrent file interpreter. The problem is that I can't seem to understand how to go about interpreting the pieces value. I am aware that the pieces key contains a concatenation of the SHA-1 hashes for each piece and that SHA-1 contains 20 bytes. A result of this is that the final output should be a multiple of 20 bytes. However, after counting the bytes from the pieces value as a string or in hexadecimal form it still does not satisfy this. How should I interpret the pieces key?
Here we use bencode and bdecode, and the pieces value can get easily. I think you need to firstly read BEP for more details. What's more, you can see this and use it as an example.
From looking at a real torrent file, I found that the SHA-1 hashes had to be taken from its hexadecimal string format, but I previously thought that it was wrong because the byte length of the hash was not a multiple of 20. Turns out I forgot to add a trailing 0 to hexadecimals that were only 1 character (e.g. a had to be changed to 0a)
Related
The way I understand the torrent file format is that it contains a field pieces, which specifies a hash list of each piece's SHA-1 hash. But, does it specify how large each piece should be and at which byte the division should occur? How does the client know how to divide the original file?
Thanks
You are looking for the "piece length" in the Info dictionary. Every piece is of equal length except for the final piece, which is irregular. The number of pieces is thus determined by 'ceil( total length / piece size )'.
https://wiki.theory.org/BitTorrentSpecification#Info_Dictionary
Can someone explain the gibberish characters at the end of every .torrent file?
The picture shows the understandable information along with only a part of the gibberish section. It just seems like the comprehensible part ends so abruptly at the pink pipeline I painted.
By the way, I am viewing it in VIM with UTF-8 encoding, which torrent files should be encoded with if I am not mistaken.
The data you are referring to is the value for the dictionary entry with a key of pieces. The 6:pieces129140: before your marked position indicates that the entry's key has a length of 6 characters, which allows us to determine that the key is pieces. The 129140 which follows the key is the length of the entry's value, in bytes. This data structure is a result of bencoding.
The pieces dictionary entry in the .torrent file contains the SHA1 hashes for all pieces concatenated into one long string. Hashes are important as they allow the user to ensure the pieces they have downloaded are valid. Using hashes for individual pieces is better than only having the hash for the whole file, as it reduces the wasted data; you don't have to download the whole file before your client realises that the data is invalid.
SHA1 hashes consist of 20 bytes, which are stored as raw bytes in the .torrent file. This is why the data appears malformed in your editor.
pieces maps to a string whose length is a multiple of 20. It is to be subdivided into strings of length 20, each of which is the SHA1 hash of the piece at the corresponding index.
Taken from this BitTorrent protocol specification document.
I'm searching for the most appropriated encoding or method to compress bytes into character that can be read with a ReadLine-like command that only recognizes readable char and terminates on end of line char. There is probably a common practice to achieve it, but I don't know a lot about encoding.
Currently, I'm outputing bytes as a string of hex, so I need 2 bytes to represent 1 byte. It works well, but it is slow. Ex: byte with a value 255 is represented as 'FF'.
I'm sure it could be 3 or 4 times smaller, though there's a limit since I'm outputing MP3 data, but I don't know how. Should I just ZIP my string or there would be too much overhead on it?
Will ASCII85 contains random null bytes and EndOfLine or I'm safe with it?
Don't zip mp3 files, that will not gain much (or anything at all).
I'm a bit disappointed that you did not read up on Ascii85 before asking as I think the Wikipedia article explains fairly clearly that it uses only printable ASCII characters; so, no line endings or null bytes. It is efficient and the conversion is also fairly simple and quick - split your data to 4-byte ints; you will convert these to just five Ascii85 digits by repeatedly dividing the int value by 85 and taking ASCII value of the modulo + 33.
You can also consider using Base64 or UUEncode. These are fairly popular (e.g. used in email attachments) so you will find many libraries preparing these. But they are less efficient.
I'm using QR Code barcodes to store UUIDs in my system and I need to check that the barcodes generated are mine and not someone else's. I also need to keep the encoded data short so that the QR Codes remain in the lower version range and remain easy to scan.
My approach is to take the UUID raw value number (a 128-bit value) and a 16 bit checksum and then Base64 encoded that data before converting to a QR code. So far so good, this works perfectly.
To generate the checksum I take the string version of the UUID and combine it with a long secret string and XOR the odd bytes together to produce a SHA-1 hash. But this hash is too long, so I XOR all the old bytes together to produce half the checksum, and likewise with the even bytes to produce the other half.
What worries me is that I have compromised the SHA-1 system needlessly by XORing it down. Would it be better to just take two unmanipulated bytes from somewhere within the result? I accept that a 16-bit checksum won't be as secure as a 160-bit checksum, but that is a price I have to pay for usability with the barcodes. What I really don't want to find is that I've now provided a checksum that is easy to crack as the UUID is transmitted in the clear.
If there is a better way of generating the checksum that would also be a suitable answer to the question. As always many thanks for your time or just reading this, double plus good thanks if you post an answer.
There's no reason to do any XORing. Simply taking the first two bytes will be as (in)secure.
To keep the code version as small as possible, you might want to convert the 144 bit value to a decimal string and encode that. QR Codes have different characters sets and encode numbers efficiently. Base64 can only be encoded as 8 bit values in QR codes so you add 30% right there.
I have a sha256 hash of some data that is a product ID for a registration system. I want to give this information to the end user, and I wish it to contain only printable characters (preferably a-z, A-Z and 0-9). I tried regular hex and base64, but they both produce very long results that are not satisfactory. I wish to represent the data in as small a format as possible in alphanumeric characters, but without losing integrity. Note that the data does not need to be converted back, so it can be a one-way process as long as no security is lost.
I am working in C.
Thanks in advance for any help on this!
Kind regards,
Philip Bennefall
32 bytes of data is going to be very difficult to meaningfully provide to a user in a medium that doesn't support cut/paste, however you represent it.
Lessen the amount of data you're using for the product ID and you can use Base-64 and friends.
If Base64 isn't adequate for your 32 bytes, MD5 it down to 16 bytes -- shazam, now it's half as long.
Why, yes it is absurd to hash a 32 byte hash down to 16 bytes, but that's basically what you're asking to do, whether it's 16 or any other number of bytes. You WILL lose information.
Or simply use MD5 to begin with, since it's a smaller hash.
If the user isn't going to key this number in, how important is the representation anyway? All of these long hash dumps are inscrutable. When I see them I just look at the last 3 characters anyway.