Content of torrent file - bittorrent

I have a question about the torrent file.
I know that it contains a list of servers (users) that I need to connect to, for downloading part of the whole file.
my question is if this is all what the torrent contains? there are more important details?
thanks!

A torrent file is a specially formatted binary file. It always contains a list of files and integrity metadata about all the pieces, and optionally contains a list of trackers.
A torrent file is a bencoded dictionary with the following keys:
announce - the URL of the tracker
info - this maps to a dictionary whose keys are dependent on whether one or more files are being shared:
name - suggested file/directory name where the file(s) is/are to be saved
piece length - number of bytes per piece. This is commonly 28KiB = 256 KiB = 262,144 B.
pieces - a hash list. That is, a concatenation of each piece's SHA-1 hash. As SHA-1 returns a 160-bit hash, pieces will be a string whose length is a multiple of 160-bits.
length - size of the file in bytes (only when one file is being shared)
files - a list of dictionaries each corresponding to a file (only when multiple files are being shared). Each dictionary has the following keys:
path - a list of strings corresponding to subdirectory names, the last of which is the actual file name
length - size of the file in bytes.

Related

Speed up sha256 file fingerprinting [duplicate]

We need a file fingerprint for all uploaded files in server. Now, sha256 is chosen to be the hash function.
For large files, each file is split into several file chunks of equal size (except the last one) to transfer. sha256 values of each file chunk are provided by clients. They are re-calculated and checked by server.
However, those sha256 values cannot be combined into the sha256 value for the whole file.
So I consider changing the definition of file fingerprint:
For files smaller than 1GB, the sha256 value is the fingerprint.
For files larger than 1GB, it is sliced into 1GB chunks. Each chunk has its own sha256 value, denoted as s0, s1, s2 (all are integer value).
When the first chunk received:
h0 = s0
When second chunk received
h1 = SHA256(h0 << 256 + s1)
This is essentially concatenating two hash values and hash it again. This process is repeated until all chunks received. And the final value hn is used as the file fingerprint.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have googled a lot. And read a few articles on combine_hash functions in various languages or frameworks. Different author chooses different bit mangling hash functions and most of them are said to be working well.
In my case, however, the efficiency is not a concern. But the fingerprint is stored and used as the file content identifier system-wide.
My primary concern is if the naive method listed above will introduce more conflicts than sha256 itself?
If sha256 is not a good choice for combining hash values in our case, is there any recommendations?
You are essentially reinventing Merkle tree.
What you'll have to do is to split your large files into equally-sized chunks (sans last fragment), compute hash for each of those chunks, and then combine them pairwise until there is a single ultimate hash value. Note that the "root" hash will not be equal to the hash of the original file, but that's not required to validate the integrity of the entire file.

Calculate hash value / checksum of complete file (all data inside file) in Pyspark

I am looking to validate the files, if content of the file is exact duplicate wrt other file (with different name in the same folder). I have read the files using below pyspark code
for file in os.listdir(fileDirectory):
file_read = spark.read.csv(fileDirectory + '/' + file)
Now, I want to calculate the single-value checksum of whole file. Please advise.
The best route depends on your use case. Typically this could be done with MD5 file hashes or with row-wise checksums.
MD5 file-based hash with Spark
You could hash each file with MD5 and store the result somewhere for comparison. You then read a single file at a time and compare it to the stored results to identify duplicates.
MD5 is isn't a splittable algorithm though. It must run sequentially through an entire file. This makes it somewhat less useful with Spark, and means that when opening a large file you can't benefit from Spark's data distribution capabilities.
Instead, to hash a whole file, if you must use Spark, you could use the wholeTextFiles method, and a map function which calculates the MD5 hash.
This reads in the entire contents of one or more files into a single record in a partition. If you have several executors, but only 1 file, then all executors but one would be idle. With Spark, a record cannot be split across executors so you risk running out of memory if the largest file contents doesn't fit in the executor memory.
Anyway here is what it looks like:
import hashlib
rdd = spark.sparkContext.wholeTextFiles(location)
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
md5_hash = hashlib.md5()
md5_hash.update(file_contents.encode('utf-8'))
return file_name, md5_hash.hexdigest()
rdd.map(map_hash_file).collect()
A benefit to this approach is that if you have many files in a folder you could do the MD5 for each of them in parallel. You must ensure the largest possible file fits into a single record i.e into your executor memory.
With this approach you'd read all the files in your folder each time and compute the hashes in parallel, but don't need to store and retrieve the hashes from somewhere as would be the case if you were processing 1 file at a time.
If you only want to detect duplicates in a folder, and don't mind duplicates across folders, then perhaps this approach would work.
If you also want to detect duplicates across folders, you'd need to read in all of those files too, or just store their hashes somewhere if you already processed them.
MD5 file-based hash without Spark
If you want to store the hashes and process a single file at a time, and therefore want to avoid read in all the files, or if you can't fit the largest file in memory, then you'd need a different approach than Spark.
Since you are using Pyspark, how about using regular Python to read and hash the file? You don't need to read the entire file into memory, you can read it in small chunks and MD5 it serially in that way.
From https://stackoverflow.com/a/1131238/2409299 :
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Then compare the hexdigest with previously stored hashes to identify a duplicate file.
Row-wise checksum
Using the Spark CSV reader means you've already unpacked the file into rows & columns which means you can no longer compute an accurate file hash. You could instead do row-wise hashes by adding a column with the hash per row, sort the dataset the same way across all the files, and hash down that column to come up with a deterministic result. It would probably be easier to go the file-based MD5 route though.
With Spark SQL and the built-in hash function:
spark.sql("SELECT *, HASH(*) AS row_hash FROM my_table")
Spark's Hash function is not an MD5 algorithm. In my opinion it may not be suitable for this use case. For example, it skips columns that are NULL which can cause hash collisions (false-positive duplicates).
Below hash values are the same:
spark.sql("SELECT HASH(NULL, 'a', 'b'), HASH('a', NULL , 'b')")
+----------------+----------------+
|hash(NULL, a, b)|hash(a, NULL, b)|
+----------------+----------------+
| 190734147| 190734147|
+----------------+----------------+
Other notes
Some data stores such as S3 have object metadata (E-Tag) that acts like a hash. If you are using such a data store you could simply retrieve and compare these to identify duplicates, and avoid hashing any files yourself.
I am able to do this part using below code
file_txt = ''.join(map(str, file_read))
return(sha256(file_txt.encode('utf-8')).hexdigest())

Locating EOCD in ZIP files by offset

I'm trying to write a collection of yara signatures that will tag zip files based on artifacts of their creation.
I understand the EOCD has a magic number of 0x06054b50, and that it is located at the end of the archive structure. It has a variable length comment field, with a max length of 0xFFFF, so the EOCD could be up to 0xFFFF+ ~20 bytes. However, there could be data after the zip structure that could throw off the any offset dependent scanning.
Is there any way to locate the record without scanning the whole file for the magic bytes? How do you validate that the magic bytes aren't there by coincidence if there can be data after the EOCD?
This is typically done by scanning backwards from the end of the file until you find the EOCD signature. Yes, it is possible to find the same signature embedded in the comment, so you need to check other parts of the EOCD record to see if they are consistent with the file you are reading.
For example, if the EOCD record isn't at the end of the file, the comment length field in the EOCD cannot be zero. It should match the number of bytes left in the file.
Similarly, if this is a single disk archive, the offset of start of central directory needs to point to somewhere within the size of the zip archive. If you want to follow that offset you should find the signature for a central directory record.
And so on.
Note that I've ignored the complications of the Zip64 records and encryption records, but the principle is the same. You need to check the fields in the records are consistent with the file being read.

How to add md5 sum to file metadata on a Linux file-system for the purpose of search and de-duplication?

I have a large number of files that occasionally have duplicates that have different names and would like to add a fslint like capability to the file system so that it can be de-duplicated and then any new files created in specified locations be checked against known md5 values. The intent being that after the initial summing of the entire file collection the overhead is smaller as it then only requires the new file's md5 sum to be compared against the store of existing sums. This check could be a daily job or as part of a file submission process.
checkandsave -f newfile -d destination
Does this utility already exist? What is the best way of storing fileid-md4sum pairs to that the search on a new file's sum is as fast as possible?
r.e. Using rmlink:
Where does rmlink store the checksums, or is that work repeated every run? I want to add the checksum to the file metadata (or some form of store that optimises search speed) so that when I have a new file I generate it's sum and check it against the existing pre-calculated sums, for all files of the same size.
Yes rmlint can do this via the --xattr-read --xattr-write options.
The cron job would be something like:
/usr/bin/rmlint -T df -o sh:/home/foo/dupes.sh -c sh:link --xattr-read --xattr-write /path/to/files
-T df means just look for duplicate files
-o sh:/home/foo/newdupes.sh specifies where to put the output report / shell script (if you want one)
-c sh:link specifies that the shell script should replace duplicates with hardlinks or symlinks (or reflinks on btrfs)
Note that rmlint only calculates file checksums when necessary, for example if there is only one file with a given size then there is no chance of a duplicate so no checksum is calculated.
Edit: the checksums are stored in the file extended attributes metadata. The default uses SHA1 but you can switch this to md5 via -a md5

How to create an RDD with the whole content of files as values?

I have a directory with many files, and I want to create a RDD whose value is the content of each file. How can I do that?
You can use SparkContext.wholeTextFiles method that reads:
a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Just keep in mind that individual files have to fit into worker memory and generally speaking it is less efficient than using textFile.

Resources