We need a file fingerprint for all uploaded files in server. Now, sha256 is chosen to be the hash function.
For large files, each file is split into several file chunks of equal size (except the last one) to transfer. sha256 values of each file chunk are provided by clients. They are re-calculated and checked by server.
However, those sha256 values cannot be combined into the sha256 value for the whole file.
So I consider changing the definition of file fingerprint:
For files smaller than 1GB, the sha256 value is the fingerprint.
For files larger than 1GB, it is sliced into 1GB chunks. Each chunk has its own sha256 value, denoted as s0, s1, s2 (all are integer value).
When the first chunk received:
h0 = s0
When second chunk received
h1 = SHA256(h0 << 256 + s1)
This is essentially concatenating two hash values and hash it again. This process is repeated until all chunks received. And the final value hn is used as the file fingerprint.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have googled a lot. And read a few articles on combine_hash functions in various languages or frameworks. Different author chooses different bit mangling hash functions and most of them are said to be working well.
In my case, however, the efficiency is not a concern. But the fingerprint is stored and used as the file content identifier system-wide.
My primary concern is if the naive method listed above will introduce more conflicts than sha256 itself?
If sha256 is not a good choice for combining hash values in our case, is there any recommendations?
You are essentially reinventing Merkle tree.
What you'll have to do is to split your large files into equally-sized chunks (sans last fragment), compute hash for each of those chunks, and then combine them pairwise until there is a single ultimate hash value. Note that the "root" hash will not be equal to the hash of the original file, but that's not required to validate the integrity of the entire file.
Related
I am looking to validate the files, if content of the file is exact duplicate wrt other file (with different name in the same folder). I have read the files using below pyspark code
for file in os.listdir(fileDirectory):
file_read = spark.read.csv(fileDirectory + '/' + file)
Now, I want to calculate the single-value checksum of whole file. Please advise.
The best route depends on your use case. Typically this could be done with MD5 file hashes or with row-wise checksums.
MD5 file-based hash with Spark
You could hash each file with MD5 and store the result somewhere for comparison. You then read a single file at a time and compare it to the stored results to identify duplicates.
MD5 is isn't a splittable algorithm though. It must run sequentially through an entire file. This makes it somewhat less useful with Spark, and means that when opening a large file you can't benefit from Spark's data distribution capabilities.
Instead, to hash a whole file, if you must use Spark, you could use the wholeTextFiles method, and a map function which calculates the MD5 hash.
This reads in the entire contents of one or more files into a single record in a partition. If you have several executors, but only 1 file, then all executors but one would be idle. With Spark, a record cannot be split across executors so you risk running out of memory if the largest file contents doesn't fit in the executor memory.
Anyway here is what it looks like:
import hashlib
rdd = spark.sparkContext.wholeTextFiles(location)
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
md5_hash = hashlib.md5()
md5_hash.update(file_contents.encode('utf-8'))
return file_name, md5_hash.hexdigest()
rdd.map(map_hash_file).collect()
A benefit to this approach is that if you have many files in a folder you could do the MD5 for each of them in parallel. You must ensure the largest possible file fits into a single record i.e into your executor memory.
With this approach you'd read all the files in your folder each time and compute the hashes in parallel, but don't need to store and retrieve the hashes from somewhere as would be the case if you were processing 1 file at a time.
If you only want to detect duplicates in a folder, and don't mind duplicates across folders, then perhaps this approach would work.
If you also want to detect duplicates across folders, you'd need to read in all of those files too, or just store their hashes somewhere if you already processed them.
MD5 file-based hash without Spark
If you want to store the hashes and process a single file at a time, and therefore want to avoid read in all the files, or if you can't fit the largest file in memory, then you'd need a different approach than Spark.
Since you are using Pyspark, how about using regular Python to read and hash the file? You don't need to read the entire file into memory, you can read it in small chunks and MD5 it serially in that way.
From https://stackoverflow.com/a/1131238/2409299 :
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Then compare the hexdigest with previously stored hashes to identify a duplicate file.
Row-wise checksum
Using the Spark CSV reader means you've already unpacked the file into rows & columns which means you can no longer compute an accurate file hash. You could instead do row-wise hashes by adding a column with the hash per row, sort the dataset the same way across all the files, and hash down that column to come up with a deterministic result. It would probably be easier to go the file-based MD5 route though.
With Spark SQL and the built-in hash function:
spark.sql("SELECT *, HASH(*) AS row_hash FROM my_table")
Spark's Hash function is not an MD5 algorithm. In my opinion it may not be suitable for this use case. For example, it skips columns that are NULL which can cause hash collisions (false-positive duplicates).
Below hash values are the same:
spark.sql("SELECT HASH(NULL, 'a', 'b'), HASH('a', NULL , 'b')")
+----------------+----------------+
|hash(NULL, a, b)|hash(a, NULL, b)|
+----------------+----------------+
| 190734147| 190734147|
+----------------+----------------+
Other notes
Some data stores such as S3 have object metadata (E-Tag) that acts like a hash. If you are using such a data store you could simply retrieve and compare these to identify duplicates, and avoid hashing any files yourself.
I am able to do this part using below code
file_txt = ''.join(map(str, file_read))
return(sha256(file_txt.encode('utf-8')).hexdigest())
I have a question about the torrent file.
I know that it contains a list of servers (users) that I need to connect to, for downloading part of the whole file.
my question is if this is all what the torrent contains? there are more important details?
thanks!
A torrent file is a specially formatted binary file. It always contains a list of files and integrity metadata about all the pieces, and optionally contains a list of trackers.
A torrent file is a bencoded dictionary with the following keys:
announce - the URL of the tracker
info - this maps to a dictionary whose keys are dependent on whether one or more files are being shared:
name - suggested file/directory name where the file(s) is/are to be saved
piece length - number of bytes per piece. This is commonly 28KiB = 256 KiB = 262,144 B.
pieces - a hash list. That is, a concatenation of each piece's SHA-1 hash. As SHA-1 returns a 160-bit hash, pieces will be a string whose length is a multiple of 160-bits.
length - size of the file in bytes (only when one file is being shared)
files - a list of dictionaries each corresponding to a file (only when multiple files are being shared). Each dictionary has the following keys:
path - a list of strings corresponding to subdirectory names, the last of which is the actual file name
length - size of the file in bytes.
I have a large set of strings, on order ~10^12 or so, and I need to choose an appropriate data structure so that, provided a string, I can retrieve and associated integer value in something like O(log(n)) or O(m) time where 'n' is the length of the list of strings and 'm' is the length of each string.
We can expect that our set of strings, each of length 'm' and encoded over some alphabet of size 'q', covers nearly all possible strings of this length. For example, imagine we have 10^12 all-unique binary strings of length m = 39. This implies that we've covered ~54% of the set of all possible binary strings of this length.
As such, I'm concerned about finding an appropriate hashing function for the strings that avoids collisions. Is there a good one I can use? How long will it take me to index my set of n strings?
Or should I go with a suffix tree? We know that Ukkonen’s algorithm allows for linear time construction, and my guess is that this will save space given the large numbers of similar strings?
Given your extremely large number of strings your choice has to focus in several points:
1. Are your indexing structures going to fit in memory?
For hashtables the answer is clearly not. Thus the access time will be much slower than O(1). Still you just need one disk access (the whole insertion process would be O(N)).
For b-tree I made some theoretical calculations, assuming b+tree (to save more space in interior nodes) and also that interior nodes get fully occupied. By this analysis it won't fit in memory:
The usual disk page size is 4096 bytes. That is the size of one b-tree node.
The average size of your strings is 70 bytes (if it less, the better).
Child node address has 4 bytes.
An interior node holds d keys and have d+1 children addresses:
**4096B = 4*(d+1)+70*d <=> d = 4096/75 => d = 54 **
* #internal nodes in memory -> #leaves nodes in disk -> #strings mapped*
0 internal nodes -> 1 leaves node -> 53 strings mapped
1 internal node -> 54 leaves nodes used (each with 53 leaves) -> 53² strings mapped
1+54 internal nodes -> 54² leaves nodes used -> 53³ strings mapped
...
...+54⁵ internal nodes -> 54⁶ leaves nodes = 53⁷ strings mapped
53⁷ > 10^12 , but 54⁵*4096 bytes > 1TB of memory
If your strings are not uniformly distributed, you can explore common prefixes. This way an internal node might be able to address more children, allowing you to save memory. BerkeleyDB has that option.
2. What kind of access are you going to employ? Large or small number of reads?
If you have large number of reads, are they random or sequential?
If your access is sequential you still might benefit of btree, because you will use cached nodes a lot (not needing disk access) and leaves are sequentially linked (b+tree). This is also great for range queries (which I think is not the case). If your access is completely random then hashtable is faster, as it always needs only one disk access and btree needs on disk access for each level stored in disk.
If you are going to make a small number of accesses, hashtable is preferable due to the fact that insertion will be always faster.
Since you know the total number of your strings you can indicate it to the hashtable, and you will not loose time in bucket scale operations (which implies all elements to be rehashed).
Note: I found something about your ukkonens suffix tree. Insertion is linear, and access is sequential also. However I only found it being used with some GBs. Here are some references about suffix tree algorithms: [ref1], [ref2] and [ref3].
Hope this helps somehow...
Hash tables are useful when the keys are sparse, but when the keys are dense, there is no need to hash; you can use the key (the string) itself to index. To support simple membership queries, you could use a bit vector. If your data is 39-bit binary strings, you'd have a bit vector of length 2^39. 1 means the string is present, 0 means it's absent. The bit vector would not be terribly large, since it's only 2^39 bits = 2^31 bytes = 2 GB.
To go from a string over a q-letter alphabet to an integer, you treat it as a number in base q. For example, if q=4 and the string is 3011, find the integer as 3*4^3 + 0*4^2 + 1*4^1 + 1*4^0, which equals 197.
The associated integer values will consume a lot of space. You could store them in an array indexed by the string; so in your example, you'd have an array of 2^39 integers, with some slots empty. That is unlikely to fit in memory, though, since it would consume a terabyte even if each integer was only one byte. In that case, you could store them sequentially in a disk file.
You'll probably find it helpful to look up information on bit vectors/bit arrays: http://en.wikipedia.org/wiki/Bit_array
The wikipedia link talks about compression, which might be applicable.
...
Hi Bob,
the long answer short: the classic HASH+BTREE approach is strong and superfast.
Whether 10 million or 10 billion strings are to be stored in the above structure it doesn't matter - you always have a very low MAX seeks threshold.
Well, you need 10^12 = 1,000,000,000,000 - but this is 1 trillion, it surprises me - even my heavy string corpora are in range of 1 billion.
Just check my implementation in C at:
http://www.sanmayce.com/#Section13Level
As such, I'm concerned about finding an appropriate hashing function for the strings that
avoids collisions. Is there a good one I can use?
The fastest hash table-lookup function in C is here:
http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture3
It is 300-500% faster than strong CRC32 8slice variants (both Castagnoli's & Koopman's) while featuring similar collisions.
I want to create a unique id for a device, so I have decided to create SHA1(mac XOR timestamp XOR user_password). Is there any security problem related with this? Would it be better to do SHA1(mac CONCATENATE timestamp CONCATENATE user_password)?
Thank you
Use concatenation - then you'll be basing your hash on all of the available source data.
If you use XOR then there's a risk that one piece of your source data will "cancel out" some (or all) of the bits of the remaining data before it's even passed to the hash function.
And concatenating rather than XORing won't affect the space required for storage of your hash - the generated SHA1 hash will always be 20 bytes regardless of the size of your source data.
I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).
When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.
What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?
I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?
I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.
This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.
The next question is how you deal with say 10-12 collision probability - see this very similar question.