How much data (many MB) can I uniquely identify using MD5 - security

I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).
When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.
What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?
I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?
I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.

This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.
The next question is how you deal with say 10-12 collision probability - see this very similar question.

Related

Cassandra - Number of disk seeks in a read request

I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.

Single row hotspot

I built a Twitter clone, and the row that stores Justin Bieber’s profile (some very famous person with a lot of followers) is read incredibly often. The server that stores it seems to be overloaded. Can I buy a bigger server just for that row? By the way, it isn’t updated very often.
The short answer is that Cloud Spanner does not offer different server configurations, except to increase your number of nodes.
If you don't mind reading stale data, one way to increase read throughput is to use read-only, bounded-staleness transactions. This will ensure that your reads for these rows can be served from any replica of the split(s) that owns those rows.
If you wanted to go even further, you might consider a data modeling tradeoff that makes writes more expensive but reads cheaper. One way of doing that would be to manually shard that row (for example by creating N copies of it with different primary keys). When you want to read the row, a client can pick one to read at random. When you update it, just update all the copies atomically within a single transaction. Note that this approach is rarely used in practice, as very few workloads truly have the characteristics you are describing.

Out of memory [divide and conquer algorithm]

So I have a foo table, which is huge and whenever I try to read all data from that table Node.JS gives me out of memory error! However still you can get chuncks of data by having offset and limit; but again I cannot merge all of the chuncks and have them in memory, because I run into out of memory again! In my algorithm I have lots of ids and need to check whether each id exists in foo table or not; what is the best solution (in terms of algorithm complexity) when I cannot have all of the data in memory to see if id exists in foo table or not?
PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe...
Under the constraints you specified, you could create a hash table containing the ID's you are looking for as keys, with all values initialized to false.
Then, read the table chunk by chunk and for each item in the table, look it up in the hash table. If found, mark the hash table entry with true.
After going all the chunks, your hash table will hold values of true for ID's found in the table.
Providing that the hash table lookup has a fixed time complexity, then this algorithm has a time complexity of O(N).
You can sort your ids, and break them to chunks. Then you can keep in memory range of values in each chunk - (lowestId,highestId) for that chunk.
You can quickly find chunk (if any) id may be contained in using in memory binary search, and then load that specific chunk in memory and to binary search on it.
Complexity should be LogN for both. In general, read about binary search algorithm.
"PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe..."
Let's say you could load the whole table into your memory. In any case you'll need to check each ID whether or not it is in the DB. You can't do it any better than comparing.
Having said that, a hash table comes to mind. Lets say the IDs are integers, and they are randomly picked. you could hash the IDs you need to check by the last two digits (or the first two for that matter). Then checking the items you have in your memory will be quicker.

Maximum key size in Cassandra

I'm completely new to using cassandra.. is there a maximum key size and would that ever impact performance?
Thanks!
The key (and column names) must be under 64K bytes.
Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large "natural" keys use their hashes instead to cut down the size.
http://en.wikipedia.org/wiki/Apache_Cassandra claims (apparently incorrectly!) that:
The row key in a table is a string
with no size restrictions, although
typically 16 to 36 bytes long
See also:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/value-size-is-there-a-suggested-limit-td4959690.html which suggests that there is some limit.
Clearly, very large keys could have some network performance impact if they need to be sent over the Thrift RPC interface - and they would cost storage. I'd suggest you try a quick benchmark to see what impact it has for your data.
One way to deal with this might be to pre-hash your keys and just use the hash value as the key, though this won't suit all use cases.

Digitally Sign Data as it is Archived

I have an application that records data from a manufacturing process on a periodic basis (various sample rates, minimum of 1 sec, the usual max is 10 min or more). The customer would like to know if the data has been altered (changed in place, records added to, or records deleted from).
The data is recorded as a binary record. There can be multiple streams of data, each going to its own file, and each with its own data format. The data is written a record at a time, and if the monitoring PC or process goes down, manufacturing does not necessarily stop, so I can't guarantee the archiving process will stay up. Obviously, I can only authenticate what I actually record, but the recording might start and stop.
What methods can be used to authenticate that data? I'd prefer to use a separate 'logging' file to validate the data to maintain backwards compatibility, but I'm not sure that's possible. Barring direct answers, are there suggestions for search terms to find some suggestions?
Thanks!
I don't think you necessarily need digital signatures, secure hashes (say, SHA-256) should be sufficient.
As each record is written, compute a secure hash of it, and store the hash value in your logging file. If there's some sort of record ID, store that as well. You need some way to match up the hash with the corresponding record.
Now, as long as no one tampers with the logging file, any alteration of the records will be detectable. To make tampering difficult, periodically hash your log file and send that hash and the number of records in the log file somewhere secure. Ideally, send it multiple places, each under the control of a different person.
A slightly more sophisticated approach is to use a Merkle tree, essentially a binary tree of hashes, rather than just a single hash of the log file. Then store the whole tree (which isn't very large) and send the "root" hash to various places. The root hash allows you to verify the integrity of the tree and the tree allows you to verify the integrity of the log file -- and if the integrity check fails, it also enables you to determine which records were modified.
You could look at digital timestamping instead. GuardTime has the technology to support massively scalable 1sec precision timestamping which guarantees information integrity.

Resources