Digitally Sign Data as it is Archived - digital-signature

I have an application that records data from a manufacturing process on a periodic basis (various sample rates, minimum of 1 sec, the usual max is 10 min or more). The customer would like to know if the data has been altered (changed in place, records added to, or records deleted from).
The data is recorded as a binary record. There can be multiple streams of data, each going to its own file, and each with its own data format. The data is written a record at a time, and if the monitoring PC or process goes down, manufacturing does not necessarily stop, so I can't guarantee the archiving process will stay up. Obviously, I can only authenticate what I actually record, but the recording might start and stop.
What methods can be used to authenticate that data? I'd prefer to use a separate 'logging' file to validate the data to maintain backwards compatibility, but I'm not sure that's possible. Barring direct answers, are there suggestions for search terms to find some suggestions?
Thanks!

I don't think you necessarily need digital signatures, secure hashes (say, SHA-256) should be sufficient.
As each record is written, compute a secure hash of it, and store the hash value in your logging file. If there's some sort of record ID, store that as well. You need some way to match up the hash with the corresponding record.
Now, as long as no one tampers with the logging file, any alteration of the records will be detectable. To make tampering difficult, periodically hash your log file and send that hash and the number of records in the log file somewhere secure. Ideally, send it multiple places, each under the control of a different person.
A slightly more sophisticated approach is to use a Merkle tree, essentially a binary tree of hashes, rather than just a single hash of the log file. Then store the whole tree (which isn't very large) and send the "root" hash to various places. The root hash allows you to verify the integrity of the tree and the tree allows you to verify the integrity of the log file -- and if the integrity check fails, it also enables you to determine which records were modified.

You could look at digital timestamping instead. GuardTime has the technology to support massively scalable 1sec precision timestamping which guarantees information integrity.

Related

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

Single row hotspot

I built a Twitter clone, and the row that stores Justin Bieber’s profile (some very famous person with a lot of followers) is read incredibly often. The server that stores it seems to be overloaded. Can I buy a bigger server just for that row? By the way, it isn’t updated very often.
The short answer is that Cloud Spanner does not offer different server configurations, except to increase your number of nodes.
If you don't mind reading stale data, one way to increase read throughput is to use read-only, bounded-staleness transactions. This will ensure that your reads for these rows can be served from any replica of the split(s) that owns those rows.
If you wanted to go even further, you might consider a data modeling tradeoff that makes writes more expensive but reads cheaper. One way of doing that would be to manually shard that row (for example by creating N copies of it with different primary keys). When you want to read the row, a client can pick one to read at random. When you update it, just update all the copies atomically within a single transaction. Note that this approach is rarely used in practice, as very few workloads truly have the characteristics you are describing.

On-disk lookup table with node.js bindings

For a project I am creating a queuing library and basically store URLs in a Set (it's actually an object, where I set keys to true, but one can see it as an array), so the queue only takes every url once. This works really well, however I am facing the problem that there are many URLs and so the RAM usage becomes really high.
Therefor I want to use an on-disk key-value store (actually only keys are required, no idea whether there is some different approach) with the following requirements:
No need to load the whole data set into RAM
Speedy lookups
Node.js bindings
It doesn't have to be too safe (losing data once in a while isn't a huge problem, low RAM requirements are more important) and even though I use Node.JS in this scenario this lookup doesn't necessarily need to run async.
Actually a side question would be whether there is some better way than a on-disk key-value approach. A term would be nice. Lookuptables somehow always lets me find data sets (IPs, ZIP codes, etc.)
I'd use a sql table with a single column (to store the url). Better control on memory usage than redis (which pretty much stores all in memory).
easy to check if there is already the same value
easy to insert
easy to remove one element
If it really "doesn't have to be too safe", another design would be to keep storing everything in memory but limit the number of URLs you store, for example by using an LRU cache.
You could either use a cache in node.js (easy to find via Google) or use a separate memcached server, possibly on the same machine.

Data retrieval - Database VS Programming language

I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.

How much data (many MB) can I uniquely identify using MD5

I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).
When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.
What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?
I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?
I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.
This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.
The next question is how you deal with say 10-12 collision probability - see this very similar question.

Resources