How does a hash digest update function work? [duplicate] - node.js

I have an API route that proxies a file upload from the browser/client to AWS S3.
This API route attempts to stream the file as it is uploaded to avoid buffering the entire contents of the file in memory on the server.
However, the route also attempts to calculate an MD5 checksum of the file's body. As each part of the file is chunked, the hash.update() method is invoked w/ the chunk.
http://nodejs.org/api/crypto.html#crypto_hash_update_data_input_encoding
var crypto = require('crypto');
var hash = crypto.createHash('md5');
function write (chunk) {
// invoked many times as file is uploaded
hash.update(chunk);
}
function done() {
// will hash buffer all chunks in memory at this point?
hash.digest('hex');
}
Will the instance of Hash buffer all the contents of the file in order to perform the hash calculation (thus defeating the goal of avoiding buffering the entire file's contents in memory)? Or can an MD5 hash be calculated incrementally, without ever having the entire input available to perform the calculation?

MD5 and some other hash functions are based on the Merkle–Damgård construction. It supports the incremental/progressive/streaming hashing of data. After the data is transformed into an internal state (which has a fixed size) a last finalization step is performed to generate the final hash by padding and processing the last block and afterwards by simply returning the final state.
This is probably also why many hashing library functions are designed in such a way with an update and a finalization step.
To answer your question: No, the file content is not kept in a buffer, but is rather transformed into a fixed size internal state.

All modern cryptographic hash functions are created in such a way that they can be updated incrementally.
To allow for incremental updates, the input data of the message is first arranged in blocks. These blocks are processed in order. To do this the implementation usually buffers the input internally until it has a full block, and then processes this block together with the current state to produce a new state, using a so called compression function. The initial state usually simply consists of predetermined constant values. During the call to digest the last block is padded - usually with bit padding and an encoding of the amount of processed bytes - and the final state is calculated; this may require an additional block without any message data. A final operation may be performed and finally the resulting hash value is returned.
For MD5 the Merkle–Damgård construction is used. This common construction is also used for SHA-1 and SHA-2. SHA-2 is a family of hashes based on the algorithms for SHA-256 (SHA-224) and SHA-512 (SHA-384, SHA-512/224 and SHA-512/256). MD5 in particular uses a block size of 512 bits and a internal state of 128 bits. The internal state of the last block (including padding) is simply output directly without any post-processing for MD5, SHA-1, SHA-256 and SHA-512.
Keccak has been chosen to be SHA-3. It is construction based on a sponge, a specific compression function. It isn't a Merkle–Damgård hash - which is a big reason why it has been chosen as SHA-3. It still has all the update properties of Merkle–Damgård hashes and has been designed to be compatible with SHA-2. It splits up and buffers blocks just like the previously mentioned hashes, but it has a larger internal state and performs final operations on the output, making it arguably more secure.
So when you were using a modern hash construction such as MD5 you were unknowingly performing additional buffering. Fortunately, the buffering of a single block of 512 bits + 128 bits for the state size will not likely make you run out of memory. It is certainly not required for the hash implementation to buffer the entire message before the final hash value can be calculated.
Notes:
MD5 and SHA-1 are considered insecure w.r.t. collision resistance and they should preferably not be used anymore, especially when it comes to validating contents;
A "compression function" is a specific cryptographic notion; it is not
LSZIP or anything similar;
There may be specialized, theoretical hashes that perform the calculate the values differently - theoretically speaking there is no requirement to split the input messages into blocks and operate on the blocks sequentially. No worry, those are unlikely to be in the libraries you are using;
Similarly, implementations may decide to buffer more blocks at once, but that is fortunately extremely uncommon as well. Commonly only one block is used as buffer - in some cases it could be more performant to buffer a few blocks instead;
Some low level implementations may require you to supply the blocks yourself for reasons of efficiency.

Related

How to perform a digital signature on Smartcard with prestored hash

I am trying to use a smart card to perform a digital signature, my issue is when I try these set of commands:
Select Application: 00A4040410E828BD080F*********
Verify Pin: 0020008506*******
Set SE for CRT HT: 002241AA03800110
Set SE for CRT DST: 002241b606800112840105
Store Hash: 002a90a00890008004AAAAAAAA // AAAAAAAA are Just a random 4 bytes for the card to compute then store
Sign: 002a9e9a00
I can not sign neither by setting the security environment to CRT-DST nor CRT-HT, with the former it returns 6a88(SE problem) and the latter returns 6a95(Hash not found).
I am following IAS_ECC_v1.0.1 to the book but it is not clear which security environment to use in case of setting the hash then signing. I tried the commands for SHA-256 as well but same result.
I am used to setting the security environment then performing the digital signature but this is the first time I encounter the prestored hash type of card.
To clarify at least some issues: What you are describing is not a precomputed hash, but an intermediate hash value, as typically applied in the scheme, where the card has at least some influence in the hash computation. It is supposed to update the given intermediate hash by considering the last data bytes given. This is a sort of middle point between the card hashing all the input data (possible, but due to limited I/O bandwith seldom attractive) and providing the final hash value from the outside (no influence by the card).
Such an intermediate hash requires in DO 90 the intermediate hash value concatenated with a bit counter, which is 8 bytes long. For SHA-256 this would mean 40 bytes (32 bytes hash followed by bit counter). This is combined with DO 80 giving the final data.
Your example (store hash is at least a misleading term), provides the DO 90 as empty, however, contradicting the intention of an intermediate hash.

OpenSSL data transmission using AES

I want to use OpenSSL for data transmission between server and client. I want to do it using EVP with AES in CBC mode. But when I try to decode second message on client, EVP_EncryptFinal_ex returns 0.
The my scheme is shown on picture.
I think, this behavior because I call EVP_EncryptFinal_ex (and EVP_DecryptFinal_ex) twice for one EVP context. How to do it correctly?
You cannot call EVP_EncryptUpdate() after calling EVP_EncryptFinal_ex() according to the EVP docs.
If padding is enabled (the default) then EVP_EncryptFinal_ex()
encrypts the "final" data, that is any data that remains in a partial
block. It uses standard block padding (aka PKCS padding) as described
in the NOTES section, below. The encrypted final data is written to
out which should have sufficient space for one cipher block. The
number of bytes written is placed in outl. After this function is
called the encryption operation is finished and no further calls to
EVP_EncryptUpdate() should be made.
Instead, you should setup the cipher ctx for encryption again by calling EVP_EncryptInit_ex(). Note that unlike EVP_EncryptInit(), with EVP_EncryptInit_ex(), you can continue reusing an existing context without allocating and freeing it up on each call.

nodejs crypto module, does hash.update() store all input in memory

I have an API route that proxies a file upload from the browser/client to AWS S3.
This API route attempts to stream the file as it is uploaded to avoid buffering the entire contents of the file in memory on the server.
However, the route also attempts to calculate an MD5 checksum of the file's body. As each part of the file is chunked, the hash.update() method is invoked w/ the chunk.
http://nodejs.org/api/crypto.html#crypto_hash_update_data_input_encoding
var crypto = require('crypto');
var hash = crypto.createHash('md5');
function write (chunk) {
// invoked many times as file is uploaded
hash.update(chunk);
}
function done() {
// will hash buffer all chunks in memory at this point?
hash.digest('hex');
}
Will the instance of Hash buffer all the contents of the file in order to perform the hash calculation (thus defeating the goal of avoiding buffering the entire file's contents in memory)? Or can an MD5 hash be calculated incrementally, without ever having the entire input available to perform the calculation?
MD5 and some other hash functions are based on the Merkle–Damgård construction. It supports the incremental/progressive/streaming hashing of data. After the data is transformed into an internal state (which has a fixed size) a last finalization step is performed to generate the final hash by padding and processing the last block and afterwards by simply returning the final state.
This is probably also why many hashing library functions are designed in such a way with an update and a finalization step.
To answer your question: No, the file content is not kept in a buffer, but is rather transformed into a fixed size internal state.
All modern cryptographic hash functions are created in such a way that they can be updated incrementally.
To allow for incremental updates, the input data of the message is first arranged in blocks. These blocks are processed in order. To do this the implementation usually buffers the input internally until it has a full block, and then processes this block together with the current state to produce a new state, using a so called compression function. The initial state usually simply consists of predetermined constant values. During the call to digest the last block is padded - usually with bit padding and an encoding of the amount of processed bytes - and the final state is calculated; this may require an additional block without any message data. A final operation may be performed and finally the resulting hash value is returned.
For MD5 the Merkle–Damgård construction is used. This common construction is also used for SHA-1 and SHA-2. SHA-2 is a family of hashes based on the algorithms for SHA-256 (SHA-224) and SHA-512 (SHA-384, SHA-512/224 and SHA-512/256). MD5 in particular uses a block size of 512 bits and a internal state of 128 bits. The internal state of the last block (including padding) is simply output directly without any post-processing for MD5, SHA-1, SHA-256 and SHA-512.
Keccak has been chosen to be SHA-3. It is construction based on a sponge, a specific compression function. It isn't a Merkle–Damgård hash - which is a big reason why it has been chosen as SHA-3. It still has all the update properties of Merkle–Damgård hashes and has been designed to be compatible with SHA-2. It splits up and buffers blocks just like the previously mentioned hashes, but it has a larger internal state and performs final operations on the output, making it arguably more secure.
So when you were using a modern hash construction such as MD5 you were unknowingly performing additional buffering. Fortunately, the buffering of a single block of 512 bits + 128 bits for the state size will not likely make you run out of memory. It is certainly not required for the hash implementation to buffer the entire message before the final hash value can be calculated.
Notes:
MD5 and SHA-1 are considered insecure w.r.t. collision resistance and they should preferably not be used anymore, especially when it comes to validating contents;
A "compression function" is a specific cryptographic notion; it is not
LSZIP or anything similar;
There may be specialized, theoretical hashes that perform the calculate the values differently - theoretically speaking there is no requirement to split the input messages into blocks and operate on the blocks sequentially. No worry, those are unlikely to be in the libraries you are using;
Similarly, implementations may decide to buffer more blocks at once, but that is fortunately extremely uncommon as well. Commonly only one block is used as buffer - in some cases it could be more performant to buffer a few blocks instead;
Some low level implementations may require you to supply the blocks yourself for reasons of efficiency.

DES and ICryptoTransform

This method works fine in a program I've made. However I cannot really understand what is happening and where the encryption is actually performed. I read the related description from MSDN but not much information is given.
Can someone explain what is happening in general especially in line 8 and 9 please.
public byte[] Decrypt(byte[] input, byte[] key, byte[] iv)
{
DES des = new DESCryptoServiceProvider();
des.Mode = CipherMode.ECB;
des.Padding = PaddingMode.None;
des.Key = key;
ICryptoTransform ct = des.CreateDecryptor(key, iv);
byte[] result = ct.TransformFinalBlock(input, 0, input.Length);
return result;
}
If you want to understand what is going on, you should read about block cipher operations here:
http://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Electronic_codebook_.28ECB.29
In a nutshell, block ciphers chaining causes the input of one block operation to be fed into the next block operation. This obscures any block-level patterns in the ciphertext. Since there is a chaining structure, the last block gets an input from the second last block, and so on... until the second block gets an input from the first block. Now the first block needs to get an input from something, but there are no preceding blocks. So we use something called an Initialization Vector (iv) to start it off. This IV does not need to be secret like the key, but it does need to have a low probability of re-use (otherwise the attacker can use it to correlate the first blocks of all your ciphertexts). Typically random numbers are used, or sometimes increasing sequence numbers.
In regard to the specific call:
Your method works to decrypt a single block using DES. (Which is nowadays considered out of date and insecure, by the way, please consider using AES instead - the block cipher structures remain the same so all you need to do is swap the library). Anyway,
Since you're using a cipher in ECB mode, each block is decrypted independently with the same initialization vector, which is provided to your Decrypt method call. The call to CreateDecryptor initializes a decryption object using the provided secret key and initialization vector.
The actual decryption is performed using the call to TransformFinalBlock. The arguments are the input byte array, and then an offset and a length parameter (used for when you don't want to decrypt the entire byte array). In this case you do want to use the entire byte array so the starting offset is 0 and the size is the length of the whole byte array.
One thing you should probably add is to check that the input byte array is the correct block size for your cipher, otherwise it will throw an exception. In the case of DES, this is 64 bits. If you switch to AES as I recommended it will be 128 bits.

Audio, AES CBC and IVs

I'm currently working on a voip project and have a question about the implementation of AES-CBC mode. I know that for instant messaging based on text message communication, it's important to generate an IV for every message to avoid possible guess of the first block if this one is redundant during the communication.
But is it useful to do the same with audio data ? Since audio data is much more complex than clear text, i'm wondering if it would be wise to generate an IV for each audio chunk ( that would mean a lot of IVs per second, more than 40 ), or will this just slow everything down for nothing? Or just one IV generated at the start of the conversation should be enough?
Thanks in advance,
Nolhian
You do not need to generate new IVs each time.
For example, in SSH and TLS only one IV is used for a whole data session, and rekeying is needed only after some gbytes of data.
CBC requires a new IV for each message. However nobody said that you had to send a message in one go.
Consider SSL/TLS. The connection begins with a complex procedure (the "handshake") which results in a shared "master key" from which are derived symmetric encryption keys, MAC keys, and IVs. From that point and until the connection end (or new handshake), the complete data sent by the client to the server is, as far as CBC is concerned, one unique big message which uses, quite logically, a unique IV.
In more details, with CBC each block (of 16 bytes with AES) is first XORed with the previous encrypted block, then is itself encrypted. The IV is needed only for the very first block, since there is no previous block at that point. One way of seeing it is that each encrypted block is the IV for the encryption of what follows. When, as part of the SSL/TLS dialog, the client sends some data (a "record" in SSL speak), it remembers the last encrypted block of that record, to be used as IV for the next record.
In your case, I suppose that you have an audio stream to encrypt. You could handle it as SSL/TLS does, simply chopping the CBC stream between blocks. It has, however, a slight complication: usually, in VoIP protocols, some packets may be lost. If you receive a chunk of CBC-encrypted data and do not have the previous chunk, then you do not know the IV for that chunk (i.e. the last encrypted block of the previous chunk). You are then unable to properly decrypt the first block (16 bytes) of the chunk you receive. Whether recovery from that situation is easy or not depends on what data you are encrypting (in particular, with audio, what kind of compression algorithm you use). If that potential loss is a problem, then a workaround is to include the IV in each chunk: in CBC-speak, the last encrypted block of a chunk (in a packet) is repeated as first encrypted block in the next chunk (in the next packet).
Or, to state it briefly: you need an IV per chunk, but CBC generates these IV "naturally" because all the IV (except the very first) are blocks that you just encrypted.

Resources