Efficient decryption of AES/CFB8 - haskell

I currently use this function to decrypt a data stream encrypted with AES in CFB8 mode:
https://github.com/Lazersmoke/civskell/blob/ebf4d761362ee42935faeeac0fe447abe96db0b5/src/Civskell/Tech/Encrypt.hs#L167-L175
cfb8Decrypt :: AES128 -> BS.ByteString -> BS.ByteString -> (BS.ByteString,BS.ByteString)
cfb8Decrypt c i = BS.foldl magic (BS.empty,i)
where
magic (ds,iv) d = (ds `BS.snoc` pt,ivFinal)
where
pt = BS.head (ecbEncrypt c iv) `xor` d
-- snoc on cipher always
ivFinal = BS.tail iv `BS.snoc` d
In case you don't understand Haskell, here's a quick rundown of how I believe this code works: (I did not write it)
Given an IV and a list of encrypted bytes
For every encrypted byte:
Encrypt the IV in ECB-mode.
Take the first byte of the encrypted IV and xor it with the encrypted byte. This is the next plaintext byte.
Remove the first byte from the IV
Append the encrypted byte to the IV
The next character will be decrypted using this new IV
Not that the ECB-mode encryption is handled by the cryptonite library. I could not find a library supporting CFB8.
Now, this works. However, with the amount of data I need to decrypt, it caps out one of my CPU cores and 80% of the time is just spent on decrypting.
The incoming data is not even that much, so this is not acceptable. Unfortunately, my knowledge of cryptography is rather limited and resources on CFB8 seem rather sparse. It appears that CFB8 is an uncommon mode of operation, also indicated by the lack of library support.
So, my question then is: How would I go about optimising this?
The incoming data is from a TCP stream, but the information is grouped into packets. The cfb8Decrypt function is called 2-5 times per packet, depending on the size. This is necessary, because the length of the packet is transmitted at the beginning, but the length of this size information is variable. After 1-4 decryptions are used to decrypt the length, the entire packet will be decrypted at once. I thought about trying to reduce this, but I am unsure if it would have any effect on speed at all.
Edit: Profiling results: http://svgur.com/i/40b.svg

CFB8 was created to have good error propagation properties over a noisy channel. It is well known that it is not fast; it is actually 16 times as slow, as it requires a block encrypt for each byte. Currently it is not very hot, as we tend to use CRC's for the data layer and MAC for integrity on cryptographic levels against willful attacks.
How can you speed it up? The only thing you can really do is to use a fast library. The library you are currently using seems to have support for AES-NI, so make sure that is enabled on your CPU and BIOS.
However, it is very likely that it won't speed up much if you have to call it block for block. You really want to use a native call that takes the whole packet and decrypts it. AES-NI in it's slowest on an Atom implementing TLS still goes to 20 MiB/s, but on server chips AES-NI often goes far beyond the 1 GiB/s limits. Assembly or optimized C should be about 6/7 times as slow when AES-NI is not available.
Functional programming languages like Haskell are not really created for fast I/O nor fast bit-operations. So you can bet that it will be much, much slower than e.g. Java or C#, and those are already much slower than native code let alone assembly code or specialized instructions.
Memory nowadays is pretty fast; CPU's are however much much faster. So avoiding spurious memory allocations and copying should be avoided (again, not that easy to do on a fully functional language, all the more reason to do as much as possible in native code). Do however make sure that there are no buffer overflow issues or you will have fast AES/CFB within an insecure application.

Related

Questions about Python3.6 os.urandom/os.getrandom/secrets

Referring to documentation for os and secrets:
os.getrandom(size, flags=0)
Get up to size random bytes. The function can return less bytes than requested.
getrandom() relies on entropy gathered from device drivers and other sources of environmental noise.
So does this mean it's from /dev/random?
On Linux, if the getrandom() syscall is available, it is used in blocking mode: block until the system urandom entropy pool is initialized (128 bits of entropy are collected by the kernel).
So to ensure kernel CSPRNG with bad internal state is never used I should use os.getrandom()? Since the function can return less bytes than requested, should I run the application level CSPRNG as something like
def rng():
r = bytearray()
while len(r) < 32:
r += os.getrandom(1)
return bytes(r)
to ensure maximum security? I explicitly want all systems that do not support blocking until urandom entropy pool is initialized to be unable to run the program and system that support it, to wait. This is because the software must be secure even if it's run from a live CDs that has zero entropy at start.
Or does the blocking mean if I do os.getrandom(32), the program waits if necessary, forever until the 32 bytes are collected?
The flags argument is a bit mask that can contain zero or more of the following values ORed together: os.GRND_RANDOM and GRND_NONBLOCK.
Can someone please ELI5 how this works?
os.urandom(size)
On Linux, if the getrandom() syscall is available, it is used in blocking mode: block until the system urandom entropy pool is initialized (128 bits of entropy are collected by the kernel).
So the urandom quietly falls back to non-blocking CSPRNG that doesn't know it's internal seeding state in older Linux kernel versions?
Changed in version 3.6.0: On Linux, getrandom() is now used in blocking mode to increase the security.
Does this have to do with os.getrandom()? Is it a lower level call? Are the two the same?
os.GRND_NONBLOCK
By default, when reading from /dev/random, getrandom() blocks if no random bytes are available, and when reading from /dev/urandom, it blocks if the entropy pool has not yet been initialized.
So it's the 0-flag in os.getrandom(size, flag=0)?
os.GRND_RANDOM
If this bit is set, then random bytes are drawn from the /dev/random pool instead of the /dev/urandom pool.
What does ORing the os.getrandom() flags mean? How does the os.getrandom(flags=1) tell if I meant to enable os.GRND_NONBLOCK or os.GRND_RANDOM. Or do I need to set it before like this:
os.GRND_RANDOM = 1
os.getrandom(32) # or use the rng() defined above
secrets module
The secrets module is used for generating cryptographically strong random numbers suitable for managing data such as passwords, account authentication, security tokens, and related secrets.
The only clear way to generate random bytes is
secrets.token_bytes(32)
The secrets module provides access to the most secure source of randomness that your operating system provides.
So that should mean it's os.getrandom with fallback to os.urandom? So it's not a good choice if you desire 'graceful exit if internal state can not be evaluated'?
To be secure against brute-force attacks, tokens need to have sufficient randomness. Unfortunately, what is considered sufficient will necessarily increase as computers get more powerful and able to make more guesses in a shorter period. As of 2015, it is believed that 32 bytes (256 bits) of randomness is sufficient for the typical use-case expected for the secrets module.
Yet the blocking stops at 128 bits of internal state, not 256. Most symmetric ciphers have 256-bit versions for a reason.
So I should probably make sure the /dev/random is used in blocking mode to ensure internal state has reached 256 bits by the time the key is generated?
So tl;dr
What's the most secure way in Python3.6 to generate a 256-bit key on a Linux (3.17 or newer) live distro that has zero entropy in kernel CSPRNG internal state at the start of my program's execution?
After doing some research, I can answer my own question.
os.getrandom is a wrapper for getrandom() syscall offered in Linux Kernel 3.17 and newer. The flag is a number (0, 1, 2 or 3) that corresponds to bitmask in following way:
GETRANDOM with ChaCha20 DRNG
os.getrandom(32, flags=0)
GRND_NONBLOCK = 0 (=Block until the ChaCha20 DRNG seed level reaches 256 bits)
GRND_RANDOM = 0 (=Use ChaCha20 DRNG)
= 00 (=flag 0)
This is a good default to use with all Python 3.6 programs on all platforms (including live distros) when no backwards compatibility with Python 3.5 and pre-3.17 kernels is needed.
The PEP 524 is incorrect when it claims
On Linux, getrandom(0) blocks until the kernel initialized urandom with 128 bits of entropy.
According to page 84 of the BSI report, the 128-bit limit is used during boot time for callers of kernel module's get_random_bytes() function, if the code was made to properly wait for the triggering of the add_random_ready_callback() function. (Not waiting means get_random_bytes() might return insecure random numbers.) According to page 112
When reaching the state of being fully seeded and thus having the ChaCha20 DRNG seeded with 256 bits of entropy -- the getrandom system call unblocks and generates random numbers.
So, GETRANDOM() never returns random numbers until the ChaCha20 DRNG is fully seeded.
os.getrandom(32, flags=1)
GRND_NONBLOCK = 1 (=If the ChaCha20 DRNG is not fully seeded, raise BlockingIOError instead of blocking)
GRND_RANDOM = 0 (=Use ChaCha20 DRNG)
= 01 (=flag 1)
Useful if the application needs to do other tasks while it waits for the ChaCha20 DRNG to be fully seeded. The ChaCha20 DRNG is almost always fully seeded during boot time, so flags=0 is most likely a better choice. Needs the try-except logic around it.
GETRANDOM with blocking_pool
The blocking_pool is also accessible via the /dev/random device file. The pool was designed with the idea in mind that entropy runs out. This idea applies only when trying to create one-time pads (that strive for information theoretic security). The quality of entropy in blocking_pool for that purpose is not clear, and the performance is really bad. For every other use, properly seeded DRNG is enough.
The only situation where blocking_pool might be more secure is with pre-4.17 kernels that have the CONFIG_RANDOM_TRUST_CPU flag set during compile time, and if the CPU HWRNG happened to have a backdoor. Since in that case the ChaCha20 DRNG is initially seeded with RDSEED/RDRAND instruction, bad CPU HWRNG would be a problem. However, according to page page 134 of the BSI report:
[As of kernel version 4.17] The Linux-RNG now considers the ChaCha20 DRNG fully seeded after it received 128 bit of entropy from the noise sources. Previously it was sufficient that it received at least 256 interrupts.
Thus the ChaCha20 DRNG wouldn't be considered fully seeded until entropy is also mixed from input_pool, that pools and mixes random events from all LRNG noise sources together.
By using os.getrandom() with flags 2 or 3, the entropy comes from blocking_pool, that receives entropy from input_pool, that in turn receives entropy from several additional noise sources. The ChaCha20 DRNG is reseeded also from the input_pool, thus the CPU RNG does not have permanent control over the DRNG state. Once this happens, ChaCha20 DRNG is as secure as blocking_pool.
os.getrandom(32, flags=2)
GRND_NONBLOCK = 0 (=Return 32 bytes or less if entropy counter of blocking_pool is low. Block if no entropy is available.)
GRND_RANDOM = 1 (=Use blocking_pool)
= 10 (=flag 2)
This needs an external loop that runs the function and stores returned bytes into a buffer until the buffer size is 32 bytes. The major problem here is due to the blocking behavior of the blocking_pool, obtaining the bytes needed might take a very long time, especially if other programs are also requesting random numbers from the same syscall or /dev/random. Another issue is loop that uses os.getrandom(32, flags=2) spends more time idle waiting for random bytes than it would with flag 3 (see below).
os.getrandom(32, flags=3)
GRND_NONBLOCK = 1 (=return 32 bytes or less if entropy counter of blocking_pool is low. If no entropy is available, raise BlockingIOError instead of blocking).
GRND_RANDOM = 1 (=use blocking_pool)
= 11 (=flag 3)
Useful if the application needs to do other tasks while it waits for blocking_pool to have some amount of entropy. Needs the try-except logic around it plus an external loop that runs the function and stores returned bytes into a buffer until the buffer size is 32 bytes.
Other
open('/dev/urandom', 'rb').read(32)
To ensure backwards compatibility, unlike GETRANDOM() with ChaCha20 DRNG, reading from /dev/urandom device file never blocks. There is no guarantee for the quality of random numbers, which is bad. This is the least recommended option.
os.urandom(32)
os.urandom(n) provides best effort security:
Python3.6
On Linux 3.17 and newer, os.urandom(32) is the equivalent of os.getrandom(32, flags=0). On older kernels it quietly falls back to the equivalent of open('/dev/urandom', 'rb').read(32) which is not good.
os.getrandom(32, flags=0) should be preferred as it can not fall back to insecure mode.
Python3.5 and earlier
Always the equivalent of open('/dev/urandom', 'rb').read(32) which is not good. As os.getrandom() is not available, Python3.5 should not be used.
secrets.token_bytes(32) (Python 3.6 only)
Wrapper for os.urandom(). Default length of keys is 32 bytes (256 bits). On Linux 3.17 and newer, secrets.token_bytes(32) is the equivalent of os.getrandom(32, flags=0). On older kernels it quietly falls back to the equivalent of open('/dev/urandom', 'rb').read(32) which is not good.
Again, os.getrandom(32, flags=0) should be preferred as it can not fall back to insecure mode.
tl;dr
Use os.getrandom(32, flags=0).
What about other RNG sources, random, SystemRandom() etc?
import random
random.<anything>()
is never safe for creating passwords, cryptographic keys etc.
import random
sys_rand = random.SystemRandom()
is safe for cryptographic use WITH EXCEPTIONS!
sys_rand.sample()
Generating a random password with sys_rand.sample(list_of_password_chars, counts=password_length) is not safe because to quote the documentation, the sample() method is used for "random sampling without replacement". This means that each consequtive character in the password is guaranteed not to contain any of the previous characters. This will lead to passwords that are not uniformly random.
sys_rand.choices()
The sample() method was used for random sampling without replacement. The choices() method is used for random sampling with replacement. However, the to quote the documentation on choices,
The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
The floating point arithmetic choices() method uses thus introduces cryptographically non-negligible biases to the sampled passwords. Thus, random.choices() must not be used for password/key generation!
sys_random.choice()
As per the previously quoted piece of documentation, the sys_random.choice() method uses integer arithmetic as opposed to floating point arithmetic, thus generating passwords/keys with repeated calls to sys_random.choice() is therefore safe.
secrets.choice()
The secrets.choice() is a wrapper for sys_random.choice(), and can be used interchangeably with random.SystemRandom().choice(): they are the same thing.
The recipe for best practice to generate a passphrase with secrets.choice() is
import secrets
# On standard Linux systems, use a convenient dictionary file.
# Other platforms may need to provide their own word-list.
with open('/usr/share/dict/words') as f:
words = [word.strip() for word in f]
passphrase = ' '.join(secrets.choice(words) for i in range(4))
How can I ensure the generated passphrase meets some security level, e.g. 128 bits?
Here's a recipe for that
import math
import secrets
def generate_passphrase() -> str:
PASSWORD_MIN_BIT_STRENGTH = 128 # Set desired minimum bit strength here
with open('/usr/share/dict/words') as f:
wordlist = [word.strip() for word in f]
word_space = len(wordlist)
word_count = math.ceil(math.log(2 ** PASSWORD_MIN_BIT_STRENGTH, word_space))
passphrase = ' '.join(secrets.choice(wordlist) for _ in range(word_count))
# pwd_bit_strength = math.floor(math.log2(word_space ** word_count))
# print(f"Generated {pwd_bit_strength}-bit passphrase.")
return passphrase
As #maqp suggested...
Using os.getrandom(32, flags=0) is logical choice unless you're using the new secrets AND the Linux kernel (3.17 and newer) does NOT fall back to open('dev/urandom', 'rb').read(32).
Workaround, Secrets on Python 3.5.x
I installed Secrets for Python 2 even though running Python 3 and at a glance Secrets is working in Python 3.5.2 environment. Perhaps if I get time, or someone else does they can learn whether this one falls back, I suppose if Linux kernel is below certain version it may occur.
pip install python2-secrets
Once completed you can import secrets just like you would have with the Python 3 flavor.
Or just make sure to use Linux Kernel 3.17 & newer. Knowing one's kernel is always good practice, but in reality we count on smart people like maqp to find & share these things. Great job.
Were we complacent... having false sense of security?*
1st, 2nd, 4th... where is the outrage? It's not about 'your' security, that would be selfish to assume. It's the ability to spy on those who represent you in government, those who have skeletons & weaknesses (humans). Be sure to correct those selfish ones that say, "I got nothing to hide".
How Bad Was It?
The strength of encryption increases exponentially as length of key increases, therefore is it reasonable to assume that a reduction by of at least half, say 256 down to 128 would equate to decrease in strength by factors of tens, hundreds, thousands or more? Did it make big-bro job pretty easy, or just a tiny bit easier, am leaning towards saying the former.
The Glass Half Full?
Oh well, at least Linux is open source & we can see the insides for the most part. We still need chip hackers to find secret stuff, and the chips & hardware drivers are where you'll probably find stuff that will keep you from sleeping at night.

How to protect against Replay Attacks

I am trying to figure out a way to implement decent crypto on a micro-controller project. I have an ARMv4 based MCU that will control my garage door and receive commands over a WiFi module.
The MCU will run a TCP/IP server, that will listen for commands from Android clients that can connect from anywhere on the Internet, which is why I need to implement crypto.
I understand how to use AES with shared secret key to properly encrypt traffic, but I am finding it difficult to deal with Replay Attacks. All solutions I see so far have serious drawbacks.
There are two fundamental problems which prevent me from using well
established methods like session tokens, timestamps or nonces:
The MCU has no reliable source of entropy, so I can't generate
quality random numbers.
The attacker can reset the MCU by cutting power to the garage,
thus erasing any stored state at will, and resetting time counter to
zero (or just wait 49 days until it loops).
With these restrictions in mind, I can see only one approach that seems
ok to me (i.e. I don't see how to break it yet). Unfortunately, this
requires non-volatile storage, which means writing to external flash,
which introduces some serious complexity due to a variety of technical details.
I would love to get some feedback on the following solution. Even better, is there a solution I am missing that does not require non-volatile storage?
Please note that the primary purpose of this project is education. I realize that I could simplify this problem by setting up a secure relay inside my firewall, and let that handle Internet traffic, instead of exposing the MCU directly. But what would be the fun in that? ;)
= Proposed Solution =
A pair of shared AES keys will be used. One key to turn a Nonce into an IV for the CBC stage, and another for encrypting the messages themselves:
Shared message Key
Shared IV_Key
Here's a picture of what I am doing:
https://www.youtube.com/watch?v=JNsUrOVQKpE#t=10m11s
1) Android takes current time in milliseconds (Ti) (64-bit long) and
uses it as a nonce input into the CBC stage to encrypt the command:
a) IV(i) = AES_ECB(IV_Key, Ti)
b) Ci = AES_CBC(Key, IV(i), COMMAND)
2) Android utilizes /dev/random to generate the IV_Response that the
MCU will use to answer current request.
3) Android sends [<Ti, IV_Response, Ci>, <== HMAC(K)]
4) MCU receives and verifies integrity using HMAC, so attacker can't
modify plain text Ti.
5) MCU checks that Ti > T(i-1) stored in flash. This ensures that
recorded messages can't be replayed.
6) MCU calculates IV(i) = AES_ECB(IV_Key, Ti) and decrypts Ci.
7) MCU responds using AES_CBC(Key, IV_Response, RESPONSE)
8) MCU stores Ti in external flash memory.
Does this work? Is there a simpler approach?
EDIT: It was already shown in comments that this approach is vulnerable to a Delayed Playback Attack. If the attacker records and blocks messages from reaching the MCU, then the messages can be played back at any later time and still be considered valid, so this algorithm is no good.
As suggested by #owlstead, a challenge/response system is likely required. Unless I can find a way around that, I think I need to do the following:
Port or implement a decent PRGA. (Any recommendations?)
Pre-compute a lot of random seed values for the PRGA. A new seed will be used for every MCU restart. Assuming 128-bit seeds, 16K of storage buys be a 1000 unique seeds, so the values won't loop until the MCU successfully uses at least one PRGA output value and restarts a 1000 times. That doesn't seem too bad.
Use the output of PRGA to generate the challenges.
Does that sound about right?
Having an IV_KEY is unnecessary. IVs (and similar constructs, such as salts) do not need to be encrypted, and if you look at image you linked to in the youtube video you'll see their description of the payload includes the IV in plaintext. They are used so that the same plaintext does not encode to the same ciphertext under the same key every time, which presents information to an attacker. The protection against the IV being altered is the HMAC on the message, not the encryption. As such, you can remove that requirement. EDIT: This paragraph is incorrect, see discussion below. As noted though, your approach described above will work fine.
Your system does have a flaw though, namely the IV_Response. I assume, from that you include it in your system, that it serves some purpose. However, because it is not in any way encoded, it allows an attacker to respond affirmatively to a device's request without the MCU receiving it. Let's say that your device's were instructing an MCU that was running a small robot to throw a ball. Your commands might look like.
1) Move to loc (x,y).
2) Lower anchor to secure bot table.
3) Throw ball
Our attacker can allow messages 1 and 3 to pass as expected, and block 2 from getting to the MCU while still responding affirmatively, causing our bot to be damaged when it tosses the ball without being anchored. This does have an imperfect solution. Append the response (which should fit into a single block) to the command so that it is encrypted as well, and have the MCU respond with AES_ECB(Key, Response), which the device will verify. As such, the attacker will not be able to forge (feasibly) a valid response. Note that as you consider /dev/random untrustworthy this could provide an attacker with plaintext-ciphertext pairs, which can be used for linear cryptanalysis of the key provided an attacker has a large set of pairs to work with. As such, you'll need to change the key with some regularity.
Other than that, your approach looks good. Just remember it is crucial that you use the stored Ti to protect against the replay attack, and not the MCU's clock. Otherwise you run into synchronization problems.

Securely transmit commands between PIC microcontrollers using nRF24L01 module

I have created a small wireless network using a few PIC microcontrollers and nRF24L01 wireless RF modules. One of the PICs is PIC18F46K22 and it is used as the main controller which sends commands to all other PICs. All other (slave) microcontrollers are PIC16F1454, there are 5 of them so far. These slave controllers are attached to various devices (mostly lights). The main microcontroller is used to transmit commands to those devices, such as turn lights on or off. These devices also report the status of the attached devices back to the main controller witch then displays it on an LCD screen. This whole setup is working perfectly fine.
The problem is that anybody who has these cheap nRF24L01 modules could simply listen to the commands which are being sent by the main controller and then repeat them to control the devices.
Encrypting the commands wouldn’t be helpful as these are simple instructions and if encrypted they will always look the same, and one does not need to decrypt it to be able to retransmit the message.
So how would I implement a level of security in this system?
What you're trying to do is to prevent replay attacks. The general solution to this involves two things:
Include a timestamp and/or a running message number in all your messages. Reject messages that are too old or that arrive out of order.
Include a cryptographic message authentication code in each message. Reject any messages that don't have the correct MAC.
The MAC should be at least 64 bits long to prevent brute force forgery attempts. Yes, I know, that's a lot of bits for small messages, but try to resist the temptation to skimp on it. 48 bits might be tolerable, but 32 bits is definitely getting into risky territory, at least unless you implement some kind of rate limiting on incoming messages.
If you're also encrypting your messages, you may be able to save a few bytes by using an authenticated encryption mode such as SIV that combines the MAC with the initialization vector for the encryption. SIV is a pretty nice choice for encrypting small messages anyway, since it's designed to be quite "foolproof". If you don't need encryption, CMAC is a good choice for a MAC algorithm, and is also the MAC used internally by SIV.
Most MACs, including CMAC, are based on block ciphers such as AES, so you'll need to find an implementation of such a cipher for your microcontroller. A quick Google search turned up this question on electronics.SE about AES implementations for microcontrollers, as well as this blog post titled "Fast AES Implementation on PIC18F4550". There are also small block ciphers specifically designed for microcontrollers, but such ciphers tend to be less thoroughly analyzed than AES, and may harbor security weaknesses; if you can use AES, I would. Note that many MAC algorithms (as well as SIV mode) only use the block cipher in one direction; the decryption half of the block cipher is never used, and so need not be implemented.
The timestamp or message number should be long enough to keep it from wrapping around. However, there's a trick that can be used to avoid transmitting the entire number with each message: basically, you only send the lowest one or two bytes of the number, but you also include the higher bytes of the number in the MAC calculation (as associated data, if using SIV). When you receive a message, you reconstruct the higher bytes based on the transmitted value and the current time / last accepted message number and then verify the MAC to check that your reconstruction is correct and the message isn't stale.
If you do this, it's a good idea to have the devices regularly send synchronization messages that contain the full timestamp / message number. This allows them to recover e.g. from prolonged periods of message loss causing the truncated counter to wrap around. For schemes based on sequential message numbering, a typical synchronization message would include both the highest message number sent by the device so far as well as the lowest number they'll accept in return.
To guard against unexpected power loss, the message numbers should be regularly written to permanent storage, such as flash memory. Since you probably don't want to do this after every message, a common solution is to only save the number every, say, 1000 messages, and to add a safety margin of 1000 to the saved value (for the outgoing messages). You should also design your data storage patterns to avoid directly overwriting old data, both to minimize wear on the memory and to avoid data corruption if power is lost during a write. The details of this, however, are a bit outside the scope of this answer.
Ps. Of course, the MAC calculation should also always include the identities of the sender and the intended recipient, so that an attacker can't trick the devices by e.g. echoing a message back to its sender.

How long should a message header/prefix be?

I've worked with a few protocols, and written my own. I have written some message formats with only 1 char to identify the message, and some with 4 chars. I don't feel that I'm experienced enough to tell which is better, so I'm looking for an answer which describes in which scenario one might be better than the other.
For performance, you would imagine that sending 2 bytes (A%1i) is faster than sending 5 bytes (ABCD%1i). However, I have noticed that when writing the protocol with the 1 byte prefix, if you have a bug which causes your code to not read enough data from the socket, you might get garbage data comming into your system.
So is the purpose of a 4 byte prefix just to provide a guarentee that your message is clean? Is it worth it for the performance you sacrafice? Do you really sacrafice any performance at all? Maybe it's better to have 2 or 3 byte prefix?
I'm not sure if this question should be specific to TCP, or whether it applies to all transport protocols. Advice on this would be interesting.
Update: For interest, I will mention that Synergy uses 4-byte message prefixes, so for a mouse move delta the header is the same size as the actual data. Some have suggested just having a 1 or 2 byte prefix to improve efficiency. I wonder what drawbacks this would have?
Update: Also, I wonder if only the handshake really matters, if you're worried about garbage data. Synergy has a long handshake (a few bytes), so are the 4-byte message prefixes needed? I made a protocol recently that has only a 1 byte handshake, and that turned out to be a bad idea, since incompatible protocols were spamming the system with bad data (off the back of this, I might reccomend at least having a long handshake).
The purpose of the header is to make it easier to solve the frame synchronization problem ( byte aligning in serial communication ).
To synchronize, the receiver looks for anything in the data stream that "looks like" a start-of-message header.
If you have lots of different kinds of valid start-of-message headers, and all of them are 1 byte long, then you will inevitably get a lot of "false frame synchronizations" -- garbage from something that "looks like" a start-of-message header, but isn't.
It would be better to pick some other header that makes it "unlikely" that anything in the serial data stream "looks like" a valid start-of-message header.
It is inevitable that you will get garbage data coming into your system, no matter how you design the packet header.
Whatever you use to handle these other problems (such as occasional bit errors in the middle of the message) should also be adequate to handle the occasional "false frame synchronization" garbage.
In some systems, any bad data is quickly overwritten by fresh new good data, and if you blink you might never see the bad data.
Other systems need at least some sort of error detection in the footer to reject the bad data.
Yet other systems need to not only detect such errors, but somehow keep re-sending that message -- until both sides are convinced that an error-free version of that message has been successfully received.
As Oleksi implied, in some systems the latency is not significantly different between sending a single binary bit (100 ms) and sending 10 bytes (102.4 ms).
So the advantages of using a tiny header (2.4% less latency!) may not be worth it compared to the advantages of using a more verbose header (easier debugging; easier to make backward-compatible and forward-compatible; easier to test the effect of minor changes "in isolation" without upgrading both sides in lockstep to the new protocol which is completely incompatible with the old protocol).
Perhaps you could get the best of both worlds by (a) keeping the verbose, easy-to-debug headers on messages that are so rarely used that the effect of tiny headers is too small to measure (which I suspect is nearly all messages), and (b) introducing a "tiny header" format for any kind of message where the effect of tiny headers is "noticeably better" or at least at least measurable.
It looks like the Synergy protocol is flexible enough to add such a "tiny header" format in a way that is easily distinguishable from the other kinds of message headers.
I use Synergy between my laptop and a few desktop machines. I am glad someone is trying to make it even better.
The performance will depend on the content of the message you are sending. If your content is several kilobytes, it doesn't really matter how many bytes your header is. For now, I would choose the scheme that's easiest to work with, because the performance difference between sending one byte, or four bytes is going to be negligible compared to the actual data that you're sending.

Audio, AES CBC and IVs

I'm currently working on a voip project and have a question about the implementation of AES-CBC mode. I know that for instant messaging based on text message communication, it's important to generate an IV for every message to avoid possible guess of the first block if this one is redundant during the communication.
But is it useful to do the same with audio data ? Since audio data is much more complex than clear text, i'm wondering if it would be wise to generate an IV for each audio chunk ( that would mean a lot of IVs per second, more than 40 ), or will this just slow everything down for nothing? Or just one IV generated at the start of the conversation should be enough?
Thanks in advance,
Nolhian
You do not need to generate new IVs each time.
For example, in SSH and TLS only one IV is used for a whole data session, and rekeying is needed only after some gbytes of data.
CBC requires a new IV for each message. However nobody said that you had to send a message in one go.
Consider SSL/TLS. The connection begins with a complex procedure (the "handshake") which results in a shared "master key" from which are derived symmetric encryption keys, MAC keys, and IVs. From that point and until the connection end (or new handshake), the complete data sent by the client to the server is, as far as CBC is concerned, one unique big message which uses, quite logically, a unique IV.
In more details, with CBC each block (of 16 bytes with AES) is first XORed with the previous encrypted block, then is itself encrypted. The IV is needed only for the very first block, since there is no previous block at that point. One way of seeing it is that each encrypted block is the IV for the encryption of what follows. When, as part of the SSL/TLS dialog, the client sends some data (a "record" in SSL speak), it remembers the last encrypted block of that record, to be used as IV for the next record.
In your case, I suppose that you have an audio stream to encrypt. You could handle it as SSL/TLS does, simply chopping the CBC stream between blocks. It has, however, a slight complication: usually, in VoIP protocols, some packets may be lost. If you receive a chunk of CBC-encrypted data and do not have the previous chunk, then you do not know the IV for that chunk (i.e. the last encrypted block of the previous chunk). You are then unable to properly decrypt the first block (16 bytes) of the chunk you receive. Whether recovery from that situation is easy or not depends on what data you are encrypting (in particular, with audio, what kind of compression algorithm you use). If that potential loss is a problem, then a workaround is to include the IV in each chunk: in CBC-speak, the last encrypted block of a chunk (in a packet) is repeated as first encrypted block in the next chunk (in the next packet).
Or, to state it briefly: you need an IV per chunk, but CBC generates these IV "naturally" because all the IV (except the very first) are blocks that you just encrypted.

Resources