Questions about Python3.6 os.urandom/os.getrandom/secrets - linux

Referring to documentation for os and secrets:
os.getrandom(size, flags=0)
Get up to size random bytes. The function can return less bytes than requested.
getrandom() relies on entropy gathered from device drivers and other sources of environmental noise.
So does this mean it's from /dev/random?
On Linux, if the getrandom() syscall is available, it is used in blocking mode: block until the system urandom entropy pool is initialized (128 bits of entropy are collected by the kernel).
So to ensure kernel CSPRNG with bad internal state is never used I should use os.getrandom()? Since the function can return less bytes than requested, should I run the application level CSPRNG as something like
def rng():
r = bytearray()
while len(r) < 32:
r += os.getrandom(1)
return bytes(r)
to ensure maximum security? I explicitly want all systems that do not support blocking until urandom entropy pool is initialized to be unable to run the program and system that support it, to wait. This is because the software must be secure even if it's run from a live CDs that has zero entropy at start.
Or does the blocking mean if I do os.getrandom(32), the program waits if necessary, forever until the 32 bytes are collected?
The flags argument is a bit mask that can contain zero or more of the following values ORed together: os.GRND_RANDOM and GRND_NONBLOCK.
Can someone please ELI5 how this works?
os.urandom(size)
On Linux, if the getrandom() syscall is available, it is used in blocking mode: block until the system urandom entropy pool is initialized (128 bits of entropy are collected by the kernel).
So the urandom quietly falls back to non-blocking CSPRNG that doesn't know it's internal seeding state in older Linux kernel versions?
Changed in version 3.6.0: On Linux, getrandom() is now used in blocking mode to increase the security.
Does this have to do with os.getrandom()? Is it a lower level call? Are the two the same?
os.GRND_NONBLOCK
By default, when reading from /dev/random, getrandom() blocks if no random bytes are available, and when reading from /dev/urandom, it blocks if the entropy pool has not yet been initialized.
So it's the 0-flag in os.getrandom(size, flag=0)?
os.GRND_RANDOM
If this bit is set, then random bytes are drawn from the /dev/random pool instead of the /dev/urandom pool.
What does ORing the os.getrandom() flags mean? How does the os.getrandom(flags=1) tell if I meant to enable os.GRND_NONBLOCK or os.GRND_RANDOM. Or do I need to set it before like this:
os.GRND_RANDOM = 1
os.getrandom(32) # or use the rng() defined above
secrets module
The secrets module is used for generating cryptographically strong random numbers suitable for managing data such as passwords, account authentication, security tokens, and related secrets.
The only clear way to generate random bytes is
secrets.token_bytes(32)
The secrets module provides access to the most secure source of randomness that your operating system provides.
So that should mean it's os.getrandom with fallback to os.urandom? So it's not a good choice if you desire 'graceful exit if internal state can not be evaluated'?
To be secure against brute-force attacks, tokens need to have sufficient randomness. Unfortunately, what is considered sufficient will necessarily increase as computers get more powerful and able to make more guesses in a shorter period. As of 2015, it is believed that 32 bytes (256 bits) of randomness is sufficient for the typical use-case expected for the secrets module.
Yet the blocking stops at 128 bits of internal state, not 256. Most symmetric ciphers have 256-bit versions for a reason.
So I should probably make sure the /dev/random is used in blocking mode to ensure internal state has reached 256 bits by the time the key is generated?
So tl;dr
What's the most secure way in Python3.6 to generate a 256-bit key on a Linux (3.17 or newer) live distro that has zero entropy in kernel CSPRNG internal state at the start of my program's execution?

After doing some research, I can answer my own question.
os.getrandom is a wrapper for getrandom() syscall offered in Linux Kernel 3.17 and newer. The flag is a number (0, 1, 2 or 3) that corresponds to bitmask in following way:
GETRANDOM with ChaCha20 DRNG
os.getrandom(32, flags=0)
GRND_NONBLOCK = 0 (=Block until the ChaCha20 DRNG seed level reaches 256 bits)
GRND_RANDOM = 0 (=Use ChaCha20 DRNG)
= 00 (=flag 0)
This is a good default to use with all Python 3.6 programs on all platforms (including live distros) when no backwards compatibility with Python 3.5 and pre-3.17 kernels is needed.
The PEP 524 is incorrect when it claims
On Linux, getrandom(0) blocks until the kernel initialized urandom with 128 bits of entropy.
According to page 84 of the BSI report, the 128-bit limit is used during boot time for callers of kernel module's get_random_bytes() function, if the code was made to properly wait for the triggering of the add_random_ready_callback() function. (Not waiting means get_random_bytes() might return insecure random numbers.) According to page 112
When reaching the state of being fully seeded and thus having the ChaCha20 DRNG seeded with 256 bits of entropy -- the getrandom system call unblocks and generates random numbers.
So, GETRANDOM() never returns random numbers until the ChaCha20 DRNG is fully seeded.
os.getrandom(32, flags=1)
GRND_NONBLOCK = 1 (=If the ChaCha20 DRNG is not fully seeded, raise BlockingIOError instead of blocking)
GRND_RANDOM = 0 (=Use ChaCha20 DRNG)
= 01 (=flag 1)
Useful if the application needs to do other tasks while it waits for the ChaCha20 DRNG to be fully seeded. The ChaCha20 DRNG is almost always fully seeded during boot time, so flags=0 is most likely a better choice. Needs the try-except logic around it.
GETRANDOM with blocking_pool
The blocking_pool is also accessible via the /dev/random device file. The pool was designed with the idea in mind that entropy runs out. This idea applies only when trying to create one-time pads (that strive for information theoretic security). The quality of entropy in blocking_pool for that purpose is not clear, and the performance is really bad. For every other use, properly seeded DRNG is enough.
The only situation where blocking_pool might be more secure is with pre-4.17 kernels that have the CONFIG_RANDOM_TRUST_CPU flag set during compile time, and if the CPU HWRNG happened to have a backdoor. Since in that case the ChaCha20 DRNG is initially seeded with RDSEED/RDRAND instruction, bad CPU HWRNG would be a problem. However, according to page page 134 of the BSI report:
[As of kernel version 4.17] The Linux-RNG now considers the ChaCha20 DRNG fully seeded after it received 128 bit of entropy from the noise sources. Previously it was sufficient that it received at least 256 interrupts.
Thus the ChaCha20 DRNG wouldn't be considered fully seeded until entropy is also mixed from input_pool, that pools and mixes random events from all LRNG noise sources together.
By using os.getrandom() with flags 2 or 3, the entropy comes from blocking_pool, that receives entropy from input_pool, that in turn receives entropy from several additional noise sources. The ChaCha20 DRNG is reseeded also from the input_pool, thus the CPU RNG does not have permanent control over the DRNG state. Once this happens, ChaCha20 DRNG is as secure as blocking_pool.
os.getrandom(32, flags=2)
GRND_NONBLOCK = 0 (=Return 32 bytes or less if entropy counter of blocking_pool is low. Block if no entropy is available.)
GRND_RANDOM = 1 (=Use blocking_pool)
= 10 (=flag 2)
This needs an external loop that runs the function and stores returned bytes into a buffer until the buffer size is 32 bytes. The major problem here is due to the blocking behavior of the blocking_pool, obtaining the bytes needed might take a very long time, especially if other programs are also requesting random numbers from the same syscall or /dev/random. Another issue is loop that uses os.getrandom(32, flags=2) spends more time idle waiting for random bytes than it would with flag 3 (see below).
os.getrandom(32, flags=3)
GRND_NONBLOCK = 1 (=return 32 bytes or less if entropy counter of blocking_pool is low. If no entropy is available, raise BlockingIOError instead of blocking).
GRND_RANDOM = 1 (=use blocking_pool)
= 11 (=flag 3)
Useful if the application needs to do other tasks while it waits for blocking_pool to have some amount of entropy. Needs the try-except logic around it plus an external loop that runs the function and stores returned bytes into a buffer until the buffer size is 32 bytes.
Other
open('/dev/urandom', 'rb').read(32)
To ensure backwards compatibility, unlike GETRANDOM() with ChaCha20 DRNG, reading from /dev/urandom device file never blocks. There is no guarantee for the quality of random numbers, which is bad. This is the least recommended option.
os.urandom(32)
os.urandom(n) provides best effort security:
Python3.6
On Linux 3.17 and newer, os.urandom(32) is the equivalent of os.getrandom(32, flags=0). On older kernels it quietly falls back to the equivalent of open('/dev/urandom', 'rb').read(32) which is not good.
os.getrandom(32, flags=0) should be preferred as it can not fall back to insecure mode.
Python3.5 and earlier
Always the equivalent of open('/dev/urandom', 'rb').read(32) which is not good. As os.getrandom() is not available, Python3.5 should not be used.
secrets.token_bytes(32) (Python 3.6 only)
Wrapper for os.urandom(). Default length of keys is 32 bytes (256 bits). On Linux 3.17 and newer, secrets.token_bytes(32) is the equivalent of os.getrandom(32, flags=0). On older kernels it quietly falls back to the equivalent of open('/dev/urandom', 'rb').read(32) which is not good.
Again, os.getrandom(32, flags=0) should be preferred as it can not fall back to insecure mode.
tl;dr
Use os.getrandom(32, flags=0).
What about other RNG sources, random, SystemRandom() etc?
import random
random.<anything>()
is never safe for creating passwords, cryptographic keys etc.
import random
sys_rand = random.SystemRandom()
is safe for cryptographic use WITH EXCEPTIONS!
sys_rand.sample()
Generating a random password with sys_rand.sample(list_of_password_chars, counts=password_length) is not safe because to quote the documentation, the sample() method is used for "random sampling without replacement". This means that each consequtive character in the password is guaranteed not to contain any of the previous characters. This will lead to passwords that are not uniformly random.
sys_rand.choices()
The sample() method was used for random sampling without replacement. The choices() method is used for random sampling with replacement. However, the to quote the documentation on choices,
The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
The floating point arithmetic choices() method uses thus introduces cryptographically non-negligible biases to the sampled passwords. Thus, random.choices() must not be used for password/key generation!
sys_random.choice()
As per the previously quoted piece of documentation, the sys_random.choice() method uses integer arithmetic as opposed to floating point arithmetic, thus generating passwords/keys with repeated calls to sys_random.choice() is therefore safe.
secrets.choice()
The secrets.choice() is a wrapper for sys_random.choice(), and can be used interchangeably with random.SystemRandom().choice(): they are the same thing.
The recipe for best practice to generate a passphrase with secrets.choice() is
import secrets
# On standard Linux systems, use a convenient dictionary file.
# Other platforms may need to provide their own word-list.
with open('/usr/share/dict/words') as f:
words = [word.strip() for word in f]
passphrase = ' '.join(secrets.choice(words) for i in range(4))
How can I ensure the generated passphrase meets some security level, e.g. 128 bits?
Here's a recipe for that
import math
import secrets
def generate_passphrase() -> str:
PASSWORD_MIN_BIT_STRENGTH = 128 # Set desired minimum bit strength here
with open('/usr/share/dict/words') as f:
wordlist = [word.strip() for word in f]
word_space = len(wordlist)
word_count = math.ceil(math.log(2 ** PASSWORD_MIN_BIT_STRENGTH, word_space))
passphrase = ' '.join(secrets.choice(wordlist) for _ in range(word_count))
# pwd_bit_strength = math.floor(math.log2(word_space ** word_count))
# print(f"Generated {pwd_bit_strength}-bit passphrase.")
return passphrase

As #maqp suggested...
Using os.getrandom(32, flags=0) is logical choice unless you're using the new secrets AND the Linux kernel (3.17 and newer) does NOT fall back to open('dev/urandom', 'rb').read(32).
Workaround, Secrets on Python 3.5.x
I installed Secrets for Python 2 even though running Python 3 and at a glance Secrets is working in Python 3.5.2 environment. Perhaps if I get time, or someone else does they can learn whether this one falls back, I suppose if Linux kernel is below certain version it may occur.
pip install python2-secrets
Once completed you can import secrets just like you would have with the Python 3 flavor.
Or just make sure to use Linux Kernel 3.17 & newer. Knowing one's kernel is always good practice, but in reality we count on smart people like maqp to find & share these things. Great job.
Were we complacent... having false sense of security?*
1st, 2nd, 4th... where is the outrage? It's not about 'your' security, that would be selfish to assume. It's the ability to spy on those who represent you in government, those who have skeletons & weaknesses (humans). Be sure to correct those selfish ones that say, "I got nothing to hide".
How Bad Was It?
The strength of encryption increases exponentially as length of key increases, therefore is it reasonable to assume that a reduction by of at least half, say 256 down to 128 would equate to decrease in strength by factors of tens, hundreds, thousands or more? Did it make big-bro job pretty easy, or just a tiny bit easier, am leaning towards saying the former.
The Glass Half Full?
Oh well, at least Linux is open source & we can see the insides for the most part. We still need chip hackers to find secret stuff, and the chips & hardware drivers are where you'll probably find stuff that will keep you from sleeping at night.

Related

Why does ECC signature verification need random numbers (sometimes taking a long time) in OpenSSL 1.1?

I was working on a Linux boot-time (kinit) signature checker using ECC certificates, changing over
from raw RSA signatures to CMS-format ECC signatures. In doing so, I found the
CMS_Verify() function stalling until the kernel printed "crng init done", indicating it needed to wait for there to be enough system entropy for cryptographically secure random number generation. Since nothing else is going on in the system, this took about 90 seconds on a Beaglebone Black.
This surprised me, I would have expected secure random numbers to be needed for certificate generation or maybe for signature generation, but there aren't any secrets to protect in public-key signature verification. So what gives?
(I figured it out but had not been able to find the solution elsewhere, so the answer is below for others).
Through a painstaking debug-by-printf process (my best option given it's a kinit), I found that a fundamental ECC operation uses random numbers as a defense against side-channel attacks. This is called "blinding" and helps prevent attackers from sussing out secrets based on how long computation takes, cache misses, power spikes, etc. by adding some indeterminacy.
From comments deep within the OpenSSL source:
/*-
* Computes the multiplicative inverse of a in GF(p), storing the result in r.
* If a is zero (or equivalent), you'll get a EC_R_CANNOT_INVERT error.
* Since we don't have a Mont structure here, SCA hardening is with blinding.
*/
int ec_GFp_simple_field_inv(const EC_GROUP *group, BIGNUM *r, const BIGNUM *a,
BN_CTX *ctx)
and that function goes on to call BN_priv_rand_range().
But in a public-key signature verification there are no secrets to protect. To solve the problem, in my kinit I just pre-seeded the OpenSSL random number generator with a fixed set of randomly-chosen data, as follows:
RAND_seed( "\xe5\xe3[...29 other characters...]\x9a", 32 );
DON'T DO THAT if your program works with secrets or generates any keys, signatures, or random numbers. In a signature-checking kinit it's OK. In a program that required more security I could have seeded with data from the on-chip RNG hardware (/dev/hw_random), or saved away some entropy in secure storage if I had any, or sucked it up and waited for crng init done.

Efficient decryption of AES/CFB8

I currently use this function to decrypt a data stream encrypted with AES in CFB8 mode:
https://github.com/Lazersmoke/civskell/blob/ebf4d761362ee42935faeeac0fe447abe96db0b5/src/Civskell/Tech/Encrypt.hs#L167-L175
cfb8Decrypt :: AES128 -> BS.ByteString -> BS.ByteString -> (BS.ByteString,BS.ByteString)
cfb8Decrypt c i = BS.foldl magic (BS.empty,i)
where
magic (ds,iv) d = (ds `BS.snoc` pt,ivFinal)
where
pt = BS.head (ecbEncrypt c iv) `xor` d
-- snoc on cipher always
ivFinal = BS.tail iv `BS.snoc` d
In case you don't understand Haskell, here's a quick rundown of how I believe this code works: (I did not write it)
Given an IV and a list of encrypted bytes
For every encrypted byte:
Encrypt the IV in ECB-mode.
Take the first byte of the encrypted IV and xor it with the encrypted byte. This is the next plaintext byte.
Remove the first byte from the IV
Append the encrypted byte to the IV
The next character will be decrypted using this new IV
Not that the ECB-mode encryption is handled by the cryptonite library. I could not find a library supporting CFB8.
Now, this works. However, with the amount of data I need to decrypt, it caps out one of my CPU cores and 80% of the time is just spent on decrypting.
The incoming data is not even that much, so this is not acceptable. Unfortunately, my knowledge of cryptography is rather limited and resources on CFB8 seem rather sparse. It appears that CFB8 is an uncommon mode of operation, also indicated by the lack of library support.
So, my question then is: How would I go about optimising this?
The incoming data is from a TCP stream, but the information is grouped into packets. The cfb8Decrypt function is called 2-5 times per packet, depending on the size. This is necessary, because the length of the packet is transmitted at the beginning, but the length of this size information is variable. After 1-4 decryptions are used to decrypt the length, the entire packet will be decrypted at once. I thought about trying to reduce this, but I am unsure if it would have any effect on speed at all.
Edit: Profiling results: http://svgur.com/i/40b.svg
CFB8 was created to have good error propagation properties over a noisy channel. It is well known that it is not fast; it is actually 16 times as slow, as it requires a block encrypt for each byte. Currently it is not very hot, as we tend to use CRC's for the data layer and MAC for integrity on cryptographic levels against willful attacks.
How can you speed it up? The only thing you can really do is to use a fast library. The library you are currently using seems to have support for AES-NI, so make sure that is enabled on your CPU and BIOS.
However, it is very likely that it won't speed up much if you have to call it block for block. You really want to use a native call that takes the whole packet and decrypts it. AES-NI in it's slowest on an Atom implementing TLS still goes to 20 MiB/s, but on server chips AES-NI often goes far beyond the 1 GiB/s limits. Assembly or optimized C should be about 6/7 times as slow when AES-NI is not available.
Functional programming languages like Haskell are not really created for fast I/O nor fast bit-operations. So you can bet that it will be much, much slower than e.g. Java or C#, and those are already much slower than native code let alone assembly code or specialized instructions.
Memory nowadays is pretty fast; CPU's are however much much faster. So avoiding spurious memory allocations and copying should be avoided (again, not that easy to do on a fully functional language, all the more reason to do as much as possible in native code). Do however make sure that there are no buffer overflow issues or you will have fast AES/CFB within an insecure application.

What does the IS_ALIGNED macro in the linux kernel do?

I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html

High Speed Serial

I have a system which uses a UART clocked at 26 Mhz. This is a 16850 UART on a i86 architecture. I have no problems accessing the port. The largest incoming message is about 56 bytes, the largest outgoing about 100. The baud rate divisor needs to be 1 so seterial /dev/ttyS4 baud_base 115200 is OK and opening at 115200. There is no flow control. Specifying part 16850 does NOT set the FIFO to deep. I was losing bytes. All the data is byte, unsigned char.
I wrote a routine that uses ioperm to set the deep FIFO's to 64 and now a read/write works meaning that the deep FIFO's are NOT being enabled by serial_core.c or 8250.c, at least in a deep manner.
With the deep FIFO set using s brute force, post open(fd, "/dev/ttyS4", NO_BLOCKING, etc I get reliably the correct number of bytes but I tend to get the same word missing a bit. Not a byte, a bit.
All this same stuff runs fine under DOS so it is not a hardware issue.
I have opened the port for raw, no delays, no party, 8 bits, 2 stop.
Has anyone seen issues reading serial ports are relatively high speeds with short bursts of data?
Yes, I have tried custom baud, etc. The FIFO levels made the biggest improvement. This is a ISA bus card using IRQ7.
It appears the serial driver for Linux sucks and has way to much latency and far to many features for really basic raw operation.
Has anyone else tried very high speed data without flow control or had similar issues. As I stated, I get the correct number of bytes and all the data is correct except 1 bit in byte 4.
I am pretty stumped.

Securely transmit commands between PIC microcontrollers using nRF24L01 module

I have created a small wireless network using a few PIC microcontrollers and nRF24L01 wireless RF modules. One of the PICs is PIC18F46K22 and it is used as the main controller which sends commands to all other PICs. All other (slave) microcontrollers are PIC16F1454, there are 5 of them so far. These slave controllers are attached to various devices (mostly lights). The main microcontroller is used to transmit commands to those devices, such as turn lights on or off. These devices also report the status of the attached devices back to the main controller witch then displays it on an LCD screen. This whole setup is working perfectly fine.
The problem is that anybody who has these cheap nRF24L01 modules could simply listen to the commands which are being sent by the main controller and then repeat them to control the devices.
Encrypting the commands wouldn’t be helpful as these are simple instructions and if encrypted they will always look the same, and one does not need to decrypt it to be able to retransmit the message.
So how would I implement a level of security in this system?
What you're trying to do is to prevent replay attacks. The general solution to this involves two things:
Include a timestamp and/or a running message number in all your messages. Reject messages that are too old or that arrive out of order.
Include a cryptographic message authentication code in each message. Reject any messages that don't have the correct MAC.
The MAC should be at least 64 bits long to prevent brute force forgery attempts. Yes, I know, that's a lot of bits for small messages, but try to resist the temptation to skimp on it. 48 bits might be tolerable, but 32 bits is definitely getting into risky territory, at least unless you implement some kind of rate limiting on incoming messages.
If you're also encrypting your messages, you may be able to save a few bytes by using an authenticated encryption mode such as SIV that combines the MAC with the initialization vector for the encryption. SIV is a pretty nice choice for encrypting small messages anyway, since it's designed to be quite "foolproof". If you don't need encryption, CMAC is a good choice for a MAC algorithm, and is also the MAC used internally by SIV.
Most MACs, including CMAC, are based on block ciphers such as AES, so you'll need to find an implementation of such a cipher for your microcontroller. A quick Google search turned up this question on electronics.SE about AES implementations for microcontrollers, as well as this blog post titled "Fast AES Implementation on PIC18F4550". There are also small block ciphers specifically designed for microcontrollers, but such ciphers tend to be less thoroughly analyzed than AES, and may harbor security weaknesses; if you can use AES, I would. Note that many MAC algorithms (as well as SIV mode) only use the block cipher in one direction; the decryption half of the block cipher is never used, and so need not be implemented.
The timestamp or message number should be long enough to keep it from wrapping around. However, there's a trick that can be used to avoid transmitting the entire number with each message: basically, you only send the lowest one or two bytes of the number, but you also include the higher bytes of the number in the MAC calculation (as associated data, if using SIV). When you receive a message, you reconstruct the higher bytes based on the transmitted value and the current time / last accepted message number and then verify the MAC to check that your reconstruction is correct and the message isn't stale.
If you do this, it's a good idea to have the devices regularly send synchronization messages that contain the full timestamp / message number. This allows them to recover e.g. from prolonged periods of message loss causing the truncated counter to wrap around. For schemes based on sequential message numbering, a typical synchronization message would include both the highest message number sent by the device so far as well as the lowest number they'll accept in return.
To guard against unexpected power loss, the message numbers should be regularly written to permanent storage, such as flash memory. Since you probably don't want to do this after every message, a common solution is to only save the number every, say, 1000 messages, and to add a safety margin of 1000 to the saved value (for the outgoing messages). You should also design your data storage patterns to avoid directly overwriting old data, both to minimize wear on the memory and to avoid data corruption if power is lost during a write. The details of this, however, are a bit outside the scope of this answer.
Ps. Of course, the MAC calculation should also always include the identities of the sender and the intended recipient, so that an attacker can't trick the devices by e.g. echoing a message back to its sender.

Resources