Encryption algorithm in 4k memory - security

i am working in some devices (hardware) for domotic aplications, now i need to implement an algorithm to avoid intruders in my network deploy by the devices, the only restriction i have is that the memory in each device is 4k, any suggestion that i can implement??

4k is a huge amount of memory for any cryptographic implementation. A block cipher such as AES-128 encrypts 16 byte blocks at a time. Any mode of operation can be performed (nearly) in-place so you don't need to make a copy of the data you are encrypting/decrypting. Also CBC mode is a good choice.
Symmetric Cryptography is very light weight, I am not sure why memory or even CPU requirements would be a concern. This would have been a concern 30 or more years ago. But certainly not when you have 4,096 bytes of memory!

Related

Storing and writing to text file in STM32 F4 internal flash memory

I need to store a text file into the STM32 F446RE internal flash memory.  This text file will contain log data that needs to be written to and updated consistently.  I know there are a couple ways of writing to it including embedding the text as constant string/data into the source code or implementing a file system like fatfs (Not suitable for STM32 F4 flash due to its sector orientation). It has total 7 sectors that vary in size.  Sectors 0-3 each contain 16 kB, 4 contains 64 kB, and 5-7 each contain 128 kB.  This translates to a total of 512 kB of Flash memory.   These are are not sufficient for what I'm looking for, and was wondering if anyone has ideas?  I'm using the STM32CubeIDE. 
Writing to FLASH memory requires an erase operation first. It seems that you already know that erase operations must be performed on whole sectors. Note also that FLASH memory wears out with repeated erase/write cycles.
I suggest one of three approaches depending upon how much data you must store and your coding abilities.
An "in-chip" approach is to implement a circular buffer in RAM and maintain your log there. If power is lost then you need code to commit that RAM buffer to FLASH. On power-up you need code to restore the RAM buffer from FLASH. This implies that your design does not suffer frequent power cycles and that you can maintain power to the microcontroller long enough to save the buffer from RAM to FLASH.
Next option is to use an external memory chip. EEPROMs are not terribly fast and are also subject to wear. FRAM is fast and suffers trillions of writes before wear issues. It is available as I2C or SPI so you can implement a number of chips to provide a reasonable buffer size and handle the chip to memory mapping in your handler code. FRAM is not cheap though.
Finally, there is an option to add an SSD drive. These devices include "wear levelling" to maintain their active lives. However, you need a suitable interface such as USB or PCI.
HTH

How do modern cpus handle crosspage unaligned access?

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.
What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?
Here is an example:
I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.
Physical memory:
---|---------------|---------------|-->
|... 4b| | |4b ...|-->
I need 8 bytes that are split at the page boundaries.
How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?
I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?
Should I care about such things if I'm a software programmer and why?
I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.
I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":
An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.
So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses1, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.
Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.
I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:
[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.
so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).
Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.
[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.
So the architecture doesnt really matter other than x86 has traditionally not directly told you not to where mips and arm traditionally generate a data abort rather than trying to just make it work.
where it doesnt matter is that all processors have a fixed number of pins a fixed size (maximum) data bus a fixed size (max) address bus, "modern processors" tend to have data busses more than 8 bits wide but the units on addresses is still an 8 bit byte, so the opportunity for unaligned exists. Anything larger than one byte in a particular transfer has the opportunity of being unaligned if the architecture allows.
Transfers are typically in some units of bytes and/or bus widths. On an ARM amba/axi bus for example the length field is in units of bus widths, 32 or 64 bits, 4 or 8 bytes. And no it is not going to be in units of 4Kbytes....
(yes this is elementary I assume you understand all of this).
Whether it is 16 bits or 128 bits, the penalty for unaligned comes from the additional bus cycles which these days is an extra bus clock per. So for an ARM 16 bit unaligned transfer (which arm will support on its newer cores without faulting) that means you need to read 128 bits instead of 64, 64 bits to get 16 is not a penalty as 64 is the smallest size for a bus transfer. Each transfer whether it is a single width of the data bus or multiple has multiple clock cycles associated with it, lets say there are 6 clock cycles to do an aligned 16 bit read, then ideally it is 7 cycles to do an unaligned 16 bit. Seems small but it does add up.
caches help alot because the dram side of the cache will be setup to use multiples of the bus width and will always do aligned accesses for cache fetches and evictions. not-cached accesses will follow the same pain except the dram side is not handfuls of clocks but dozens to hundreds of clocks of overhead.
For random access a single 16 bit read that not only spans a bus width boundary but also happens to cross a cache line boundary will not just incur the one additional clock on the processor side but worst case it can incur an addition cache line fetch which is dozens to hundreds of additional clock cycles. if you were walking through an array of things that happen to not be aligned (structures/unions may be an example depending on the compiler and code) that additional cache line fetch would have happened anyway, if the array of things is a little over on one or both ends then you might still incur one or two more cache line fetches that you would have avoided had the array been aligned.
That is really the key to this on reads is before or after an aligned area you might have to incur a transfer for each one for each side you spill into.
Writes are both good and bad. random reads are slower because the transaction has to stall until the answer comes back. For a random write the memory controller has all the information it needs it has the address, data, byte mask, transfer type, etc. So it is fire and forget the processor has done its job and can call the transaction complete from its perspective and move on. Naturally gang too much of these up or do a read on something just written and then the processor stalls due to the completion of a prior write in addition to the current transaction.
An unaligned 16 bit write for example does not only incur the additional read cycle but assuming a 32 or 64 bit wide bus that would be one byte per location so you have to do a read-modify-write on whatever that closest memory is (cache or dram). so depending on how the processor and then memory controller implements it it can be two individual read-modify-write transactions (unlikely since that incurs twice the overhead), or the double width read, modify both parts, and a double width read. incurring two additional clocks over and above the overhead, the overhead is doubled as well. If it had been an aligned bus width write then no read-modify-write is required, you save the read. Now if this read-modify-write is in the cache then that is pretty fast but still noticeable up to a few clocks depending on what is queued up and you have to wait on.
I am also most familiar with ARM. Arm traditionally would punish an unaligned access with an abort, you could turn that off, and you would instead get a rotation of the bus rather than it spilling over which would make for some nice freebie endian swaps. the more modern arm cores will tolerate and implement an unaligned transfer. Understand for example a store multiple of say 4 or more registers against a non-64-bit-aligned address is not considered an unaligned access even though it is a 128 bit write to an address that is neither 64 nor 128 bit aligned. What the processor does in that case is brakes it into 3 writes, an aligned 32 bit write, an aligned 64 bit write and an aligned 32 bit write. the memory controller does not have to deal with the unaligned stuff. That is for legal things like store multiple. the core I am familiar with wont do a write length of more than 2 anyway, an 8 register store multiple, is not a single length of 4 write it is 2 separate length of two writes. But a load multiple of 8 registers, so long it is aligned on a 64 bit address is a single length of 4 transaction. I am pretty sure that since there is no masking on the bus side for a read, everything is in units of bus width, there is no reason to break say a 4 register load multiple on an address that is not 64 bits aligned into 3 transactions, simply do a length of 3 read. When the processor reads a single byte you cant tell that from the bus all you see is a 64 bit read AFAIK. The processor strips the byte lane out. If the processor/bus does care be it arm, x86, mips, etc, then sure you will hopefully see separate transfers.
Does everyone do this? no older processors (not thinking of an arm nor x86) would put more burden on the memory controller. I dont know what modern x86 and mips and such do.
Your malloc example. First off you are not going to see single bus transfers of 4Kbytes, that 4k will be broken up into digestible bits anyway. first off it has to do one to many bus cycles against the memory management unit to find the physical address and other properties anyway (those answers can get cached to make them faster, but sometimes they have to go all the way out to slow dram) so for that example the only transfer that matters is an aligned transfer that splits the 4k boundary, say a 16 bit transfer, for the mmu system to work at all the only way for that to be supported is that has to be turned into two separate 8 bit transfers that happen in those physical address spaces, and yes that literally doubles everything the mmu lookup cycles the cache/dram bus cycles, etc. Other than that boundary there is nothing special about your 8k being split. the bulk of your cycles will be within one of the two 4k pages, so it looks like any other random access, with of course repetitive/sequential accesses gaining the benefit of caching.
The short answer is that no matter what platform you are on either 1) the platform will abort an unaligned transfer, or 2) somewhere in the path there is an additional one or more (dozens/hundreds) as a result of the unaligned access compared to an aligned access.
It doesn't matter whether the physical pages are adjacent or not. Modern CPUs use caches. Data is transferred to/from DRAM a full cache-line at a time. Thus, DRAM will never see a multi-byte read or write that crosses a 64B boundary, let alone a page boundary.
Stores that cross a page boundary are still slow (on modern x86). I assume the hardware handles the page-split case by detecting it at some later pipeline stage, and triggering a re-do that does two TLB checks. IDK if Intel designs insert extra uops into the pipeline to handle it, or what. (i.e. impact on latency, throughput of page-splits, throughput of all memory accesses, throughput of other (e.g. non-memory) uops).
Normally there's no penalty at all for unaligned accesses within a cache-line (since about Nehalem), and a small penalty for cache-line splits that aren't page-splits. An even split is apparently cheaper than others. (e.g. a 16B load that takes 8B from one cache line and 8B from another).
Anyway, DRAM will never see an unaligned access directly. AFAIK, no sane modern design has only write-through caches, so DRAM only sees writes when a cache-line is flushed, at which point the fact that one unaligned access dirtied two cache lines is not available. Caches don't even record which bytes are dirty; they just burst-write the whole 64B to the next level down (or last-level to DRAM) when needed.
There are probably some CPU designs that don't work this way, but Intel and AMD's designs are also this way.
Caveat: loads/stores to uncachable memory regions might produce smaller stores, but probably still only within a single cache-line. (On x86, this prob. applies to MOVNT non-temporal stores that use write-combining store buffers but otherwise bypass the cache).
Uncacheable unaligned stores that cross a page boundary are probably still split into separate stores (because each part needs a separate TLB translation).
Caveat 2: I didn't fact-check this. I'm certain about the whole-cache-line aligned access to DRAM for "normal" loads/stores to "normal" memory regions, though.

How can I exhaust /dev/urandom for testing?

I recently had a bug where I didn't properly handle when the entropy on my linux server got too low and a read of /dev/urandom returned less than the number of bytes expected.
How can I recreate this with a test? Is there a way to lower the entropy on a system or to reliably empty /dev/urandom?
I'd like to be able to have a regression test that will verify my fix. I'm using Ubuntu 12.04.
According to random(4) man page,
read from the /dev/urandom device will not block
You should read a lot of bytes from /dev/random (without any u) if you want it to block. (How many is hardware and system dependent).
So you cannot "exaust" /dev/urandom, since
A read from the /dev/urandom device will not block waiting for
more entropy. As a result, if there is not sufficient entropy in
the entropy pool, the returned values are theoretically vulnerable
to a cryptographic attack on the algorithms used by the driver.
I believe you should use /dev/random which indeed can be exhausted, by blocking.
But you should not read more than about 256 bits from it.

Why splice with sockets cannot improve performance without DMA?

In Wikipedia's introduction to splice, I found:
When using splice() with sockets, the network controller (NIC) must
support DMA.
When the NIC does not support DMA then splice() will not deliver any
performance improvement. The reason for this is that each page of the
pipe will just fill up to frame size (1460 bytes of the available 4096
bytes per page).
From what I understand, the splice improves performance because:
there's less context switching
it minimizes the number of copies (minimum two DMA copies)
If the NIC does not support DMA copy, we use CPU copy. This is still better than normal copies which have to go to the user space.
So, I don't understand why Wikipedia says there's no performance improvements without DMA support in NIC.
Maybe wikipedia is wrong? That article is already flagged as being light on citations...

Linux block device with odd (not even) size

Is it possible to create a Linux (2.6) block device (such as a loopback device) with an odd size? I couldn't make it happen. losetup seems to round down to 512 byte boundary. The ubd devices of User-mode Linux ubd devices seem to round up to 512 byte boundary. In struct request, we have sector_t __sector for the block offset for read/write operations.
I'm asking this question just for educational purposes. I can cope with the 512 byte boundary, but I'm still interested if it would be possible to bypass it. In this question I'm not interested in other layers of abstraction (such as using regular files or character devices).
No. The Linux 2.6 block layer doesn't comprehend anything smaller than 512 bytes. Anything smaller (especially not a power of 2) would require a major rewrite of an awful lot of code.
This is what makes a block device instead of a character device: the block granularity. The dichotomy exists because it is vastly more efficient to model real hardware that works a block at a time as an abstraction that also deals in blocks. To do otherwise would turn every operation into a much more costly computation.
The way to bypass it is, as you mention, to use a character oriented device or abstraction. This is central to the Unix device model: everything is a series of octets, except for the things that can only be virtualized as one.

Resources