Safely zeroing buffers after working with crypto/* - security

Is there a way to zero buffers containing e. g. private keys after
using them and make sure that compilers don't delete the zeroing code as
unused? Something tells me that a simple:
copy(privateKey, make([]byte, keySize))
Is not guaranteed to stay there.

Sounds like you want to prevent sensitive data remaining in memory. But have you considered the data might have been replicated, or swapped to disk?
For these reasons I use the https://github.com/awnumar/memguard package.
It provides features to destroy the data when no longer required, while keeping it safe in the mean time.
You can read about its background here; https://spacetime.dev/memory-security-go

How about checking (some of) the content of the buffer after zeroing it and passing it to another function? For example:
copy(privateKey, make([]byte, keySize))
if privateKey[0] != 0 {
// If you pass the buffer to another function,
// this check and above copy() can't be optimized away:
fmt.Println("Zeroing failed", privateKey[0])
}
To be absolutely safe, you could XOR the passed buffer content with random bytes, but if / since the zeroing is not optimized away, the if body is never reached.
You might think a very intelligent compiler might deduce the above copy() zeros privateKey[0] and thus determine the condition is always false and still optimize it away (although this is very unlikely). The solution to this is not to use make([]byte, keySize) but e.g. a slice coming from a global variable or a function argument (whose value can only be determined at runtime) so the compiler can't be smart enough to deduce the condition is going to be always false at compile time.

Related

Memory barrier in the implementation of single producer single consumer

The following implementation from Wikipedia:
volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];
void producer(void) {
while (1) {
while (produceCount - consumeCount == BUFFER_SIZE)
sched_yield(); // buffer is full
buffer[produceCount % BUFFER_SIZE] = produceToken();
// a memory_barrier should go here, see the explanation above
++produceCount;
}
}
void consumer(void) {
while (1) {
while (produceCount - consumeCount == 0)
sched_yield(); // buffer is empty
consumeToken(buffer[consumeCount % BUFFER_SIZE]);
// a memory_barrier should go here, the explanation above still applies
++consumeCount;
}
}
says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.
This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Thanks
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Good question.
In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.
As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).
So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.
What you will actually observe is undefined behaviour.
It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".
Consider following code tansformation:
int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();
Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.
Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.
If a fence is not used, won't this kind of reordering violate the correctness of code?
Nope. Can you construct any portable code that can tell the difference?
The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.
If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.
However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

Are objects accessed indirectly in D?

As I've read all objects in D are fully location independent. How this requirement is achieved?
One thing that comes to my mind, is that all references are not pointers to the objects, but to some proxy, so when you move object (in memory) you just update that proxy, not all references used in program.
But this is just my guess. How it is done in D for real?
edit: bottom line up front, no proxy object, objects are referenced directly through regular pointers. /edit
structs aren't allowed to keep a pointer to themselves, so if they get copied, they should continue to just work. This isn't strictly enforced by the language though:
struct S {
S* lol;
void beBad() {
lol = &this; // this compiler will allow this....
}
}
S pain() {
S s;
s.beBad();
return s;
}
void main() {
S s;
s = pain();
assert(s.lol !is &s); // but it will also move the object without notice!
}
(EDIT: actually, I guess you could use a postblit to update internal pointers, so it isn't quite without notice. If you're careful enough, you could make it work, but then again, if you're careful enough, you can shoot between your toes without hitting your foot too. EDIT2: Actually no, the compiler/runtime is still allowed to move it without even calling the postblit. One example of where this happens is if it copies a stack frame to the heap to make a closure. The struct data is moved to a new address without being informed. So yeah. /edit)
And actually, that assert isn't guaranteed to pass, the compiler might choose to call pain straight on the local object declared in main, so the pointer would work (though I'm not able to force this optimization here for a demo, generally, when you return a struct from a function, it is actually done via a hidden pointer the caller passes - the caller says "put the return value right here" thus avoiding a copy/move in some cases).
But anyway, the point just is that the compiler is free to copy or not to copy a struct at its leisure, so if you do keep the address of this around in it, it may become invalid without notice; keeping that pointer is not a compile error, but it is undefined behavior.
The situation is different with classes. Classes are allowed to keep references to this internally since a class is (in theory, realized by the garbage collector implementation)) an independent object with an infinite lifetime. While it may be moved (such as be a moving GC (not implemented in D today)), if it is moved, all references to it, internal and external, would also be required to be updated.
So classes can't have the memory pulled out from under them like structs can (unless you the programmer take matters into your own hands and bypass the GC...)
The location independent thing I'm pretty sure is referring only to structs and only to the rule that they can't have pointers to themselves. There's no magic done with references or pointers - they indeed work with memory addresses, no proxy objects.

Assign string to zmq::message_t without copying

I need to do some high performance c++ stuff and that is why I need to avoid copying data whenever possible.
Therefore I want to directly assign a string buffer to a zmq::message_t object without copying it. But there seems to be some deallocation of the string which avoids successful sending.
Here is the piece of code:
for (pair<int, string> msg : l) {
comm_out.send_int(msg.first);
comm_out.send_int(t_id);
int size = msg.second.size();
zmq::message_t m((void *) std::move(msg.second).data(), size, NULL, NULL);
comm_out.send_frame_msg(m, false); // some zmq-wrapper class
}
How can I avoid that the string is deallocated before the message is send out? And when is the string deallocated exactly?
Regards
I think that zmq::message_t m((void *) std::move(msg.second).data()... is probably undefined behaviour, but is certainly the cause of your problem. In this instance, std::move isn't doing what I suspect you think it does.
The call to std::move is effectively creating an anonymous temporary of a string, moving the contents of msg.second into it, then passing a pointer to that temporary data into the message_t constructor. The 0MQ code assumes that pointer is valid, but the temporary object is destroyed after the constructor of message_t completes - i.e. before you call send_frame.
Zero-copy is a complicated matter in 0mq (see the 0MQ Guide) for more details, but you have to ensure that the data that hasn't been copied is valid until 0MQ tells you explicitly that it's finished with it.
Using C++ strings in this situation is hard, and requires a lot of thought. Your question about how to "avoid that the string is deallocated..." goes right to the heart of the issue. The only answer to that is "with great care".
In short, are you sure you need zero-copy at all?

When should the Win32 InterlockedExchange function be used?

I came across the function InterlockedExchange and was wondering when I should use this function. In my opinion, setting a 32 Bit value on an x86 processor should always be atomic?
In the case where I want to use the function, the new value does not depend on the old value (it is not an increment operation).
Could you provide an example where this method is mandatory (I'm not looking for InterlockedCompareExchange)
InterlockedExchange is both a write and a read -- it returns the previous value.
This is necessary to ensure another thread didn't write a different value just after you did. For example, say you're trying to increment a variable. You can read the value, add 1, then set the new value with InterlockedExchange. The value returned by InterlockedExchange must match the value you originally read, otherwise another thread probably incremented it at the same time, and you need to loop around and try again.
As well as writing the new value, InterlockedExchange also reads and returns the previous value; this whole operation is atomic. This is useful for lock-free algorithms.
(Incidentally, 32-bit writes are not guaranteed to be atomic. Consider the case where the write is unaligned and straddles a cache boundary, for instance.)
In a multi-processor or multi-core machine each core has it's own cache - so each core has each own potentially different "view" of what the content of the system memory is.
Thread synchronization mechanisms take care of synchronizing between cores, for more information look at http://blogs.msdn.com/oldnewthing/archive/2008/10/03/8969397.aspx or google for acquire and release semantics
Setting a 32-bit value is atomic, but only if you're setting a literal.
b = a is 2 operations:
mov eax,dword ptr [a]
mov dword ptr [b],eax
Theoretically there could be some interruption between the first and second operation.
Writing a value is never atomic by default. When you write a value to a variable, several machine instructions are generated. With modern, preemptive OSes, the OS might switch to another thread between the individual operations of the write.
This is even more a problem on multi-processor machines, where several threads could be executing at the same time, and trying to write to a single memory location simultaneously.
Interlocked operations avoid this by using specialized instructions to make the write (x86 has dedicated instructions for this kind of situation), which do the read-modify-write in one instruction. These instructions also lock the memory bus of all processors, to ensure that no other executing thread could be writing to the value at the same time.
InterlockedExchange makes sure that the change of a variable and the return of its original value are not interrupted by other threads.
So, if 'i' is an int, these calls (taken individually) do not need InterlockedExchange around 'i':
a = i;
i = 9;
i = a;
i = a + 9;
a = i + 9;
if(0 == i)
None of these statements rely upon BOTH the initial AND final values of 'i'. But these following calls DO need InterlockedExchange around 'i':
a = i++; //a = InterlockedExchange(&i, i + 1);
Without it, two threads running through this same code might get the same value of 'i' assigned to 'a' or 'a' may unexpectedly skip two or more numbers.
if(0 == i++) //if(0 == InterlockedExchange(&i, i + 1))
Two threads may both execute the code that is only supposed to happen once.
etc.
wow, so many conflicting answers. Hard to sift through who's right, who's wrong, and what information is misleading.
I'm unsure of the answer too, given the above half-answers, but I think it works like this, I may be wrong, and it will be interesting to find out if I am:
32-bit read & writes ARE atomic, but depending on your code, that may not mean much.
don't worry about non-aligned read/writes. ALL 32-bit writes to a 32-bit variable have to be aligned or the machine page-faults.
don't worry about a write wrapping around the end of a cached page, that can't happen.
If you need to write-then-read on one thread, and you're writing on another thread, then you need to use InterlockedExchange. If you're simply reading the value on one thread, and writing it on another, then you don't need to use it, but those values may be wiggly because of multithreading.

When can DataInputStream.skipBytes(n) not skip n bytes?

The Sun Documentation for DataInput.skipBytes states that it "makes an attempt to skip over n bytes of data from the input stream, discarding the skipped bytes. However, it may skip over some smaller number of bytes, possibly zero. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility."
Other than reaching end of file, why might skipBytes() not skip the right number of bytes? (The DataInputStream I am using will either be wrapping a FileInputStream or a PipedInputStream.)
If I definitely want to skip n bytes and throw an EOFException if this causes me to go to the end of the file, should I use readFully() and ignore the resulting byte array? Or is there a better way?
1) There might not be that much data available to read (the other end of the pipe might not have sent that much data yet), and the implementing class might be non-blocking (i.e. it will just return what it can, rather than waiting for enough data to fulfil the request).
I don't know if any implementations actually behave in this way, however, but the interface is designed to permit it.
Another option is simply that the file gets closed part-way through the read.
2) Either readFully() (which will always wait for enough input or else fail) or call skipBytes() in a loop. I think the former is probably better, unless the array is truly vast.
I came across this problem today. It was reading off a network connection on a virtual machine so I imagine there could be a number of reasons for this happening. I solved it by simply forcing the input stream to skip bytes until it had skipped the number of bytes I wanted it to:
int byteOffsetX = someNumber; //n bytes to skip
int nSkipped = 0;
nSkipped = in.skipBytes(byteOffsetX);
while (nSkipped < byteOffsetX) {
nSkipped = nSkipped + in.skipBytes(byteOffsetX - nSkipped);
}
It turns out that readFully() adds more performance overhead than I was willing to put up with.
In the end I compromised: I call skipBytes() once, and if that returns fewer than the right number of bytes, I call readFully() for the remaining bytes.
Josh Bloch has publicised this recently. It is consistent in that InputStream.read is not guaranteed to read as many bytes as it could. However, it is utterly pointless as an API method. InputStream should probably also have readFully.
According to the docs, readFully() is the only way that both works and guaranteed to work.
The actual Oracle implementation is... confusing:
public final int skipBytes(int n) throws IOException {
int total = 0;
int cur = 0;
while ((total<n) && ((cur = (int) in.skip(n-total)) > 0)) {
total += cur;
}
return total;
}
Why call skip() in a loop when skipBytes() has essentially the same contract as skip()? It would make perfect sense to implement it this way if both of the following were true: skipBytes() guaranteed to skip less than requested only on EOF, and skip() guaranteed to skip at least one byte if at all possible (just like read() does).
What's even worse, is that skip() is actually implemented using read(), which means it actually does the job. It just doesn't promise to do it, which means other implementations may fail to do so (and even Oracle one may potentially fail in the future if changed in future releases).
To be completely safe: call readBytes(), and if it doesn't do the job, allocate a temp buffer and call readFully() (use a loop here if the ability to skip over arbitrarily large amounts of data is needed).
On the other hand, calling skipBytes() in a loop is pointless. At least with Oracle implementation, because it's already a loop.

Resources