Safe alternative to _vscwprintf - string

In MSVC, a number of string functions offer the original, a safe version, and a strsafe version. For example, sprintf, sprintf_s, and StringCchPrintf are all equivalents, increasing in safety (by some metric).
Now, I have a bit of code that does:
int bufsize = _vscwprintf(fmt, args) + 1;
std::vector<wchar_t> buffer(bufsize);
int len = _vsnwprintf_s(&buffer[0], bufsize, bufsize-1, fmt, args);
To allocate a buffer of the correct size.
While looking through the strsafe functions, I found an alternative for _vsnwprintf_s, but none for _vscwprintf. A check of Google didn't seem to return anything.
Is there a strsafe way of writing that bit of code, or alternate functions for both that I'm missing, or is mixing an original and strsafe function acceptable? (no safety warning are given about the current way, on /w4 with PREfast all rules)

_vscwprintf() merely computes the size of the wchar_t[] array you need to safely format the string, it doesn't actually write anything to a buffer. Accordingly you don't need and there is no safe version of the function.

Related

simple_read_from_buffer/simple_write_to_buffer vs. copy_to_user/copy_from_user

I recently wrote a module implementing these functions.
What is the difference between the two? From my understanding, the copy_..._user functions are more secure. Please correct me if I'm mistaken.
Furthermore, is it a bad idea to mix the two functions in one program? For example, I used simple_read_from_buffer in my misc dev read function, and copy_from_user in my write function.
Edit: I believe I've found the answer to my question from reading fs/libfs.c (I wasn't aware that this was where the source code was located for these functions); from my understanding the simple_...() functions are essentially a wrapper around the copy_...() functions. I think it was appropriate in my case to use copy_from_user for the misc device write function as I needed to validate that the input matched a specific string before returning it to the user buffer.
I will still leave this question open though in case someone has a better explanation or wants to correct me!
simple_read_from_buffer and simple_write_to_buffer are just convenience wrappers around copy_{to,from}_user for when all you need to do is service a read from userspace from a kernel buffer, or service a write from userspace to a kernel buffer.
From my understanding, the copy_..._user functions are more secure.
Neither version is "more secure" than the other. Whether or not one might be more secure depends on the specific use case.
I would say that simple_{read,write}_... could in general be more secure since they do all the appropriate checks for you before copying. If all you need to do is service a read/write to/from a kernel buffer, then using simple_{read,write}_... is surely faster and less error-prone than manually checking and calling copy_{from,to}_user.
Here's a good example where those functions would be useful:
#define SZ 1024
static char kernel_buf[SZ];
static ssize_t dummy_read(struct file *filp, char __user *user_buf, size_t n, loff_t *off)
{
return simple_read_from_buffer(user_buf, n, off, kernel_buf, SZ);
}
static ssize_t dummy_write(struct file *filp, char __user *user_buf, size_t n, loff_t *off)
{
return simple_write_to_buffer(kernel_buf, SZ, off, user_buf, n);
}
It's hard to tell what exactly you need without seeing your module's code, but I would say that you can either:
Use copy_{from,to}_user if you want to control the exact behavior of your function.
Use a return simple_{read,write}_... if you don't need such fine-grained control and you are ok with just returning the standard values produced by those wrappers.

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.
Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

Fastest Way to Copy Buffer or C-String into a std::string

Let's say I have char buffer[64] and uint32_t length, and buffer might or might not be null terminated. If it is null terminated, the rest of the buffer will be filled with nulls. the length variable holds the length of buffer.
I would like to copy it into a std::string without extra nulls at the end of the string object.
Originally, I tried:
std::string s(buffer, length);
which copies the extra nulls when buffer is filled with nulls at the end.
I can think of:
char buffer2[128];
strncpy(buffer2, buffer, 128);
const std::sring s(buffer2);
But it is kind of wasteful because it copies twice.
I wonder whether there is a faster way. I know I need to benchmark to tell exactly which way is faster...but I would like to look at some other solutions and then benchmark...
Thanks in advance.
If you can, I'd simply add a '\0' at the end of your buffer and
then use the c-string version of the string constructor.
If you can't, you need to determine if there's a '\0' in your
buffer and while you're at it, you might as well count the number of
characters you encounter before the '\0'. You can then use that
count with the (buffer,length) form of the string constructor:
#include <string.h>
//...
std::string s(buffer, strnlen(buffer, length));
If you can't do 1. and don't want to iterate over buffer twice (once to determine the length, once in the string constructor), you could do:
char last_char = buffer[length-1];
buffer[length-1] = '\0';
std::string s(buffer); //the c-string form since we're sure there's a '\0' in the buffer now
if(last_char!='\0' && s.length()==(length-1)) {
//With good buffer sizes, this might not need to cause reallocation of the strings internal buffer
s.push_back(last_char);
}
I leave the benchmarking to you. It is possible that the c-string version of the constructor uses something like strlen internally anyway to avoid reallocations so there might not be much to gain from using the c-string version of the string constructor.
You can use all the canonical way to do this.
Faster way is surely implement by yourself smartpointer (or use anything already done as std::shared_ptr
).
Each smartpointer (sp) point to first char of array and contain.
Each time you do array.copy you don't do a true copy, but you simply add a reference do that array.
So, a "copy" take O(1) instead of O(N)

What is the use of __iomem in linux while writing device drivers?

I have seen that __iomem is used to store the return type of ioremap(), but I have used u32 in ARM architecture for it and it works well.
So what difference does __iomem make here? And in which circumstances should I use it exactly?
Lots of type casts are going to just "work well". However, this is not very strict. Nothing stops you from casting a u32 to a u32 * and dereference it, but this is not following the kernel API and is prone to errors.
__iomem is a cookie used by Sparse, a tool used to find possible coding faults in the kernel. If you don't compile your kernel code with Sparse enabled, __iomem will be ignored anyway.
Use Sparse by first installing it, and then adding C=1 to your make call. For example, when building a module, use:
make -C $KPATH M=$PWD C=1 modules
__iomem is defined like this:
# define __iomem __attribute__((noderef, address_space(2)))
Adding (and requiring) a cookie like __iomem for all I/O accesses is a way to be stricter and avoid programming errors. You don't want to read/write from/to I/O memory regions with absolute addresses because you're usually using virtual memory. Thus,
void __iomem *ioremap(phys_addr_t offset, unsigned long size);
is usually called to get the virtual address of an I/O physical address offset, for a specified length size in bytes. ioremap() returns a pointer with an __iomem cookie, so this may now be used with inline functions like readl()/writel() (although it's now preferable to use the more explicit macros ioread32()/iowrite32(), for example), which accept __iomem addresses.
Also, the noderef attribute is used by Sparse to make sure you don't dereference an __iomem pointer. Dereferencing should work on some architecture where the I/O is really memory-mapped, but other architectures use special instructions for accessing I/Os and in this case, dereferencing won't work.
Let's look at an example:
void *io = ioremap(42, 4);
Sparse is not happy:
warning: incorrect type in initializer (different address spaces)
expected void *io
got void [noderef] <asn:2>*
Or:
u32 __iomem* io = ioremap(42, 4);
pr_info("%x\n", *io);
Sparse is not happy either:
warning: dereference of noderef expression
In the last example, the first line is correct, because ioremap() returns its value to an __iomem variable. But then, we deference it, and we're not supposed to.
This makes Sparse happy:
void __iomem* io = ioremap(42, 4);
pr_info("%x\n", ioread32(io));
Bottom line: always use __iomem where it's required (as a return type or as a parameter type), and use Sparse to make sure you did so. Also: do not dereference an __iomem pointer.
Edit: Here's a great LWN article about the inception of __iomem and functions using it.
Simple, Straight and Short (S3) Explanation.
There is an article https://lwn.net/Articles/653585/ for more details.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.
x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).
You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Resources