Writing a custom malloc which stores informations in the pointer

Writing a custom malloc which stores informations in the pointer - garbage-collection

I have recently been reading about a family of automatic memory management techniques that rely on storing information in the pointer returned by the allocator, i.e. few bits of header e.g. to differentiate between pointers or to store thread-related information (note that I'm not talking about limited-field reference counting here, only immutable information).
I'd like to toy with these techniques. Now, to implement them, I need to be able to return pointers with a specific shape from my allocator. I suppose I could play with the least weight bits but this would require padding that looks extremely memory consuming, so I believe that I should play with the heaviest bits. However, I have no good idea on how to do this. Is there a way for me to, call malloc or malloc_create_zone or some related function and request a pointer that always starts with the given bits?
Thanks everyone!

The amount of information you can actually store in a pointer is pretty limited (typically one or two bits per pointer). And every attempt to dereference the pointer has to first mask out the magic information. The technique is often called tagging, BTW.
#define TAG_MASK 0x3
#define CONS_TAG 0x1
#define STRING_TAG 0x2
#define NUMBER_TAG 0x3
typedef uintptr_t value_t;
typedef struct cons {
value_t car;
value_t cdr;
} cons_t;
value_t
create_cons(value_t t1, value_t t2)
{
cons_t* pair = malloc(sizeof(cons_t));
value_t addr = (value_t)pair;
pair->car = t1;
pair->cdr = t2;
return addr | CONS_TAG;
}
value_t
car_of_cons(value_t v)
{
if ((v % TAG_MASK) != CONS_TAG) error("wrong type of argument");
return ((cons_t*) (v & ~TAG_MASK))->car;
}
One advantage of this technique is, that you can directly infer the type of the object from the pointer itself. You don't need to dereference it (say, in order to read a special type field or similar). Many language implementations using this scheme also have a special tag combination for "immediate" numbers and other small values, which can be represented direcly using the "pointer".
The disadvatage is, that the amount of information, which can be stored, is pretty limited. Also, as the example code shows, you have to be aware of the tagging in every access to the object, and need to "untag" the pointer before actually using it.
The use of the least significant bits for tagging stemms from the observation, that on most platforms, all pointer to malloced memory is actually aligned on a non-byte boundary (usually 8 bytes), so the least significant bits are always zero.

Related

Results of doing += on a double from multiple threads

Consider the following code:
void add(double& a, double b) {
a += b;
}
which according to godbolt compiles on a Skylake to:
add(double&, double):
vaddsd xmm0, xmm0, QWORD PTR [rdi]
vmovsd QWORD PTR [rdi], xmm0
ret
If I call add(a, 1.23) and add(a, 2.34) from different threads (for the same variable a), will a definitely end up as either a+1.23, a+2.34, or a+1.23+2.34?
That is, will one of these results definitely happen given this assembly, and a will not end up in some other state?

Here is a relevant questions to me:
Does the CPU fetch the word you are dealing with in a single operation?
Some processor might allow memory access to a variable that happens to be not aligned in memory by doing two fetches one after the other - non atomically of course.
In that case, problems would arise if another thread interjects writing on that area of memory while the first thread had fetched already the first part of the word and then fetches the second part when the other thread has already modified the word.
thread 1 fetches first part of a XXXX
thread 1 fetches second part of a YYYY
thread 2 fetches first part of a XXXX
thread 1 increments double represented as XXXXYYYY that becomes ZZZZWWWW by adding b
thread 1 writes back in memory ZZZZ
thread 1 writes back in memory WWWW
thread 2 fetches second part of a that is now WWWW
thread 2 increments double represented as XXXXWWWW that becomes VVVVPPPP by adding b
thread 2 writes back in memory VVVV
thread 2 writes back in memory PPPP
For keeping it compact I used one character to represent 8 bits.
Now XXXXWWWW and VVVVPPPP are going to be representation of total different floating point values than the one you would have expected. That is because you ended up mixing two parts of two different binary representation (IEEE-754) of double variables.
Said that, I know that in certain ARM based architectures data access are not allowed (that would cause a trap to be generated), but I suspect that Intel processors do allow that instead.
Therefore, if your variable a is aligned, your result can be any of
a+1.23, a+2.34, a+1.23+2.34
if your variable might be mis-aligned (i.e. has got an address that is not a multiple of 8) your result can be any of
a+1.23, a+2.34, a+1.23+2.34 or a rubbish value
As a further note, please bear in mind that even if your environment alignof(double) == 8 that is not necessarily enough to conclude you are not going to have misalignment issues. All depends from where your particular variable comes from. Consider the following (or run it here):
#pragma push()
#pragma pack(1)
struct Packet
{
unsigned char val1;
unsigned char val2;
double val3;
unsigned char val4;
unsigned char val5;
};
#pragma pop()
int main()
{
static_assert(alignof(double) == 8);
double d;
add(d,1.23); // your a parameter is aligned
Packet p;
add(p.val3,1.23); // your a parameter is now NOT aligned
return 0;
}
Therefore asserting alignof() doesn't necessarily guarantee your variable is aligned. If your variable is not involved in any packing then you should be OK.
Please allow me just a disclaimer for whoever else is reading this answer: using std::atomic<double> in these situations is the best compromise in term of implementation effort and performance to achieve thread safety. There are CPUs architectures that have special efficient instructions for dealing with atomic variables without injecting heavy fences. That might end up satisfying your performance requirements already.

Does std::ptr::write transfer the "uninitialized-ness" of the bytes it writes?

I'm working on a library that help transact types that fit in a pointer-size int over FFI boundaries. Suppose I have a struct like this:
use std::mem::{size_of, align_of};
struct PaddingDemo {
data: u8,
force_pad: [usize; 0]
}
assert_eq!(size_of::<PaddingDemo>(), size_of::<usize>());
assert_eq!(align_of::<PaddingDemo>(), align_of::<usize>());
This struct has 1 data byte and 7 padding bytes. I want to pack an instance of this struct into a usize and then unpack it on the other side of an FFI boundary. Because this library is generic, I'm using MaybeUninit and ptr::write:
use std::ptr;
use std::mem::MaybeUninit;
let data = PaddingDemo { data: 12, force_pad: [] };
// In order to ensure all the bytes are initialized,
// zero-initialize the buffer
let mut packed: MaybeUninit<usize> = MaybeUninit::zeroed();
let ptr = packed.as_mut_ptr() as *mut PaddingDemo;
let packed_int = unsafe {
std::ptr::write(ptr, data);
packed.assume_init()
};
// Attempt to trigger UB in Miri by reading the
// possibly uninitialized bytes
let copied = unsafe { ptr::read(&packed_int) };
Does that assume_init call triggered undefined behavior? In other words, when the ptr::write copies the struct into the buffer, does it copy the uninitialized-ness of the padding bytes, overwriting the initialized state as zero bytes?
Currently, when this or similar code is run in Miri, it doesn't detect any Undefined Behavior. However, per the discussion about this issue on github, ptr::write is supposedly allowed to copy those padding bytes, and furthermore to copy their uninitialized-ness. Is that true? The docs for ptr::write don't talk about this at all, nor does the nomicon section on uninitialized memory.

Does that assume_init call triggered undefined behavior?
Yes. "Uninitialized" is just another value that a byte in the Rust Abstract Machine can have, next to the usual 0x00 - 0xFF. Let us write this special byte as 0xUU. (See this blog post for a bit more background on this subject.) 0xUU is preserved by copies just like any other possible value a byte can have is preserved by copies.
But the details are a bit more complicated.
There are two ways to copy data around in memory in Rust.
Unfortunately, the details for this are also not explicitly specified by the Rust language team, so what follows is my personal interpretation. I think what I am saying is uncontroversial unless marked otherwise, but of course that could be a wrong impression.
Untyped / byte-wise copy
In general, when a range of bytes is being copied, the source range just overwrites the target range -- so if the source range was "0x00 0xUU 0xUU 0xUU", then after the copy the target range will have that exact list of bytes.
This is what memcpy/memmove in C behave like (in my interpretation of the standard, which is not very clear here unfortunately). In Rust, ptr::copy{,_nonoverlapping} probably performs a byte-wise copy, but it's not actually precisely specified right now and some people might want to say it is typed as well. This was discussed a bit in this issue.
Typed copy
The alternative is a "typed copy", which is what happens on every normal assignment (=) and when passing values to/from a function. A typed copy interprets the source memory at some type T, and then "re-serializes" that value of type T into the target memory.
The key difference to a byte-wise copy is that information which is not relevant at the type T is lost. This is basically a complicated way of saying that a typed copy "forgets" padding, and effectively resets it to uninitialized. Compared to an untyped copy, a typed copy loses more information. Untyped copies preserve the underlying representation, typed copies just preserve the represented value.
So even when you transmute 0usize to PaddingDemo, a typed copy of that value can reset this to "0x00 0xUU 0xUU 0xUU" (or any other possible bytes for the padding) -- assuming data sits at offset 0, which is not guaranteed (add #[repr(C)] if you want that guarantee).
In your case, ptr::write takes an argument of type PaddingDemo, and the argument is passed via a typed copy. So already at that point, the padding bytes may change arbitrarily, in particular they may become 0xUU.
Uninitialized usize
Whether your code has UB then depends on yet another factor, namely whether having an uninitialized byte in a usize is UB. The question is, does a (partially) uninitialized range of memory represent some integer? Currently, it does not and thus there is UB. However, whether that should be the case is heavily debated and it seems likely that we will eventually permit it.
Many other details are still unclear, though -- for example, transmuting "0x00 0xUU 0xUU 0xUU" to an integer may well result in a fully uninitialized integer, i.e., integers may not be able to preserve "partial initialization". To preserve partially initialized bytes in integers we would have to basically say that an integer has no abstract "value", it is just a sequence of (possibly uninitialized) bytes. This does not reflect how integers get used in operations like /. (Some of this also depends on LLVM decisions around poison and freeze; LLVM might decide that when doing a load at integer type, the result is fully poison if any input byte is poison.) So even if the code is not UB because we permit uninitialized integers, it may not behave as expected because the data you want to transfer is being lost.
If you want to transfer raw bytes around, I suggest to use a type suited for that, such as MaybeUninit. If you use an integer type, the goal should be to transfer integer values -- i.e., numbers.

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10

This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.

Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

What is the use of __iomem in linux while writing device drivers?

I have seen that __iomem is used to store the return type of ioremap(), but I have used u32 in ARM architecture for it and it works well.
So what difference does __iomem make here? And in which circumstances should I use it exactly?

Lots of type casts are going to just "work well". However, this is not very strict. Nothing stops you from casting a u32 to a u32 * and dereference it, but this is not following the kernel API and is prone to errors.
__iomem is a cookie used by Sparse, a tool used to find possible coding faults in the kernel. If you don't compile your kernel code with Sparse enabled, __iomem will be ignored anyway.
Use Sparse by first installing it, and then adding C=1 to your make call. For example, when building a module, use:
make -C $KPATH M=$PWD C=1 modules
__iomem is defined like this:
# define __iomem __attribute__((noderef, address_space(2)))
Adding (and requiring) a cookie like __iomem for all I/O accesses is a way to be stricter and avoid programming errors. You don't want to read/write from/to I/O memory regions with absolute addresses because you're usually using virtual memory. Thus,
void __iomem *ioremap(phys_addr_t offset, unsigned long size);
is usually called to get the virtual address of an I/O physical address offset, for a specified length size in bytes. ioremap() returns a pointer with an __iomem cookie, so this may now be used with inline functions like readl()/writel() (although it's now preferable to use the more explicit macros ioread32()/iowrite32(), for example), which accept __iomem addresses.
Also, the noderef attribute is used by Sparse to make sure you don't dereference an __iomem pointer. Dereferencing should work on some architecture where the I/O is really memory-mapped, but other architectures use special instructions for accessing I/Os and in this case, dereferencing won't work.
Let's look at an example:
void *io = ioremap(42, 4);
Sparse is not happy:
warning: incorrect type in initializer (different address spaces)
expected void *io
got void [noderef] <asn:2>*
Or:
u32 __iomem* io = ioremap(42, 4);
pr_info("%x\n", *io);
Sparse is not happy either:
warning: dereference of noderef expression
In the last example, the first line is correct, because ioremap() returns its value to an __iomem variable. But then, we deference it, and we're not supposed to.
This makes Sparse happy:
void __iomem* io = ioremap(42, 4);
pr_info("%x\n", ioread32(io));
Bottom line: always use __iomem where it's required (as a return type or as a parameter type), and use Sparse to make sure you did so. Also: do not dereference an __iomem pointer.
Edit: Here's a great LWN article about the inception of __iomem and functions using it.

Simple, Straight and Short (S3) Explanation.
There is an article https://lwn.net/Articles/653585/ for more details.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string