virtual to physical address conversion in linux kernel - linux

The following is used to translate virtual address to physical address in linux kernel. But what does it mean?
I have very limited knowledge of assembly
163 #define __pv_stub(from,to,instr,type) \
164 __asm__("# __pv_stub\n" \
165 "1: " instr " %0, %1, %2\n" \
166 " .pushsection .pv_table,\"a\"\n" \
167 " .long 1b\n" \
168 " .popsection\n" \
169 : "=r" (to) \
170 : "r" (from), "I" (type))

It's not really "assembly" as there is no instruction in this macro per se.
It's just a macro which inserts instr (an instruction passed to the macro) which has one input operand from, one immediate (constant) input operand type and a output operand to.
There is also the part between pushsection and popsection which records in a specific binary section pv_table the address of this instruction. That allows the kernel to find these places in its code if it wishes to.
The last part is the asm constraints and operands. It lists what the compiler will replace %0, %1 and %2 with. %0 is the first listed ("=r"(to)), it means that %0 will be any general purpose register, that is an output operand that will be stored in the macro argument to. The other 2 are similar except they're input operands: from is a register so gets "r" but type is an immediate so is "i"
See http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm for details
So consider this code from the kernel (http://lxr.linux.no/linux+v3.9.4/arch/arm/include/asm/memory.h#L172)
static inline unsigned long __virt_to_phys(unsigned long x)
{ unsigned long t;
__pv_stub(x, t, "add", __PV_BITS_31_24);
return t;
}
__pv_stub will be equivalent to t = x + __PV_BITS_31_24 (instr == add, from == x, to == t, type == __PV_BITS_31_24)
So you might wonder why anybody would do such a complicated thing instead of just writing t = x + __PV_BITS_31_24 in the code.
The reason is the pv_table I mentioned above. The address of all these statements is recorded in a specific elf section. Under some circumstances, the kernel patches these instructions at runtime (so needs to be able to easily find all of them) hence the need for a table.
The ARM port does exactly that here: http://lxr.linux.no/linux+v3.9.4/arch/arm/kernel/head.S#L541
It's used only if CONFIG_ARM_PATCH_PHYS_VIRT is used to compile the kernel:
CONFIG_ARM_PATCH_PHYS_VIRT:
Patch phys-to-virt and virt-to-phys translation functions at
boot and module load time according to the position of the
kernel in system memory.
This can only be used with non-XIP MMU kernels where the base
of physical memory is at a 16MB boundary, or theoretically 64K
for the MSM machine class.

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Why physical address of Aarch64 kernel image is nonnegative?

I'm recently learning about booting system of Linux kernel. (v4.6, with ARM64 arch.)
In the source code arch/arm64/kernel/head.S, definition of __PHYS_OFFSET is:
#define __PHYS_OFFSET (KERNEL_START - TEXT_OFFSET)
where KERNEL_START is simply defined to be _text section.
And if I'm right, TEXT_OFFSET is a random number determined during kernel compile, as /arch/arm64/Makefile says:
TEXT_OFFSET := $(shell awk 'BEGIN {srand(); printf "0x%03x000\n", int(512 * rand())}')
so that the kernel image file has random location, as the linker script /arch/arm64/kernel/vmlinux.lds.S includes:
. = KIMAGE_VADDR + TEXT_OFFSET;
.head.text : {
_text = .;
HEAD_TEXT
}
Here, KIMAGE_VADDR is a virtual address 0xFFFF000000000000 + 128M. Since TEXT_OFFSET is added, section _text will be randomly located.
Rest parts of head.S map KIMAGE_VADDR to __PHYS_OFFSET to enable MMU.
My question is this: is __PHYS_OFFSET = _text - TEXT_OFFSET always nonnegative?
I don't know where would be exact physical location of _text, but I think 512 * rand() might be as big as 512 * 32767 ~ 10MB.
Do I make sense? Is there any reason makes these codes safe?
vmlinux.lds.S does:
. = KIMAGE_VADDR + TEXT_OFFSET;
followed by
_text = .;
So _text = KIMAGE_VADDR + TEXT_OFFSET. When you then subtract TEXT_OFFSET, __PHYS_OFFSET will be the same as KIMAGE_VADDR.
Thus, if KIMAGE_VADDR is non-negative, so is __PHYS_OFFSET.

any way to stop unaligned access from c++ standard library on x86_64?

I am trying to check for any unaligned reads in my program. I enable unaligned access processor exception via (using x86_64 on g++ on linux kernel 3.19):
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x40000, %%rax \n"
"push %%rax \n"
"popf \n" ::: "rax");
I do an optional forced unaligned read which triggers the exception so i know its working. After i disable that I get an error in a piece of code which otherwise seems fine :
char fullpath[eMaxPath];
snprintf(fullpath, eMaxPath, "%s/%s", "blah", "blah2");
the stacktrace shows a failure via __memcpy_sse2 which leads me to suspect that the standard library is using sse to fulfill my memcpy but it doesnt realize that i have now made unaligned reads unacceptable.
Is my thinking correct and is there any way around this (ie can i make the standard library use an unaligned safe sprintf/memcpy instead)?
thanks
While I hate to discourage an admirable notion, you're playing with fire, my friend.
It's not merely sse2 access but any unaligned access. Even a simple int fetch.
Here's a test program:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
void *intptr;
void
require_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x00040000, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
relax_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"andl $0xFFFBFFFF, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
msg(const char *str)
{
int len;
len = strlen(str);
write(1,str,len);
}
void
grab(void)
{
volatile int x = *(int *) intptr;
}
int
main(void)
{
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = malloc(256);
printf("intptr=%p\n",intptr);
// normal access to aligned pointer
msg("normal\n");
grab();
// enable alignment check exception
require_aligned();
// access aligned pointer under check [will be okay]
msg("aligned_norm\n");
grab();
// this grab will generate a bus error
intptr += 1;
msg("aligned_except\n");
grab();
return 0;
}
The output of this is:
intptr=0x1996010
normal
aligned_norm
aligned_except
Bus error (core dumped)
The program generated this simply because of an attempted 4 byte int fetch from address 0x1996011 [which is odd and not a multiple of 4].
So, once you turn on the AC [alignment check] flag, even simple things will break.
IMO, if you truly have some things that are not aligned optimally and are trying to find them, using printf, instrumenting your code with debug asserts, or using gdb with some special watch commands or breakpoints with condition statements are a better/safer way to go
UPDATE:
I a using my own custom allocator am preparing my code to run on an architecture that doesnt suport unaligned read/writes so I want to make sure my code will not break on that architecture.
Fair enough.
Side note: My curiousity has gotten the better of me as the only [major] arches I can recall [at the moment] that have this issue are Motorola mc68000 and older IBM mainframes (e.g. IBM System 370).
One practical reason for my curiosity is that for certain arches (e.g. ARM/android, MIPS) there are emulators available. You could rebuild the emulator from source, adding any extra checks, if needed. Otherwise, doing your debugging under the emulator might be an option.
I can trap unaligned read/write using either the asm , or via gdb but both cause SIGBUS which i cant continue from in gdb and im getting too many false positives from std library (in the sense that their implementation would be aligned access only on the target).
I can tell you from experience that trying to resume from a signal handler after this doesn't work too well [if at all]. Using gdb is the best bet if you can eliminate the false positives by having AC off in the standard functions [see below].
Ideally i guess i would like to use something like perf to show me callstacks that have misaligned but so far no dice.
This is possible, but you'd have to verify that perf even reports them. To see, you could try perf against my original test program above. If it works, the "counter" should be zero before and one after.
The cleanest way may be to pepper your code with "assert" macros [that can be compiled in and out with a -DDEBUG switch].
However, since you've gone to the trouble of laying the groundwork, it may be worthwhile to see if the AC method can work.
Since you're trying to debug your memory allocator, you only need AC on in your functions. If one of your functions calls libc, disable AC, call the function, and then reenable AC.
A memory allocator is fairly low level, so it can't rely on too many standard functions. Most standard functions rely on being able to call malloc. So, you might also want to consider a vtable interface to the rest of the [standard] library.
I've coded some slightly different AC bit set/clear functions. I put them into a .S function to eliminate inline asm hassles.
I've coded up a simple sample usage in three files.
Here are the AC set/clear functions:
// acbit/acops.S -- low level AC [alignment check] operations
#define AC_ON $0x00040000
#define AC_OFF $0xFFFFFFFFFFFBFFFF
.text
// acpush -- turn on AC and return previous mask
.globl acpush
acpush:
// get old mask
pushfq
pop %rax
mov %rax,%rcx // save to temp
or AC_ON,%ecx // turn on AC bit
// set new mask
push %rcx
popfq
ret
// acpop -- restore previous mask
.globl acpop
acpop:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
and AC_ON,%edi // isolate the AC bit in argument
or %edi,%eax // lay it in
// set new mask
push %rax
popfq
ret
// acon -- turn on AC
.globl acon
acon:
jmp acpush
// acoff -- turn off AC
.globl acoff
acoff:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
// set new mask
push %rax
popfq
ret
Here is a header file that has the function prototypes and some "helper" macros:
// acbit/acbit.h -- common control
#ifndef _acbit_acbit_h_
#define _acbit_acbit_h_
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
typedef unsigned long flags_t;
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#ifdef ACDEBUG
#define ACPUSH \
do { \
flags_t acflags = acpush()
#define ACPOP \
acpop(acflags); \
} while (0)
#define ACEXEC(_expr) \
do { \
acoff(); \
_expr; \
acon(); \
} while (0)
#else
#define ACPUSH /**/
#define ACPOP /**/
#define ACEXEC(_expr) _expr
#endif
void *intptr;
flags_t
acpush(void);
void
acpop(flags_t omsk);
void
acon(void);
void
acoff(void);
#endif
Here is a sample program that uses all of the above:
// acbit/acbit2 -- sample allocator
#include <acbit.h>
// mymalloc1 -- allocation function [raw calls]
void *
mymalloc1(size_t len)
{
flags_t omsk;
void *vp;
// function prolog
// NOTE: do this on all "outer" (i.e. API) functions
omsk = acpush();
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
acoff();
printf("%p\n",vp);
acon();
// function epilog
acpop(omsk);
return vp;
}
// mymalloc2 -- allocation function [using helper macros]
void *
mymalloc2(size_t len)
{
void *vp;
// function prolog
ACPUSH;
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
ACEXEC(printf("%p\n",vp));
// function epilog
ACPOP;
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc1(256);
intptr = mymalloc2(256);
x = *(int *) intptr;
return x;
}
UPDATE #2:
I like the idea of disabling the check before any library calls.
If the AC H/W works and you wrap the library calls, this should yield no false positives. The only exception would be if the compiler made a call to its internal helper library (e.g. doing 64 bit divide on 32 bit machine, etc.).
Be aware/wary of the ELF loader (e.g. /lib64/ld-linux-x86-64.so.2) doing dynamic symbol resolution on "lazy" bindings of symbols. Shouldn't be a big problem. There are ways to force the relocations to occur at program start, if necessary.
I have given up on perf for this as it seems to show me garbage even for a simple program like the one you wrote.
The perf code in the kernel is complex enough that it may be more trouble than it's worth. It has to communicate with the perf program with a pipe [IIRC]. Also, doing the AC thing is [probably] uncommon enough that the kernel's code path for this isn't well tested.
Im using ocperf with misalign_mem_ref.loads and stores but either way the counters dont correlate at all. If i record and look at the callstacks i get completely unrecognizable callstacks for these counters so i suspect either the counter doesnt work on my hardware/perf or it doesnt actually count what i think it counts
I honestly don't know if perf handles process reschedules to different cores properly [or not]--it should [IMO]. But, using sched_setaffinity to lock your program to a single core might help.
But, using the AC bit is more direct and definitive, IMO. I think that's the better bet.
I've talked about adding "assert" macros in the code.
I've coded some up below. These are what I'd use. They are independent of the AC code. But, they can also be used in conjunction with the AC bit code in a "belt and suspenders" approach.
These macros have one distinct advantage. When properly [and liberally] inserted, they can check for bad pointer values at the time they're calculated. That is, much closer to the true source of the problem.
With AC, you may calculate a bad value, but AC only kicks in [sometime] later, when the pointer is dereferenced [which may not happen in your API code at all].
I've done a complete memory allocator before [with overrun checks and "guard" pages, etc.]. The macro approach is what I used. And, if I had only one tool for this, it is the one I'd use. So, I recommend it above all else.
But, as I said, it can be used with the AC code as well.
Here's the header file for the macros:
// acbit/acptr.h -- alignment check macros
#ifndef _acbit_acptr_h_
#define _acbit_acptr_h_
#include <stdio.h>
typedef unsigned int u32;
// bit mask for given width
#define ACMSKOFWID(_wid) \
((1u << (_wid)) - 1)
#ifdef ACDEBUG2
#define ACPTR_MSK(_ptr,_msk) \
acptrchk(_ptr,_msk,__FILE__,__LINE__)
#else
#define ACPTR_MSK(_ptr,_msk) /**/
#endif
#define ACPTR_WID(_ptr,_wid) \
ACPTR_MSK(_ptr,(_wid) - 1)
#define ACPTR_TYPE(_ptr,_typ) \
ACPTR_WID(_ptr,sizeof(_typ))
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno);
// acptrchk -- check pointer for given alignment
static inline void
acptrchk(const void *ptr,u32 msk,const char *file,int lno)
{
#ifdef ACDEBUG2
#if ACDEBUG2 >= 2
printf("acptrchk: TRACE ptr=%p msk=%8.8X file='%s' lno=%d\n",
ptr,msk,file,lno);
#endif
if (((unsigned long) ptr) & msk)
acptrfault(ptr,file,lno);
#endif
}
#endif
Here's the "fault" handler function:
// acbit/acptr -- alignment check macros
#include <acbit/acptr.h>
#include <acbit/acbit.h>
#include <stdlib.h>
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno)
{
// NOTE: it's easy to set a breakpoint on this function
printf("acptrfault: pointer fault -- ptr=%p file='%s' lno=%d\n",
ptr,file,lno);
exit(1);
}
And, here's a sample program that uses them:
// acbit/acbit3 -- sample allocator using check macros
#include <acbit.h>
#include <acptr.h>
static double static_array[20];
// mymalloc3 -- allocation function
void *
mymalloc3(size_t len)
{
void *vp;
// get something valid
vp = static_array;
// do lots of stuff ...
printf("BEF vp=%p\n",vp);
// check pointer
// NOTE: these can be peppered after every [significant] calculation
ACPTR_TYPE(vp,double);
// do something bad ...
vp += 1;
printf("AFT vp=%p\n",vp);
// check again -- this should fault
ACPTR_TYPE(vp,double);
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc3(256);
x = *(int *) intptr;
return x;
}
Here's the program output:
BEF vp=0x601080
acptrchk: TRACE ptr=0x601080 msk=00000007 file='acbit/acbit3.c' lno=22
AFT vp=0x601081
acptrchk: TRACE ptr=0x601081 msk=00000007 file='acbit/acbit3.c' lno=29
acptrfault: pointer fault -- ptr=0x601081 file='acbit/acbit3.c' lno=29
I left off the AC code in this example. On your real target system, the dereference of intptr in main would/should fault on an alignment, but notice how much later that is in the execution timeline.
Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use
asm volatile ("add $-128, %rsp\n\t"
"pushf\n\t"
"orl $0x40000, (%rsp)\n\t"
"popf\n\t"
"sub $-128, %rsp\n\t"
);
(-128 fits in a sign-extended 8bit immediate, but 128 doesn't, hence using add $-128 to subtract 128.)
Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:
asm("stac"); // Set AC flag
asm("clac"); // Clear AC flag
It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.
Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.
modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.
What you might try is looking at performance counters for some of these events:
misalign_mem_ref.loads [Speculative cache line split load uops dispatched to L1 cache]
misalign_mem_ref.stores [Speculative cache line split STA uops dispatched to L1 cache]
ld_blocks.store_forward [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]
(from ocperf.py list output on my Sandybridge CPU).
There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.
The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.
So yes, you would need special slow&safe versions of library functions.
If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.
Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.
The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.
You're not going to like it, but there is only one answer: don't link against the standard libraries. By changing that setting you have changed the ABI and the standard library doesn't like it. memcpy and friends are hand-written assembly so it's not a matter of compiler options to convince the compiler to do something else.

Compiling a library on a 64 bit architecture: Incorrect register `%rax' used with `l' suffix

I have to compile a library on a 64 bit architecture, anyway I get that error.
The lines of code affected by the error are in assembler, here's an example (they are all very similar):
//=== get the index to write ===///
__asm__ __volatile__ ("lock; xaddl %0,%1"
: "=r" (indexToWrite), "=m" ( indexTable[entityId] )
: "0" (1), "m" ( indexTable[entityId] ));
can you help me out?
I am under linux 64bit (ubuntu) and I am using gcc.
Use the k operand modifier to select a 32-bit sub-register: xaddl %k0,%1.
The syntax: xaddl %k0,%k1 is also harmless, since %1 is mem addr anyway.
The operand modifiers for 8, 16, 32, and 64 bits are b, w, k, q respectively.
The second "m" in the input list seems suspect to me. I might be wrong, but I think it should be:
"1" (indexTable[entityId])
With xadd I don't suppose it matters, but this would technically be argument %3 otherwise.
Personally, I'd go with:
: "=r" (indexToWrite), "+m" (indexTable[entityId]) : "0" (1)
And yes, "+m" is perfectly legal. It has been for a long time, and was only recently corrected as a bug in the documentation of gcc!

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.
On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//
This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}
We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.
You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).
Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Resources