I'm porting some legacy assembly code to Rust and I need to call it through the asm! macro. However, the assembly code depends on some constants stored in a C header file. I'd like to keep it similar, define the constants in Rust and have the names of the constants in the asm macro.
Legacy C header:
#define HCR_VALUE 0xffff0000
Legacy ASM file:
.func
...
ldr x0, =HCR_VALUE
...
Rust code:
pub const HCR_VALUE: u32 = 0xffff0000;
unsafe { asm!("ldr x0, HCR_VALUE":::"x0"); }
Building the application ends up with a linker error:
lld-link: error: undefined symbol: HCR_VALUE
You need to pass the constant with a suitable constraint, like this:
unsafe { asm!("ldr x0, =${0:c}" : : "i" (HCR_VALUE) : "x0"); }
The right constraint depends on the architecture; on RISC CPUs, not all constants can be represented as immediate values. So you may have to use a register constraint instead and have LLVM materialize the constant there.
The current answers asm style was deprecated in favor of a more rustacean syntax. However, constants in assembly are still behind a feature flag and only in nightly if I recall correctly. Just add #![feature(asm_const)] at the top of your main file to use this amazing feature.
pub const HCR_VALUE: u32 = 0xffff0000;
// just assume we are in an unsafe block
asm!(
"ldr x0, ={hcr}",
"...",
hcr = const HCR_VALUE,
// this is technically necessary to tell the compiler that you have
// changed the value of x0 in your assembly code (called clobbering)
out ("x0") _
);
More information can be found in this issue
You normally don't want to use ldr with a const operand in inline asm: more normally you'd ask the compiler for the constant already in a register if you want that. Then you don't need to mark it as an out register, and the compiler will know about that value still being in a register after the asm statement. In a very large asm statement you might run out of registers, or in a naked function the compiler won't emit instructions for you.
In instructions like "orr x1, x2, #{hcr}" (hard-coded registers) or and {foo}, {bar}, #{hcr} (compiler's choice of registers) it makes sense to use HCR_VALUE directly as an immediate with boolean instructions. (Bitwise boolean instructions use a bit-pattern encoding for immediates that can encode constants with one contiguous block of set bits, or a repeating pattern, so 0xffff0000 is usable directly, unlike 0xffef0000 or similar.)
When you want the value in a register, preferably ask the compiler to do it before your asm instructions:
asm!(
"use {reg} here where you would have used x0",
reg = in(reg) HCR_VALUE
);
This will probably result in the same compiled result as above.
If you want to put a value in a specific register, you can:
asm!(
"...", // You don't even have to alter your code
in ("x0") HCR_VALUE
);
Related
In my application's InitInstance function, I have the following code to rewrite the location of the CHM Help Documentation:
CString strHelp = GetProgramPath();
strHelp += _T("MeetSchedAssist.CHM");
free((void*)m_pszHelpFilePath);
m_pszHelpFilePath = _tcsdup(strHelp);
It is all functional but it gives me a code analysis warning:
C26408 Avoid malloc() and free(), prefer the nothrow version of new with delete (r.10).
When you look at the official documentation for m_pszHelpFilePath it does state:
If you assign a value to m_pszHelpFilePath, it must be dynamically allocated on the heap. The CWinApp destructor calls free( ) with this pointer. You many want to use the _tcsdup( ) run-time library function to do the allocating. Also, free the memory associated with the current pointer before assigning a new value.
Is it possible to rewrite this code to avoid the code analysis warning, or must I add a __pragma?
You could (should?) use a smart pointer to wrap your reallocated m_pszHelpFilePath buffer. However, although this is not trivial, it can be accomplished without too much trouble.
First, declare an appropriate std::unique_ptr member in your derived application class:
class MyApp : public CWinApp // Presumably
{
// Add this member...
public:
std::unique_ptr<TCHAR[]> spHelpPath;
// ...
};
Then, you will need to modify the code that constructs and assigns the help path as follows (I've changed your C-style cast to an arguably better C++ cast):
// First three (almost) lines as before ...
CString strHelp = GetProgramPath();
strHelp += _T("MeetSchedAssist.CHM");
free(const_cast<TCHAR *>(m_pszHelpFilePath));
// Next, allocate the shared pointer data and copy the string...
size_t strSize = static_cast<size_t>(strHelp.GetLength() + 1);
spHelpPath std::make_unique<TCHAR[]>(strSize);
_tcscpy_s(spHelpPath.get(), strHelp.GetString()); // Use the "_s" 'safe' version!
// Now, we can use the embedded raw pointer for m_pszHelpFilePath ...
m_pszHelpFilePath = spHelpPath.get();
So far, so good. The data allocated in the smart pointer will be automatically freed when your application object is destroyed, and the code analysis warnings should disappear. However, there is one last modification we need to make, to prevent the MFC framework from attempting to free our assigned m_pszHelpFilePath pointer. This can be done by setting that to nullptr in the MyApp class override of ExitInstance:
int MyApp::ExitInstance()
{
// <your other exit-time code>
m_pszHelpFilePath = nullptr;
return CWinApp::ExitInstance(); // Call base class
}
However, this may seem like much ado about nothing and, as others have said, you may be justified in simply supressing the warning.
Technically, you can take advantage of the fact that new / delete map to usual malloc/free by default in Visual C++, and just go ahead and replace. The portability won't suffer much as MFC is not portable anyway. Sure you can use unique_ptr<TCHAR[]> instead of direct new / delete, like this:
CString strHelp = GetProgramPath();
strHelp += _T("MeetSchedAssist.CHM");
std::unique_ptr<TCHAR[]> str_old(m_pszHelpFilePath);
auto str_new = std::make_unique<TCHAR[]>(strHelp.GetLength() + 1);
_tcscpy_s(str_new.get(), strHelp.GetLength() + 1, strHelp.GetString());
m_pszHelpFilePath = str_new.release();
str_old.reset();
For robustness for replaced new operator, and for least surprise principle, you should keep free / strdup.
If you replace multiple of those CWinApp strings, suggest writing a function for them, so that there's a single place with free / strdup with suppressed warnings.
I am trying to check for any unaligned reads in my program. I enable unaligned access processor exception via (using x86_64 on g++ on linux kernel 3.19):
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x40000, %%rax \n"
"push %%rax \n"
"popf \n" ::: "rax");
I do an optional forced unaligned read which triggers the exception so i know its working. After i disable that I get an error in a piece of code which otherwise seems fine :
char fullpath[eMaxPath];
snprintf(fullpath, eMaxPath, "%s/%s", "blah", "blah2");
the stacktrace shows a failure via __memcpy_sse2 which leads me to suspect that the standard library is using sse to fulfill my memcpy but it doesnt realize that i have now made unaligned reads unacceptable.
Is my thinking correct and is there any way around this (ie can i make the standard library use an unaligned safe sprintf/memcpy instead)?
thanks
While I hate to discourage an admirable notion, you're playing with fire, my friend.
It's not merely sse2 access but any unaligned access. Even a simple int fetch.
Here's a test program:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
void *intptr;
void
require_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x00040000, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
relax_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"andl $0xFFFBFFFF, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
msg(const char *str)
{
int len;
len = strlen(str);
write(1,str,len);
}
void
grab(void)
{
volatile int x = *(int *) intptr;
}
int
main(void)
{
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = malloc(256);
printf("intptr=%p\n",intptr);
// normal access to aligned pointer
msg("normal\n");
grab();
// enable alignment check exception
require_aligned();
// access aligned pointer under check [will be okay]
msg("aligned_norm\n");
grab();
// this grab will generate a bus error
intptr += 1;
msg("aligned_except\n");
grab();
return 0;
}
The output of this is:
intptr=0x1996010
normal
aligned_norm
aligned_except
Bus error (core dumped)
The program generated this simply because of an attempted 4 byte int fetch from address 0x1996011 [which is odd and not a multiple of 4].
So, once you turn on the AC [alignment check] flag, even simple things will break.
IMO, if you truly have some things that are not aligned optimally and are trying to find them, using printf, instrumenting your code with debug asserts, or using gdb with some special watch commands or breakpoints with condition statements are a better/safer way to go
UPDATE:
I a using my own custom allocator am preparing my code to run on an architecture that doesnt suport unaligned read/writes so I want to make sure my code will not break on that architecture.
Fair enough.
Side note: My curiousity has gotten the better of me as the only [major] arches I can recall [at the moment] that have this issue are Motorola mc68000 and older IBM mainframes (e.g. IBM System 370).
One practical reason for my curiosity is that for certain arches (e.g. ARM/android, MIPS) there are emulators available. You could rebuild the emulator from source, adding any extra checks, if needed. Otherwise, doing your debugging under the emulator might be an option.
I can trap unaligned read/write using either the asm , or via gdb but both cause SIGBUS which i cant continue from in gdb and im getting too many false positives from std library (in the sense that their implementation would be aligned access only on the target).
I can tell you from experience that trying to resume from a signal handler after this doesn't work too well [if at all]. Using gdb is the best bet if you can eliminate the false positives by having AC off in the standard functions [see below].
Ideally i guess i would like to use something like perf to show me callstacks that have misaligned but so far no dice.
This is possible, but you'd have to verify that perf even reports them. To see, you could try perf against my original test program above. If it works, the "counter" should be zero before and one after.
The cleanest way may be to pepper your code with "assert" macros [that can be compiled in and out with a -DDEBUG switch].
However, since you've gone to the trouble of laying the groundwork, it may be worthwhile to see if the AC method can work.
Since you're trying to debug your memory allocator, you only need AC on in your functions. If one of your functions calls libc, disable AC, call the function, and then reenable AC.
A memory allocator is fairly low level, so it can't rely on too many standard functions. Most standard functions rely on being able to call malloc. So, you might also want to consider a vtable interface to the rest of the [standard] library.
I've coded some slightly different AC bit set/clear functions. I put them into a .S function to eliminate inline asm hassles.
I've coded up a simple sample usage in three files.
Here are the AC set/clear functions:
// acbit/acops.S -- low level AC [alignment check] operations
#define AC_ON $0x00040000
#define AC_OFF $0xFFFFFFFFFFFBFFFF
.text
// acpush -- turn on AC and return previous mask
.globl acpush
acpush:
// get old mask
pushfq
pop %rax
mov %rax,%rcx // save to temp
or AC_ON,%ecx // turn on AC bit
// set new mask
push %rcx
popfq
ret
// acpop -- restore previous mask
.globl acpop
acpop:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
and AC_ON,%edi // isolate the AC bit in argument
or %edi,%eax // lay it in
// set new mask
push %rax
popfq
ret
// acon -- turn on AC
.globl acon
acon:
jmp acpush
// acoff -- turn off AC
.globl acoff
acoff:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
// set new mask
push %rax
popfq
ret
Here is a header file that has the function prototypes and some "helper" macros:
// acbit/acbit.h -- common control
#ifndef _acbit_acbit_h_
#define _acbit_acbit_h_
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
typedef unsigned long flags_t;
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#ifdef ACDEBUG
#define ACPUSH \
do { \
flags_t acflags = acpush()
#define ACPOP \
acpop(acflags); \
} while (0)
#define ACEXEC(_expr) \
do { \
acoff(); \
_expr; \
acon(); \
} while (0)
#else
#define ACPUSH /**/
#define ACPOP /**/
#define ACEXEC(_expr) _expr
#endif
void *intptr;
flags_t
acpush(void);
void
acpop(flags_t omsk);
void
acon(void);
void
acoff(void);
#endif
Here is a sample program that uses all of the above:
// acbit/acbit2 -- sample allocator
#include <acbit.h>
// mymalloc1 -- allocation function [raw calls]
void *
mymalloc1(size_t len)
{
flags_t omsk;
void *vp;
// function prolog
// NOTE: do this on all "outer" (i.e. API) functions
omsk = acpush();
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
acoff();
printf("%p\n",vp);
acon();
// function epilog
acpop(omsk);
return vp;
}
// mymalloc2 -- allocation function [using helper macros]
void *
mymalloc2(size_t len)
{
void *vp;
// function prolog
ACPUSH;
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
ACEXEC(printf("%p\n",vp));
// function epilog
ACPOP;
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc1(256);
intptr = mymalloc2(256);
x = *(int *) intptr;
return x;
}
UPDATE #2:
I like the idea of disabling the check before any library calls.
If the AC H/W works and you wrap the library calls, this should yield no false positives. The only exception would be if the compiler made a call to its internal helper library (e.g. doing 64 bit divide on 32 bit machine, etc.).
Be aware/wary of the ELF loader (e.g. /lib64/ld-linux-x86-64.so.2) doing dynamic symbol resolution on "lazy" bindings of symbols. Shouldn't be a big problem. There are ways to force the relocations to occur at program start, if necessary.
I have given up on perf for this as it seems to show me garbage even for a simple program like the one you wrote.
The perf code in the kernel is complex enough that it may be more trouble than it's worth. It has to communicate with the perf program with a pipe [IIRC]. Also, doing the AC thing is [probably] uncommon enough that the kernel's code path for this isn't well tested.
Im using ocperf with misalign_mem_ref.loads and stores but either way the counters dont correlate at all. If i record and look at the callstacks i get completely unrecognizable callstacks for these counters so i suspect either the counter doesnt work on my hardware/perf or it doesnt actually count what i think it counts
I honestly don't know if perf handles process reschedules to different cores properly [or not]--it should [IMO]. But, using sched_setaffinity to lock your program to a single core might help.
But, using the AC bit is more direct and definitive, IMO. I think that's the better bet.
I've talked about adding "assert" macros in the code.
I've coded some up below. These are what I'd use. They are independent of the AC code. But, they can also be used in conjunction with the AC bit code in a "belt and suspenders" approach.
These macros have one distinct advantage. When properly [and liberally] inserted, they can check for bad pointer values at the time they're calculated. That is, much closer to the true source of the problem.
With AC, you may calculate a bad value, but AC only kicks in [sometime] later, when the pointer is dereferenced [which may not happen in your API code at all].
I've done a complete memory allocator before [with overrun checks and "guard" pages, etc.]. The macro approach is what I used. And, if I had only one tool for this, it is the one I'd use. So, I recommend it above all else.
But, as I said, it can be used with the AC code as well.
Here's the header file for the macros:
// acbit/acptr.h -- alignment check macros
#ifndef _acbit_acptr_h_
#define _acbit_acptr_h_
#include <stdio.h>
typedef unsigned int u32;
// bit mask for given width
#define ACMSKOFWID(_wid) \
((1u << (_wid)) - 1)
#ifdef ACDEBUG2
#define ACPTR_MSK(_ptr,_msk) \
acptrchk(_ptr,_msk,__FILE__,__LINE__)
#else
#define ACPTR_MSK(_ptr,_msk) /**/
#endif
#define ACPTR_WID(_ptr,_wid) \
ACPTR_MSK(_ptr,(_wid) - 1)
#define ACPTR_TYPE(_ptr,_typ) \
ACPTR_WID(_ptr,sizeof(_typ))
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno);
// acptrchk -- check pointer for given alignment
static inline void
acptrchk(const void *ptr,u32 msk,const char *file,int lno)
{
#ifdef ACDEBUG2
#if ACDEBUG2 >= 2
printf("acptrchk: TRACE ptr=%p msk=%8.8X file='%s' lno=%d\n",
ptr,msk,file,lno);
#endif
if (((unsigned long) ptr) & msk)
acptrfault(ptr,file,lno);
#endif
}
#endif
Here's the "fault" handler function:
// acbit/acptr -- alignment check macros
#include <acbit/acptr.h>
#include <acbit/acbit.h>
#include <stdlib.h>
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno)
{
// NOTE: it's easy to set a breakpoint on this function
printf("acptrfault: pointer fault -- ptr=%p file='%s' lno=%d\n",
ptr,file,lno);
exit(1);
}
And, here's a sample program that uses them:
// acbit/acbit3 -- sample allocator using check macros
#include <acbit.h>
#include <acptr.h>
static double static_array[20];
// mymalloc3 -- allocation function
void *
mymalloc3(size_t len)
{
void *vp;
// get something valid
vp = static_array;
// do lots of stuff ...
printf("BEF vp=%p\n",vp);
// check pointer
// NOTE: these can be peppered after every [significant] calculation
ACPTR_TYPE(vp,double);
// do something bad ...
vp += 1;
printf("AFT vp=%p\n",vp);
// check again -- this should fault
ACPTR_TYPE(vp,double);
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc3(256);
x = *(int *) intptr;
return x;
}
Here's the program output:
BEF vp=0x601080
acptrchk: TRACE ptr=0x601080 msk=00000007 file='acbit/acbit3.c' lno=22
AFT vp=0x601081
acptrchk: TRACE ptr=0x601081 msk=00000007 file='acbit/acbit3.c' lno=29
acptrfault: pointer fault -- ptr=0x601081 file='acbit/acbit3.c' lno=29
I left off the AC code in this example. On your real target system, the dereference of intptr in main would/should fault on an alignment, but notice how much later that is in the execution timeline.
Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use
asm volatile ("add $-128, %rsp\n\t"
"pushf\n\t"
"orl $0x40000, (%rsp)\n\t"
"popf\n\t"
"sub $-128, %rsp\n\t"
);
(-128 fits in a sign-extended 8bit immediate, but 128 doesn't, hence using add $-128 to subtract 128.)
Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:
asm("stac"); // Set AC flag
asm("clac"); // Clear AC flag
It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.
Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.
modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.
What you might try is looking at performance counters for some of these events:
misalign_mem_ref.loads [Speculative cache line split load uops dispatched to L1 cache]
misalign_mem_ref.stores [Speculative cache line split STA uops dispatched to L1 cache]
ld_blocks.store_forward [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]
(from ocperf.py list output on my Sandybridge CPU).
There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.
The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.
So yes, you would need special slow&safe versions of library functions.
If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.
Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.
The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.
You're not going to like it, but there is only one answer: don't link against the standard libraries. By changing that setting you have changed the ABI and the standard library doesn't like it. memcpy and friends are hand-written assembly so it's not a matter of compiler options to convince the compiler to do something else.
I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).
I'm writing a multithreaded application and having a problem on the SPARC platform. Ultimately my question comes down to atomicity of this platform and how I could be obtaining this result.
Some pseudocode to help clarify my question:
// Global variable
typdef struct pkd_struct{
uint16_t a;
uint16_t b;
} __attribute__(packed) pkd_struct_t;
pkd_struct_t shared;
Thread 1:
swap_value() {
pkd_struct_t prev = shared;
printf("%d%d\n", prev.a, prev.b);
...
}
Thread 2:
use_value() {
pkd_struct_t next;
next.a = 0; next.b = 0;
shared = next;
printf("%d%d\n", shared.a, shared.b);
...
}
Thread 1 and 2 are accessing the shared variable "shared". One is setting, the other is getting. If Thread 2 is setting "shared" to zero, I'd expect Thread 1 to read count either before OR after the setting -- since "shared" is aligned on a 4-byte boundary. However, I will occasionally see Thread 1 reading the value of the form 0xFFFFFF00. That is the high-order 24 bits are OLD, but the low-order byte is NEW. It appears I've gotten an intermediate value.
Looking at the disassembly, the use_value function simply does an "ST" instruction. Given that the data is aligned and isn't crossing a word boundary, is there any explanation for this behavior? If ST is indeed NOT atomic to use this way, does this explain the result I see (only 1 byte changed?!?)? There is no problem on x86.
UPDATE 1:
I've found the problem, but not the cause. GCC appears to be generating assembly that reads the shared variably byte-by-byte (thus allowing a partial update to be obtained). Comments added, but I am not terribly comfortable with SPARC assembly. %i0 is a pointer to the shared variable.
xxx+0xc: ldub [%i0], %g1 // ld unsigned byte g1 = [i0] -- 0 padded
xxx+0x10: ...
xxx+0x14: ldub [%i0 + 0x1], %g5 // ld unsigned byte g5 = [i0+1] -- 0 padded
xxx+0x18: sllx %g1, 0x18, %g1 // g1 = [i0+0] left shifted by 24
xxx+0x1c: ldub [%i0 + 0x2], %g4 // ld unsigned byte g4 = [i0+2] -- 0 padded
xxx+0x20: sllx %g5, 0x10, %g5 // g5 = [i0+1] left shifted by 16
xxx+0x24: or %g5, %g1, %g5 // g5 = g5 OR g1
xxx+0x28: sllx %g4, 0x8, %g4 // g4 = [i0+2] left shifted by 8
xxx+0x2c: or %g4, %g5, %g4 // g4 = g4 OR g5
xxx+0x30: ldub [%i0 + 0x3], %g1 // ld unsigned byte g1 = [i0+3] -- 0 padded
xxx+0x34: or %g1, %g4, %g1 // g1 = g4 OR g1
xxx+0x38: ...
xxx+0x3c: st %g1, [%fp + 0x7df] // store g1 on the stack
Any idea why GCC is generating code like this?
UPDATE 2: Adding more info to the example code. Appologies -- I'm working with a mix of new and legacy code and it's difficult to separate what's relevant. Also, I understand sharing a variable like this is highly-discouraged in general. However, this is actually in a lock implementation where higher-level code will be using this to provide atomicity and using pthreads or platform-specific locking is not an option for this.
Because you've declared the type as packed, it gets one byte alignment, which means it must be read and written one byte at a time, as SPARC does not allow unaligned loads/stores. You need to give it 4-byte alignment if you want the compiler to use word load/store instructions:
typdef struct pkd_struct {
uint16_t a;
uint16_t b;
} __attribute__((packed, aligned(4))) pkd_struct_t;
Note that packed is essentially meaningless for this struct, so you could leave that out.
Answering my own question here -- this has bugged me for too long and hopefully I can save someone a bit of frustration at some point.
The problem is that although the shared data is aligned, because it is packed GCC reads it byte-by-byte.
There is some discussion here on how packing leading to load/store bloat on SPARC (and other RISC platforms I'd assume...), but in my case it has lead to a race.
I want to write a 64 bit integer to a particular memory location.
sample C++ code would look like this:
extern char* base;
extern uint64_t data;
((uint64_t *)base)[1] = data;
Now, here is my attempt to write the above as inline assembly:
uint64_t addr = (uint64_t)base + 8;
asm volatile (
"movq %0, (%1)\n\t"
:: "r" (data), "r"(addr) : "memory"
);
The above works in a small test program but in my application, I suspect that something here is off.
Do I need to specify any output operands or any other constraints in the above?
Thanks!
Just say:
asm("mov %1, %0\n\t" : "=m"(*(uint64_t*)(base + 8)) : "r"(data) : "memory");
The tricky thing is that when using the "m" constraint, you're (possibly counterintuitively so) not giving an address but instead what in C looks like the "value" of the variable you want to change.
That's why, in this case, the weird pointer-cast-dereference. The compiler for this makes sure to put the address of the operand in.