Compiling a library on a 64 bit architecture: Incorrect register `%rax' used with `l' suffix - linux

I have to compile a library on a 64 bit architecture, anyway I get that error.
The lines of code affected by the error are in assembler, here's an example (they are all very similar):
//=== get the index to write ===///
__asm__ __volatile__ ("lock; xaddl %0,%1"
: "=r" (indexToWrite), "=m" ( indexTable[entityId] )
: "0" (1), "m" ( indexTable[entityId] ));
can you help me out?
I am under linux 64bit (ubuntu) and I am using gcc.

Use the k operand modifier to select a 32-bit sub-register: xaddl %k0,%1.
The syntax: xaddl %k0,%k1 is also harmless, since %1 is mem addr anyway.
The operand modifiers for 8, 16, 32, and 64 bits are b, w, k, q respectively.
The second "m" in the input list seems suspect to me. I might be wrong, but I think it should be:
"1" (indexTable[entityId])
With xadd I don't suppose it matters, but this would technically be argument %3 otherwise.
Personally, I'd go with:
: "=r" (indexToWrite), "+m" (indexTable[entityId]) : "0" (1)
And yes, "+m" is perfectly legal. It has been for a long time, and was only recently corrected as a bug in the documentation of gcc!

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

How to get right MIPS libc toolchain for embedded device

I've run into a problem (repetitively) with various company's' embedded linux products where GPL source code from them does not match what is actually running on a system. It's "close", but not quite right, especially with respect to the standard C library they use.
Isn't that a violation of the GPL?
Often this mismatch results in a programmer (like me) cross compiling only to have the device reply cryptically "file not found" or something similar when the program is run.
I'm not alone with this kind of problem -- For many people have threads directly and indirectly related to the problem: eg:
Compile parameters for MIPS based codesourcery toolchain?
And I've run into the problem on Sony devices, D-link, and many others. It's very common.
Making a new library is not a good solution, since most systems are ROMFS only, and LD_LIBRARY_PATH is sometimes broken -- so that installing a new library on the device wastes very limited memory and often won't work.
If I knew what the right source code version of the library was, I could go around the manufacturer's carelessness and compile it from the original developer's tree; but how can I find out which version I need when all I have is the binary of the library itself?
For example: I ran elfread -a libc.so.0 on a DSL modem's libc (see below); but I don't see anything here that could tell me exactly which libc it was...
How can I find the name of the source code, or an identifier from the library's binary so I can create a cross compiler using that library? eg: Can anyone tell me what source code this library came from, and how they know?
ELF Header:
Magic: 7f 45 4c 46 01 02 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, big endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: MIPS R3000
Version: 0x1
Entry point address: 0x5a60
Start of program headers: 52 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x1007, noreorder, pic, cpic, o32, mips1
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 4
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
There are no sections in this file.
There are no sections to group in this file.
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
REGINFO 0x0000b4 0x000000b4 0x000000b4 0x00018 0x00018 R 0x4
LOAD 0x000000 0x00000000 0x00000000 0x2c9ee 0x2c9ee R E 0x1000
LOAD 0x02c9f0 0x0006c9f0 0x0006c9f0 0x009a0 0x040b8 RW 0x1000
DYNAMIC 0x0000cc 0x000000cc 0x000000cc 0x0579a 0x0579a RWE 0x4
Dynamic section at offset 0xcc contains 19 entries:
Tag Type Name/Value
0x0000000e (SONAME) Library soname: [libc.so.0]
0x00000004 (HASH) 0x18c
0x00000005 (STRTAB) 0x3e9c
0x00000006 (SYMTAB) 0x144c
0x0000000a (STRSZ) 6602 (bytes)
0x0000000b (SYMENT) 16 (bytes)
0x00000015 (DEBUG) 0x0
0x00000003 (PLTGOT) 0x6ce20
0x00000011 (REL) 0x5868
0x00000012 (RELSZ) 504 (bytes)
0x00000013 (RELENT) 8 (bytes)
0x70000001 (MIPS_RLD_VERSION) 1
0x70000005 (MIPS_FLAGS) NOTPOT
0x70000006 (MIPS_BASE_ADDRESS) 0x0
0x7000000a (MIPS_LOCAL_GOTNO) 11
0x70000011 (MIPS_SYMTABNO) 677
0x70000012 (MIPS_UNREFEXTNO) 17
0x70000013 (MIPS_GOTSYM) 0x154
0x00000000 (NULL) 0x0
There are no relocations in this file.
The decoding of unwind sections for machine type MIPS R3000 is not currently supported.
Histogram for bucket list length (total of 521 buckets):
Length Number % of total Coverage
0 144 ( 27.6%)
1 181 ( 34.7%) 27.1%
2 130 ( 25.0%) 66.0%
3 47 ( 9.0%) 87.1%
4 12 ( 2.3%) 94.3%
5 5 ( 1.0%) 98.1%
6 1 ( 0.2%) 99.0%
7 1 ( 0.2%) 100.0%
No version information found in this file.
Primary GOT:
Canonical gp value: 00074e10
Reserved entries:
Address Access Initial Purpose
0006ce20 -32752(gp) 00000000 Lazy resolver
0006ce24 -32748(gp) 80000000 Module pointer (GNU extension)
Local entries:
Address Access Initial
0006ce28 -32744(gp) 00070000
0006ce2c -32740(gp) 00030000
0006ce30 -32736(gp) 00000000
0006ce34 -32732(gp) 00010000
0006ce38 -32728(gp) 0006d810
0006ce3c -32724(gp) 0006d814
0006ce40 -32720(gp) 00020000
0006ce44 -32716(gp) 00000000
0006ce48 -32712(gp) 00000000
Global entries:
Address Access Initial Sym.Val. Type Ndx Name
0006ce4c -32708(gp) 000186c0 000186c0 FUNC bad section index[ 6] __fputc_unlocked
0006ce50 -32704(gp) 000211a4 000211a4 FUNC bad section index[ 6] sigprocmask
0006ce54 -32700(gp) 0001e2b4 0001e2b4 FUNC bad section index[ 6] free
0006ce58 -32696(gp) 00026940 00026940 FUNC bad section index[ 6] raise
...
truncated listing
....
Note:
The rest of this post is a blog showing how I came to ask the question above and to put useful information about the subject in one place.
Don't bother reading it unless you want to know I actually did research the question... in gory detail... and how NOT to answer my question.
The proper (theoretical) way to get a libc program running on (for example) a D-link modem would simply be to get the TRUE source code for the product from the manufacturer, and compile against those libraries.... (It's GPL !? right, so the law is on our side, right?)
For example: I just bought a D-Link DSL-520B modem and a 526B modem -- but found out after the fact that the manufacturer "forgot" to supply linux source code for the 520B but does have it for the 526B. I checked all of the DSL-5xxB devices online for source code & toolchains, finding to my delight that ALL of them (including 526B) -- contain the SAME pre-compiled libc.so.0 with MD5sum of 6ed709113ce615e9f170aafa0eac04a6 . So in theory, all supported modems in the DSL-5xxB family seemed to use the same libc library... and I hoped I might be able to use that library.
But after I figured out how to get the DSL modem itself to send me a copy of the installed /lib/libc.so.0 library -- I found to my disgust that they ALL use a library with MD5 sum of b8d492decc8207e724a0822641205078 . In NEITHER of the modems I bought (supported or not) was found the same library as contained in the source code toolchain.
To verify the toolchain from D-link was defective, I didn't compile a program (the toolchain wouldn't run on my PC anyway as it was the wrong binary format) -- but I found the toolchain had some pre-compiled mips binaries in it already; so I simply downloaded one to the modem and chmod +x -- and (surprise) I got the message "file not found." when I tried to run it ... It won't run.
So, I knew the toolchains were no good immediately, but not exactly why.
I decided to get a newer verson of MIPS GCC (binary version) that should have less bugs, more features and which is supported on most PC platforms. This is the way to GO!
See: Kernel.org pre-compiled binaries
I upgraded to gcc 4.9.0 after selecting the older "mips" verson from the above site to get the right FTP page; and set my shells' PATH variable to the /bin directory of the cross compiler once installed.
Then I copied all the header files and libraries from the D-link source code package into the new cross compiler just to verify that it could compile D-link libc binaries. And it did on the first try, compiling "hello world!" with no warnings or errors into a mips 32 big endian binary.
( START EDIT: ) #ChrisStratton points out in the comments (after this post) that my test of the toolchain is inadequate, and that using a newer GCC with an older library -- even though it links properly -- is flawed as a test. I wish there was a way to give him points for his comments -- I've become convinced that he's right; although that makes what D-link did even a worse problem -- for there's no way to know from the binaries on the modem which GCC they actually used. The GCC used for the kernel isn't necessarily the same used in user space.
In order to test the new compiler's compatibility with the modems and also make tools so I could get a copy of the actual libraries found on the modem: ( END EDIT ) I wrote a program that doesn't use the C library at all (but in two parts): It ran just fine... and the code is attached to show how it can be done.
The first listing is an assembly language program to bypass linking the standard C libraries on MIPS; and the second listing is a program meant to create an octal number dump of a binary file/stream using only the linux kernel. eg: It enables copying/pasting or scripting of binary data over telnet, netcat, etc... via ash/bash or busybox :) like a poor man's uucp.
// substart.S MIPS assembly language bypass of libc startup code
// it just calls main, and then jumps to the exit function
.text
.globl __start
__start: .ent __start
.frame $29, 32, $31
.set noreorder
.cpload $25
.set reorder
.cprestore 16
jal main
j exit
.end __start
// end substart.S
...and...
// octdump.c
// To compile w/o libc :
// mips-linux-gcc stubstart.S octdump.c -nostdlib -o octdump
// To compile with working libc (eg: x86 system) :
// gcc octdump.c -o octdump_x86
#include <syscall.h>
#include <errno.h>
#include <sys/types.h>
int* __errno_location(void) { return &errno; }
#ifdef _syscall1
// define three unix functions (exit,read,write) in terms of unix syscall macros.
_syscall1( void, exit, int, status );
_syscall3( ssize_t, read, int, fd, void*, buf, size_t, count );
_syscall3( ssize_t, write, int, fd, const void*, buf, size_t, count );
#endif
#include <unistd.h>
void oct( unsigned char c ) {
unsigned int n = c;
int m=6;
static unsigned char oval[6]={'\\','\\','0','0','0','0'};
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
oval[5]='0'+(n&7);
oval[4]='0'+((n>>3)&7);
oval[3]='0'+((n>>6)&7);
write( STDOUT_FILENO, oval, m );
}
int main(void) {
char buffer[255];
int count=1;
int i;
while (count>0) {
count=read( STDIN_FILENO, buffer, 17 );
if (count>0) write( STDOUT_FILENO, "echo -ne $'",11 );
for (i=0; i<count; ++i) oct( buffer[i] );
if (count>0) write( STDOUT_FILENO, "'\n", 2 );
}
write( STDOUT_FILENO,"#\n",2);
return 0;
}
Once mips' octdump was saved (chmod +x) as /var/octdump on the modem, it ran without errors.
(use your imagination about how I got it on there... Dlink's TFTP, & friends are broken.)
I was able to use octdump to copy all the dynamic libraries off the DSL modem and examine them, using an automated script to avoid copy/pasting by hand.
#!/bin/env python
# octget.py
# A program to upload a file off an embedded linux device via telnet
import socket
import time
import sys
import string
if len( sys.argv ) != 4 :
raise ValueError, "Usage: octget.py IP_OF_MODEM passwd path_to_file_to_get"
o = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
o.connect((sys.argv[1],23)) # The IP address of the DSL modem.
time.sleep(1)
sys.stderr.write( o.recv(1024) )
o.send("admin\r\n");
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send(sys.argv[2]+"\r\n")
time.sleep(0.1)
o.send("sh\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("cd /var\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("./octdump.x < "+sys.argv[3]+"\r\n" );
sys.stderr.write( o.recv(21) )
get="y"
while get and not ('#' in get):
get = o.recv(4096)
get = get.translate( None, '\r' )
sys.stdout.write( get )
time.sleep(0.5)
o.close()
The DSL520B modem had the following libraries...
libcrypt.so.0 libpsi.so libutil.so.0 ld-uClibc.so.0 libc.so.0 libdl.so.0 libpsixml.so
... and I thought I might cross compile using these libraries since (at least in theory) -- GCC could link against them; and my problem might be solved.
I made very sure to erase all the incompatible .so libraries from gcc-4.9.0/mips-linux/mips-linux/lib, but kept the generic crt..o files; then I copied the modem's libraries into the cross compiler directory.
But even though the kernel version of the source code, and the kernel version of the modem matched -- GCC found undefined symbols in the crt files.... So, either the generic crt files or the modem libraries themselves are somehow defective... and I don't know why. Without knowing how to get the full library version of the ? ucLibc ? library, I'm not sure how I can get the CORRECT source code to recompile the libraries and the crt's from scratch.

GNU inline assembly to move data

I want to write a 64 bit integer to a particular memory location.
sample C++ code would look like this:
extern char* base;
extern uint64_t data;
((uint64_t *)base)[1] = data;
Now, here is my attempt to write the above as inline assembly:
uint64_t addr = (uint64_t)base + 8;
asm volatile (
"movq %0, (%1)\n\t"
:: "r" (data), "r"(addr) : "memory"
);
The above works in a small test program but in my application, I suspect that something here is off.
Do I need to specify any output operands or any other constraints in the above?
Thanks!
Just say:
asm("mov %1, %0\n\t" : "=m"(*(uint64_t*)(base + 8)) : "r"(data) : "memory");
The tricky thing is that when using the "m" constraint, you're (possibly counterintuitively so) not giving an address but instead what in C looks like the "value" of the variable you want to change.
That's why, in this case, the weird pointer-cast-dereference. The compiler for this makes sure to put the address of the operand in.

virtual to physical address conversion in linux kernel

The following is used to translate virtual address to physical address in linux kernel. But what does it mean?
I have very limited knowledge of assembly
163 #define __pv_stub(from,to,instr,type) \
164 __asm__("# __pv_stub\n" \
165 "1: " instr " %0, %1, %2\n" \
166 " .pushsection .pv_table,\"a\"\n" \
167 " .long 1b\n" \
168 " .popsection\n" \
169 : "=r" (to) \
170 : "r" (from), "I" (type))
It's not really "assembly" as there is no instruction in this macro per se.
It's just a macro which inserts instr (an instruction passed to the macro) which has one input operand from, one immediate (constant) input operand type and a output operand to.
There is also the part between pushsection and popsection which records in a specific binary section pv_table the address of this instruction. That allows the kernel to find these places in its code if it wishes to.
The last part is the asm constraints and operands. It lists what the compiler will replace %0, %1 and %2 with. %0 is the first listed ("=r"(to)), it means that %0 will be any general purpose register, that is an output operand that will be stored in the macro argument to. The other 2 are similar except they're input operands: from is a register so gets "r" but type is an immediate so is "i"
See http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm for details
So consider this code from the kernel (http://lxr.linux.no/linux+v3.9.4/arch/arm/include/asm/memory.h#L172)
static inline unsigned long __virt_to_phys(unsigned long x)
{ unsigned long t;
__pv_stub(x, t, "add", __PV_BITS_31_24);
return t;
}
__pv_stub will be equivalent to t = x + __PV_BITS_31_24 (instr == add, from == x, to == t, type == __PV_BITS_31_24)
So you might wonder why anybody would do such a complicated thing instead of just writing t = x + __PV_BITS_31_24 in the code.
The reason is the pv_table I mentioned above. The address of all these statements is recorded in a specific elf section. Under some circumstances, the kernel patches these instructions at runtime (so needs to be able to easily find all of them) hence the need for a table.
The ARM port does exactly that here: http://lxr.linux.no/linux+v3.9.4/arch/arm/kernel/head.S#L541
It's used only if CONFIG_ARM_PATCH_PHYS_VIRT is used to compile the kernel:
CONFIG_ARM_PATCH_PHYS_VIRT:
Patch phys-to-virt and virt-to-phys translation functions at
boot and module load time according to the position of the
kernel in system memory.
This can only be used with non-XIP MMU kernels where the base
of physical memory is at a 16MB boundary, or theoretically 64K
for the MSM machine class.

Linux assembler error "impossible constraint in ‘asm’"

I'm starting with assembler under Linux. I have saved the following code as testasm.c
and compiled it with: gcc testasm.c -otestasm
The compiler replies: "impossible constraint in ‘asm’".
#include <stdio.h>
int main(void)
{
int foo=10,bar=15;
__asm__ __volatile__ ("addl %%ebx,%%eax"
: "=eax"(foo)
: "eax"(foo), "ebx"(bar)
: "eax"
);
printf("foo = %d", foo);
return 0;
}
How can I resolve this problem?
(I've copied the example from here.)
Debian Lenny, kernel 2.6.26-2-amd64
gcc version 4.3.2 (Debian 4.3.2-1.1)
Resolution:
See the accepted answer - it seems the 'modified' clause is not supported any more.
__asm__ __volatile__ ("addl %%ebx,%%eax" : "=a"(foo) : "a"(foo), "b"(bar));
seems to work. I believe that the syntax for register constraints changed at some point, but it's not terribly well documented. I find it easier to write raw assembly and avoid the hassle.
The constraints are single letters (possibly with extra decorations), and you can specify several alternatives (i.e., an inmediate operand or register is "ir"). So the constraint "eax" means constraints "e" (signed 32-bit integer constant), "a" (register eax), or "x" (any SSE register). That is a bit different that what OP meant... and output to an "e" clearly doesn't make any sense. Also, if some operand (in this case an input and an output) must be the same as another, you refer to it by a number constraint. There is no need to say eax will be clobbered, it is an output. You can refer to the arguments in the inline code by %0, %1, ..., no need to use explicit register names. So the correct version for the code as intended by OP would be:
#include <stdio.h>
int main(void)
{
int foo=10, bar=15;
__asm__ __volatile__ (
"addl %2, %0"
: "=a" (foo)
: "0" (foo), "b" (bar)
);
printf("foo = %d", foo);
return 0;
}
A better solution would be to allow %2 to be anything, and %0 a register (as x86 allows, but you'd have to check your machine manual):
#include <stdio.h>
int main(void)
{
int foo=10, bar=15;
__asm__ __volatile__ (
"addl %2, %0"
: "=r" (foo)
: "0" (foo), "g" (bar)
);
printf("foo = %d", foo);
return 0;
}
If one wants to use multiline, then this will also work..
__asm__ __volatile__ (
"addl %%ebx,%%eax; \
addl %%eax, %%eax;"
: "=a"(foo)
: "a"(foo), "b"(bar)
);
'\' should be added for the compiler to accept a multiline string (the instructions).

Resources