This structure is used to store extended information about stack frames.
typedef struct _CallSnapshotEx {
DWORD dwReturnAddr;
DWORD dwFramePtr;
DWORD dwCurProc;
DWORD dwParams[4];
} CallSnapshotEx;
Does anybody know what dwParams is about?
The structure CallSnapshotEx gets filled by the function call GetThreadCallStack with STACKSNAP_EXTENDED_INFO flag. The dwParams contains the parameters for the function. I guess it is just a standard convention to use four parameters. Because, if you observe the documentation of another related structure CallSnapshot3, the number of parameters is four. I don't think there might be any other strong reason behind it other than the convention that a minimum of four dwords must always be allocated.
Related
I'm using the /QIfist compiler switch regularly, which causes the compiler to generate FISTP instructions to round floating point values to integers, instead of calling the _ftol helper function.
How can I make it use FIST(P) DWORD, instead of QWORD?
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
I don't understand what you are trying to say here.
The FIST and FISTP instructions differ from each other in exactly two ways:
FISTP pops the top value off of the floating point stack, while FIST does not. This is the obvious difference, and is reflected in the opcode naming: FISTP has that P suffix, which means "pop", just like ADDP, etc.
FISTP has an additional encoding that works with 64-bit (QWORD) operands. That means you can use FISTP to convert a floating point value to a 64-bit integer. FIST, on the other hand, maxes out at 32-bit (DWORD) operands.
(I don't think there's a technical reason for this. I certainly can't imagine it is related to the popping behavior. I assume that when the Intel engineers added support for 64-bit operands some time later, they figured there was no reason for a non-popping version. They were probably running out of opcode encodings.)
There are lots of online references for the x86 instruction set. For example, this site is the top hit for most Google searches. Or you can look in Intel's manuals (FIST/FISTP are on p. 365).
Where the two instructions read the value from, and where they store it to, is exactly the same. Both read the value from the top of the floating point stack, and both store the result to memory.
There would be absolutely no advantage to the compiler using FIST instead of FISTP. Remember that you always have to pop all values off of the floating point stack when exiting from a function, so if FIST is used, you'd have to follow it by an additional FSTP instruction. That might not be any slower, but it would needlessly inflate the code.
Besides, there's another reason that the compiler prefers FISTP: the support for 64-bit operands. It allows the code generator to be identical, regardless of what size integer you're rounding to.
The only time you might prefer to use FIST is if you're hand-writing assembly code and want to re-use the floating point value on the stack after rounding it. The compiler doesn't need to do that.
So anyway, all of that to say that the answer to your question is no. The compiler can't be made to generate FIST instructions automatically. If you're still insistent, you can write inline assembly that uses whatever instructions you want:
int32 RoundToNearestEven(float value)
{
int32 result;
__asm
{
fld DWORD PTR value
fist DWORD PTR result
// do something with the value on the floating point stack...
//
// ... but be sure to pop it off before returning
fstp st(0)
}
return result;
}
I am trying to understand what the following code segment from tls.h in glibc is doing and why:
/* Macros to load from and store into segment registers. */
# define TLS_GET_FS() \
({ int __seg; __asm ("movl %%fs, %0" : "=q" (__seg)); __seg; })
I think I understand the basic operation it is moving the value stored in the fs register to __seg. However, I have some questions:
My understanding is the fs is only 16-bits. Is this correct? What happens when the value gets moved to a quadword memory location? Does this mean the upper bits get set to 0?
More importantly I think that the scope of the variable __seg that gets declared at the start of the segment is limited to this segment. So how is __seg useful? I'm sure that the authors of glibc have a good reason for doing this but I can't figure out what it is from looking at the source code.
I tried generating assembly for this code and I got the following?
#APP
# 13 "fs-test.cpp" 1
movl %fs, %eax
# 0 "" 2
#NO_APP
So in my case it looks like eax was used for __seg. But I don't know if that is always what happens or if it was just what happened in the small test file that I compiled. If it is always going to use eax why wouldn't the assembly be written that way? If the compiler might pick other registers then how will the programmer know which one to access since __seg goes out of scope at the end of the macro? Finally I did not see this macro used anywhere when I grepped for it in the glibc source code, so that further adds to my confusion about what its purpose is. Any explanation about what the code is doing and why is appreciated.
My understanding is the fs is only 16-bits. Is this correct? What happens when the value gets moved to a quadword memory location? Does this mean the upper bits get set to 0?
Yes.
the variable __seg that gets declared at the start of the segment is limited to this segment. So how is __seg useful?
You have to read about GCC statement-expression extension. The value of statement expression is the value of the last expression in it. The __seg; at the end would be useless, unless one assigns it to something else, like this:
int foo = TLS_GET_FS();
Finally I did not see this macro used anywhere when I grepped for it in the glibc source code
The TLS_{GET,SET}_FS in fact do not appear to be used. They probably were used in some version, then accidentally left over when the code referencing them was removed.
I have recently been reading about a family of automatic memory management techniques that rely on storing information in the pointer returned by the allocator, i.e. few bits of header e.g. to differentiate between pointers or to store thread-related information (note that I'm not talking about limited-field reference counting here, only immutable information).
I'd like to toy with these techniques. Now, to implement them, I need to be able to return pointers with a specific shape from my allocator. I suppose I could play with the least weight bits but this would require padding that looks extremely memory consuming, so I believe that I should play with the heaviest bits. However, I have no good idea on how to do this. Is there a way for me to, call malloc or malloc_create_zone or some related function and request a pointer that always starts with the given bits?
Thanks everyone!
The amount of information you can actually store in a pointer is pretty limited (typically one or two bits per pointer). And every attempt to dereference the pointer has to first mask out the magic information. The technique is often called tagging, BTW.
#define TAG_MASK 0x3
#define CONS_TAG 0x1
#define STRING_TAG 0x2
#define NUMBER_TAG 0x3
typedef uintptr_t value_t;
typedef struct cons {
value_t car;
value_t cdr;
} cons_t;
value_t
create_cons(value_t t1, value_t t2)
{
cons_t* pair = malloc(sizeof(cons_t));
value_t addr = (value_t)pair;
pair->car = t1;
pair->cdr = t2;
return addr | CONS_TAG;
}
value_t
car_of_cons(value_t v)
{
if ((v % TAG_MASK) != CONS_TAG) error("wrong type of argument");
return ((cons_t*) (v & ~TAG_MASK))->car;
}
One advantage of this technique is, that you can directly infer the type of the object from the pointer itself. You don't need to dereference it (say, in order to read a special type field or similar). Many language implementations using this scheme also have a special tag combination for "immediate" numbers and other small values, which can be represented direcly using the "pointer".
The disadvatage is, that the amount of information, which can be stored, is pretty limited. Also, as the example code shows, you have to be aware of the tagging in every access to the object, and need to "untag" the pointer before actually using it.
The use of the least significant bits for tagging stemms from the observation, that on most platforms, all pointer to malloced memory is actually aligned on a non-byte boundary (usually 8 bytes), so the least significant bits are always zero.
I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.
The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.
If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int
I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.
x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).
You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.