Thread local storage vs. repeat string instructions (rep movs) - multithreading

I wonder how compilers (GCC or LLVM as example) behave when a variable is declared to be stored in Thread Local Storage (TLS) but the code with the variable is a candidate for generating repeat string instruction (rep movsd, rep stos and etc). I guess that as rep string instructions store data in ES segment, they are suppressed by a compiler to be generated for TLS variables. But it is just my assumption. If you know an exact answer, welcome.
Thanks,
Andrei.

Related

Clarion 6.3 DLL, *CSTRING parameter exporte function - adds an invisible parameter?

I need to negotiate a function call, from my Delphi app, into provided DLL made in Clarion 6.3.
I need to pass one or two string parameters (either one functon wit htwo params or two single-params functions).
We quickly settled on using 1-byte 0-ended strings (char* in C terms, CSTRING in Clarion terms, PAnsiChar in Delphi terms), and that is where things got a bit unpredictable and hard too understand.
The working solution we got was passing untyped pointers disguised as 32-bit integers, which Clarion-made DLL then uses to traverse memory with something Clarion programmer called "pick" or maybe "peek". There are also forum articles on interop between Clarion and Visual Basic which address passing strings from VB into Clarion and glancing upon which from behind my shoulder the Clarion developer said something like "i don't need copy of it, i already know it, it is typical".
This however puts more burden on us long-term, as low-level untyped code is much "richer" on boilerplate and error-prone. Typed code would feel better solution.
What i seek here is less of "That is the pattern to copy-paste and make things work without thinking" - we already have it - and more of understanding, what is going on behind the hood, and how can i rely on it, and what should i expect from Clarion DLLs. To avoid getting eventually stuck in "works by chance" solution.
As i was glancing into Clarion 6.3 help from behind his shoulder, the help was not helful on low-level details. It was all about calling DLLs from Clarion, but not about being called. I also don't have Clarion on my machine, not do i want to, ahem, borrow it. And as i've been told sources of Clarion 6.3 runtime are not available to developers either.
Articles like interop between Clarion and VB or between Clarion and C# are not helpful, because they fuse idiosyncrasies of both languages and give yet less information about "bare metal" level.
Google Books pointed to "Clarion Tips & Techniques - David Harms" - and it seem to have interesting insights for Clarion seasoned ones, but i am Clarion zero. At least i was not able to figure out low-level interop-enabling details from it.
Is there maybe a way to make Clarion 6.3 save 'listing files' for the DLLs it make, a standard *.H header file maybe?
So, to repeat, what works, as expected was a function that was passing pointers on Delphi side ( procedure ...(const param1, param2: PAnsiChar); stdcall; which should translate to C stdcall void ...(char* p1, char* p2) and which allegedly look in Clarion something like (LONG, LONG), LONG, pascal, RAW.
This function takes two 32-bit parameters from stack in reverse order, uses them, and exits, passing return value (actually, unused garbage) in EAX register and clearing parameters from stack. Almost exactly stdcall, except that it seems to preserve EBX register for some obscure reason.
Clarion function entry:
04E5D38C 83EC04 sub esp,$04 ' allocate local vars
04E5D38F 53 push ebx ' ????????
04E5D390 8B44240C mov eax,[esp+$0c]
04E5D394 BBB4DDEB04 mov ebx,$04ebddb4
04E5D399 B907010000 mov ecx,$00000107
04E5D39E E889A1FBFF call $04e1752c ' clear off local vars before use
And its exit
00B8D500 8B442406 mov eax,[esp+$06] ' pick return value
00B8D504 5B pop ebx ' ????
00B8D505 83C41C add esp,$1c ' remove local vars
00B8D508 C20800 ret $0008 ' remove two 32-bits params from stack
Except for unexplainable for me manipulation with EBX and returning garbage result - it works as expected. But - untyped low-level operations in Clarion sources required.
Now the function that allegedly only takes one string parameter: on Delphi side - procedure ...(const param1: PAnsiChar); stdcall; which should translate to C stdcall void ...(char* p1) and which allegedly look in Clarion something like (*CSTRING), LONG, pascal, RAW.
Clarion function entry:
00B8D47C 83EC1C sub esp,$1c ' allocate local vars
00B8D47F 53 push ebx ' ????????
00B8D480 8D44240A lea eax,[esp+$0a]
00B8D484 BB16000000 mov ebx,$00000016
00B8D489 B990FEBD00 mov ecx,$00bdfe90
00B8D48E BA15000000 mov edx,$00000015
00B8D493 E82002FBFF call $00b3d6b8 ' clear off local vars before use
And its exit
04E5D492 8B442404 mov eax,[esp+$04] ' pick return value
04E5D496 5B pop ebx ' ????
04E5D497 83C404 add esp,$04 ' remove local vars
04E5D49A C20800 ret $0008 ' remove TWO 32-bits params from stack
What strucks here is that somehow TWO parameters are expected by the function, and only the second one is used (i did not see any reference to the first parameter in the x86 asm code). The function seems to work fine, if being called as procedure ...(const garbage: integer; const param1: PAnsiChar); stdcall; which should translate to C stdcall void ...(int garbage, char* p1).
This "invisible" parameter would look much like a Self/This pointer in object-oriented languages method functions, but the Clarion programmer told me with certainty there was no any objects involved. More so, his 'double-int' function does not seem expect invisible parameter either.
The aforementioned 'Tips' book describes &CSTRING and &STRING Clarion types as actually being two parameters behind the hood, pointer to the buffer and the buffer length. It however gives no information upon how specifically they are passed on stack though. But i was said Clation refused to make a DLL with exported &CSTRING-parametrized function.
I could suppose the invisible parameter is where Clarion wants to store function's return value (if there would had been any assignment to it in Clarion sources), crossing stdcall/PASCAL convention, but the assembler epilogue code shows clear use of EAX register for that, and again the 'double-LONG' function does not use it.
And, so, while i made the "works on my machine" quality code, that successfully calls that Clarion function, by voluntarily inserted a garbage-parameter - i feel rather fuzzy, because i can not understand what and why Clarion is doing there, and hence, what it can suddenly start doing in future after any seemingly unrelated changes.
What is that invisible parameter? Why can it happen there? What to expect from it?
If you are consuming a DLL from Clarion you can prototype with RAW - but procedures in a Clarion DLL cannot do this.
So in the Clarion DLL they can prototype as
Whatever Procedure(*Cstring parm1, *Cstring parm2),C,name('whatever')
And, as you note, from your side you should see this as 4 parameters, length, pointer, length, pointer. (knowing explicit max lengths is not a bad thing from a safety point of view anyway.)
the alternative is
Whatever Procedure(Long parm1, Long parm2),C,name('whatever')
Then from your side it's just 2 addresses.
But there's a bit more code on his side turning those incoming addresses into memory pointers. (yes, he can use PEEK and POKE but that's a bit of overkill)
(From memory he could just declare local variables as
parm1String &cstring,over(parm1)
parm2String &cstring,over(parm2)
but it's been decades since I did this, so I'm not 100% that syntax is legal.)

How to tell compiler to pad a specific amount of bytes on every C function?

I'm trying to practice some live instrumentation and I saw there was a linker option -call-nop=prefix-nop, but it has some restriction as it only works with GOT function (I don't know how to force compiler to generate GOT function, and not sure if it's good idea for performance reason.) Also, -call-nop=* cannot pad more than 1 byte.
Ideally, I'd like to see a compiler option to pad any specific amount of bytes, and compiler will still perform all the normal function alignment.
Once I have this pad area, I can at run time to reuse these padding area to store some values or redirect the control flow.
P.S. I believe Linux kernel use similar trick to dynamically enable some software tracepoint.
-pg is intended for profile-guided optimization. The correct option for this is -fpatchable-function-entry
-fpatchable-function-entry=N[,M]
Generate N NOPs right at the beginning of each function, with the function entry point before the Mth NOP. If M is omitted, it defaults to 0 so the function entry points to the address just at the first NOP. The NOP instructions reserve extra space which can be used to patch in any desired instrumentation at run time, provided that the code segment is writable. The amount of space is controllable indirectly via the number of NOPs; the NOP instruction used corresponds to the instruction emitted by the internal GCC back-end interface gen_nop. This behavior is target-specific and may also depend on the architecture variant and/or other compilation options.
It'll insert N single-byte 0x90 NOPs and doesn't make use of multi-byte NOPs thus performance isn't as good as it should, but you probably don't care about that in this case so the option should work fine
I achieved this goal by implement my own mcount function in an assembly file and compile the code with -pg.

Forcing MSVC to generate FIST instructions with the /QIfist option

I'm using the /QIfist compiler switch regularly, which causes the compiler to generate FISTP instructions to round floating point values to integers, instead of calling the _ftol helper function.
How can I make it use FIST(P) DWORD, instead of QWORD?
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
I don't understand what you are trying to say here.
The FIST and FISTP instructions differ from each other in exactly two ways:
FISTP pops the top value off of the floating point stack, while FIST does not. This is the obvious difference, and is reflected in the opcode naming: FISTP has that P suffix, which means "pop", just like ADDP, etc.
FISTP has an additional encoding that works with 64-bit (QWORD) operands. That means you can use FISTP to convert a floating point value to a 64-bit integer. FIST, on the other hand, maxes out at 32-bit (DWORD) operands.
(I don't think there's a technical reason for this. I certainly can't imagine it is related to the popping behavior. I assume that when the Intel engineers added support for 64-bit operands some time later, they figured there was no reason for a non-popping version. They were probably running out of opcode encodings.)
There are lots of online references for the x86 instruction set. For example, this site is the top hit for most Google searches. Or you can look in Intel's manuals (FIST/FISTP are on p. 365).
Where the two instructions read the value from, and where they store it to, is exactly the same. Both read the value from the top of the floating point stack, and both store the result to memory.
There would be absolutely no advantage to the compiler using FIST instead of FISTP. Remember that you always have to pop all values off of the floating point stack when exiting from a function, so if FIST is used, you'd have to follow it by an additional FSTP instruction. That might not be any slower, but it would needlessly inflate the code.
Besides, there's another reason that the compiler prefers FISTP: the support for 64-bit operands. It allows the code generator to be identical, regardless of what size integer you're rounding to.
The only time you might prefer to use FIST is if you're hand-writing assembly code and want to re-use the floating point value on the stack after rounding it. The compiler doesn't need to do that.
So anyway, all of that to say that the answer to your question is no. The compiler can't be made to generate FIST instructions automatically. If you're still insistent, you can write inline assembly that uses whatever instructions you want:
int32 RoundToNearestEven(float value)
{
int32 result;
__asm
{
fld DWORD PTR value
fist DWORD PTR result
// do something with the value on the floating point stack...
//
// ... but be sure to pop it off before returning
fstp st(0)
}
return result;
}

Good references for the syscalls

I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.
The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.
If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int

Visual C++ App crashes before main in Release, but runs fine in Debug

When in release it crashes with an unhandled exception: std::length error.
The call stack looks like this:
msvcr90.dll!__set_flsgetvalue() Line 256 + 0xc bytes C
msvcr90.dll!__set_flsgetvalue() Line 256 + 0xc bytes C
msvcr90.dll!_getptd_noexit() Line 616 + 0x7 bytes C
msvcr90.dll!_getptd() Line 641 + 0x5 bytes C
msvcr90.dll!rand() Line 68 C
NEM.exe!CGAL::Random::Random() + 0x34 bytes C++
msvcr90.dll!_initterm(void (void)* * pfbegin=0x00000003, void (void)* * pfend=0x00345560) Line 903 C
NEM.exe!__tmainCRTStartup() Line 582 + 0x17 bytes C
kernel32.dll!7c817067()
Has anyone got any clues?
Examining the stack dump:
InitTerm is simply a function that walks a list of other functions and executes each in step - this is used for, amongst other things, global constructors (on startup), global destructors (on shutdown) and atexit lists (also on shutdown).
You are linking with CGAL, since that CGAL::Random::Random in your stack dump is due to the fact that CGAL defines a global variable called default_random of the CGAL::Random::Random type. That's why your error is happening before main, the default_random is being constructed.
From the CGAL source, all it does it call the standard C srand(time(NULL)) followed by the local get_int which, in turn, calls the standard C rand() to get a random number.
However, you're not getting to the second stage since your stack dump is still within srand().
It looks like it's converting your thread into a fiber lazily, i.e., this is the first time you've tried to do something in the thread and it has to set up fiber-local storage before continuing.
So, a couple of things to try and investigate.
1/ Are you running this code on pre-XP? I believe fiber-local storage (__set_flsgetvalue) was introduced in XP. This is a long shot but we need to clear it up anyway.
2/ Do you need to link with CGAL? I'm assuming your application needs something in the CGAL libraries, otherwise don't link with it. It may be a hangover from another project file.
3/ If you do use CGAL, make sure you're using the latest version. As of 3.3, it supports a dynamic linking which should prevent the possibility of mixing different library versions (both static/dynamic and debug/nondebug).
4/ Can you try to compile with VC8? The CGAL supported platforms do NOT yet include VC9 (VS2008). You may need to follow this up with the CGAL team itself to see if they're working on that support.
5/ And, finally, do you have Boost installed? That's another long shot but worth a look anyway.
If none of those suggestions help, you'll have to wait for someone more knowledgable than I to come along, I'm afraid.
Best of luck.
Crashes before main() are usually caused by a bad constructor in a global or static variable.
Looks like the constructor for class Random.
Do you have a global or static variable of type Random? Is it possible that you're trying to construct it before the library it's in has been properly initialized?
Note that the order of construction of global and static variables is not fixed and might change going from debug to release.
Could you be more specific about the error you're receiving? (unhandled exception std::length sounds weird - i've never heard of it)
To my knowledge, FlsGetValue automatically falls back to TLS counterpart if FLS API is not available.
If you're still stuck, take .dmp of your process at the time of crash and post it (use any of the numerous free upload services - and give us a link) (Sounds like a missing feature in SO - source/data file exchange?)

Resources