Delphi 11.2: CreateWindowEx fails thread on x64 - multithreading

I'm using Peter Below's PBThreadedSplashForm to display a splash window during application startup.
This component worked great for 10 years, but, since updating my Delphi to 11.2, I get an AV on the CreateWindowEx call.
This happens on Win64 platform only, on problems on Win32.
Anyone who knows what can be the cause of this?

This is one of the many issues that have surfaced in 11.2 due to the new default ASLR settings in the compiler and linker.
After a very quick glance at the source code I see this:
SetWindowLong( wnd, GWL_WNDPROC, Integer( thread.FCallstub ));
thread.FCallstub is defined as Pointer.
Just as I thought.
You see, pointers are of native size, so in 32-bit applications, pointers are 32 bits wide, while in 64-bit applications, pointers are 64 bits wide.
It was very common in the 32-bit world that pointer values were temporarily saved in Integers. This worked because a 32-bit pointer fits in a 32-bit Integer.
But in a 64-bit application, this is an obvious bug, since a 64-bit pointer doesn't fit in a 32-bit Integer. It's like taking a phone number like 5362417812 and truncating it to 17812, hoping that it will still "work".
Of course, in general, this causes bugs such as AVs and memory corruption.
However, until recently, there was a rather high probability that a pointer in a 64-bit Delphi application by "chance" didn't use its 32 upper bits (so it was like maybe $0000000000A3BE41, and so truncating it to $00A3BE41 didn't have any effect). So it seemed to work most of the time, but only by accident.
Now, recent versions of the Delphi compiler and linker enables ASLR, making such accidents much less likely.
And this is a good thing: If you have a serious bug in your code, it is better if you discover it right away and not "randomly" at your customers.
So, to fix the issue, you need to go through the code and make sure you never store a pointer in a 32-bit Integer. Instead, use a native-sized NativeInt, Pointer, LPARAM, or whatever is semantically appropriate.
(Disabling ASLR will also make it work in "many" cases by accident again, but this is a very bad approach. Your software still has a very serious bug that may manifest itself at any time.)
In your code, there is also
Integer( Pchar( FStatusMessage ))
Integer( Pchar( msg ))


How to find the number of bit of OS/390 or a z/OS?

What is the command for finding the number of bit of an OS/390 or a z/OS?
Since there didn't seem to be a "real" answer on this thread, I thought I'd provide one just in case anyone needs the information...
The definitive source of whether you're running in 64-bit mode is the STORE FACILITY LIST (STFL, or STFLE) hardware instruction. It sets two different bits - one to indicate that the 64-bit zArchitecture facility is installed, and one to indicate that the 64-bit zArchitecture facility is active (it was once possible to run in 31-bit mode on 64-bit hardware, so this would give you the "installed, but not active" case).
The operating system generously issues STFL/STFLE during IPL, saving the response in the PSA (that's low memory, starting at location 0). This is handy, since STFL/STFLE are privileged instructions, but testing low storage doesn't require anything special. You can check the value at absolute address 0xc8 (decimal 200) for the 0x20 bit to tell that the system is active in 64-bit mode, otherwise it's 31-bit mode.
Although I doubt there are any pre-MVS/XA systems alive anymore (that is, 24-bit), for completeness you can also test CVTDCB.CVTMVSE bit - if this bit is not set, then you have a pre-MVS/XA 24-bit mode system. Finding this bit is simple - but left as an exercise for the reader... :)
If you're not able to write a program to test the above, then there are a variety of ways to display storage, such as TSO TEST or any mainframe debugger, as well as by looking at a dump, etc.
While I was not able to find commands to give this information, I think below is what you're looking for:
According to this:
z/OS is OS/390 with various extensions including support for 64-bit architecture.
So if you're on a zSeries processor with z/OS, you're on 64-bit.
According to this:
OS/390 was installed on ESA/390 computers, which were 32-bit computers, but were 31-bit addressable.
For either z/OS or OS/390 I believe you can do a D IPLINFO and look for ARCHLEVEL. ARCHLEVEL 1 = 31 bit, ARCHLEVEL 2 = 64 bit. But it's been a very long time since I've been on an OS/390 system.

How to use sysenter for 64 bits userland programs linked against SystemⅤ libraries?

Is it possible to use sysenter in a 64 bits program on Linux ? Or is it impossible to adapt the use of sysenter with the SystemⅤ calling convention without getting other dynamic link libraries crashing (I know the 32 bits way won’t work but I just want to know if it’s possible to work around this like withint 0x80) ?
There is very few documentation around using sysenter in 32 bits, so I couldn’t found anything for 64 bits.
I know this not recommended but it’s the only opcode I can use to trigger a system call as part of bug bounty hunting exploit where the program need to exit using a special function that can be trigger only from normal execution.
It is possible to use them, but they use the 32-bit entry point of the kernel (check the code for more).
The actual location (and code) of this entry point depends on you kernel version.
For versions 4.2 and newer it is entry_SYSENTER_32.
For versions 4.1 and older it is ia32_sysenter_target.
Finally, SYSRET is not available at userspace (it can only be executed from ring 0). Check the Intel manual description of the instruction.

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the
movdqu instruction accessing data across a page boundary (OS = Linux).
If the following page is mapped, this works just fine. If it's not
mapped then I get a SIGSEGV, which is probably expected.
However this diminishes the usefulness of the unaligned loads quite
a bit. Additionally SSE4.2 instructions (like pcmpistri) which
allow for unaligned memory references appear to exhibit this behavior
as well.
That's all fine -- except there's many an implementation of strcmp
using pcmpistri that I've found that don't seem to address this issue
at all -- and I've been able to contrive trivial testcases that will
cause these implementations to fail, while the byte-at-a-time trivial
strcmp implementation will work just fine with the same data layout.
One more note -- it appears the the GNU C library implementation for
64-bit Linux has a __strcmp_sse42 variant that appears to use the
pcmpistri instruction in a more safe manner. The implementation of
this strcmp is fairly complex, but it appears to be carefully trying
to avoid the page boundary issue. I'm not sure if that's due to the
issue I describe above, or whether it's just a side-effect of trying to
get better performance by aligning the data.
Anyway the question I have is primarily -- where can I find out more
about this issue? I've typed in "movdqu crossing page boundary" and
every variant of that I can think of to Google, but haven't come across
anything particularly useful. If anyone can point me to further info
on this it would be greatly appreciated.
First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.
Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.
AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.
(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)
Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.
The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.
Since we never read past the string we cannot potentially stray into a protected page.
To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.

how come an x64 OS can run a code compiled for x86 machine

Basically, what I wonder is how come an x86-64 OS can run a code compiled for x86 machine. I know when first x64 Systems has been introduced, this wasn't a feature of any of them. After that, they somehow managed to do this.
Note that I know that x86 assembly language is a subset of x86-64 assembly language and ISA's is designed in such a way that they can support backward compatibility. But what confuses me here is stack calling conventions. These conventions differ a lot depending on the architecture. For example, in x86, in order to backup frame pointer, proceses pushes where it points to stack(RAM) and pops after it is done. On the other hand, in x86-64, processes doesn't need to update frame pointer at all since all the references is given via stack pointer. And secondly, While in x86 architecture arguments to functions is passed by stack in x86-64, registers are used for that purpose.
Maybe this differences between stack calling conventions of x86-64 and x64 architecture may not affect the way program stack grows as long as different conventions are not used at the same time and this is mostly the case because x32 functions are called by other x32's and same for x64. But, at one point, a function (probably a system function) will call a function whose code is compiled for a x86-64 machine with some arguments, at this point, I am curious about how OS(or some other control unit) handle to get this function work.
Thanks in advance.
Part of the way that the i386/x86-64 architecture is designed is that the CS and other segment registers refer to entries in the GDT. The GDT entries have a few special bits besides the base and limit that describe the operating mode and privilege level of the current running task.
If the CS register refers to a 32-bit code segment, the processor will run in what is essentially i386 compatibility mode. Likewise 64-bit code requires a 64-bit code segment.
So, putting this all together.
When the OS wants to run a 32-bit task, during the task switch into it, it loads a value into CS which refers to a 32-bit code segment. Interrupt handlers also have segment registers associated with them, so when a system call occurs or an interrupt occurs, the handler will switch back to the OS's 64-bit code segment, (allowing the 64-bit OS code to run correctly) and the OS then can do its work and continue scheduling new tasks.
As a follow up with regards to calling convention. Neither i386 or x86-64 require the use of frame pointers. The code is free to do as it pleases. In fact, many compilers (gcc, clang, VS) offer the ability to compile 32-bit code without frame pointers. What is important is that the calling convention is implemented consistently. If all the code expects arguments to be passed on the stack, that's fine, but the called code better agree with that. Likewise, passing via registers is fine too, just everyone has to agree (at least at the library interface level, internal functions can generally do as they please).
Beyond that, just keep in mind that the difference between the two isn't really an issue because every process gets its own private view of memory. A side consequence though is that 32-bit apps can't load 64-bit dlls, and 64-bit apps can't load 32-bit dlls, because a process either has a 32-bit code segment or a 64-bit code segment. It can't be both.
The processor in put into legacy mode, but that requires everything executing at that time to be 32bit code. This switching is handled by the OS.
Windows : It uses WoW64. WoW64 is responsible for changing the processor mode, it also provides the compatible dll and registry functions.
Linux : Until recently Linux used to (like windows) shift to running the processor in legacy mode when ever it started executing 32bit code, you needed all the 32bit glibc libraries installed, and it would break if it tried to work together with 64bit code. Now there are implementing the X32 ABI which should make everything run like smoother and allow 32bit applications to access x64 feature like increased no. of registers. See this article on the x32 abi
PS : I am not very certain on the details of things, but it should give you a start.
Also, this answer combined with Evan Teran's answer probably give a rough picture of everything that is happening.

glibc fnmatch vulnerability: how to expose the vulnerability?

I have to validate a vulnerability on one of our 64-bit systems which is running glibc-2.9 .
The above link gives a script which when passed a magic number apparently leads to arbitrary code execution. But when I tried it on my system, nothing seems to be happening.
Am I doing something wrong? Does the system crash if the vulnerability exists? How do I detect if it's accidental code execution?
If you're running on a 64-bit machine then the original circumstances of the bug don't apply. As you can see in Chris' blog, he's using a 32-bit Ubuntu 9.04 system. The exploit relies on causing the stack pointer to wrap about the 32-bit address space, leading to stack corruption.
I gave it a quick try on a 64-bit system with glibc 2.5, but saw malloc() failures instead of crashes.
$ ./a.out 3000000000
a.out: malloc() failed.
You asked how to identify accidental code execution; with the toy program here, which doesn't carry an exploit / payload, we'd expect to see either a SIGSEGV, SIGILL, or SIGBUS as the CPU tried to "execute" junk parts of the stack, showing up as the respective error message from the shell.
If you were to run into the problem on a 64-bit machine, you'd have to mimic the original code but provide a number that wraps the stack on a 64-bit machine. The original number provided was:
$ bc
So, one way of describing the input number is (ULONG_MAX - 112) / 4.
The analogue number for a 64-bit machine is 4611686018427387876:
$ bc
However, to stand a chance of this working, you'd have to modify the reported code to use strtroull() or something similar; atoi() is normally limited to 32-bit integers and would be no use on the 64-bit numbers above. The code also contains:
num_as = atoi(argv[1]);
if (num_as < 5) {
errx(1, "Need 5.");
p = malloc(num_as);
Where num_as is a size_t and p is a char *. So, you'd have to be able to malloc() a gargantuan amount of space (almost 4 EiB). Most people don't have enough virtual memory on their machines, even with disk space for backing, to do that. Now, maybe, just maybe, Linux would allow you to over-commit (and let the OOM Killer swoop in later), but the malloc() would more likely fail.
There were other features that were relevant and affect 32-bit systems in a way that it cannot affect 64-bit systems (yet).
If you're going to stand a chance of reproducing it on a 64-bit machine, you probably have to do a 32-bit compilation. Then, if the wind is behind you and you have appropriately old versions of the relevant software perhaps you can reproduce it.
