Is rbp/ebp(x86-64) register still used in conventional way? - linux

I have been writing a small kernel lately based on x86-64 architecture. When taking care of some user space code, I realized I am virtually not using rbp. I then looked up at some other things and found out compilers are getting smarter these days and they really dont use rbp anymore. (I could be wrong here.)
I was wondering if conventional use of rbp/epb is not required anymore in many instances or am I wrong here. If that kind of usage is not required then can it be used like a general purpose register?
Thanks

It is only needed if you have variable-length arrays in your stack frames (recording the array length would require more memory and more computations). It is no longer needed for unwinding because there is now metadata for that.
It is still useful if you are hand-writing entire assembly functions, but who does that? Assembly should only be used as glue to jump into a C (or whatever) function.

Related

How do I find out the number of CPU cores using cpuid?

I am interested in physical cores, not logical cores.
I am aware of https://crates.io/crates/num_cpus, but I want to get the number of cores using cpuid. I am mostly interested in a solution that works on Ubuntu, but cross-platform solutions are welcome.
I see mainly two ways for you to do this.
You could use the higher level library cpuid. With this, it's as simple as cpuid::identify().unwrap().num_cores (of course, please do proper error handling). But since you know about the library num_cpus and still ask this question, I assume you don't want to use an external library.
The second way to do this is do it all on your own. But this method of doing it is mostly unrelated to Rust as the main difficulty lies in understanding the CPUID instruction and what it returns. This is explained in this Q&A, for example. It's not trivial, so I won't repeat it here.
The only Rust specific thing is how to actually execute that instruction in Rust. One way to do it is to use core::arch::x86_64::__cpudid_count. It's an unsafe function that returns the raw result (four registers). After calling it, you have to extract the information you want via bit shifting and masking as described in the Q&A I linked above. Consult core::arch for other architectures or more cpuid related functions.
But again, doing this manually is not trivial, error prone and apparently hard to make work truly cross-CPU. So I would strongly recommend using a library like num_cpus in any real code.

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?
If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.
The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:
The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.
If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.
Initial analysis
This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:
For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?
For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?
Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).
But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

Removing assembly instructions from ELF

I am working on an obfuscated binary as a part of a crackme challenge. It has got a sequence of push, pop and nop instructions (which repeats for thousands of times). Functionally, these chunks do not have any effect on the program. But, they make generation of CFGs and the process of reversing, very hard.
There are solutions on how to change the instructions to nop so that I can remove them. But in my case, I would like to completely strip off those instructions, so that I can get a better view of the CFG. If instructions are stripped off, I understand that the memory offsets must be modified too. As far as I could see, there were no tools available to achieve this directly.
I am using IDA Pro evaluation version. I am open to solutions using other reverse engineering frameworks too. It is preferable, if it is scriptable.
I went through a similar question but, the proposed solution is not applicable in my case.
I would like to completely strip off those instructions ... I understand that the memory offsets must be modified too ...
In general, this is practically impossible:
If the binary exports any dynamic symbols, you would have to update the .dynsym (these are probably the offsets you are thinking of).
You would have to find every statically-assigned function pointer, and update it with the new address, but there is no effective way to find such pointers.
Computed GOTOs and switch statements create function pointer tables even when none are present in the program source.
As Peter Cordes pointed out, it's possible to write programs that use delta between two assembly labels, and use such deltas (small immediate values directly encoded into instructions) to control program flow.
It's possible that your target program is free from all of the above complications, but spending much effort on a technique that only works for that one program seems wasteful.

Copying garbage collection of generated code

Copying (generational) garbage collection offers the best performance of any form of automatic memory management, but requires pointers to relocated chunks of data be fixed up. This is enabled, in languages which support this memory management technique, by disallowing pointer arithmetic and making sure all pointers are to the beginning of identifiable objects.
If you're generating code at run time with a JIT compiler, things look a bit trickier because return addresses on the call stack will point to, not the beginning of code blocks, but random locations within them, so fixup is a problem.
How is this typically solved?
Quite often, you don't relocate code. This is both because it is indeed complicated to fix the stack and other addresses (think jumps across code fragments), and because you don't actually need garbage collection for such code (as it is only manipulated by code you write anyway, so you can do manual memory management). You also don't expect to create a whole lot of machine code (compared to application objects), so fragmentation etc. is not a concern.
If you insist on moving machine code and fixing up the stack, there is a way, I think: Similar to Mark-Compact, build a "break table" (I have no idea where this name comes from; "relocation table" might be clearer) that tells you the amount by which you should adjust pointers to moved objects. Now, walk the stack for return addresses (highly platform-specific, of course) and fix them if they refer to relocated code. Instead of looking for exact matches, search for the highest address lower than the return address you're currently replacing. You can check that this address indeed refers to some machine code that moved by looking at the object size (you have a pointer to the start of the object, after all). This approach isn't feasible for all objects, for various reasons.
There are other reasons to do something similar though. Some JIT compilers feature on-stack replacement, which means creating a new version (e.g. more optimized, or less optimized) of some machine code and replacing all occurrences of the old version with it. This is far more complicated than just fixing the return addresses though. You have to ensure the new version logically continue where the old one was left hanging. I am not familiar with how this is implemented, so I will not go into detail.

How to modify an ELF file in a way that changes data length of parts of the file?

I'm trying to modify the executable contents of my own ELF files to see if this is possible. I have written a program that reads and parses ELF files, searches for the code that it should update, changes it, then writes it back after updating the sh_size field in the section header.
However, this doesn't work. If I simply exchange some bytes, with other bytes, it works. However, if I change the size, it fails. I'm aware of that some sh_offsets are immediately adjacent to each other; however this shouldn't matter when I'm reducing the size of the executable code.
Of course, there might be a bug in my program (or more than one), but I've already painstakingly gone through it.
Instead of asking for help with debugging my program I'm just wondering, is there anything else than the sh_size field I need to update in order to make this work (when reducing the size)? Is there anything that would make changing the length fail other than that field?
Edit:
It seems that Andy Ross was perfectly correct. Even in this very simple program I have come across some indirect addressing in __libc_start_main that I cannot trivially modify to update the offset it will reach.
I was curious though, what would be the best approach to still trying to get as far as possible with this problem? I know I cannot solve this in every case, but for some simple programs, it should be possible to update what is required to make it run? Should I try writing my own virtual machine or try developing a "debugger" that would replace each suspected problem instruction with INT 3? Any ideas?
The text segment is likely internally linked with relative offsets. So one function might be trying to jump to, say, "current address plus 194 bytes". If you move things around such that the jump target is now 190 bytes, you will obviously break things. The same is true of constant data on some architectures (e.g. x86-64 but not i686). There is no simple way short of a complete disassembly to know where the internal references are, and in fact it's computationally undecidable to find them all (i.e. trying to figure out all possible jump targets of a runtime-computed branch is the Halting Problem).
Basically, this isn't solvable in the general case, so if you have an ELF binary from someone else you're trying to patch, you'll need to try other techniques. But with (great!) care it's possible to produce a library where all internal references go through the GOT/PLT which can be sliced up and relinked like this. What are you trying to accomplish?
is there anything else than the sh_size field I need to update in order to make this work
It sounds like you are patching a fully-linked binary (ET_EXEC or ET_DYN). Please note that .sh_size is not used for anything after the static link is done. You can strip the entire section table, and the binary will continue to work fine. What matters at runtime are the segments in the ELF, not sections.
ELF stands for executable and linking format, and the executable and linking form "dual nature" of the ELF. Sections are used at (static) link time to combine into segments; which are used at execution time (aka runtime, aka dynamic linking time).
Of course you haven't told us exactly what your patching strategy is when you are shrinking your binary, and in what way the result is broken. It is very likely that Andy Ross's answer is the real cause of your breakage.

Resources