Reduce assembly number of instructions - linux

I want to reduce (manually) the number of instructions from a Linux assembly file. This will be basically done by searching predefined reductions in an abstract syntax tree.
For example:
pushl <reg1>
popl <reg1>
Will be deleted because it makes no sense.
Or:
pushl <something1>
popl <something2>
Will become:
movl <something1>, <something2>
I'm looking for other optimizations that involve a fixed number of instructions. I don't want to search dynamic ranges of instructions.
Could you suggest other similar patterns that can be replaced with less instructions?
Later Edit: Found out, thanks to Richard Pennington, that what I want is peephole optimization.
So I rephrase the question as: suggestions for peephole optimization on Linux assembly code.

Compilers already do such optimizations. Besides, it's not that straightforward decision to make such optimizations, because:
push reg1
pop reg1
Still leaves value of reg1 at memory location [sp-nn] (where nn = size of reg1 in bytes). So although sp is past it, the code after can assume [sp-nn] contains the value of reg1.
The same applies to other optimization as well:
push some1
pop some2
And that is usually emitted only when there is no equivalent movl some1, some2 instruction.
If you're trying to optimize a high-level compiler generated code, compilers usually take most of those cases into account. If you're trying to optimize natively written assembly code, then an assembly programmer should write even better code.
I would suggest you to optimize the compiler, rather than optimizing the assembly code, it would provide you a better framework for dealing with intent of the code and register usage etc.

To get more information about what you are trying to do, you might want to look for "peephole optimization".

pushl <something1>
popl <something2>
replaced with
mov <something1>, <something2>
actually increased the size of my program. Weird!
Could you provide some other possible peephole optimizations?

Related

RiscV assembler - output is not what I expected for register and immediate operands

I am compiling (with RV32I assembler) the following code - with no errors posted on the command line.
slt x15,x16,x17 # line a
slt x15,x16,22 # line b immediate operand
slti x15,x16,22 # line c
sltu x15,x16,x17 # line d
sltu x15,x16,22 # line e immediate operand
sltiu x15,x16,22 # line f
I notice that the machine code generated for line b is identical to the machine code generated for line c. And I notice the same situation with line e and f - the machine code from these 2 lines are identical. This machine output for these specific instructions, does not meet my expectation. Shouldn't the assembler throw an error or warning that the operands are not technically correct for "slt x15,x16,22" - and the immediate version of this instruction should be used - "slti x15,x16,22"? I invoke the assembler with the '-warn' option.
This result appears to defeat the purpose of having 2 different versions of these instructions. A version where all operands are registers and another version that has registers and one immediate operand. What if the intention was to use 'x22' instead of '22'?
As mentioned in a comment, I have moved this issue to GitHub as issue #79 on riscv/riscv-binutils-gdb.
The short answer to my original question is the assembler has a feature that will convert an instruction like SLTU, regX,regY,imm to the immediate version of the instruction - SLTIU regX,regY,imm. I have not seen any documentation that explains this feature.
By experimenting, here is a list of instructions I have discovered that perform this operation.
.text
slt x0,x0,-1 # bug
sltu x0,x0,0 # -> sltiu
add x0,x0,5 # -> addi
xor x0,x0,8 # -> xori
or x0,x0,12 # -> ori
and x0,x0,16 # -> andi
sll x0,x0,6 # -> slli
srl x0,x0,4 # -> srli
sra x0,x0,9 # -> srai
These instructions assemble with no errors or warnings. And I verified the machine code with the list file output below. (this task is simplified by using the x0 register).
Disassembly of section .text:
0000000000000000 <.text>:
0: fff02013 slt x0,x0,-1
4: 00003013 sltiu x0,x0,0
8: 00500013 addi x0,x0,5
c: 00804013 xori x0,x0,8
10: 00c06013 ori x0,x0,12
14: 01007013 andi x0,x0,16
18: 00601013 slli x0,x0,0x6
1c: 00405013 srli x0,x0,0x4
20: 40905013 srai x0,x0,0x9
The SLT instruction will write machine code for SLTI but the list file shows SLT - I consider this a bug. For detailed arguments see GitHub #79. All other instructions work as expected.
This approach works only if you have base instruction pairs in the base instructions. Like ADD/ADDI or XOR/XOI. But alas, SUB does not have a SUBI instruction in the RiscV ISA. I confirmed this when I received an error trying to assemble SUB with an immediate operand. So if you are the lazy assembler programmer and you don't want to use the correct operands for a base instruction - now you have to remember that should work fine except for SUB. Or add the SUBI instruction to your custom RiscV ISA.
What follows are some philosophy comments (so, you can skip the rest of this Answer if your RiscV project is due tomorrow). First, I feel guilty being critical of any open-source project. I am a long time Linux user and have used many open source tools. Not just for hobby work but for products used by IBM, HP and Dell. I have maybe 6 assemblers I have used in the past - at various levels of expertise. Starting way back with 8080/8085 and I have taught assembly language/computer architecture at the college level. I have to admit there is a lot of expertise that has huddled around RiscV - but none-the-less, I do not consider myself a total noob in assemblers.
1) Assemblers should stay close to the base instructions - and therefore they should present very good reasons when they deviate. Things like this feature where ADD is internally converted to ADDI inside the assembler - I feel this feature offers very little value. IMO There may be some value when using disassembly from C/C++ - but I can't put my finger on it. If someone has some details on why this approach was taken please post.
2) RiscV was touted as a fresh new, open ISA. However, it is similar to MIPS and the problem is MIPS binutils baggage comes with RiscV. Seems like I have run head-on into the "it worked in MIPS so it has to work in RiscV" thinking on GitHub #79.
3) If you don't like the assembly mnemonics - or are too lazy to bother using the correct operands for an instruction - then please consider writing a macro. For example, you can write a macro for the SUB operation to handle immediate arguments. Resist the urge to carry the macro idea into the assembler - especially if it will not be well documented to new users. This feature I have discovered, is very similar to a built-in macro in the assembler.
4) Bugs in list files are important - to some people they are critical to the verification task. They should be taken seriously and fixed. I am not certain if the bug on SLT to SLTI for the list file is the fault of the assembler it may be a problem in the binutils objdump utility.
5) Pseudoinstructions that are defined in the ISA - are like built-in macros. I think they should be used sparingly. Since, I think they can add more confusion. I write macros for my stack operations like PUSH and POP. I don't mind writing those macros - I don't feel I need many pseudoinstructions in the assembler or in the ISA. People who are familiar with gcc/gnu style assembler syntax should be able to quickly code-up some test code using only base instructions and not have to worry about discovering tricks in the assembler. I stumbled on the SLT trick by accident (typo).
6) This trick of converting instructions in the RiscV assembler comes at the expense of 'strong typing' the operands. If you make a typo (like I did) - but you intended to use all register operands for the base instruction - you will get the immediate form of the instruction with no warnings posted. So consider this a friendly heads-up. I prefer to invoke the KIS principle in assemblers and lean toward strict enforcement of the correct operands. Or why not offer an assembler option to turn on/off this feature?
7) More and more it seems like assemblers are used mostly for debug and verification and not for general purpose software development. If you need more abstract code tools - you typically move to C or C++ for embedded cores. Yes you could go crazy writing many assembly macros, but it is much easier to code in C/C++. You use some inline assembler perhaps to optimize some time critical code - and certainly it helps to disassemble to view compiled C/C++ code. But the C/C++ compilers have improved so much that for many projects this can make assembly optimization obsolete. Assembly is used for startup code - e.g. if you port Uboot bootloader to another processor you probably will have to deal with some start up files in assembler. So, I think the purpose of assemblers has morphed over time to some startup file duty but the most value in debug and verification. And that is why I think things like list files have to be correct. The list of commands that have this feature (e.g. converting from ADD to ADDI based on operand type), mean that the assembly programmer needs to master only one instruction. But RiscV has a small list of base instructions anyway. This is apparent if you had any experience with the old CISC processors. In fact, Risc processors by default should have a small instruction sets. So my question in my original post - why have the immediate version of the instruction? The answer is - for the instructions I have identified - you don't need them. You can code them with either all registers or registers and an immediate value - and the assembler will figure it out. But the HW implementation most definitely needs both versions (register only operands and register and immediate operands). E.g. the core needs to steer ALU input operands from either the register file output or the immediate value that was stripped from the instruction word.
So, the answer to my original question - "why does this create the exact same machine code?" - is "because that is how the assembler works". But as it stands today - this feature works most of the time..

Obfuscation of checksum guards

As part of my project, I have to insert small codes in a C program called checksum guards. What these guards do is they calculate the checksum value of a portion of code using a function(add, xor, etc.) which operates on the instruction opcodes. So, if somebody has tampered with the instructions(add, modify, delete) in that region of code, the checksum value will change and intrusion will be detected.
Here is the research paper which talks about this technique:
https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2001-49.pdf
Here is the guard template:
guard:
add ebp, -checksum
mov eax, client_addr
for:
cmp eax, client_end
jg end
mov ebx, dword[eax]
add ebp, ebx
add eax, 4
jmp for
end:
Now, I have two questions.
Would putting the guards in the assembly better than putting it in the source program?
Assuming I put them in the assembly(at an appropriate place) what kind of obfuscation should I use to prevent the guard template to be easily visible? (Since when I have more than 1 guard, the attacker should not easily find out all the guard templates and disable all the guards together as that would leave the code with no security)
Thank you in advance.
From attacker's (without sources) point of view the first question doesn't matter; he's tampering with the final binary machine code, whether it was produced from .c or .s will make zero difference. So I would worry mainly how to generate the correct binary with appropriate checksums. I'm not aware of any way how to get proper checksum inside the C source. But I can imagine to have some external tool running over assembler files created by C compiler, in some post-process way - before compiling the .s files into .o. But... Keep in mind, that some calls and addresses are just relative offsets, and the binary loaded into memory is patched by the OS loader according to linker's table, to make those point to the real memory addresses. Thus the data bytes will change (opcodes will stay fixed).
Yours guard template doesn't take that into account, and does checksum whole opcodes with data bytes as well (Some advanced guards have opcodes definitions, and checksum/encrypt/decipher only the opcodes themselves without operand bytes).
Otherwise it's neat, that the result is damaged ebp value, ruining any C code around (*) working with stack variables. But it's still artificial test, you can simply comment out both add ebp,-checksum and add ebp,ebx making the guard harmless.
(*) notice you have to put the guard in between some classic C code to get some real runtime problems from invalid ebp value. If you would put it at the end of subroutine, which ends with pop ebp, everything would work well.
So to the second question:
You definitely want more malicious ways to guard correct value, than only ebp damage. Usually the hardest (to remove) way is to make checksum value part of some calculation, eventually skewing results just slightly, so serious usage of the SW will be impossible, but it will take time to notice by the user.
You can also use some genuine code loop to add your checksumming to it, so simply skipping whole loop will skip also valid code (but I can imagine this one only added by hand into generated assembly from C, so you will have to redo it after every new compilation of particular C source).
Then the particular guard template can be obfuscated by any imaginable mutation (different registers used, modified order of instructions, instruction variants), try to search about viruses with mutation encoding to get some ideas.
And I didn't read that paper, but from the Figures I would say the main point is to make those guarding areas to overlap, so patching off one of them will affect another one, which sounds to me like that extra sugar to make it somewhat functional (although this still looks like normal challenge to 8bit game crackers ;), not even "hard" level). But that also means you would need either very cunning external tool to calculate that cyclic tree of dependencies, and insert the guard templates in correct order, or do it again manually completely.
Of course when doing manually, you have to do it after each new C compilation, so it's worth of the effort only on something very precious and expensive, or rock solid stable, where you will not produce another revision for next 10y or so... :D

How do I take operands as registers from the byte value?

I have a fairly simple program so far to start off my emulation experience. I load in an instruction and determine how many (if any) operands there are, then I grab those operands and use them. For things like jumps and pushes it's somewhat straightforward until I get to registers.. How do I know when an operand is a register? Or how can I tell if it's the value at an address instead of just an address (i.e when they use something like ld (hl),a)
I'm rather new to emulation and all, but I have a decent bit of experience with assembly, even for the z80.
Question
How do I tell the difference between what is meant as a register and what is meant as an address or dereference of an address?
Because you decode the instruction. For example in ld (hl), a, which is 0x77, or 0b01110111, the first 01 tell you it's an ld reg8, reg8 and that you have to decode two groups of 3 bits, each a reg8. So 110 and 111, and you look them up in the reg8 decoding table, where 110 means (hl) and 111 means a. Alternatively you could just make a Giant Switch of Death and directly decode 0x77 to ld (hl), a, but that's more of a difference in implementation than anything deep or significant.
The instruction completely specifies what the operands are, so this "how do I tell" question strikes me as a bit silly - the answer is already staring you right in the face when you're decoding the instruction.
See also: decoding z80 opcodes

Why is bounds checking not implemented in some of the languages?

According to the Wikipedia (http://en.wikipedia.org/wiki/Buffer_overflow)
Programming languages commonly associated with buffer overflows include C and C++, which provide no built-in protection against accessing or overwriting data in any part of memory and do not automatically check that data written to an array (the built-in buffer type) is within the boundaries of that array. Bounds checking can prevent buffer overflows.
So, why are 'Bounds Checking' not implemented in some of the languages like C and C++?
Basically, it's because it means every time you change an index, you have to do an if statement.
Let's consider a simple C for loop:
int ary[X] = {...}; // Purposefully leaving size and initializer unknown
for(int ix=0; ix< 23; ix++){
printf("ary[%d]=%d\n", ix, ary[ix]);
}
if we have bounds checking, the generated code for ary[ix] has to be something like
LOOP:
INC IX ; add `1 to ix
CMP IX, 23 ; while test
CMP IX, X ; compare IX and X
JGE ERROR ; if IX >= X jump to ERROR
LD R1, IX ; put the value of IX into register 1
LD R2, ARY+IX ; put the array value in R2
LA R3, Str42 ; STR42 is the format string
JSR PRINTF ; now we call the printf routine
J LOOP ; go back to the top of the loop
;;; somewhere else in the code
ERROR:
HCF ; halt and catch fire
If we don't have that bounds check, then we can write instead:
LD R1, IX
LOOP:
CMP IX, 23
JGE END
LD R2, ARY+R1
JSR PRINTF
INC R1
J LOOP
This saves 3-4 instructions in the loop, which (especially in the old days) meant a lot.
In fact, in the PDP-11 machines, it was even better, because there was something called "auto-increment addressing". On a PDP, all of the register stuff etc turned into something like
CZ -(IX), END ; compare IX to zero, then decrement; jump to END if zero
(And anyone who happens to remember the PDP better than I do, don't give me trouble about the precise syntax etc; you're an old fart like me, you know how these things slip away.)
It's all about the performance. However, the assertion that C and C++ have no bounds checking is not entirely correct. It is quite common to have "debug" and "optimized" versions of each library, and it is not uncommon to find bounds-checking enabled in the debugging versions of various libraries.
This has the advantage of quickly and painlessly finding out-of-bounds errors when developing the application, while at the same time eliminating the performance hit when running the program for realz.
I should also add that the performance hit is non-negigible, and many languages other than C++ will provide various high-level functions operating on buffers that are implemented directly in C and C++ specifically to avoid the bounds checking. For example, in Java, if you compare the speed of copying one array into another using pure Java vs. using System.arrayCopy (which does bounds checking once, but then straight-up copies the array without bounds-checking each individual element), you will see a decently large difference in the performance of those two operations.
It is easier to implement and faster both to compile and at run-time. It also simplifies the language definition (as quite a few things can be left out if this is skipped).
Currently, when you do:
int *p = (int*)malloc(sizeof(int));
*p = 50;
C (and C++) just says, "Okey dokey! I'll put something in that spot in memory".
If bounds checking were required, C would have to say, "Ok, first let's see if I can put something there? Has it been allocated? Yes? Good. I'll insert now." By skipping the test to see whether there is something which can be written there, you are saving a very costly step. On the other hand, (she wore a glove), we now live in an era where "optimization is for those who cannot afford RAM," so the arguments about the speed are getting much weaker.
The primary reason is the performance overhead of adding bounds checking to C or C++. While this overhead can be reduced substantially with state-of-the-art techniques (to 20-100% overhead, depending upon the application), it is still large enough to make many folks hesitate. I'm not sure whether that reaction is rational -- I sometimes suspect that people focus too much on performance, simply because performance is quantifiable and measurable -- but regardless, it is a fact of life. This fact reduces the incentive for major compilers to put effort into integrating the latest work on bounds checking into their compilers.
A secondary reason involves concerns that bounds checking might break your app. Particularly if you do funky stuff with pointer arithmetic and casting that violate the standard, bounds checking might block something your application is currently doing. Large applications sometimes do amazingly crufty and ugly things. If the compiler breaks the application, then there's no point in pointing blaming the crufty code for the problem; people aren't going to keep using a compiler that breaks their application.
Another major reason is that bounds checking competes with ASLR + DEP. ASLR + DEP are perceived as solving, oh, 80% of the problem or so. That reduces the perceived need for full-fledged bounds checking.
Because it would cripple those general purpose languages for HPC requirements. There are plenty of applications where buffer overflows really do not matter one iota, simply because they do not happen. Such features are much better off in a library (where in fact you can already find examples for C/C++).
For domain specific languages it may make sense to bake such features into the language definition and trade the resulting performance hit for increased security.

How do I go about Power and Square root functions in Assembly(IA32)?

How do I go about Power and Square root functions in
Assembly Language (with/out) Stack on Linux.
Edit 1 : I'm programming for Intel x_86.
In x86 assembly, there is no instruction for a Power operation, but you can build your own routine for calculating Power() by expressing the Power in terms of logarithms.
The following two instructions calculate logarithms:
FYL2X ; Replace ST(1) with (ST(1) * log2 ST(0)) and pop the register stack.
FYL2XP1 ; Replace ST(1) with (ST(1) * log2(ST(0) + 1.0)) and pop the register stack.
There are several ways to compute the square root:
(1) You can use the FPU instruction
FSQRT ; Computes square root of ST(0) and stores the result in ST(0).
(2) alternatively, you can use the following SSE/SSE2 instructions:
SQRTPD xmm1, xmm2/m128 ;Compute Square Roots of Packed Double-Precision Floating-Point Values
SQRTPS xmm1, xmm2/m128 ;Compute Square Roots of Packed Single-Precision Floating-Point Values
SQRTSS xmm1, xmm2/m128 ;Compute Square Root of Scalar Single-Precision Floating-Point Value
SQRTSD xmm1, xmm2/m128 ;Compute Square Root of Scalar Double-Precision Floating-Point Value
Write a simple few line C program that performs the task you are interested in. Compile that to an object. Disassemble that object....Look at how the assembler prepares to call the math function and how it calls the math function, take the disassembled code segments as your starting point for assembler and go from there.
Now if you are talking some embedded system with no operating system, the problem is not the operating system but the C/math library. Those libraries, in these functions or other, may rely on operating system calls which wont be valid. Ideally though it is the same exact mechanism, prepare for the function call by setting up the right registers, make the call to the function, use the results. With embedded your problem comes when you try to link your code with the library and/or when you try to execute it.
If you are asking how to re-create this functionality without using a pre-made library using discrete instructions. That is a completely different topic, esp if you are using a processor without those instructions. You can learn a little by looking at the source code for the library for those functions, and/or the disassembly of the functions in question, but it is likely not obvious. Look for the book or a book similar to "Hacker's Delight", which is packed full of things like performing math functions that are not natively supported by the language or processor.

Resources