Trying to understand how jump instruction calculate the address when different Program counter present

Trying to understand how jump instruction calculate the address when different Program counter present - riscv

Currently I am understanding a RISC-V(ISA) instruction set architecture with three extensions I, M, and C extension. Almost I have understood all the instructions of I, M, and C extension but I haven't found yet that how jumps and branches instructions work when 2 different counter presents and how they calculate the address of the next instruction and the immediate value of the label which we are giving.
Note: In C-extension the Program counter increment with +2, because C means compressed and it contains 16-bit instructions. While in I and M extension the Program counter increment with +4 because I and M contain 32-bit instructions.
I have two examples that I want to understand how jumps and branches calculate the address of the next instructions and immediate value of the label that we are given. Can anyone please provide me or explain the formula of calculating the next address of instruction when jumps or branches occur. I am providing two examples of RISC-V assembly. Kindly please help me. Thanks in advance.
Example 1:
0x0 addi x5,x0,12 #x5 = x0 + 12
0x4 c.addi x6,0 #x6 = x6 + 0
l1:
0x6 c.addi x8,6 #x8 = x8 + 6
0x8 c.jal end # ?
0xA c.li x7,2 #x7 = x7 + 2
end:
0xC c.mv x6,x8 #x6 = x8
0xE bne x5,x6,l1 # ?
0x12 c.add x7,x6 # x7 = x7 + x6
0x14 add x8,x5,x7 # x8 = x5 + x7
0x18 c.jal end # ?
Example 2:
0x0 addi x5,x0,12 #x5 = x0 + 12
0x4 c.addi x6,1 #x6 = x6 + 1
l1:
0x6 c.li x7,1 #x7 = x7 + 1
0x8 beq x6,x7,end # ?
0xC c.add x7,x6 #x7 = x7 + x6
end:
0xE add x8,x5,x7 #x8 = x5 + x7
0x12 c.jal l1 # ?
0x14 sub x9,x8,x6 #x9 = x8 + x6

bne and beq are 32-bit instructions that allow for a 13-bit immediate byte offset, which only requires 12 bits to store since the low bit is always zero and thus not stored (all instructions are multiples of 2 bytes).
The 13-bit immediate is used in a pc-relative addressing mode, so when the conditional branch is taken, the hardware computes:
pc' := pc + signExtend(immediate12 ## 0)
where pc' is the next pc, and ## represents bitwise concatenation.  When the branch is not taken it computes the usual pc' := pc + 4 which is sequential flow.
Sign extension is done to interpret the immediate as signed, and, this means that the immediate can be negative or positive, to jump backwards or forwards, respectively.
The 12 bits of the 13-bit branch target immediate are stored distributed among several fields throughout the instruction.  These fields are selected for good overlap with other immediates, and to allow the register fields to remain in the same place relative to other instruction formats.
The c.jal instruction encodes a 12-bit immediate, within a 16-bit instruction; The immediate is encoded in 11-bits (again because the low bit is always zero, so no need to represent it in the instruction).  The hardware takes the 11-bit encoded immediate, adds an extra 0 to the end to make it 12 bits, and then sign extends to full width (we could also say it first sign extends to full width, then multiplies by 2 — same result).  The operation is pc' := pc + signExtend(imm11 ## 0) where ## is concatenation.
Once we know how the processor computes the branch target pc, pc', we simply reverse computation when assembling instructions.  Subtract the difference between the target label (to) and the current pc (from), then divide by 2 and truncate to fit the field width.
If truncation changes the numeric value, then the immediate is too large for the field of the instruction and thus cannot be encoded.
Inline the encoded immediate field's value:
0x0 addi x5,x0,12 #x5 = x0 + 12
0x4 c.addi x6,0 #x6 = x6 + 0
l1:
0x6 c.addi x8,6 #x8 = x8 + 6
0x8 c.jal end # ? **(to-from)/2=(0xC-0x8)/2=2**
0xA c.li x7,2 #x7 = x7 + 2
end:
0xC c.mv x6,x8 #x6 = x8
0xE bne x5,x6,l1 # ? **(0x6-0xE)/2=-4**
0x12 c.add x7,x6 # x7 = x7 + x6
0x14 add x8,x5,x7 # x8 = x5 + x7
0x18 c.jal end # ? **(0xC-0x18)/2=-6**
0x0 addi x5,x0,12 #x5 = x0 + 12
0x4 c.addi x6,1 #x6 = x6 + 1
l1:
0x6 c.li x7,1 #x7 = x7 + 1
0x8 beq x6,x7,end # ? **(0xE-0x8)/2=3**
0xC c.add x7,x6 #x7 = x7 + x6
end:
0xE add x8,x5,x7 #x8 = x5 + x7
0x12 c.jal l1 # ? **(0x6-0x12)/2=-6**
0x14 sub x9,x8,x6 #x9 = x8 + x6
Sign extension is used to make a short signed field into a full width value.
An 11 bit encoded immediate, as in c.jal, using -6 and +6 as examples, would look like this in binary:
# Example using -6
* The bit under the * is the MSB
|
11111111010 # -6 in 11 bits
11111111111111111111111111111010 # -6 in 32 bits
********************* Copied from the MSB in 11 bits
# Example using +6
* The bit under the * is the MSB
|
00000000110 # 6 in 11 bits
00000000000000000000000000000110 # 6 in 32 bits
********************* Copied from the MSB in 11 bits
The Most Significant Bit is the top bit and if 1 means the number is negative.  In order to preserve the value when widening (e.g. from 11 to 32 bits), propagate the MSB in the shorter field across all the bits of extra width.

Not sure if this is the best way to do it
<p id="x"></p> // replace **{{c}}** blank element hence it will impact anything
After the successful ajax call
success: function(data){ // data will hold whatever is returned from the url
$("#x").html(data['c']); // This will populate the p element with your value
alert("Hello");
},
You can also hide the ment

Related

Why cache misses happen more when more data is prefetched on ARM?

I'm using OProfile to profile the following function on a raspberry pi 3B+. (I'm using gcc version 10.2 on the raspberry (not doing cross-compilation) and the following flags for the compiler: -O1 -mfpu-neon -mneon-for-64bits. The generate assembly code are included at the end.)
void do_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
for (int i = 0; i < array_size; i++)
{
uint32_t tmp1 = b[i];
uint32_t tmp2 = a[i];
c[i] = tmp1 * tmp2;
}
}
I'm looking at L1D_CACHE_REFILL and PREFETCH_LINEFILL these 2 cpu events. Looking at the doc, PREFETCH_LINEFILL counts the number of cache line fill because of prefetch, and L1D_CACHE_REFILL counts the number of cache line refill because of cache misses. I got the following results for the above loop:
array_size
array_size / L1D_CACHE_REFILL
array_size / PREFETCH_LINEFILL
16777216
18.24
8.366
I would imagine the above loop is memory bound, which is somehow confirmed by the value 8.366: Every loop instance needs 3 x uint32_t which is 12B. And 8.366 loop instances needs ~100B of data from the memory. But the prefetcher can only fill 1 cache line to L1 every 8.366 loop instances, which has 64B by the manual of Cortex-A53. So the rest of the cache accesses would contribute to cache misses, which is the 18.24. If you combine these two number, you get ~5.7, that means 1 cache line fill from either prefetch or cache miss refill every 5.7 loop instances. And 5.7 loop instances needs 5.7 x 3 x 4 = 68B, more or less consistent with the cache line size.
Then I added more stuff to the loop, which then becomes the following:
void do_more_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
for (int i = 0; i < array_size; i++)
{
uint32_t tmp1 = b[i];
uint32_t tmp2 = a[i];
tmp1 = tmp1 * 17;
tmp1 = tmp1 + 59;
tmp1 = tmp1 /2;
tmp2 = tmp2 *27;
tmp2 = tmp2 + 41;
tmp2 = tmp2 /11;
tmp2 = tmp2 + tmp2;
c[i] = tmp1 * tmp2;
}
}
And the profiling data of the cpu events is something I don't understand:
array_size
array_size / L1D_CACHE_REFILL
array_size / PREFETCH_LINEFILL
16777216
11.24
7.034
Since the loop takes longer to execute, the prefetcher now only needs 7.034 loop instances to fill 1 cache line. But what I don't understand is why cache missed also happens more frequently, reflecting by the number 11.24, compared to 18.24 before? Can someone please shed some light on how all these can be put together?
Update to include the generated assembly
Loop1:
cbz x3, .L178
lsl x6, x3, 2
mov x3, 0
.L180:
ldr w4, [x1, x3]
ldr w5, [x0, x3]
mul w4, w4, w5
lsl w4, w4, 1
str w4, [x2, x3]
add x3, x3, 4
cmp x3, x6
bne .L180
.L178:
Loop2:
cbz x3, .L178
lsl x6, x3, 2
mov x5, 0
mov w8, 27
mov w7, 35747
movk w7, 0xba2e, lsl 16
.L180:
ldr w3, [x1, x5]
ldr w4, [x0, x5]
add w3, w3, w3, lsl 4
add w3, w3, 59
mul w4, w4, w8
add w4, w4, 41
lsr w3, w3, 1
umull x4, w4, w7
lsr x4, x4, 35
mul w3, w3, w4
lsl w3, w3, 1
str w3, [x2, x5]
add x5, x5, 4
cmp x5, x6
bne .L180
.L178:

I'll try answer my own question based on more measurement and discussion with #artlessnoise.
I further measured the READ_ALLOC_ENTER event for the above 2 loops and had the following data:
Loop 1
Array Size
READ_ALLOC_ENTER
16777216
12494
Loop 2
Array Size
READ_ALLOC_ENTER
16777216
1933
So apparently the small loop (1st) enters Read Allocate Mode a lot more than the big one (2nd), which could be due to the CPU was able to detect consecutive write pattern more easily. In read allocate mode, the stores went directly to L2 (if no hit in L1). That's why L1D_CACHE_REFILL is less for the 1st loop since it involves L1 less. For the 2nd loop, since it needs to involve L1 to update c[] more often than the 1st one, refill due to cache miss could be more. Moreover, for the second case, since L1 is often occupied with more cache lines for c[], it affects the cache hit rates for a[] and b[], thus more L1D_CACHE_REFILL.

Converting from IEEE-754 to Fixed Point with nearest rounding

I am implementing a converter for IEEE 754 32 bits to a Fixed point with S15.16 in a FPGA. The IEEE-754 standard represent the number as:
Where s represent the sign, exp is the exponent denormalized and m is the mantissa. All these values separately are represented in fixed point.
Well, the simplest way is take the IEEE-754 value and multiplies by 2**16. Finally, round it to the nearest to get the less error in truncation.
Problem: I'm doing in a FPGA device, so, I can't do it in this way.
Solution: Use the binary representations from values to perform the conversion via bitwise operations
From the previous expression, and with the condition of the exponent and mantissa are in fixed point, logic says me that I can perform as this:
Because powers of two are shifts in fixed point, is possible to rewrite the expression as (with Verilog notation):
x_fixed = ({1'b1, m[22:7]}) << (exp - 126)
Ok, this works perfectly, but not all the times... The problem here is: How can I apply nearest rounding? I have performed experiments to see what happens, in different ranges. The ranges are contained within powers of 2. I want to say:
For values from 0 < x < 1
For values from 1 <= x < 2
For values from 2 <= x < 4
And so on with the values contained in the following powers of two... When the values are contained from 1 to 2, I have been able to round without problems seeing the behaviour of the 2 followings bits that have been discarded in the mantissa. This bits show that:
if 00: Rounding is not necessary
if 01 or 10: Adding one to the shifted mantissa
if 11: adding two to the shifted mantissa.
To perform the experiments I have implemented a minimal solution in Python using bitwise operations. Codes are:
# Get the bits of sign, exponent and mantissa
def FLOAT_2_BIN(num):
bits, = struct.unpack('!I', struct.pack('!f', num))
N = "{:032b}".format(bits)
a = N[0] # sign
b = N[1:9] # exponent
c = "1" + N[9:] # mantissa with hidden bit
return {'sign': a, 'exp': b, 'mantissa': c}
# Convert the floating point value to fixed via
# bitwise operations
def FLOAT_2_FIXED(x):
# Get the IEEE-754 bit representation
IEEE754 = FLOAT_2_BIN(x)
# Exponent minus 127 to normalize
shift = int(IEEE754['exp'],2) - 126
# Get 16 MSB from mantissa
MSB_mnts = IEEE754['mantissa'][0:16]
# Convert value from binary to int
value = int(MSB_mnts, 2)
# Get the rounding bits: similars to guard bits???
rnd_bits = IEEE754['mantissa'][16:18]
# Shifted value by exponent
value_shift = value << shift
# Control to rounding nearest
# Only works with values from 1 to 2
if rnd_bits == '00':
rnd = 0
elif rnd_bits == '01' or rnd_bits == '10':
rnd = 1
else:
rnd = 2
return value_shift + rnd
The test with values between 0 and 1 gives the following results:
Test for values from 1 <= x < 2
FLOAT 32 VALUE 16 MSB MANTISSA THEORICAL FIXED PRACTICAL FIXED RND BITS DIFS 4 LSB MANTISSA
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
1 1000000000000000 65536 65536 00 0 0000
1.1 1000110011001100 72090 72090 11 0 1101
1.2 1001100110011001 78643 78643 10 0 1010
1.3 1010011001100110 85197 85197 01 0 0110
1.4 1011001100110011 91750 91750 00 0 0011
1.5 1100000000000000 98304 98304 00 0 0000
1.6 1100110011001100 104858 104858 11 0 1101
1.7 1101100110011001 111411 111411 10 0 1010
1.8 1110011001100110 117965 117965 01 0 0110
1.9 1111001100110011 124518 124518 00 0 0011
Obviously: if I take values that have a decimal part multiple of a power of two, there is don't need rounding:
In this case the values have an increment of 1/32
FLOAT 32 VALUE 16 MSB MANTISSA THEORICAL FIXED PRACTICAL FIXED RND BITS DIFS 4 LSB MANTISSA
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
10 1010000000000000 655360 655360 00 0 0000
10.0312 1010000010000000 657408 657408 00 0 0000
10.0625 1010000100000000 659456 659456 00 0 0000
10.0938 1010000110000000 661504 661504 00 0 0000
10.125 1010001000000000 663552 663552 00 0 0000
10.1562 1010001010000000 665600 665600 00 0 0000
10.1875 1010001100000000 667648 667648 00 0 0000
10.2188 1010001110000000 669696 669696 00 0 0000
10.25 1010010000000000 671744 671744 00 0 0000
10.2812 1010010010000000 673792 673792 00 0 0000
10.3125 1010010100000000 675840 675840 00 0 0000
10.3438 1010010110000000 677888 677888 00 0 0000
10.375 1010011000000000 679936 679936 00 0 0000
10.4062 1010011010000000 681984 681984 00 0 0000
10.4375 1010011100000000 684032 684032 00 0 0000
10.4688 1010011110000000 686080 686080 00 0 0000
10.5 1010100000000000 688128 688128 00 0 0000
10.5312 1010100010000000 690176 690176 00 0 0000
10.5625 1010100100000000 692224 692224 00 0 0000
10.5938 1010100110000000 694272 694272 00 0 0000
10.625 1010101000000000 696320 696320 00 0 0000
10.6562 1010101010000000 698368 698368 00 0 0000
10.6875 1010101100000000 700416 700416 00 0 0000
10.7188 1010101110000000 702464 702464 00 0 0000
10.75 1010110000000000 704512 704512 00 0 0000
10.7812 1010110010000000 706560 706560 00 0 0000
10.8125 1010110100000000 708608 708608 00 0 0000
10.8438 1010110110000000 710656 710656 00 0 0000
10.875 1010111000000000 712704 712704 00 0 0000
10.9062 1010111010000000 714752 714752 00 0 0000
10.9375 1010111100000000 716800 716800 00 0 0000
10.9688 1010111110000000 718848 718848 00 0 0000
But, if 2 <= x < 4 and the increments is not a multiple of a power of two:
Test for values from 2 <= x < 4. Increment is 0.1
Here, I am not applying the rounding in order to show how the rounding error
increase with the exponent. e.g: shift**2 - 1, where shift is exponent - 126
FLOAT 32 VALUE 16 MSB MANTISSA THEORICAL FIXED PRACTICAL FIXED RND BITS DIFS 4 LSB MANTISSA
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
2 1000000000000000 131072 131072 00 0 0000
2.1 1000011001100110 137626 137624 01 -2 0110
2.2 1000110011001100 144179 144176 11 -3 1101
2.3 1001001100110011 150733 150732 00 -1 0011
2.4 1001100110011001 157286 157284 10 -2 1010
2.5 1010000000000000 163840 163840 00 0 0000
2.6 1010011001100110 170394 170392 01 -2 0110
2.7 1010110011001100 176947 176944 11 -3 1101
2.8 1011001100110011 183501 183500 00 -1 0011
2.9 1011100110011001 190054 190052 10 -2 1010
3 1100000000000000 196608 196608 00 0 0000
3.1 1100011001100110 203162 203160 01 -2 0110
3.2 1100110011001100 209715 209712 11 -3 1101
3.3 1101001100110011 216269 216268 00 -1 0011
3.4 1101100110011001 222822 222820 10 -2 1010
3.5 1110000000000000 229376 229376 00 0 0000
3.6 1110011001100110 235930 235928 01 -2 0110
3.7 1110110011001100 242483 242480 11 -3 1101
3.8 1111001100110011 249037 249036 00 -1 0011
3.9 1111100110011001 255590 255588 10 -2 1010
It is clearly that the rounding is not correct, and also I have perceived that the maximun rounding error in fixed point is always 2**shift - 1.
Any idea or sugerence? I have thought that the problem here is that I'm not taking into account the guard bits: GSR, but in the other hand, if actually the problem was this: What's happens when the necessary rounding is higher than one, e.g: 2, 3, 4... ?

The ISO-C99 code below demonstrates one possible way of doing the conversion. The significand (mantissa) bits of the binary32 argument form the bits of the s15.16 result. The exponent bits tell us whether we need to shift these bits right or left to move the least significant integer bit to bit 16. If a left shift is required, rounding is not needed. If a right shift is required, we need to capture any less significant bits discarded. The most significant discarded bit is the round bit, all others collectively represent the sticky bit. Using the literal definition of the rounding mode, we need to round up if (1) either the round bit and the sticky bit are set, or (2) the round bit is set and the sticky bit clear (i.e., we have a tie case), but the least significant bit of the intermediate result is odd.
Note that real hardware implementations often deviate from such a literal application of the rounding-mode logic. One common scheme is to first increment the result when the round bit is set. Then, if such an increment occurred, clear the least significant bit of the result if the sticky bit is not set. It is easy to see that this achieves the same effect by enumerating all possible combinations of round bit, sticky bit, and result LSB.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#define USE_LITERAL_RND_DEF (1)
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
#define FP32_MANT_FRAC_BITS (23)
#define FP32_EXPO_BITS (8)
#define FP32_EXPO_MASK ((1u << FP32_EXPO_BITS) - 1)
#define FP32_MANT_MASK ((1u << FP32_MANT_FRAC_BITS) - 1)
#define FP32_MANT_INT_BIT (1u << FP32_MANT_FRAC_BITS)
#define FP32_SIGN_BIT (1u << (FP32_MANT_FRAC_BITS + FP32_EXPO_BITS))
#define FP32_EXPO_BIAS (127)
#define FX15P16_FRAC_BITS (16)
#define FRAC_BITS_DIFF (FP32_MANT_FRAC_BITS - FX15P16_FRAC_BITS)
int32_t fp32_to_fixed (float a)
{
/* split binary32 operand into constituent parts */
uint32_t ia = float_as_uint32 (a);
uint32_t expo = (ia >> FP32_MANT_FRAC_BITS) & FP32_EXPO_MASK;
uint32_t mant = expo ? ((ia & FP32_MANT_MASK) | FP32_MANT_INT_BIT) : 0;
int32_t sign = ia & FP32_SIGN_BIT;
/* compute and clamp shift count */
int32_t shift = (expo - FP32_EXPO_BIAS) - FRAC_BITS_DIFF;
shift = (shift < (-31)) ? (-31) : shift;
shift = (shift > ( 31)) ? ( 31) : shift;
/* shift left or right so least significant integer bit becomes bit 16 */
uint32_t shifted_right = mant >> (-shift);
uint32_t shifted_left = mant << shift;
/* capture discarded bits if right shift */
uint32_t discard = mant << (32 + shift);
/* round to nearest or even if right shift */
uint32_t round = (discard & 0x80000000) ? 1 : 0;
uint32_t sticky = (discard & 0x7fffffff) ? 1 : 0;
#if USE_LITERAL_RND_DEF
uint32_t odd = shifted_right & 1;
shifted_right = (round & (sticky | odd)) ? (shifted_right + 1) : shifted_right;
#else // USE_LITERAL_RND_DEF
shifted_right = (round) ? (shifted_right + 1) : shifted_right;
shifted_right = (round & ~sticky) ? (shifted_right & ~1) : shifted_right;
#endif // USE_LITERAL_RND_DEF
/* make final selection between left shifted and right shifted */
int32_t res = (shift < 0) ? shifted_right : shifted_left;
/* negate if negative */
return (sign < 0) ? (-res) : res;
}
int main (void)
{
int32_t res, ref;
float x;
printf ("IEEE-754 binary32 to S15.16 fixed-point conversion in RNE mode\n");
printf ("use %s implementation of round to nearest or even\n",
USE_LITERAL_RND_DEF ? "literal" : "alternate");
/* test positive half-plane */
x = 0.0f;
while (x < 0x1.0p15f) {
ref = (int32_t) rint ((double)x * 65536);
res = fp32_to_fixed(x);
if (res != ref) {
printf ("error # x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
printf ("Test FAILED\n");
return EXIT_FAILURE;
}
x = nextafterf (x, INFINITY);
}
/* test negative half-plane */
x = -1.0f * 0.0f;
while (x >= -0x1.0p15f) {
ref = (int32_t) rint ((double)x * 65536);
res = fp32_to_fixed(x);
if (res != ref) {
printf ("error # x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
printf ("Test FAILED\n");
return EXIT_FAILURE;
}
x = nextafterf (x, -INFINITY);
}
printf ("Test PASSED\n");
return EXIT_SUCCESS;
}

RISC V LD error - (.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `UND'

Does any body has clue why I get below error :-
/tmp/cceP5axg.o: in function `.L0 ':
(.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `*UND*'
collect2: error: ld returned 1 exit status

R_RISCV_JAL relocation can represent an even signed 21-bit offset (-1MiB to +1MiB-2). If your symbol is further than this limit , then you have this error.

This error can also happen as an odd result of branch instructions that use hard-coded offsets. I was getting the same exact error on a program that was far less than 2Mib. It turns out it was because I had several instructions that looked like bne rd, rs, offset, but the offset was a number literal like 0x8.
The solution was to remove the literal offset and replace it with a label from the code so it looks like
bne x7, x9, branch_to_here
[code to skip]
branch_to_here:
more code ...
instead of
bne x7, x9, 0x8
[code to skip]
more code ...
When I did that to every branch instruction, the error went away. Sorry to answer this 10 months late, but I hope it helps you, anonymous reader.

Since I've searched many resources to solve this issue, I think my attempt may help others.
There're 2 reasons may trigger this issue:
The target address is an odd:
bne ra, ra, <odd offset>
The target address is a specific value during compile time (not linking):
bne ra, ra, 0x80003000
My attempt to solve:
label:
addi x0, x0, 0x0
addi x0, x0, 0x0
bne ra, ra, label + 6 // Jump to an address that relates to a label
// This can generate Instruction Address Misaligned exception
sub_label:
addi x0, x0, 0x0
beq ra, ra, sub_label // Jump to a label directly
addi x0, x0, 0x0
nop

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

This code, when compiled for the x86_64-unknown-linux-musl target, produces a .got section:
fn main() {
println!("Hello, world!");
}
$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[12] .got PROGBITS 0000000000637b58 00037b58
00000000000004a8 0000000000000008 WA 0 0 8
...
According to this answer for analogous C code, the .got section is an artifact that can be safely removed. However, it segfaults for me:
$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1] 3131 segmentation fault (core dumped) ./hello_no_got
Looking at the disassembly, I see that the GOT basically holds static function addresses:
$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...
$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000 PC#.......#.....
$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
404350: 41 57 push r15
The number 404350 comes from 50434000 00000000, which is a little-endian 0x00000000000404350 (this was not obvious; I had to run the binary under GDB to figure this out!)
This is perplexing, since Wikipedia says that
[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.
Why is the GOT present? From the disassembly, it looks like the compiler knows all the needed addresses. As far as I know, there is no bootstrap done by the dynamic linker: there is neither INTERP nor DYNAMIC program headers present in my binary;
Why does the GOT store function pointers? Wikipedia says the GOT is only for global variables, and function pointers should be contained in the PLT.

TL;DR summary: the GOT is really a rudimentary build artifact, which I was able to get rid of via simple machine code manipulations.
Breakdown
If we look at
$ objdump -dj .text hello
and search for GLOBAL, we see only four distinct types of references to the GOT (constants differ):
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
425903: ff 25 5f 26 21 00 jmp QWORD PTR [rip+0x21265f] # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
41d8b5: 48 3b 1d b4 a5 21 00 cmp rbx,QWORD PTR [rip+0x21a5b4] # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
All of these are reading instructions, which means that the GOT is not modified at runtime. This in turn means that we can statically resolve the addresses that the GOT refers to! Let's consider the reference types one by one:
call QWORD PTR [rip+0x2126be] simply says "go to address [rip+0x2126be], take 8 bytes from there, interpret them as a function address and call the function". We can simply replace this instruction with a direct call:
40037c: e8 cf 3f 00 00 call 404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
400381: 90 nop
Notice the nop at the end: we need to replace all the 6 bytes of the machine code that constitute the first instruction, but the instruction we replace it with is only 5 bytes, so we need to pad it. Fundamentally, as we are patching a compiled binary, we can replace an instruction with a another one only if it is not longer.
jmp QWORD PTR [rip+0x21265f] is the same as the previous one, but instead of calling an address it jumps to it. This turns into:
425903: e9 b8 f7 ff ff jmp 4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$9write_str17hc384e51187942069E>
425908: 90 nop
cmp rbx,QWORD PTR [rip+0x21a5b4] - this takes 8 bytes from [rip+0x21a5b4] and compares them to the contents of rbx register. This one is tricky, since cmp can not compare register contents to an 64-bit immediate value. We could use another register for that, but we don't know which of the registers are used around this instruction. A careful solution would be something like
push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax
But that would be way beyond our limit of 7 bytes. The real solution stems from an observation that the GOT contains only addresses; our address space is (roughly) contained in range [0x400000; 0x650000], which can be seen in the program headers:
$ readelf -l hello
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000035b50 0x0000000000035b50 R E 0x200000
LOAD 0x0000000000036380 0x0000000000636380 0x0000000000636380
0x0000000000001dd0 0x0000000000003918 RW 0x200000
...
It follows that we can (mostly) get away with only comparing 4 bytes of a GOT entry instead of 8. So the substitution is:
41d8b5: 81 fb c0 63 63 00 cmp ebx,0x6363c0
41d8bb: 90 nop
The last one consists of two lines of objdump output, since 8 bytes do not fit in one line:
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
It just compares 8 bytes of the GOT to a constant (in this case, 0x0). In fact, we can do the comparison statically; if the operands compare equal, we replace the comparison with
40b259: 48 39 c0 cmp rax,rax
40b25c: 90 nop
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
Obviously, a register is always equal to itself. A lot of padding needed here!
If the left operand is greater than the right one, we replace the comparison with
40b259: 48 83 fc 00 cmp rsp,0x0
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
In practice, rsp is always greater than zero.
If the left operand is smaller than the right one, things get a bit more complicated, but since we have a whole lot of bytes (8!) we can manage:
40b259: 50 push rax
40b25a: 31 c0 xor eax,eax
40b25c: 83 f8 01 cmp eax,0x1
40b25f: 58 pop rax
40b260: 90 nop
Notice that the second and the third instructions use eax instead of rax, since cmp and xor involving eax take one less byte than with rax.
Testing
I have written a Python script to do all these substitutions automatically (it's a bit hacky and relies on parsing of objdump output though):
#!/usr/bin/env python3
import re
import sys
import argparse
import subprocess
def read_u64(binary):
return sum(binary[i] * 256 ** i for i in range(8))
def distance_u32(start, end):
assert abs(end - start) < 2 ** 31
diff = end - start
if diff < 0:
return 2 ** 32 + diff
else:
return diff
def to_u32(x):
assert 0 <= x < 2 ** 32
return bytes((x // (256 ** i)) % 256 for i in range(4))
class GotInstruction:
def __init__(self, lines, symbol_address, symbol_offset):
self.address = int(lines[0].split(":")[0].strip(), 16)
self.offset = symbol_offset + (self.address - symbol_address)
self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
self.got_offset = self.got_offset % 0x200000 # No idea why the offset is actually wrong
self.bytes = []
for line in lines:
self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]
class TextDump:
symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")
def __init__(self, binary_path):
self.got_instructions = []
objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
binary_path])
lines = objdump_output.decode("utf-8").split("\n")
current_symbol_address = 0
current_symbol_offset = 0
for line_group in self.group_lines(lines):
match = self.symbol_regex.match(line_group[0])
if match is not None:
current_symbol_address = int(match.group(1), 16)
current_symbol_offset = int(match.group(3), 16)
elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
instruction = GotInstruction(line_group, current_symbol_address,
current_symbol_offset)
self.got_instructions.append(instruction)
#staticmethod
def group_lines(lines):
if not lines:
return
line_group = [lines[0]]
for line in lines[1:]:
if line.count("\t") == 1: # this line continues the previous one
line_group.append(line)
else:
yield line_group
line_group = [line]
yield line_group
def __iter__(self):
return iter(self.got_instructions)
def read_binary_file(path):
try:
with open(path, "rb") as f:
return f.read()
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def write_binary_file(path, content):
try:
with open(path, "wb") as f:
f.write(content)
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def patch_got_reference(instruction, binary_content):
got_data = read_u64(binary_content[instruction.got_offset:])
code = instruction.bytes
if code[0] == 0xff:
assert len(code) == 6
relative_address = distance_u32(instruction.address, got_data)
if code[1] == 0x15: # call QWORD PTR [rip+...]
patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
elif code[1] == 0x25: # jmp QWORD PTR [rip+...]
patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
else:
raise ValueError(f"unknown machine code: {code}")
elif code[:3] == [0x48, 0x83, 0x3d]: # cmp QWORD PTR [rip+...],<BYTE>
assert len(code) == 8
if got_data == code[7]:
patch = b"\x48\x39\xc0" + b"\x90" * 5 # cmp rax,rax
elif got_data > code[7]:
patch = b"\x48\x83\xfc\x00" + b"\x90" * 3 # cmp rsp,0x0
else:
patch = b"\x50\x31\xc0\x83\xf8\x01\x90" # push rax
# xor eax,eax
# cmp eax,0x1
# pop rax
elif code[:3] == [0x48, 0x3b, 0x1d]: # cmp rbx,QWORD PTR [rip+...]
assert len(code) == 7
patch = b"\x81\xfb" + to_u32(got_data) + b"\x90" # cmp ebx,<DWORD>
else:
raise ValueError(f"unknown machine code: {code}")
return dict(offset=instruction.offset, data=patch)
def make_got_patches(binary_path, binary_content):
patches = []
text_dump = TextDump(binary_path)
for instruction in text_dump.got_instructions:
patches.append(patch_got_reference(instruction, binary_content))
return patches
def apply_patches(binary_content, patches):
for patch in patches:
offset = patch["offset"]
data = patch["data"]
binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
return binary_content
def main():
parser = argparse.ArgumentParser()
parser.add_argument("binary_path", help="Path to ELF binary")
parser.add_argument("-o", "--output", help="Output file path", required=True)
args = parser.parse_args()
binary_content = read_binary_file(args.binary_path)
patches = make_got_patches(args.binary_path, binary_content)
patched_content = apply_patches(binary_content, patches)
write_binary_file(args.output, patched_content)
if __name__ == "__main__":
main()
Now we can get rid of the GOT for real:
$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!
I have also tested it on my ~3k LOC app, and it seems to work alright.
P.S. I am not an expert in assembly, so some of the above might be inaccurate.

68K Assembly Math Formula

I need to write some Lines in 68k Assembly Language with the math formula:
x^2-5x+6
I want to do it with ADD and SUB commands and MOVE yet somehow I cant define the variable x it says its an undefined Symbol and I cant actually realize where my problem is.
ORG $1000
START: ; first instruction of program
MOVE X*X, D0
MOVE (-5X),D2
MOVE 6,D3
ADD D0, D3
SUB D2, D1
SIMHALT
Errors:
LINE 10 Invalid Syntax
LINE 11 Invalid Syntax

Something like this, assuming basic 68000 (and not 68020 or better).
You may have to fix matters like whether X is a word or long word and deal with matters such as sign extension as its a long time since I did 68k assembler. X is defined as a word constant at the end.
ORG $1000
START: ; first instruction of program
CLR.L D7 ; Clear D0 - alternatively MOVEQ #0,D0
MOVE.W X,D7 ; Read X
; Output initial value...
LEA S1,A1
MOVE.W #255,D1
MOVE.L D7,D1
MOVEQ #17,D0
TRAP #15
LEA SNUL,A1
MOVEQ #13,D0
TRAP #15
MOVE.L D7,D6 ; copy of X
ASL.L #2,D6 ; Multiply by 4
ADD.L D7,D6 ; 4X plus another X = 5X
MULU.W D7,D7 ; X^2
SUB.L D6,D7 ; Subtract 5X from X^2
ADDQ.L #6,D7 ; plus 6
; Output answer...
LEA S2,A1
MOVE.L D7,D1
MOVEQ #17,D0
TRAP #15
SIMHALT ; halt simulator
* Put variables and constants here
S1: DC.B 'Initial :',0
S2: DC.B 'Answer :',0
SNUL: DC.B 0
X: DC.W 1234 ; Initial (fixed) value of X
END START ; last line of source

Declaring variables in assembly doesn't work like it would in C or other similar languages. Let's say you're trying to write the following C function:
int myFunction(int x)
{
return (x**2) + (-5x) + 6;
}
So what you would do is, you would choose a register, say D0, and let that be your input variable. It can also be where the output goes.
myFunction:
MOVE.L D0,D1
MULS D0,D1 ;D1 = x squared
MOVE.L D0,D2
ADD.L D0,D0
ADD.L D0,D0
ADD.L D2,D0 ;D0 = 5X
SUB.L D0,D1 ;D1 = (X^2) - 5X
ADD.L #6,D1 ;D1 = (X^2) - 5X + 6
MOVE.L D1,D0 ;return in D0
RTS
Now, if you wanted to use this function, you would first load the desired value of x into register D0 and then call the function:
MOVE.L #5,D0 ;as an example, calculate the function where x = 5.
JSR myFunction
;the program will resume here after the calculation is done,
;and the result will be in D0.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trying to understand how jump instruction calculate the address when different Program counter present - riscv

Related

Why cache misses happen more when more data is prefetched on ARM?

Converting from IEEE-754 to Fixed Point with nearest rounding

RISC V LD error - (.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `UND'

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

68K Assembly Math Formula

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trying to understand how jump instruction calculate the address when different Program counter present - riscv

Related

Why cache misses happen more when more data is prefetched on ARM?

Converting from IEEE-754 to Fixed Point with nearest rounding

RISC V LD error - (.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `*UND*'

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

68K Assembly Math Formula

Categories

Resources

RISC V LD error - (.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `UND'