Rocket chip simulation shows unexpected instruction count - riscv

The following two code snippets differ only the value loaded into the x23
register, but the minstret instruction counts (reported by a Verilator
simulation of the Rocket chip) differ substantially. Is this a bug, or am I
doing something wrong?
The read_csr() function is from the RISC-V Frontend Server Library (https://github.com/riscv/riscv-fesvr/blob/master/fesvr/encoding.h), and the rest of the code [syscalls.c, crt.S, test.ld] is similar to the RISC-V benchmarks
(https://github.com/riscv/riscv-tests/tree/master/benchmarks/common).
I have checked that the compiled binaries contain the exact same instructions, except for the difference in the operands.
Dividing 0x0fffffff by 0xff, repeating 1024 times: 3260 instructions.
size_t instrs = 0 - read_csr(minstret);
asm volatile (
"mv x20, zero;"
"li x21, 1024;"
"li x22, 0xfffffff;"
"li x23, 0xff;"
"loop:"
"div x24, x22, x23;"
"addi x20, x20, 1;"
"bleu x20, x21, loop;"
::: "x20", "x21", "x22", "x23", "x24", "cc"
);
instrs += read_csr(minstret);
Dividing 0x0fffffff by 0xffff, repeating 1024 times: 3083 instructions.
size_t instrs = 0 - read_csr(minstret);
asm volatile (
"mv x20, zero;"
"li x21, 1024;"
"li x22, 0xfffffff;"
"li x23, 0xffff;"
"loop:"
"div x24, x22, x23;"
"addi x20, x20, 1;"
"bleu x20, x21, loop;"
::: "x20", "x21", "x22", "x23", "x24", "cc"
);
instrs += read_csr(minstret);
Here, 3083 instructions seems correct (1024 * 3 = 3072). Since minstret counts retired instructions, it seems strange that first example executed ~200 more instructions. These results are always the same no matter how many times I run these two programs.

The problem was resolved at https://github.com/freechipsproject/rocket-chip/issues/1495.
Servicing the debug interrupt, which is apparently used by the simulation to know whether the benchmark has finished executing, caused the differences in the instruction count. The verbose log produced by Verilator shows the debug address range (0x800 onwards) being injected at different points during the execution.

Related

sse4 packed sum between int32_t and int16_t (sign extend to int32_t)

I have the following code snippet (a gist can be found here) where I am trying to do a sum between 4 int32_t negative values and 4 int16_t values (that will be sign extend to int32_t).
extern exit
global _start
section .data
a: dd -76, -84, -84, -132
b: dw 406, 406, 406, 406
section .text
_start:
movdqa xmm0, [a]
pmovsxwd xmm2, [b]
paddq xmm0, xmm2
;Expected: 330, 322, 322, 274
;Results: 330, 323, 322, 275
call exit
However, when going through my debugger, I couldn't understand why the output results are different from the expected results. Any idea ?
paddq does 64-bit qword chunks, so there's carry across two of the 32-bit boundaries, leading to an off-by-one in the high half of each qword.
paddd is 32-bit dword chunks, matching the pmovsxwd dword element destination size. This is a SIMD operation with 4 separate adds, independent of each other.
BTW, you could have made this more efficient by folding the 16-byte aligned load into a memory operand for padd, but yeah for debugging it can help to see both inputs in registers with a separate load.
default rel ; use RIP-relative addressing modes when possible
_start:
movsxwd xmm0, [b]
paddd xmm0, [a]
Also you'd normally put read-only arrays in section .rodata.

RISC V LD error - (.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `*UND*'

Does any body has clue why I get below error :-
/tmp/cceP5axg.o: in function `.L0 ':
(.text+0xc4): relocation truncated to fit: R_RISCV_JAL against `*UND*'
collect2: error: ld returned 1 exit status
R_RISCV_JAL relocation can represent an even signed 21-bit offset (-1MiB to +1MiB-2). If your symbol is further than this limit , then you have this error.
This error can also happen as an odd result of branch instructions that use hard-coded offsets. I was getting the same exact error on a program that was far less than 2Mib. It turns out it was because I had several instructions that looked like bne rd, rs, offset, but the offset was a number literal like 0x8.
The solution was to remove the literal offset and replace it with a label from the code so it looks like
bne x7, x9, branch_to_here
[code to skip]
branch_to_here:
more code ...
instead of
bne x7, x9, 0x8
[code to skip]
more code ...
When I did that to every branch instruction, the error went away. Sorry to answer this 10 months late, but I hope it helps you, anonymous reader.
Since I've searched many resources to solve this issue, I think my attempt may help others.
There're 2 reasons may trigger this issue:
The target address is an odd:
bne ra, ra, <odd offset>
The target address is a specific value during compile time (not linking):
bne ra, ra, 0x80003000
My attempt to solve:
label:
addi x0, x0, 0x0
addi x0, x0, 0x0
bne ra, ra, label + 6 // Jump to an address that relates to a label
// This can generate Instruction Address Misaligned exception
sub_label:
addi x0, x0, 0x0
beq ra, ra, sub_label // Jump to a label directly
addi x0, x0, 0x0
nop

x86 Linux ELF Loader Troubles

I'm trying to write an ELF executable loader for x86-64 Linux, similar to this, which was implemented on ARM. Chris Rossbach's advanced OS class includes a lab that does basically what I want to do. My goal is to load a simple (statically-linked) "hello world" type binary into my process's memory and run it without execveing. I have successfully mmap'd the ELF file, set up the stack, and jumped to the ELF's entry point (_start).
// put ELF file into memory. This is just one line of a complex
// for() loop that loads the binary from a file.
mmap((void*)program_header.p_vaddr, program_header.p_memsz, map, MAP_PRIVATE|MAP_FIXED, elffd, program_header.p_offset);
newstack = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); // Map a page for the stack
if((long)newstack < 0) {
fprintf(stderr, "ERROR: mmap returned error when allocating stack, %s\n", strerror(errno));
exit(1);
}
topstack = (unsigned long*)((unsigned char*)newstack+4096); // Top of new stack
*((unsigned long*)topstack-1) = 0; // Set up the stack
*((unsigned long*)topstack-2) = 0; // with argc, argv[], etc.
*((unsigned long*)topstack-3) = 0;
*((unsigned long*)topstack-4) = argv[1];
*((unsigned long*)topstack-5) = 1;
asm("mov %0,%%rsp\n" // Install new stack pointer
"xor %%rax, %%rax\n" // Zero registers
"xor %%rbx, %%rbx\n"
"xor %%rcx, %%rcx\n"
"xor %%rdx, %%rdx\n"
"xor %%rsi, %%rsi\n"
"xor %%rdi, %%rdi\n"
"xor %%r8, %%r8\n"
"xor %%r9, %%r9\n"
"xor %%r10, %%r10\n"
"xor %%r11, %%r11\n"
"xor %%r12, %%r12\n"
"xor %%r13, %%r13\n"
"xor %%r14, %%r14\n"
:
: "r"(topstack-5)
:"rax", "rbx", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14");
asm("push %%rax\n"
"pop %%rax\n"
:
:
: "rax");
asm("mov %0,%%rax\n" // Jump to the entry point of the loaded ELF file
"jmp *%%rax\n"
:
: "r"(jump_target)
: );
I then step through this code in gdb. I've pasted the first few instructions of the startup code below. Everything works great until the first push instruction (starred). The push causes a segfault.
0x60026000 xor %ebp,%ebp
0x60026002 mov %rdx,%r9
0x60026005 pop %rsi
0x60026006 mov %rsp,%rdx
0x60026009 and $0xfffffffffffffff0,%rsp
0x6002600d * push %rax
0x6002600e push %rsp
0x6002600f mov $0x605f4990,%r8
I have tried:
Using the stack from the original process.
mmaping a new stack (as in the above code): (1) and (2) both cause segfaults.
pushing and poping to/from the stack before jmping to the loaded ELF file. This does not cause a segfault.
Changing the protection flags for the stack in the second mmap to PROT_READ | PROT_WRITE | PROT_EXEC. This doesn't make a difference.
I suspect this maybe has something to do with the segment descriptors (maybe?). It seems like the code from the ELF file that I'm loading does not have write access to the stack segment, no matter where it is located. I have not tried to modify the segment descriptor for the newly loaded binary or change the architectural segment registers. Is this necessary? Does anybody know how to fix this?
It turned out that when I was stepping through the loaded code in gdb, the debugger would consistently blow by the first push instruction when I typed nexti and instead continue execution. It was not in fact the push instruction that was causing the segfault but a much later instruction in the C library start code. The problem was caused by a failed call to mmap in the initial binary load that I didn't error check.
Regarding gdb randomly deciding to continue execution instead of stepping: this can be fixed by loading the symbols from the target executable after jumping to the newly loaded executable.

gnu C++ library stuck in loop during vector alloc

Running linux kernel 3.6.6-1, gcc 4.7.2-2, the following program:
1 #include <vector>
2 using namespace std;
3 int main ()
4 {
5 vector<size_t> a (1 << 24);
6 return 0;
7 }
never returns from line 5.
when I run in gdb, I see that it is stuck in stl_algobase.h at line 743/744:
0x000000000040101c in std::__fill_n_a<unsigned long*, unsigned long, unsigned long> (__first=0x7fffeffd8060, __n=16777216, __value=#0x7fffffffe0a8: 0)
at /usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/bits/stl_algobase.h:743
740 __fill_n_a(_OutputIterator __first, _Size __n, const _Tp& __value)
741 {
742 const _Tp __tmp = __value;
743 for (__decltype(__n + 0) __niter = __n;
744 __niter > 0; --__niter, ++__first)
745 *__first = __tmp;
746 return __first;
747 }
__niter just stays at the value 1 and never counts down to 0.
This behavior only occurs after my system has been running for a while. And when it occurs, the whole system seems borked. That is, the gui soon stops responding, but I can ssh in and do some stuff, but eventually the whole system becomes unusable and I reboot.
After I reboot, the above program behaves as expected.
Obviously, the problem is not with my program. It's just a symptom of some larger problem.
My question is: What do I do next?
I have checked all my error logs and found nothing. I'm not getting hardware exceptions or anything like that, so it's hard to tell exactly when my system goes into this state.
I'm out of ideas, so any help would be very appreciated.
edit:
I changed my compiler options to -g -Wall and get the same result.
Here is the disassembly for __fill_n_a (with new options):
1 0x00000000004010bd <+0>: push %rbp
2 0x00000000004010be <+1>: mov %rsp,%rbp
3 0x00000000004010c1 <+4>: mov %rdi,-0x18(%rbp)
4 0x00000000004010c5 <+8>: mov %rsi,-0x20(%rbp)
5 0x00000000004010c9 <+12>: mov %rdx,-0x28(%rbp)
6 0x00000000004010cd <+16>: mov -0x28(%rbp),%rax
7 0x00000000004010d1 <+20>: mov (%rax),%rax
8 0x00000000004010d4 <+23>: mov %rax,-0x10(%rbp)
9 0x00000000004010d8 <+27>: mov -0x20(%rbp),%rax
10 0x00000000004010dc <+31>: mov %rax,-0x8(%rbp)
11 0x00000000004010e0 <+35>: jmp 0x4010f7 <std::__fill_n_a<unsigned long*, unsigned long, unsigned long>(unsigned long*, unsigned long, unsigned long const&)+58>
12 0x00000000004010e2 <+37>: mov -0x18(%rbp),%rax
13 0x00000000004010e6 <+41>: mov -0x10(%rbp),%rdx
14 0x00000000004010ea <+45>: mov %rdx,(%rax)
15 0x00000000004010ed <+48>: subq $0x1,-0x8(%rbp)
16 0x00000000004010f2 <+53>: addq $0x8,-0x18(%rbp)
17 0x00000000004010f7 <+58>: cmpq $0x0,-0x8(%rbp)
18 0x00000000004010fc <+63>: setne %al
19 0x00000000004010ff <+66>: test %al,%al
20 0x0000000000401101 <+68>: jne 0x4010e2 <std::__fill_n_a<unsigned long*, unsigned long, unsigned long>(unsigned long*, unsigned long, unsigned long const&)+37>
21 0x0000000000401103 <+70>: mov -0x18(%rbp),%rax
22 0x0000000000401107 <+74>: pop %rbp
23 0x0000000000401108 <+75>: retq
I've also run my system's memory diagnostic tool with no errors and, as suggested by DL, ran memtest86 with no errors.
edit:
I have confirmed that this is not a hardware problem by running the same code on a different machine. The other machine has the same kernel and compiler software installed, and it fails in the same way.
I am suspicious of ImageMagick. This seems to occur only after I have run scripts that make a lot of ImageMagick convert calls. I had problems with ImageMagick previously and had to set the shell variable MAGICK_THREAD_LIMIT=1.
The overall symptoms you describe sound like running out of memory. If the system memory use does not read as high, this may be due to some kind of RAM problem, as commenters have noted.
You say:
__niter just stays at the value 1 and never counts down to 0.
but this doesn't quite make sense -- __niter should start as 16777216 and count down to 0. If you were to break into this program randomly, it would almost certainly be in this loop, but the value of __niter would almost certainly not be 1 yet, and if you step through the loop it would seem to just loop. I'm highly suspect of the debugging info put out by gcc 4.7 (actually, its a problem pretty much since gcc 4.0), in that gdb frequently seems to print out the wrong values for local variables, but if you inspect the code and look and memory/registers directly you can see the correct value. If that's what is happening here, your problem probably has nothing to do with this program; its a system instability (possibly due to a hardware problem) that manifests as things hanging, such as this program. Given what this program does, the hang probably occurs when it touches a previously untouched page (getting a page fault) and the kernel attempts to allocate a page. Which suggests a memory problem, but you noted that you already ran memory diagnostics. Also make sure that you don't have anything overclocked or otherwise running out of spec.

Intel Nehalem single threaded peak performance

i am trying to reach the single threaded FP peak performance for my nehalem cpu to detect the performance anomalies of my application, but i can't seem to reach it. The clock speed is 3.2 GHz, and i want to achieve the peak FP performance of the cpu without using SSE instructions and multi-threading.
As i understand it single precision FP addition and multiplication can be done in parallel each clock cycle, yielding a maximum performance of 2 * 3.20 = 6.4 GFLOPS/sec.
However i am not able to reach this performance with a simple piece of code:
int iterations = 1000000;
int flops_per_iteration = 2;
int num_flops = iterations * flops_per_iterations;
for(int i=0; i<iterations; i++)
{
a[i] = i;
b[i] = i*2;
c[i] = i*3;
}
tick(&start_time);
for(int i = 0; i < iterations; i++){
a[i] *= b[i];
c[i] += b[i];
}
time = tock(&start_time);
printf("Performance: %0.4f GFLOPS \n", flops/(time*pow(10,-3)*pow(10,9)));
This piece of code gives me a performance of: ~1.5 GFLOPS instead of 6.4 GFLOPS.
Anybody has any other example that can approach the peak performance without using MT and SSE, or has any idea my code doesn't?
Thanks in advance
* Update: Added Assembly Code of the hot loop: *
Address Assembly
Block 17:
0x4013a5 movssl (%rdi,%rax,4), %xmm2
0x4013aa movssl (%r8,%rax,4), %xmm0
0x4013b0 movssl (%rsi,%rax,4), %xmm1
0x4013b5 mulss %xmm2, %xmm0
0x4013b9 addss %xmm1, %xmm2
0x4013bd movssl %xmm0, (%r8,%rax,4)
0x4013c3 movssl %xmm2, (%rsi,%rax,4)
0x4013c8 inc %rax
0x4013cb cmp %rcx, %rax
0x4013ce jb 0x4013a5 <Block 17>
To give a performance of 6.4 GFLOPS, your CPU should perform 10 instructions in one clock. Or 7 instructions if unrolled. This is just not possible. You cannot get more than 4 instructions/clock on this processor.
Howe large is your L3 cache? 4 MB? So you may give a little more headroom for the cache. Try decreasing the working size by 50%.
However, the "parallelism" in the FP operation does basically mean that a FP operation can be triggered while others are still being processed and not finished. But you will hardly manage to get real parallelism without either
using a multithreaded approach and/or
using SSE registers.
Shoudln't you use loop unrolling to fill up the CPU pipeline ?

Resources