RISC-V user level reference or reference implementation - riscv

Summary: What is the definitive reference or reference implementation for the RISC-V user-level ISA?
Context: The RISC-V website has "The RISC-V Instruction Set Manual" which explains the user-level instructions very well, but does not give an exact specification for them. I am trying to build a user-level ISA simulator now and intend to write an FPGA implementation later, so the exact behavior is important to me.
A reference implementation would be sufficient, but should preferably be as simple as possible -- i.e. I would try to understand a pipelined implementation only as a last resort. What is important is to have an understanding of the specified ISA and not of a single CPU implementation or compiler implementation.
One example to show my problem is the AUIPC instruction: The prose explanation says that "AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc, then places the result in register rd." I wanted to know whether this refers to the old or new PC, i.e. the position of the AUIPC instruction or the next instruction. I looked at the "RISCV Angel" implementation, but that seems to mask out the lower bits of the (old) PC -- not just of the immediate -- which I could not find any reason for in the spec, not even in the change history of the spec (since Angel is a bit older). Instead of an answer, I now have two questions about AUIPC. Many other instructions pose similar problems to me.

AFAICT the RISC-V Instruction Set Manual you cite is the closest thing there is to a definitive reference. If there are things that are unclear or incorrect in there then you could open issues at the Github site where that document is maintained: https://github.com/riscv/riscv-isa-manual
As far as AIUPC is concerned, the answer is implied, but not stated explicitly, by this sentence at the bottom of page 9 in the current manual:
There is one additional user-visible register: the program counter pc holds the address of the current instruction.
Based on that statement I would expect that the pc value that is seen and manipulated by the AIUPC instruction is the address of the AIUPC instruction itself.
This interpretation is supported by the discussion of the JALR instruction:
The indirect jump instruction JALR (jump and link register) uses the I-type encoding. The target address is obtained by adding the 12-bit signed I-immediate to the register rs1, then setting the least-signicant bit of the result to zero. The address of the instruction following the jump (pc+4) is written to register rd.
Given that the address of the following instruction is expressed as pc+4, it seems clear that the pc value visible during the execution of JALR is the address of the JALR instruction itself.
The latest draft of the manual (at https://github.com/riscv/riscv-isa-manual/releases/download/draft-20190321-ba17106/riscv-spec.pdf) makes the situation slightly clearer. In place of this in the current manual:
AUIPC appends 12 low-order zero bits to the 20-bit U-immediate, sign-extends the result to 64 bits, then adds it to the pc and places the result in register rd.
the latest draft says:
AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc of the AUIPC instruction, then places the result in register rd.

Related

How to tell compiler to pad a specific amount of bytes on every C function?

I'm trying to practice some live instrumentation and I saw there was a linker option -call-nop=prefix-nop, but it has some restriction as it only works with GOT function (I don't know how to force compiler to generate GOT function, and not sure if it's good idea for performance reason.) Also, -call-nop=* cannot pad more than 1 byte.
Ideally, I'd like to see a compiler option to pad any specific amount of bytes, and compiler will still perform all the normal function alignment.
Once I have this pad area, I can at run time to reuse these padding area to store some values or redirect the control flow.
P.S. I believe Linux kernel use similar trick to dynamically enable some software tracepoint.
-pg is intended for profile-guided optimization. The correct option for this is -fpatchable-function-entry
-fpatchable-function-entry=N[,M]
Generate N NOPs right at the beginning of each function, with the function entry point before the Mth NOP. If M is omitted, it defaults to 0 so the function entry points to the address just at the first NOP. The NOP instructions reserve extra space which can be used to patch in any desired instrumentation at run time, provided that the code segment is writable. The amount of space is controllable indirectly via the number of NOPs; the NOP instruction used corresponds to the instruction emitted by the internal GCC back-end interface gen_nop. This behavior is target-specific and may also depend on the architecture variant and/or other compilation options.
It'll insert N single-byte 0x90 NOPs and doesn't make use of multi-byte NOPs thus performance isn't as good as it should, but you probably don't care about that in this case so the option should work fine
I achieved this goal by implement my own mcount function in an assembly file and compile the code with -pg.

2 Questions about Risc-V-Privileged-Spec-v1.7

Page 16, Table 3.1:
Base field in mcpuid: RV32I RV32E RV64I RV128I
What is "RV32E"?
Is there a "E" extension?
ECALL (page 30) says nothing about the behavior of the pc.
While mepc (page 28) and mbadaddr (page 29) claim that "mepc will point to the beginning of the instruction". I think ECALL should set the mepc to the end of the causing instruction so that a ERET would go to the next instruction. Is that right?
As answered by CliffordVienna, RV32E ("embedded") is a new base ISA which uses 16 registers and makes some of the counter registers optional.
I would not recommend implementing a RV32E core, as it is probably an unnecessary over-optimization in core size that limits your ability to use a large body of RV*I code. But if performance is not needed, and you really need the core to be a tad smaller, and the core is not connected to a memory hierarchy that would dominate the area/power anyways, and you were willing to deal with the tool-chain headaches... then maybe an RV32E core is appropriate.
ECALL is treated like an exception, and will redirect the PC to the appropriate trap handler based on the current privilege level. MEPC will be set to the current PC of the ecall instruction.
You can verify this behavior by analyzing the Berkeley RV64G Rocket processor (https://github.com/ucb-bar/rocket/blob/master/src/main/scala/csr.scala), or by looking at the Spike ISA simulator (starting here: https://github.com/riscv/riscv-isa-sim/blob/master/riscv/insns/scall.h). Careful: as of 2015 Jun 27 the code is still in flux regarding the Privileged Spec.
If we look at how Spike handles eret ("sret": https://github.com/riscv/riscv-isa-sim/blob/master/riscv/insns/sret.h) for example, we have to be a bit careful. The PC is set to "mepc", but it's the trap handler's job to advance the PC by 4. We can see that done, for example, by the proxy kernel in some of the handler functions here (https://github.com/riscv/riscv-pk/blob/master/pk/handlers.c).
A draft of the RV32E (embedded) spec can be found here (via isa-dev mailing list):
https://lists.riscv.org/lists/arc/isa-dev/2015-06/msg00022/rv32e.pdf
It's RV32I with 16 instead of 32 registers and without the counter instructions.

How to disassemble these instructions

I am writing a little disassembler using riscv-spec-v2.0 and have some questions about the following instructions and how to correctly disassemble them:
1.
FENCE instruction has "pred" and "succ" bit fields in imm
2.
AMO instructions have "aq" and "rl" bits in in funct7
3.
Float instructions have a "rm" bit field in funct3
All of these bit fields seem to lack mappings in the assembler.
E.g. page 50 just says "FENCE" but not what to do with the intermediate.
Or page 33 has an example of putting .aq or .rl at the end but not what to do if both are present.
4.
SCALL, SBREAK are the same as ECALL, EBREAK
but there is also ERET: so why not drop SCALL and SBREAK
and just use ECALL, EBREAK and ERET because other wise it
is hard to disassemble these opcodes.
The current RISC-V assembler is terse for common defaults:
"FENCE" with no arguments is treated as a full fence (all bits set)
OK to have both on same instruction
Rounding mode not shown if not specified
ECALL and EBREAK will be the new standard names (will be clarified in the revised user ISA manual)

$gp, .cpload and position independence on MIPS

I'm reading about PIC implementation on MIPS on Linux here. It says:
The global pointer which is stored in the $gp register (aka $28) is a callee saved register.
The Wikipedia article about MIPS says the same.
However, according to them, when a .cpload directive is being used in function prologue, it clobbers the previous value of $gp without saving it first. When a .cprestore is used, it saves the current $gp to the stack frame, as opposed to the value of $gp that was there on function entrance. Same goes for the effect .cprestore has on jal/jalr: it restores $gp once the callee returns - assuming the callee might've clobbered it.
And finally, there's nothing in the function epilogue about $gp.
All in all, doesn't sound like a callee-saved register to me. Sounds like a caller-saved register. What am I misunderstanding here?
Linux programs on MIPS can be compiled as pic or not. If compiled as pic, then they must use "abicalls", and its behaviour is a little different from that of the no-abicalls convention.
From the "section Position-Independent Function Prologue" of the "SYSTEM V APPLICATION BINARY INTERFACE - MIPS Processor Supplement 3rd Edition" we can cite:
After calculating the gp, a function allocates the local stack space and saves the gp on the stack, so it can be restored after subsequent function calls. In other words, the gp is a caller saved register.
The code in the following figure illustrates a position-independent function prologue. _gp_disp represents the offset between the beginning of the function and the global offset table.
name:
la gp, _gp_disp
addu gp, gp, t9
addiu sp, sp, –64
sw gp, 32(sp)
So in summary, if you're using -mabicalls then gp is calculated at the beginning of all the functions needing global symbols (with some exceptions), and additionally any code (abi or not) that calls abi code will ensure that the called function address is stored in t9.

operand of LIDT is displacement/absolute address

I stumbled upon a statement in Intel Software developers manual:
"For LGDT, LIDT, LLDT, LTR, SGDT, SIDT, SLDT, STR, the exit qualification receives the value of the instruction’s displacement field, which is sign-extended to 64 bits if necessary (32 bits on processors that do not support Intel 64 architecture). If the instruction has no displacement (for example, has a register operand), zero is stored into the exit qualification. "
Now if I have an instruction LIDT 0xf290, then is "0xf290" a displacement? I think answer is yes.
So, my confusion is what all constitute as displacement? I was under impression that displacement is something which is calculated with respect to current eip value.
For eg. jmp xxx (In intrasegment jumps this will be a displacement. But for intersegment jumps, it should be absolute address.) If that is the case then why LIDT loads a relative address?
A displacement is just an offset from some origin, which may be a Base+Index*Scale, or 0. The other operand x86 has that can hold large values is immediate, which is useful for things like adding constants (e.g. ADD $42, %eax).
Incidentally, it appears that relative jumps use the immediate field, probably because they modify EIP by a constant.

Resources