Setting the mstatus register for RISC-V - riscv

I am trying to load mstatus with another register t1.
lw t1, mstatus # load mstatys register into t1
xori t1, t1, 0x8 # xor mstatus to set 3rd bit and leave everything else as is
lw mstatus, t1 # set mstatus
The initial lw t1, mstatus works just fine. However when trying to lw mstatus, t1 the assembler gives
Error: illegal operands 'lw mstatus, t1'
I have no idea what causes this error, mstatus register is a read/write register. It should work.

mstatus is not a memory part. Then it can't be loaded/stored with lw/sw instructions under general purpose registers (x1-x31).
mstatus is part of CSR (Configuration Status Registers) that been accessed with Control and Status Register Instruction (see chapter 2.8 of riscv-spec).
Then to load mstatus you should use csrrs/c instruction and to write csrrw instruction depending of what you want to do you can also just clear/set individual bit of register.
Write t1 in mstatus and don't care of mstatus old value (x0):
csrrw t1, mstatus, x0
Read mstatus in t1 and don't touch mstatus value:
csrrs x0, mstatus, t1
or
csrrc x0, mstatus, t1

In addition to #FabienM‘s answer, I would add a reference to the pseudo instructions for handling CSRs. E.g. csrr rd, csr which is short for csrrs rd, csr, x0 and simply reads the given CSR. Those can be found in chapter 9.1 "CSR Instructions", of the The RISC-V Instruction Set Manual Volume I: Unprivileged ISA.

Related

Multi-threaded reference counting

I was just thinking about multi-threaded reference counting, searched for it and found many posts, that basicly only mention the problem of atomicity, many answers even here on stackoverflow miss the actual problems involved in multi-threaded reference counting. So what's the fundamental problem.
Let's assume an object type with a reference counter like
struct SomethingRefcounted {
int refcount;
// other stuff
} * sr;
Reference counting means, our reference counter shall equal the number of references to the object at all times, possibly being slightly higher temporarily during pointer assignments.
For all further code, I assume volatile and atomic operations.
Now, each time, we create a new reference, we do an implicit ++sr->refcount; each time we remove the reference, we do an implicit if (!--sr->refcount) free(sr);.
However, if one thread does the decrement and another thread tries the increment at the same time, we get a race, with can only be understood considering CPU registers.
SomethingRefcounted * sr1, * sr2;
int thread_1() {
for (;;) sr1 = NULL;
}
int thread_2() {
for (;;) sr2 = sr1;
}
int thread_3() {
for (;;) sr1 = new SomethingRefcounted;
}
int main() {
create the threads and let them run
}
The problem here is thread_2: The moment it reads the pointer sr1 into a CPU register, it violates the assumption, the refcounter correctly counts the number of references. The refcounter is 1 after the new assignment to sr1 but even ignoring thread_1 for a moment, once sr1 is read into a CPU register by thread_2, there are 2 reference to the object, first the variable sr1 and the CPU register of thread_2, but the refcounter is still only 1, violating our constraint from above. The following increment will fix it, but this fix will come too late, if thread_1 decrements it to 0 before.
There are solutions involving locks (global locks, maybe hashed for many objects sharing one lock out of a pool of locks, so not every object needs it's own lock but also there doesn't need to be a single application wide lock causing lock contention). One solution I came up with however is to xchg the pointer on read, so the constraint, refcounter >= number of references is enforced. thread_2 in assembly could then look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
INC [reg1]
STORE reg1 -> [sr1]
STORE reg1 -> [sr2] ; Actually, here, we have to do the
; refcount decrement on sr2 first, but you get the point
So, this is a simple spinlock, waiting for other concurrently running accesses like this to complete. Once we successfully XCHGed the pointer, no one else can get it, so we can be sure, the ref counter is at least 1 (it can't be 0, because we just found a reference to it) and it can't be decremented down to zero (even if there are more references, the one we have now, we have it exclusively in your CPU register, and that one contributes 1 to the refcounter, preventing it from reaching zero).
Similarily, thread_1 would look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
LOAD 0 -> reg2
STORE reg2 -> [sr1]
DEC [reg1]
JUMPIFZERO free_it
Now, I am wondering if there are any more efficient solution to that problem. (Or even whether I miss something here).

RISC-V interrupts, setting up MTIMECMP

I am trying to write a program in RISC-V assembly for HiFive1 board to wake up with timer interrupt
This is my interrupt setup routine
.section .text
.align 2
.globl setupINTERRUPT
.equ MTIMECMP, 0x2004000
setupINTERRUPT:
addi sp, sp, -16 # allocate a stack frame, moves the stack up by 16 bits
sw ra, 12(sp) # save return adress on stack
li t0, 0x8 # time interval at which to triger the interrupt
li t1, MTIMECMP # MTIMECMP register of the CLINT memmory map
sw t0, 0(t1) # store the interval in MTIMECMP memory location
li t0, 0x800 # make a mask for 3rd bit
csrrs t1, mstatus, t0 # use CRS READ/SET instruction to set 3rd bit using previously defined mask
li t0, 0x3 # make a mask for 0th and 1st bit
csrrc t1, mtvec, t0 # use CSR READ/CLEAR instruction to clear 0th and 1st bit
li t0, 0x80 # make a mask for 7th bit
csrrs t1, mie, t0 # set 7th bit for MACHINE TIMER INTERRUPT ENABLE
lw ra, 12(sp) # restore the return address
addi sp, sp, 16 # dealocating stack frame
ret
I am not too sure if im setting the MTIMECMP correctly, i know its a 64 bit memory location.
I am trying to use this interrupt as a delay timer for a blinking LED (just trying to make sure the interrupt works before i move onto writing a handler)
here is my setLED program. (not that all the GPIO register setup was done previously and is known to work). I have WFI instruction before each of the ON and OFF functions. The LED doesn't light up, even though in the debug mode it does. I think in LED it skips the WFI instruction as if the interrupt was asserted.
.section .text
.align 2
.globl setLED
#include "memoryMap.inc"
#include "GPIO.inc"
.equ NOERROR, 0x0
.equ ERROR, 0x1
.equ LEDON, 0x1
# which LED to set comes into register a0
# desired On/Off state comes into a1
setLED:
addi sp, sp, -16 # allocate a stack frame, moves the stack up by 16 bits
sw ra, 12(sp) # save return adress on stack
li t0, GPIO_CTRL_ADDR # load GPIO adress
lw t1, GPIO_OUTPUT_VAL(t0) # get the current value of the pins
beqz a1, ledOff # Branch off to turn off led if a1 requests it
li t2, LEDON # load up valued of LEDON into temp register
beq a1, t2, ledOn # branch if on requested
li a0, ERROR # we got a bad status request, return an error
j exit
ledOn:
wfi
xor t1, t1, a0 # doing xor to only change the value of requested LED
sw t1, GPIO_OUTPUT_VAL(t0) # write the new output value to GPIO out
li a0, NOERROR # no error
j exit
ledOff:
wfi
xor a0, a0, 0xffffffff # invert everything so that all bits are one except the LED we are turning off
and t1, t1, a0 # and a0 and t1 to get the LED we want to turn off
sw t1, GPIO_OUTPUT_VAL(t0) # write the new output value
li a0, NOERROR
exit:
lw ra, 12(sp) # restore the return address
addi sp, sp, 16 # dealocating stack frame
ret

What is the point of atomic.Load and atomic.Store

In the Go's memory model nothing is stated about atomics and their relation to memory fencing.
Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.
After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load and Store:
//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
return *ptr
}
Store is implemented in asm_amd64.s in the same package.
TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
MOVQ ptr+0(FP), BX
MOVL val+8(FP), AX
XCHGL AX, 0(BX)
RET
Both look as if they had nothing to do with parallelism.
I did look into other architectures but implementation seems to be equivalent.
However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.
As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.
func try() {
var a, b int32
go func() {
// atomic.StoreInt32(&a, 1)
// atomic.StoreInt32(&b, 1)
a = 1
b = 1
}()
for {
// if n := atomic.LoadInt32(&b); n == 1 {
if n := b; n == 1 {
if a != 1 {
panic("fail")
}
break
}
runtime.Gosched()
}
}
func main() {
n := 1000000000
for i := 0; i < n ; i++ {
try()
}
}
The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store and Load not commented. Full listing is available on the pastebin.
// Anonymous function implementation with atomic calls inlined
TEXT %22%22.try.func1(SB) gofile../path/atomic.go
atomic.StoreInt32(&a, 1)
0x816 b801000000 MOVL $0x1, AX
0x81b 488b4c2408 MOVQ 0x8(SP), CX
0x820 8701 XCHGL AX, 0(CX)
atomic.StoreInt32(&b, 1)
0x822 b801000000 MOVL $0x1, AX
0x827 488b4c2410 MOVQ 0x10(SP), CX
0x82c 8701 XCHGL AX, 0(CX)
}()
0x82e c3 RET
// Important "cycle" part of try() function
0x6ca e800000000 CALL 0x6cf [1:5]R_CALL:runtime.newproc
for {
0x6cf eb12 JMP 0x6e3
runtime.Gosched()
0x6d1 90 NOPL
checkTimeouts()
0x6d2 90 NOPL
mcall(gosched_m)
0x6d3 488d0500000000 LEAQ 0(IP), AX [3:7]R_PCREL:runtime.gosched_m·f
0x6da 48890424 MOVQ AX, 0(SP)
0x6de e800000000 CALL 0x6e3 [1:5]R_CALL:runtime.mcall
if n := atomic.LoadInt32(&b); n == 1 {
0x6e3 488b442420 MOVQ 0x20(SP), AX
0x6e8 8b08 MOVL 0(AX), CX
0x6ea 83f901 CMPL $0x1, CX
0x6ed 75e2 JNE 0x6d1
if a != 1 {
0x6ef 488b442428 MOVQ 0x28(SP), AX
0x6f4 833801 CMPL $0x1, 0(AX)
0x6f7 750a JNE 0x703
0x6f9 488b6c2430 MOVQ 0x30(SP), BP
0x6fe 4883c438 ADDQ $0x38, SP
0x702 c3 RET
As you can see, no fences or locks are in place again.
Note: all tests are done on x86_64 and i5-8259U
The question:
So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)
I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!
//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.
The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.
.store() is written in asm, loading its stack args and using xchg as a seq-cst store.
XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.
It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?
See
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors
describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.
IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

Compare user-inputted string/character to another string/character

So I'm a bit of a beginner to ARM Assembly (assembly in general, too). Right now I'm writing a program and one of the biggest parts of it is that the user will need to type in a letter, and then I will compare that letter to some other pre-inputted letter to see if the user typed the same thing.
For instance, in my code I have
.balign 4 /* Forces the next data declaration to be on a 4 byte segment */
dime: .asciz "D\n"
at the top of the file and
addr_dime : .word dime
at the bottom of the file.
Also, based on what I've been reading online I put
.balign 4
inputChoice: .asciz "%d"
at the top of the file, and put
inputVal : .word 0
at the bottom of the file.
Near the middle of the file (just trust me that there is something wrong with this standalone code, and the rest of the file doesn't matter in this context) I have this block of code:
ldr r3, addr_dime
ldr r2, addr_inputChoice
cmp r2, r3 /*See if the user entered D*/
addeq r5, r5, #10 /*add 10 to the total if so*/
Which I THINK should load "D" into r3, load whatever String or character the user inputted into r2, and then add 10 to r5 if they are the same.
For some reason this doesn't work, and the r5, r5, #10 code only works if addne comes before it.
addr_dime : .word dime is poitlessly over-complicated. The address is already a link-time constant. Storing the address in memory (at another location which has its own address) doesn't help you at all, it just adds another layer of indirection. (Which is actually the source of your problem.)
Anyway, cmp doesn't dereference its register operands, so you're comparing pointers. If you single-step with a debugger, you'll see that the values in registers are pointers.
To load the single byte at dime, zero-extended into r3, do
ldrb r3, dime
Using ldr to do a 32-bit load would also get the \n byte, and a 32-bit comparison would have to match that too for eq to be true.
But this can only work if dime is close enough for a PC-relative addressing mode to fit; like most RISC machines, ARM can't use arbitrary absolute addresses because the instruction-width is fixed.
For the constant, the easiest way to avoid that is not to store it in memory in the first place. Use .equ dime, 'D' to define a numeric constant, then you can use
cmp r2, dime # compare with immediate operand
Or ldr r3, =dime to ask the assembler to get the constant into a register for you. You can do this with addresses, so you could do
ldr r2, =inputVal # r2 = &inputVal
ldrb r2, [r2] # load first byte of inputVal
This is the generic way to handle loading from static data that might be too far away for a PC-relative addressing mode.
You could avoid that by using a stack address (sub sp, #16 / mov r5, sp or something). Then you already have the address in a register.
This is exactly what a C compiler does:
char dime[4] = "D\n";
char input[4] = "xyz";
int foo(int start) {
if (dime[0] == input[0])
start += 10;
return start;
}
From ARM32 gcc6.3 on the Godbolt compiler explorer:
foo:
ldr r3, .L4 # load a pointer to the data section at dime / input
ldrb r2, [r3]
ldrb r3, [r3, #4]
cmp r2, r3
addeq r0, r0, #10
bx lr
.L4:
# gcc greated this "literal pool" next to the code
# holding a pointer it can use to access the data section,
# wherever the linker ends up putting it.
.word .LANCHOR0
.section .data
.p2align 2
### These are in a different section, near each other.
### On Godbolt, click the .text button to see full assembler directives.
.LANCHOR0: # actually defined with a .set directive, but same difference.
dime:
.ascii "D\012\000"
input:
.ascii "xyz\000"
Try changing the C to compare with a literal character instead of a global the compiler can't optimize into a constant, and see what you get.

virtual to physical address conversion in linux kernel

The following is used to translate virtual address to physical address in linux kernel. But what does it mean?
I have very limited knowledge of assembly
163 #define __pv_stub(from,to,instr,type) \
164 __asm__("# __pv_stub\n" \
165 "1: " instr " %0, %1, %2\n" \
166 " .pushsection .pv_table,\"a\"\n" \
167 " .long 1b\n" \
168 " .popsection\n" \
169 : "=r" (to) \
170 : "r" (from), "I" (type))
It's not really "assembly" as there is no instruction in this macro per se.
It's just a macro which inserts instr (an instruction passed to the macro) which has one input operand from, one immediate (constant) input operand type and a output operand to.
There is also the part between pushsection and popsection which records in a specific binary section pv_table the address of this instruction. That allows the kernel to find these places in its code if it wishes to.
The last part is the asm constraints and operands. It lists what the compiler will replace %0, %1 and %2 with. %0 is the first listed ("=r"(to)), it means that %0 will be any general purpose register, that is an output operand that will be stored in the macro argument to. The other 2 are similar except they're input operands: from is a register so gets "r" but type is an immediate so is "i"
See http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm for details
So consider this code from the kernel (http://lxr.linux.no/linux+v3.9.4/arch/arm/include/asm/memory.h#L172)
static inline unsigned long __virt_to_phys(unsigned long x)
{ unsigned long t;
__pv_stub(x, t, "add", __PV_BITS_31_24);
return t;
}
__pv_stub will be equivalent to t = x + __PV_BITS_31_24 (instr == add, from == x, to == t, type == __PV_BITS_31_24)
So you might wonder why anybody would do such a complicated thing instead of just writing t = x + __PV_BITS_31_24 in the code.
The reason is the pv_table I mentioned above. The address of all these statements is recorded in a specific elf section. Under some circumstances, the kernel patches these instructions at runtime (so needs to be able to easily find all of them) hence the need for a table.
The ARM port does exactly that here: http://lxr.linux.no/linux+v3.9.4/arch/arm/kernel/head.S#L541
It's used only if CONFIG_ARM_PATCH_PHYS_VIRT is used to compile the kernel:
CONFIG_ARM_PATCH_PHYS_VIRT:
Patch phys-to-virt and virt-to-phys translation functions at
boot and module load time according to the position of the
kernel in system memory.
This can only be used with non-XIP MMU kernels where the base
of physical memory is at a 16MB boundary, or theoretically 64K
for the MSM machine class.

Resources