what is "wxd" in rocketcore? - riscv

In the rocket core bypass logic
val bypass_sources = IndexedSeq(
(Bool(true), UInt(0), UInt(0)), // treat reading x0 as a bypass
(ex_reg_valid && ex_ctrl.wxd, ex_waddr, mem_reg_wdata),
(mem_reg_valid && mem_ctrl.wxd && !mem_ctrl.mem, mem_waddr, wb_reg_wdata),
(mem_reg_valid && mem_ctrl.wxd, mem_waddr, dcache_bypass_data))
What do ex_ctrl.wxd and mem_ctrl.wxd stand for?

As I understand it, wxd is set for an instruction that writes a value to a register, i.e. has a result value, so writes to the register file. Some reasonably simple decode logic (e.g. test for R-type instruction) identifies whether each instruction is such a writer or not.
Also as I understand it, ex_ctrl and mem_ctrl refer to instructions in their pipeline stages, ex, and mem, respectively — so ex_ctrl.wxd is set when the instruction in the ex stage is one that writes to a register (even though it won't do the write until the wb stage).
Background
The rocket micro architecture suspends reading coprocessor results — as reading coprocessor results means writing a processor register, so also a write to the processor's register file — when wxd is asserted for instructions in the wb pipeline stage, giving processor instructions priority over coprocessor instructions. A coprocessor result value is only transferred into the processor register file when wxd is set false (meaning the processor instruction won't write).
This mechanism limits the number of ports needed to write the register file.

Related

Rust Multi-threading Memory Allocation on the RP Pico/RP2040

I'm working with the Raspberry PI Pico to perform the basic task of reading data from a UART signal, modifying it, and writing it back out to a different UART address. However, I need to simultaneously be constantly monitoring an on-board sensor and sending the values it generates as well.
I found a good example at cortexm-threads but it performs some stack allocation like this:
let mut stack1 = [0xDEADBEEF; 512];
let mut stack2 = [0xDEADBEEF; 512];
How do I know (or find out) what memory addresses I can allocate the stacks to on the RP2040/Pico?
In the example, the 0xDEADBEEF will denote the initial per-cell value of stack1 and stack2 arrays, and can be set to anything. Since those arrays are function-local non-constants/non-statics, they will end up in the (main thread) stack.
Just make sure that the arrays are large enough for your use case, otherwise risking a stack overflow (How does a "stack overflow" occur and how do you prevent it?).
Regarding on where those variables end up being located in the device memory space: cortex-m will set the initial SP value to the largest possible memory address (0x20040000 on Pico - RP2040 SRAM is located between 0x20000000 and 0x20040000, sized 256 kB). See https://github.com/rust-embedded/cortex-m/blob/657af97d66b7157d6a6e5704d86dd59b398e7108/cortex-m-rt/link.x.in#L63. Thereby, the location of those variables will be close to the end of SRAM. See also https://docs.rust-embedded.org/embedonomicon/memory-layout.html
Regarding the multicore use case, see also https://github.com/rp-rs/rp-hal/blob/427344667e9f24f03d132fa08e2dfaa709bc805d/rp2040-hal/src/multicore.rs.
You could also achieve the similar functionality (but using only one core) with interrupt-driven approach, where you store each incoming UART-byte into a circular buffer, handle on-board sensor read on a (either DMA/timer) interrupt, and process the circular buffer (and possibly read sensor value) contents in the idle loop. For more information, see https://en.wikipedia.org/wiki/Circular_buffer and https://rtic.rs/,

Is it possible to {activate|de-activate} SIGFPE generation only for some specific C++11 code segments?

I am writing a path tracer in C++11, on Linux, for the numerical simulation of light transport and I am using
#include <fenv.h>
...
feenableexcept( FE_INVALID |
FE_DIVBYZERO |
FE_OVERFLOW |
FE_UNDERFLOW );
in order to catch and debug any numerical errors that may eventually occur during execution.
At some point in the code I have to compute the intersection of rays (line segments) against axis-aligned bounding boxes (AABBs). For this computation I am using a very optimized and robust ray-box intersection algorithm which relies on the generation of some special values (e.g. NaN and inf) described in the IEEE 754 standard. Obviously, I am not interested in catching floating point exceptions generated specifically by this ray-box intersection routine.
Thus, my questions are:
Is it possible to deactivate the generation of floating point exception signals (SIGFPE) for only some sections of the code (i.e. for the ray-box intersection code section)?
When we are calculating simulations we are very concerned about
performance. In the case that it is possible to suppress exception
signals only for specific code sections, can this be done at
compile time (i.e. instrumenting/de-instrumenting code during its
generation, such that we could avoid expensive function calls)?
Thank you for any help!
UPDATE
It is possible to instrument/deinstrument code through the use of feenableexcept and fedisableexcept function calls (actually, I posted this question because I was not aware about the fedisableexcept, only feenableexcept... shame on me!). For instance:
#include <fenv.h>
int main() {
float a = 1.0f;
fedisableexcept(FE_DIVBYZERO); // disable div by zero catching
// generates an inf that **won't be** catched
float c = a / 0.0f;
feenableexcept(FE_DIVBYZERO); // enable div by zero catching
// generates an inf that **will be** catched
float d = a / 2.0f;
return 0
}
Standard C++ does not provide any way to mark code at compile-time as to whether it should run with floating-point trapping enabled or disabled. In fact, support for manipulating the floating-point environment is not required by the standard, so whether an implementation has it at all is implementation-dependent. Any answer beyond standard C++ depends on the particular hardware and software you are using, but you have not reported that information.
On typical processors, enabling and disabling floating-point trapping is achieved by changing a processor control register. You do not need a function call to do this, but it is not the function call that is expensive, as you suggest in your question. The actual instruction may consume time as it may require the processor to serialize instruction execution. (Modern processors may have hundreds of instructions executing at the same time—some being decoded, some waiting for a subunit within the processor, some in various stages of calculation, some waiting to write their results to general registers, and so on. When changing a control register, the processor may have to wait for all currently executing instructions to finish, then change the register, then start executing new instructions.) If your hardware behaves this way, there is no way to get around it. (With such hardware, which is common, it is not possible to compile code to run with or without trapping without actually executing the run-time instruction to change the control register.)
You might be able to mitigate the time cost by batching path-tracing calculations, so they are performed in groups with only two changes to the floating-point control register (one to turn traps off, one to turn them on) for the entire group.

What happens when you execute an instruction that your CPU does not support?

What happens if a CPU attempts to execute a binary that has been compiled with some instructions that your CPU doesn't support. I'm specifically wondering about some of the new AVX instructions running on older processors.
I'm assuming this can be tested for, and a friendly message could in theory be displayed to a user. Presumably most low level libraries will check this on your behalf. Assuming you didn't make this check, what would you expect to happen? What signal would your process receive?
A new instruction can be designed to be "legacy compatible" or it can not.
To the former class belong instructions like tzcnt or xacquire that have an encoding that produces valid instructions in older architecture: tzcnt is encoded as
rep bsf and xacquire is just repne.
The semantic is different of course.
To the second class belong the majority of new instructions, AVX being one popular example.
When the CPU encounters an invalid or reserved encoding it generates the #UD (for UnDefined) exception - that's interrupt number 6.
The Linux kernel set the IDT entry for #UD early in entry_64.S:
idtentry invalid_op do_invalid_op has_error_code=0
the entry points to do_invalid_op that is generated with a macro in traps.c:
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
the macro DO_ERROR generates a function that calls do_error_trap in the same file (here).
do_error_trap uses fill_trap_info (in the same file, here) to create a siginfo_t structure containing the Linux signal information:
case X86_TRAP_UD:
sicode = ILL_ILLOPN;
siaddr = uprobe_get_trap_addr(regs);
break;
from there the following calls happen:
do_trap in traps.c
force_sig_info in signal.c
specific_send_sig_info in signal.c
that ultimately culminates in calling the signal handler for SIGILL of the offending process.
The following program is a very simple example that generates an #UD
BITS 64
GLOBAL _start
SECTION .text
_start:
ud2
we can use strace to check the signal received by running that program
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x400080} ---
+++ killed by SIGILL +++
as expected.
As Cody Gray commented, libraries don't usually rely on SIGILL, instead they use a CPU dispatcher or check the presence of an instruction explicitly.

instruction set emulator guide

I am interested in writing emulators like for gameboy and other handheld consoles, but I read the first step is to emulate the instruction set. I found a link here that said for beginners to emulate the Commodore 64 8-bit microprocessor, the thing is I don't know a thing about emulating instruction sets. I know mips instruction set, so I think I can manage understanding other instruction sets, but the problem is what is it meant by emulating them?
NOTE: If someone can provide me with a step-by-step guide to instruction set emulation for beginners, I would really appreciate it.
NOTE #2: I am planning to write in C.
NOTE #3: This is my first attempt at learning the whole emulation thing.
Thanks
EDIT: I found this site that is a detailed step-by-step guide to writing an emulator which seems promising. I'll start reading it, and hope it helps other people who are looking into writing emulators too.
Emulator 101
An instruction set emulator is a software program that reads binary data from a software device and carries out the instructions that data contains as if it were a physical microprocessor accessing physical data.
The Commodore 64 used a 6502 Microprocessor. I wrote an emulator for this processor once. The first thing you need to do is read the datasheets on the processor and learn about its behavior. What sort of opcodes does it have, what about memory addressing, method of IO. What are its registers? How does it start executing? These are all questions you need to be able to answer before you can write an emulator.
Here is a general overview of how it would look like in C (Not 100% accurate):
uint8_t RAM[65536]; //Declare a memory buffer for emulated RAM (64k)
uint16_t A; //Declare Accumulator
uint16_t X; //Declare X register
uint16_t Y; //Declare Y register
uint16_t PC = 0; //Declare Program counter, start executing at address 0
uint16_t FLAGS = 0 //Start with all flags cleared;
//Return 1 if the carry flag is set 0 otherwise, in this example, the 3rd bit is
//the carry flag (not true for actual 6502)
#define CARRY_FLAG(flags) ((0x4 & flags) >> 2)
#define ADC 0x69
#define LDA 0xA9
while (executing) {
switch(RAM[PC]) { //Grab the opcode at the program counter
case ADC: //Add with carry
A = X + RAM[PC+1] + CARRY_FLAG(FLAGS);
UpdateFlags(A);
PC += ADC_SIZE;
break;
case LDA: //Load accumulator
A = RAM[PC+1];
UpdateFlags(X);
PC += MOV_SIZE;
break;
default:
//Invalid opcode!
}
}
According to this reference ADC actually has 8 opcodes in the 6502 processor, which means you will have 8 different ADC in your switch statement, each one for different opcodes and memory addressing schemes. You will have to deal with endianess and byte order, and of course pointers. I would get a solid understanding of pointer and type casting in C if you dont already have one. To manipulate the flags register you have to have a solid understanding of bitwise operations in C. If you are clever you can make use of C macros and even function pointers to save yourself some work, as the CARRY_FLAG example above.
Every time you execute an instruction, you must advance the program counter by the size of that instruction, which is different for each opcode. Some opcodes dont take any arguments and so their size is just 1 byte, while others take 16-bit integers as in my MOV example above. All this should be pretty well documented.
Branch instructions (JMP, JE, JNE etc) are simple: If some flag is set in the flags register then load the PC to the address specified. This is how "decisions" are made in a microprocessor and emulating them is simply a matter of changing the PC, just as the real microprocessor would do.
The hardest part about writing an instruction set emulator is debugging. How do you know if everything is working like it should? There are plenty of resources for helping you. People have written test codes that will help you debug every instruction. You can execute them one instruction at a time and compare the reference output. If something is different, you know you have a bug somewhere and can fix it.
This should be enough to get you started. The important thing is that you have A) A good solid understanding of the instruction set you want to emulate and B) a solid understanding of low level data manipulation in C, including type casting, pointers, bitwise operations, byte order, etc.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Resources