RISC-V: S-format instructions table

RISC-V: S-format instructions table - riscv

I have this table of S-format instructions. Can you explain to me what imm[11:5] and funct3 are? I know funct indicates its size in bits and sometimes it is 000 or 010. I don't know exactly why it's there. Also, imm[11:5] is also 7-bits of all 0s.
Please help!

imm[4:0] and imm[11:5] denote closed-intervals into the bit-representation of the immediate operand.
The S-format is used to encode store instructions, i.e.:
sX rs2, offset(r1)
There are different types of store instructions, e.g. store-byte (sb), store-half-word (sh), store-word (sw), etc. The funct3 part is used to encode the type (i.e. 0b00 -> sb, 0b010 -> sw, 0b011 -> sd, etc.). This allows to just use one (major) opcode while still having multiple types of store instructions - instead of having to waste several (major) opcodes. IOW, funct3 encodes the minor opcode of the instruction.
The immediate operand encodes the offset. If you ask yourself why it's split like this - it allows to increase similarity of the remaining parts in the encoding with other instruction formats. For example, the opcode, rs1, and funct3 parts are located at the exact same place in the R-type, I-type and B-type instruction formats. The rs2 part placement is shared with the R-type and B-type instruction formats. Those similarities help to simplify the instruction decoder.
That means the offset is 12 bit wide and in pseudo-code:
offset = sign_ext(imm[11:5] << 5 | imm[4:0])
See also the first figure in Section 2.6 (Load and Store Instruction) of the RISC-V Base specification (2019-06-08 ratified):

Related

Conflicting alignment rules for structs vs. arrays

As the title implies, the question is regarding alignment of aggregate types in x86-64 on Linux.
In our lecture, the professor introduced alignment of structs (and the elements thereof) with the attached slide. Hence, I would assume (in accordance with wikipedia and other lecture material) that for any aggregate type, the alignment is in accordance to its largest member. Unfortunately, this does not seem to be the case in a former exam question, in which it said:
"Assuming that each page table [4kB, each PTE 64b] is stored in memory
at a “naturally aligned” physical address (i.e. an address which is an
integer multiple of the size of the table), ..."
How come that for a page table (which afaik is basically an array of 8 byte values in memory), alignment rules are not according to the largest element, but to the size of the whole table?
Clarification is greatly appreciated!
Felix

Why page tables are aligned on their size
For a given level on the process of translating the virtual address, requiring the current page table to be aligned on its size in bytes speeds up the indexing operation.
The CPU doesn't need to perform an actual addition to find the base of the next level page table, it can scale the index and then replace the lowest bits in the current level base.
You can convince yourself this is indeed the case with a few examples.
It's not a coincidence x86s follow this alignment too.
For example, regarding the 4-level paging for 4KiB pages of the x86 CPUs, the Page Directory Pointer field of a 64-bit address is 9 bits wide.
Each entry in that table (a PDPTE) is 64 bits, so the page size is 4096KiB and the last entry has offset 511 * 8 = 4088 (0xff8 in hex, so only 12 bits used at most).
The address of a Page Directory Pointer table is given by a PML4 entry, these entries have don't specify the lower 12 bits of the base (which are used for other purposes), only the upper bits.
The CPU can then simply replace the lower 12 bits in the PML4 entry with the offset of the PDPTE since we have seen it has size 12 bits.
This is fast and cheap to do in hardware (no carry, easy to do with registers).
Assume that a country has ZIP codes made of two fields: a city code (C) and a block code (D), added together.
Also, assume that there can be at most 100 block codes for a given city, so D is 2 digits long.
Requiring that the city code is aligned on 100 (which means that the last two digits of C are zero) makes C + D like replacing the last two digits of C with D.
(1200 + 34 = 12|34).
Relation with the alignment of aggregates
A page table is not regarded as an aggregate, i.e. as an array of 8 byte elements. It is regarded as a type of its own, defined by the ISA of the CPU and that must satisfy the requirement of the particular part of the CPU that uses it.
The page walker finds convenient to have a page table aligned on their size, so this is the requirement.
The alignment of aggregates is a set of rules used by the compiler to allocate objects in memory, it guarantees that every element alignment is satisfied so that instructions can access any element without alignment penalties/fault.
The execution units for loads and stores are a different part of the CPU than the page walker, so different needs.
You should use the aggregates alignment to know how the compiler will align your structs and then check if that's enough for your use case.
Exceptions exist
Note that the professor went a long way with explaining what alignment on their natural boundary means for page tables.
Exceptions exist, if you are told that a datum must be aligned on X, you can assume there's some hardware trick/simplification involved and try to see which one but in the end you just do the alignment and move on.

Margaret explained why page tables are special, I'm only answer this other part of the question.
according to the largest element.
That's not the rule for normal structs either. You want max(alignof(member)) not max(sizeof(member)). So "according to the most-aligned element" would be a better way to describe the required alignment of a normal struct.
e.g. in the i386 System V ABI, double has sizeof = 8 but alignof = 4, so alignof(struct S1) = 41
Even if the char member had been last, sizeof(struct S1) still has to be padded to a multiple of its alignof(), so all the usual invariants are maintained (e.g. sizeof( array ) = N * sizeof(struct S1)), and so stepping by sizeof always gets you to a sufficiently-aligned boundary for the start of a new struct.
Footnote 1: That ABI was designed before CPUs could efficiently load/store 8 bytes at once. Modern compilers try to give double and [u]int64_t 8-byte alignment, e.g. as globals or locals outside of structs. But the ABI's struct layout rules fix the layout based on the minimum guaranteed alignment for any double or int64_t object, which is alignof(T) = 4 for those types.
x86-64 System V has alignof(T) = sizeof(T) for all the primitive types, including the 8-byte ones. This makes atomic operations on any properly-aligned int64_t possible, for example, simplifying the implementation of C++20 std::atomic_ref to not have to check for sufficient alignment. (Why is integer assignment on a naturally aligned variable atomic on x86?)

creating a constant vector of variable width in verilog

I'm writing some synthesizable Verilog. I need to create a value to use as a mask in a larger expression. This value is a sequence of 1's, when the length is stored in some register:
buffer & {offset{1'h1}};
where buffer and offset are both registers. What I expect is for buffer to be anded with 11111... of width offset. However, the compiler says this illegal in verilog, since offset needs to be constant.
Instead, I wrote the following:
buffer & ~({WIDTH{1'h1}} << offset)
where WIDTH is a constant. This works. Both expressions are equivalent in terms of values, but obviously not in terms of the hardware that would be synthesized.
What's the difference?

The difference is because of the rules for context-determined expressions (detailed in sections 11.6 and 11.8 of the IEEE 1800-2017 LRM) require the width of all operands of an expression be known at compile time.
Your example is too simple to show where the complication arises, but let's say buffer was 16-bit signed variable. To perform bitwise-and (&), you first need to know size of both operands, then extend the smaller operand to match the size of the larger operand. If we don't know what the size of {offset{1'h1}} is, we don't know whether it needs to 0-extended, or buffer needs to be sign-extended.
Of course, the language could be defined to allow this so it works, but synthesis tools would be creating a lot of unnecessary additional hardware. And if we start applying this to more complex expressions, trying to determine how the bit-widths propagate becomes unmanageable.

Both parts of your question impliy the replication operator. The operator requires you to use a replication constant to show how many times to replicate. So, the first part of your example is illegal. The offset must be a constant, not a reg.
Also, the constant is not a width of something, but a number of times the {1} is repeated. So, the second part of the example is correct syntactically.

What is an efficient way to search a string for the first of a set of delimiters?

I have a UTF-8 encoded string and I would like to iterate through it,
splitting it at one of multiple delimiters. I also need to know
which delimiter matched, as each delimiter has a specific meaning.
An example usage:
algorithm("one, two; three") => Match("one")
algorithm(", two; three") => Delimiter(",")
algorithm(" two; three") => Match(" two")
algorithm("; three") => Delimiter(";")
algorithm(" three") => Match(" three")
Additional information:
My delimiters are all single ASCII characters, so optimized
algorithms that require that are possible.
A solution that handles UTF-8 substrings would also be appreciated,
but isn't required.
I plan to call the method many times and potentially in a tight
loop, so an ideal algorithm would not need to allocate any memory.
The algorithm should return the first matching string or delimiter
and I can handle restarting the search on the next iteration.
An ideal algorithm would innately know if it is returning a match or
a delimiter, but it's possible to check that after the fact.
My target language is Rust, but I would appreciate answers in any
language with a similar lower-level focus. Pseudocode is fine as well,
as long as it recognizes the realities of UTF-8 text. Solutions that
use esoteric hex tricks or SIMD instructions are also suitable, but may require more explanation for me to understand ^_^.

For a processor-specific solution, X86-64 processors with SSE4.2 contain the PCMPxSTRx family of instructions. One of the modes available with these instructions is Equal Any:
arg1 is a character set, arg2 is the string to search in. IntRes1[i] is set to 1 if arg2[i] is in the set represented by arg1
The basic algorithm is straight-forward:
Fill an XMM register with up to 16 single bytes to search for (the needle).
Set the count of needle bytes in rax.
Calculate the memory address of the start of the string, including an offset.
Set the count of haystack bytes in rdx.
Call PCMPxSTRx with the appropriate control byte.
Check the result of ecx or one of the control code flags.
If there was no match and there is still string left to search for, increment the offset and loop.
There is a complication around page boundaries, however. Namely, the PCMPxSTRx instructions will always read 16 bytes of data. This can cause a segmentation fault if you read into a page of memory that is protected. A solution is to align all the reads to the end of the string, and handle the leftover bytes at the beginning. Before starting the above algorithm, use something like:
Mask the address of the start of the string with ~0xF. This clears all the low bits.
Use a PCMPxSTRM instruction (with a similar setup as above algorithm) for the first 16 bytes. This returns a mask of matching characters. You can shift the mask to ignore leading characters that are not part of your string.
If there was no match and there is more string left to search, start the above algorithm.
You can see the complete example of this algorithm in my Rust library Jetscii. Inline assembly is used to call out to the PCMPxSTRx instructions.

Word, Doubleword, Quadword

It's my second question, one after another. That's the problem with assembly (x86 - 32bit) too.
"Programming from the Ground Up" says that 4bytes are 32bits and that's a word.
But Intel's "Basic Architecture" guide says, that word is 16bits (2 bytes) and 4 bytes is a dualword.
Memory uses 4bytes words, to get to another word I have to skip next 4 bytes, on each word I can make 4 offsets (0-3) to read a byte, so it's wrong with Intel's name, but this memory definition goes from Intel, so what's there bad?
And how to operate on words, dualword, quadwords in assembly? How to define the number as quadword?

To answer your first question, the processor word size is a function of the architecture. Thus, a 32-processor has a 32-bit word. In software types, including assembly, usually there is need to identify the size unambigously, so word type for historical reasons is 16-bits. So probably both sources are correct, if you read them in context: the first one is referring to the processor word, while the Intel guide is referring to the word type.

We've got different "word"s: program words, memory words, OS-specific words, architecture-specific words (program space word, flash word, eeprom word), even address words.
It's just a matter of convention what size the word word refers to.
I usually find the size of the word by looking at the number of hex digits the context is using to show them. Intel's most common type, 4 digits (0x0000), is two bytes.
And for further information, even byte is a convention. In many systems in the past bytes have been 7 or 9 bits. Most architectures nowadays have 8-bit bytes. The correct name for an always-8-bit structure is an octet.

Decoding 68k instructions

I'm writing an interpreted 68k emulator as a personal/educational project. Right now I'm trying to develop a simple, general decoding mechanism.
As I understand it, the first two bytes of each instruction are enough to uniquely identify the operation (with two rare exceptions) and the number of words left to be read, if any.
Here is what I would like to accomplish in my decoding phase:
1. read two bytes
2. determine which instruction it is
3. extract the operands
4. pass the opcode and the operands on to the execute phase
I can't just pass the first two bytes into a lookup table like I could with the first few bits in a RISC arch, because operands are "in the way". How can I accomplish part 2 in a general way?
Broadly, my question is: How do I remove the variability of operands from the decoding process?
More background:
Here is a partial table from section 8.2 of the Programmer's Reference Manual:
Table 8.2. Operation Code Map
Bits 15-12 Operation
0000 Bit Manipulation/MOVEP/Immediate
0001 Move Byte
...
1110 Shift/Rotate/Bit Field
1111 Coprocessor Interface...
This made great sense to me, but then I look at the bit patterns for each instruction and notice that there isn't a single instruction where bits 15-12 are 0001, 0010, or 0011. There must be some big piece of the picture that I'm missing.
This Decoding Z80 Opcodes site explains decoding explicitly, which is something I haven't found in the 68k programmer's reference manual or by googling.

I've decided to simply create a look-up table with every possible pattern for each instruction. It was my first idea, but I discarded it as "wasteful, inelegant". Now, I'm accepting it as "really fast".

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string