How RISC-V Word Operation is performed? - riscv

In case of 64-bit processor,if word operation is to be performed then will the 32nd bit will be considered as signed bit or as a magnitude bit?

The ADDW, SUBW instructions, for example, from riscv-spec-v2.2:
ADDW and SUBW are RV64I-only instructions that are defined analogously
to ADD and SUB but operate on 32-bit values and produce signed 32-bit
results. Overflows are ignored, and the low 32-bits of the result is
sign-extended to 64-bits and written to the destination register.
thus a sign bit. If you don’t want this, then don’t use the ‘w’ variants and make sure the upper register bits are zeros. It gets a bit unusual around the multiply instructions, where you have to manage the upper and lower explicitly.

Related

Conflicting alignment rules for structs vs. arrays

As the title implies, the question is regarding alignment of aggregate types in x86-64 on Linux.
In our lecture, the professor introduced alignment of structs (and the elements thereof) with the attached slide. Hence, I would assume (in accordance with wikipedia and other lecture material) that for any aggregate type, the alignment is in accordance to its largest member. Unfortunately, this does not seem to be the case in a former exam question, in which it said:
"Assuming that each page table [4kB, each PTE 64b] is stored in memory
at a “naturally aligned” physical address (i.e. an address which is an
integer multiple of the size of the table), ..."
How come that for a page table (which afaik is basically an array of 8 byte values in memory), alignment rules are not according to the largest element, but to the size of the whole table?
Clarification is greatly appreciated!
Felix
Why page tables are aligned on their size
For a given level on the process of translating the virtual address, requiring the current page table to be aligned on its size in bytes speeds up the indexing operation.
The CPU doesn't need to perform an actual addition to find the base of the next level page table, it can scale the index and then replace the lowest bits in the current level base.
You can convince yourself this is indeed the case with a few examples.
It's not a coincidence x86s follow this alignment too.
For example, regarding the 4-level paging for 4KiB pages of the x86 CPUs, the Page Directory Pointer field of a 64-bit address is 9 bits wide.
Each entry in that table (a PDPTE) is 64 bits, so the page size is 4096KiB and the last entry has offset 511 * 8 = 4088 (0xff8 in hex, so only 12 bits used at most).
The address of a Page Directory Pointer table is given by a PML4 entry, these entries have don't specify the lower 12 bits of the base (which are used for other purposes), only the upper bits.
The CPU can then simply replace the lower 12 bits in the PML4 entry with the offset of the PDPTE since we have seen it has size 12 bits.
This is fast and cheap to do in hardware (no carry, easy to do with registers).
Assume that a country has ZIP codes made of two fields: a city code (C) and a block code (D), added together.
Also, assume that there can be at most 100 block codes for a given city, so D is 2 digits long.
Requiring that the city code is aligned on 100 (which means that the last two digits of C are zero) makes C + D like replacing the last two digits of C with D.
(1200 + 34 = 12|34).
Relation with the alignment of aggregates
A page table is not regarded as an aggregate, i.e. as an array of 8 byte elements. It is regarded as a type of its own, defined by the ISA of the CPU and that must satisfy the requirement of the particular part of the CPU that uses it.
The page walker finds convenient to have a page table aligned on their size, so this is the requirement.
The alignment of aggregates is a set of rules used by the compiler to allocate objects in memory, it guarantees that every element alignment is satisfied so that instructions can access any element without alignment penalties/fault.
The execution units for loads and stores are a different part of the CPU than the page walker, so different needs.
You should use the aggregates alignment to know how the compiler will align your structs and then check if that's enough for your use case.
Exceptions exist
Note that the professor went a long way with explaining what alignment on their natural boundary means for page tables.
Exceptions exist, if you are told that a datum must be aligned on X, you can assume there's some hardware trick/simplification involved and try to see which one but in the end you just do the alignment and move on.
Margaret explained why page tables are special, I'm only answer this other part of the question.
according to the largest element.
That's not the rule for normal structs either. You want max(alignof(member)) not max(sizeof(member)). So "according to the most-aligned element" would be a better way to describe the required alignment of a normal struct.
e.g. in the i386 System V ABI, double has sizeof = 8 but alignof = 4, so alignof(struct S1) = 41
Even if the char member had been last, sizeof(struct S1) still has to be padded to a multiple of its alignof(), so all the usual invariants are maintained (e.g. sizeof( array ) = N * sizeof(struct S1)), and so stepping by sizeof always gets you to a sufficiently-aligned boundary for the start of a new struct.
Footnote 1: That ABI was designed before CPUs could efficiently load/store 8 bytes at once. Modern compilers try to give double and [u]int64_t 8-byte alignment, e.g. as globals or locals outside of structs. But the ABI's struct layout rules fix the layout based on the minimum guaranteed alignment for any double or int64_t object, which is alignof(T) = 4 for those types.
x86-64 System V has alignof(T) = sizeof(T) for all the primitive types, including the 8-byte ones. This makes atomic operations on any properly-aligned int64_t possible, for example, simplifying the implementation of C++20 std::atomic_ref to not have to check for sufficient alignment. (Why is integer assignment on a naturally aligned variable atomic on x86?)

What is the maximum wire bit-width in verilog/system verilog

I am trying to synthesize a design using Intel Quartus software. During synthesis flow, I got a warning stating "Verilog declaration warning: vector has more than 2**16 bits". Due to project specifications, the wire length exceeds 2^16 bits. Do I really need to worry about this warning? Is there any constraint in Verilog/System Verilog regarding maximum bit-width of the wires?
Yes, you should worry about the messages. It means you are probably miscoding something as I do not believe you really need an integral vector (packed array) with 2^16 bits. This means want to represent a number as great as 2^16^2 = 1.1579209e+77. You probably want an unpacked array of bits or bytes.
The Verilog/SystemVerilog languages themselves do not define a maximum vector width. I vaguely remember a minimum for simulations mention somewhere in the IEEE1364 / IEEE1800 LRMs.
Synthesizers can have their own limitations for various reasons; ex software complexity, physical mapping/routing, business tactics (“buy this other tool”), etc. The fact you are getting warnings probably suggests physical mapping/routing as most synthesizers try to keep all the bits in a vector together; thereby making overall routing and clock-tree buffering easier.
2^16 is a lot of bits and your design exceeds that. In theory you could split the whole vector into more manageable chunks (ex left, right) and it might give you better routing and timing. If your design is meeting all requirements, then the warning could be ignored. Just be ready to make changes when it suddenly can no longer be ignored.
Yes, according to the the SystemVerilog LRM it can be limited, and since your synthesis tool is throwing a warning, you should keep this in mind if any issues arise in the future of your project.
For packed arrays (Section 7.4.1), which you are probably referring to according to the warning, the aformentioned LRM reads:
The maximum size of a packed array can be limited, but shall be at least 65 536 (2^16) bits.
For unpacked arrays (Section 7.4.2), it says:
Implementations may limit the maximum size of an array, but they shall allow at least 16 777 216 (2^24) elements.
But, do you actually need a 2^16 bit wide vector? If you have a 16 bit variable (e.g., logic [15:0]), you can handle unsigned values up to 2^16-1 = 65535. If you have a 2^16 bit wide vector (e.g., logic [65535:0]), you could display values up to 2^(2^16)-1 ≈ 2.00 × 10^19728.
Could you not split up your wire into a couple of smaller wires?

Flipping bits in the mxcsr register

I'm attempting to set bits in the mxcsr control register. I know how to do this with gcc (fesetenv), but I haven't been able to get this working MSVC. The possibilities I've considered are:
Assembly, which isn't supported inline on MSVC x64.
The _controlfp function, which doesn't seem to match up one to one (note that I may be completely misunderstanding this function, but all of this is poorly documented).
Specifically, I want to set these bits:
"Denormals are zeros"
"Flush to zeros"
Any ideas for how I can do this?
The _controlfp routine is a generic routine that works across ARM, x86 and x64 - there is no reason to expect it to use the same bits the actual hardware registers use.
From the article _controlfp, it appears that _controlfp does not allow for individual control of input and output denormals, but sets them together.
To set the x64 SSE2 to flush denormal operands and outputs to zero, use
_controlfp(_DN_FLUSH, _MCW_DN);

RISC-V: S-format instructions table

I have this table of S-format instructions. Can you explain to me what imm[11:5] and funct3 are? I know funct indicates its size in bits and sometimes it is 000 or 010. I don't know exactly why it's there. Also, imm[11:5] is also 7-bits of all 0s.
Please help!
imm[4:0] and imm[11:5] denote closed-intervals into the bit-representation of the immediate operand.
The S-format is used to encode store instructions, i.e.:
sX rs2, offset(r1)
There are different types of store instructions, e.g. store-byte (sb), store-half-word (sh), store-word (sw), etc. The funct3 part is used to encode the type (i.e. 0b00 -> sb, 0b010 -> sw, 0b011 -> sd, etc.). This allows to just use one (major) opcode while still having multiple types of store instructions - instead of having to waste several (major) opcodes. IOW, funct3 encodes the minor opcode of the instruction.
The immediate operand encodes the offset. If you ask yourself why it's split like this - it allows to increase similarity of the remaining parts in the encoding with other instruction formats. For example, the opcode, rs1, and funct3 parts are located at the exact same place in the R-type, I-type and B-type instruction formats. The rs2 part placement is shared with the R-type and B-type instruction formats. Those similarities help to simplify the instruction decoder.
That means the offset is 12 bit wide and in pseudo-code:
offset = sign_ext(imm[11:5] << 5 | imm[4:0])
See also the first figure in Section 2.6 (Load and Store Instruction) of the RISC-V Base specification (2019-06-08 ratified):

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.
Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.
If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.
Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.
assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)
The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.
For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes
For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Resources