RISCV RV32IM: MULHSU - which operand is the signed one? - riscv

Question
In the risc-v RV32IM, for instruction MULHSU, which one of operands rs1 and rs2 is the signed operand?
Background
The RISC-V Instruction Set Manual
Volume I: Unprivileged ISA
Document Version 20190608-Base-Ratified
say the following (near bottom page 43):
MULH, MULHU, and MULHSU perform the same multiplication but return the upper XLEN bits of the full 2 × XLEN-bit product, for signed × signed, unsigned × unsigned, and signed rs1 × unsigned rs2 multiplication, respectively.
So this states that the signed operand is rs1.
But the explicatory note (bottom page 43) say:
MULHSU is used in multi-word signed multiplication to multiply the most-significant word of the multiplier (which contains the sign bit) with the less-significant words of the multiplicand (which are unsigned).
From the definition of the instruction (also page 43):
31 25 24 20 19 15 14 12 11 7 6 0
+--------+----------+-------------+--------------+-----+--------+
| funct7 | rs2 | rs1 | funct3 | rd | opcode |
+--------+----------+-------------+--------------+-----+--------+
7 5 5 3 5 7
MULDIV multiplier multiplicand MUL/MUL[[S]U] dest OP
I see that the multiplier is rs2. So the explicatory note states that the signed operand is rs2.

I believe either the diagram or "explicatory note" has a typo. All of my testing has shown rs1 to be signed and rs2 to be unsigned for MULHSU.
A much more comprehensive summary of instruction formats & pseudo-codes can be found here. More detail of pseudo-instructions and other things to help write assembly code for RISC-V can be found here (same website). Its documentation specifically expresses MULHSU as follows :
MULHSU rd, rs1, rs2 #rd ← (sx(rs1) × ux(rs2)) » xlen
where sx(r) means signed version, and ux(r) means unsigned version.
If you find any evidence that this isn't the case, please let me know.

Following communication with riscv fundation: rs1 is the signed operand.
See https://github.com/riscv/riscv-isa-manual/issues/463

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

In the risc-v architecture, what do the bits returned by the mulh[[s]u] operation look like?

TLDR: given 64 bit registers rs1(signed) = 0xffff'ffff'ffff'fff6 and rs2(unsigned) = 0x10 does the riscv mulhsu instruction return 0x0000'0000'0000'000f or 0xffff'ffff'ffff'ffff or something else entirely to rd?
I am working on implementing a simulated version of the RiscV architecture and have run into a snag when implementing the RV64M mulh[[s]u] instruction. I'm not sure if mulhsu returns a signed or unsigned number. If it does return a signed number, then what is the difference between mulhsu and mulh?
here is some pseudocode demonstrating the problem (s64 and u64 denote signed and unsigned 64-bit register respectively)
rs1.s64 = 0xffff'ffff'ffff'fff6; //-10
rs2.u64 = 0x10; // 16
execute(mulhsu(rs1, rs2));
// which of these is correct? Note: rd only returns the upper 64 bits of the product
EXPECT_EQ(0x0000'0000'0000'000f, rd);
EXPECT_EQ(0xffff'ffff'ffff'ffff, rd);
EXPECT_EQ(<some other value>, rd);
Should rd be signed? unsigned?
From the instruction manual:
MUL performs an XLEN-bit × XLEN-bit multiplication of rs1 by rs2 and places the lower XLEN bits
in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return
the upper XLEN bits of the full 2 × XLEN-bit product, for signed × signed, unsigned × unsigned,
and signed rs1×unsigned rs2 multiplication, respectively. If both the high and low bits of the same
product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL
rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing
two separate multiplies.
The answer for your question is :EXPECT_EQ(0xffff'ffff'ffff'ffff, rd);.
mulhsu will do a multiplication of a sign extend of rs1.s64 and a zero extend rs2.u64.
You can see that in the compiler machine description riscv.md.
so mulhsu (64 bits) will return the equivalent of :
((s128) rs1.s64 * (u128) rs2.u64) >> 64. where s128 is a signed 128 int and u128 an unsigned 128 int.
the difference between the three mul is:
mulhsu is a multiplication between a sign extended register and a zero extended register.
mulh is a multiplication of two sign extended registers.
mulhu is a multiplication of two zero extended registers.

How to use Arithmetic expression in Enum in system verilog?

`define REG_WIDTH 48
`define FIELD_WIDTH 32
typedef enum bit [`REG_WIDTH-1:0]
{
BIN_MIN = 'h0,
BIN_MID = BIN_MIN + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3),
BIN_MAX = BIN_MID + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3),
}reg_cover;
In the above code I am getting compilation error of enum duplicate because BIN_MID is also taking value '48{0}. But when I do $display for "BIN_MIN + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3)" , I don't get zero.
Since I have typecast each enum field by 48 , why I am getting zero ? I am new to system verilog.
Typically, integer constants like 1 are treated as 32-bit values (SystemVerilog LRM specifies them to be at least 32 bits but most simulators/synthesis tools use exactly 32 bits). As such, since you are preforming a shift of 32 first, you are shifting out the one completely and left with 0 during compilation (32'd1<<32 is zero). By extending the size of the integer constant first to 48 bits, you will not lose the value due to the shift:
`define REG_WIDTH 48
`define FIELD_WIDTH 32
typedef enum bit [`REG_WIDTH-1:0] {
BIN_MIN = 'h0,
BIN_MID = BIN_MIN + (((`REG_WIDTH'(1))<<`FIELD_WIDTH)+2)/3,
BIN_MAX = BIN_MID + (((`REG_WIDTH'(1))<<`FIELD_WIDTH)+2)/3
} reg_cover;
As to why when put in a $display prints a non-zero value, I'm not sure. Some simulators I tried did print non-zero values, others printed 0. There's might be some differences in compile-time optimizations and how they run the code, but casting first is the best thing to do.

GBZ80 - ADC instructions fail test

I've been running Blarggs CPU tests through my Gameboy emulator, and the op r,r test shows that my ADC instruction is not working properly, but that ADD is. My understanding is that the only difference between the two is adding the existing carry flag to the second operand before addition. As such, my ADC code is the following:
void Emu::add8To8Carry(BYTE &a, BYTE b) //4 cycles - 1 byte
{
if((Flags >> FLAG_CARRY) & 1)
b++;
FLAGCLEAR_N;
halfCarryAdd8_8(a, b); //generates H flag based on addition
carryAdd8_8(a, b); //generates C flag appropriately
a+=b;
if(a == 0)
FLAGSET_Z;
else
FLAGCLEAR_Z;
}
I entered the following into a test ROM:
06 FE 3E 01 88
Which leaves A with the value 0 (Flags = B0) when the carry flag is set, and FF (Flags = 00) when it is not. This is how it should work, as far as my understanding goes. However, it still fails the test.
From my research, I believe that flags are affected in an identical manner to ADD. Literally the only change in my code from the working ADD instruction is the addition of the flag check/potential increment in the first two lines, which my test code seems to prove works.
Am I missing something? Perhaps there's a peculiarity with flag states between ADD/ADC? As a side note, SUB instructions also pass, but SBC fails in the same way.
Thanks
The problem is that b is an 8 bit value. If b is 0xff and carry is set then adding 1 to b will set it to 0 and won't generate carry if added with a >= 1. You get similar problems with the half carry flag if the lower nybble is 0xf.
This might be fixed if you call halfCarryAdd8_8(a, b + 1); and carryAdd8_8(a, b + 1); when carry is set. However, I suspect that those routines also take byte operands so you may have to make changes to them internally. Perhaps by adding the carry as a separate argument so that you can do tmp = a + b + carry; without overflow of b. But I can only speculate without the source to those functions.
On a somewhat related note, there's a fairly simple way to check for carry over all the bits:
int sum = a + b;
int no_carry_sum = a ^ b;
int carry_into = sum ^ no_carry_sum;
int half_carry = carry_into & 0x10;
int carry = carry_info & 0x100;
How does that work? Consider that bitwise "xor" gives the expected result of each bit if there is no carry going in to that bit: 0 ^ 0 == 0, 1 ^ 0 == 0 ^ 1 == 1 and 1 ^ 1 == 0. By xoring sum with no_carry_sum we get the bits where the sum differs from the bit-by-bit addition. sum is only different whenever there is a carry into a particular bit position. Thus both the half carry and carry bits can be obtained with almost no overhead.

Sign of expression in Verilog

Here is a small bit of Verilog code. I would expect it to return three identical results, all 8-bit representations of -1.
module trivial;
reg we;
reg [7:0] c;
initial
begin
c = 8'd3;
we = 1'b1;
$display ("res(we) = %d", (we ? (-$signed(c)) / 8'sd2 : 8'd0));
$display ("res(1) = %d", (1'b1 ? (-$signed(c)) / 8'sd2 : 8'd0));
$display ("res = %d", (-$signed(c)) / 8'sd2);
end
endmodule
Briefly, the version of the standard I have (1364-2001) says in section 4.1.5 that division rounds towards zero, so -3/2=-1. It also says in section 4.5 that operator sign only depends on the operands (edit: but only for "self determined expressions"; turns out it's necessary to read the part of the standard on signs together with the part on widths). So the sub-expression with the division should presumably be unaffected by the context it is used in, and similarly for the sub-expression involving $signed. So the results should all be the same?
Three different simulators disagree with me. And only two of them agree with each other. The apparent cause is that unsigned division is used instead of the signed division that I would expect. (-3=253, and 253/2=126.5)
Can someone please tell me if any of the simulators are right and why? (see below) I clearly must be missing something, but what please? Many thanks. edit: see above for what I was missing. I now think there is a bug in Icarus and the other two simulators are right
NB: the unused value in the ternary choice does not seem to make any difference, whether signed or unsigned. edit: this is incorrect, perhaps I forgot to save the modified test before retrying with signed numbers
Altera edition of Modelsim:
$ vsim work.trivial -do 'run -all'
Reading C:/altera/12.1/modelsim_ase/tcl/vsim/pref.tcl
# 10.1b
# vsim -do {run -all} work.trivial
# Loading work.trivial
# run -all
# res(we) = 126
# res(1) = 126
# res = -1
GPL Cver
GPLCVER_2.12a of 05/16/07 (Cygwin32).
Copyright (c) 1991-2007 Pragmatic C Software Corp.
All Rights reserved. Licensed under the GNU General Public License (GPL).
See the 'COPYING' file for details. NO WARRANTY provided.
Today is Mon Jan 21 18:49:05 2013.
Compiling source file "trivial.v"
Highest level modules:
trivial
res(we) = 126
res(1) = 126
res = -1
Icarus Verilog 0.9.6
$ iverilog.exe trivial.v && vvp a.out
res(we) = 126
res(1) = -1
res = -1
NCSIM gives:
res(we) = 126
res(1) = 126
res = -1
But if all inputs to the mux are signed I get:
$display ("res(we) = %d", (we ? (-$signed(c)) / 8'sd2 : 8'sd0)); //last argument now signed
$display ("res(1) = %d", (1'b1 ? (-$signed(c)) / 8'sd2 : 8'sd0));
$display ("res = %d", (-$signed(c)) / 8'sd2);
res(we) = -1
res(1) = -1
res = -1
Remembering if we do any arithmetic with an unsigned number the arithmetic is done as unsigned, the same happens when using bit selects:
reg signed [7:0] c;
c = c[7:0] + 7'sd1; //<-- this is unsigned
In the example the mux is part of a single line expression, I assume this is logically flattened for optimisation and therefore the signed/unsigned of all arguments is taken into consideration.

Resources