GBZ80 - ADC instructions fail test - emulation

I've been running Blarggs CPU tests through my Gameboy emulator, and the op r,r test shows that my ADC instruction is not working properly, but that ADD is. My understanding is that the only difference between the two is adding the existing carry flag to the second operand before addition. As such, my ADC code is the following:
void Emu::add8To8Carry(BYTE &a, BYTE b) //4 cycles - 1 byte
{
if((Flags >> FLAG_CARRY) & 1)
b++;
FLAGCLEAR_N;
halfCarryAdd8_8(a, b); //generates H flag based on addition
carryAdd8_8(a, b); //generates C flag appropriately
a+=b;
if(a == 0)
FLAGSET_Z;
else
FLAGCLEAR_Z;
}
I entered the following into a test ROM:
06 FE 3E 01 88
Which leaves A with the value 0 (Flags = B0) when the carry flag is set, and FF (Flags = 00) when it is not. This is how it should work, as far as my understanding goes. However, it still fails the test.
From my research, I believe that flags are affected in an identical manner to ADD. Literally the only change in my code from the working ADD instruction is the addition of the flag check/potential increment in the first two lines, which my test code seems to prove works.
Am I missing something? Perhaps there's a peculiarity with flag states between ADD/ADC? As a side note, SUB instructions also pass, but SBC fails in the same way.
Thanks

The problem is that b is an 8 bit value. If b is 0xff and carry is set then adding 1 to b will set it to 0 and won't generate carry if added with a >= 1. You get similar problems with the half carry flag if the lower nybble is 0xf.
This might be fixed if you call halfCarryAdd8_8(a, b + 1); and carryAdd8_8(a, b + 1); when carry is set. However, I suspect that those routines also take byte operands so you may have to make changes to them internally. Perhaps by adding the carry as a separate argument so that you can do tmp = a + b + carry; without overflow of b. But I can only speculate without the source to those functions.
On a somewhat related note, there's a fairly simple way to check for carry over all the bits:
int sum = a + b;
int no_carry_sum = a ^ b;
int carry_into = sum ^ no_carry_sum;
int half_carry = carry_into & 0x10;
int carry = carry_info & 0x100;
How does that work? Consider that bitwise "xor" gives the expected result of each bit if there is no carry going in to that bit: 0 ^ 0 == 0, 1 ^ 0 == 0 ^ 1 == 1 and 1 ^ 1 == 0. By xoring sum with no_carry_sum we get the bits where the sum differs from the bit-by-bit addition. sum is only different whenever there is a carry into a particular bit position. Thus both the half carry and carry bits can be obtained with almost no overhead.

Related

What does applying XOR between the input instructions & account data accomplish in this Solana smart contract?

https://github.com/solana-labs/break/blob/master/program/src/lib.rs
use solana_program::{
account_info::AccountInfo, entrypoint, entrypoint::ProgramResult, pubkey::Pubkey,
};
entrypoint!(process_instruction);
fn process_instruction<'a>(
_program_id: &Pubkey,
accounts: &'a [AccountInfo<'a>],
instruction_data: &[u8],
) -> ProgramResult {
// Assume a writable account is at index 0
let mut account_data = accounts[0].try_borrow_mut_data()?;
// xor with the account data using byte and bit from ix data
let index = u16::from_be_bytes([instruction_data[0], instruction_data[1]]);
let byte = index >> 3;
let bit = (index & 0x7) as u8;
account_data[byte as usize] ^= 1 << (7 - bit);
Ok(())
}
This is from one of their example applications, really not sure what to make of this, or where one might even begin to look in to understanding what the intent is here and how it functions..
Thanks in advance.
EDIT:
Is this done to create a program-derived address? I found this on their API, and the above seems to make sense as an implementation of this I would imagine.
1 << n sets nth bit of what's called a mask, such as 1 << 1 = 0010.
XOR is an useful operation that allows comparing bits, and in this case, it makes use of that property. If current bit is 0, it will be set to 1, and if it is 1, it will be set to 0.
Using the mask from above, we can select one single specific bit to compare, or in this case, switch based on what value it currently is.
1111 ^ 0010 = 1101, the result is a difference, with the bit that matched being set to 0.
1101 ^ 0010 = 1111, every single bit here is different, and so the bit that didn't match is also set to 1.
In short, it toggles a bit, it is a common idiom in bit manipulation code.
bits ^= 1 << n
Related: https://stackoverflow.com/a/47990/15971564

How to use Arithmetic expression in Enum in system verilog?

`define REG_WIDTH 48
`define FIELD_WIDTH 32
typedef enum bit [`REG_WIDTH-1:0]
{
BIN_MIN = 'h0,
BIN_MID = BIN_MIN + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3),
BIN_MAX = BIN_MID + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3),
}reg_cover;
In the above code I am getting compilation error of enum duplicate because BIN_MID is also taking value '48{0}. But when I do $display for "BIN_MIN + `REG_WIDTH'(((1<<`FIELD_WIDTH)+2)/3)" , I don't get zero.
Since I have typecast each enum field by 48 , why I am getting zero ? I am new to system verilog.
Typically, integer constants like 1 are treated as 32-bit values (SystemVerilog LRM specifies them to be at least 32 bits but most simulators/synthesis tools use exactly 32 bits). As such, since you are preforming a shift of 32 first, you are shifting out the one completely and left with 0 during compilation (32'd1<<32 is zero). By extending the size of the integer constant first to 48 bits, you will not lose the value due to the shift:
`define REG_WIDTH 48
`define FIELD_WIDTH 32
typedef enum bit [`REG_WIDTH-1:0] {
BIN_MIN = 'h0,
BIN_MID = BIN_MIN + (((`REG_WIDTH'(1))<<`FIELD_WIDTH)+2)/3,
BIN_MAX = BIN_MID + (((`REG_WIDTH'(1))<<`FIELD_WIDTH)+2)/3
} reg_cover;
As to why when put in a $display prints a non-zero value, I'm not sure. Some simulators I tried did print non-zero values, others printed 0. There's might be some differences in compile-time optimizations and how they run the code, but casting first is the best thing to do.

JPEG Huffman "DECODE" Procedure

The JPEG standard defines the DECODE procedure like below. I'm confused about a few parts.
CODE > MAXCODE(I), if this is true then it enters in a loop and apply left shift (<<) to code. AFAIK, if we apply left shift on non-zero number, the number will be larger then previous. In this figure it applies SLL (shift left logical operation), would't CODE always be greater than MAXCODE?
Probably I coundn't read the figure correctly
What does + NEXTBIT mean? For instance if CODE bits are 10101 and NEXTBIT is 00000001 then will result be 101011 (like string appending), am I right?
Does HUFFVAL list is same as defined in DHT marker (Vi,j values). Do I need to build extra lookup table or something? Because it seems the procedure used that list directly
Thanks for clarifications
EDIT:
My DECODE code (C):
uint8_t
jpg_decode(ImScan * __restrict scan,
ImHuffTbl * __restrict huff) {
int32_t i, j, code;
i = 1;
code = jpg_nextbit(scan);
/* TODO: infinite loop ? */
while (code > huff->maxcode[i]) {
i++;
code = (code << 1) | jpg_nextbit(scan);
}
j = huff->valptr[i];
j = code + huff->delta[i]; /* delta = j - mincode[i] */
return huff->huffval[j];
}
It's not MAXCODE, it's MAXCODE(I), which is a different value each time I is incremented.
+NEXTBIT means literally adding the next bit from the input, which is a 0 or a 1. (NEXTBIT is not 00000001. It is only one bit.)
Once you've found the length of the current code, you get the Vi,j indexing into HUFFVAL decoding table.

Controlling TI OMAP l138 frequency leads to "Division by zero in kernel"

My team is trying to control the frequency of an Texas Instruments OMAP l138. The default frequency is 300 MHz and we want to put it to 372 MHz in a "complete" form: we would like not only to change the default value to the desired one (or at least configure it at startup), but also be capable of changing the value at run time.
Searching on the web about how to do this, we found an article which tells that one of the ways to do this is by an "echo" command:
echo 372000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
We did some tests with this command and it runs fine with one problem: sometimes the first call to this echo command leads to a error message of "Division by zero in kernel":
In my personal tests, this error appeared always in the first call to the echo command. All the later calls worked without error. If, then, I reset my processor and calls the command again, the same problem occurs: the first call leads to this error and later calls work without problem.
So my questions are: what is causing this problem? And how could I solve it? (Obviously the answer "always type it twice" doesn't count!)
(Feel free to mention other ways of controlling the OMAP l138's frequency at real time as well!)
Looks to me like you have division by zero in davinci_spi_cpufreq_transition() function. Somewhere in this function (or in some function that's being called in davinci_spi_cpufreq_transition) there is a buggy division operation which tries to divide by some variable which is (in your case) has value of 0. And this is obviously error case which should be handled properly in code, but in fact it isn't.
It's hard to tell which code exactly leads to this, because I don't know which kernel you are using. It would be much more easier if you can provide link to your kernel repository. Although I couldn't find davinci_spi_cpufreq_transition in upstream kernel, I found it here.
davinci_spi_cpufreq_transition() function appears to be in drivers/spi/davinci_spi.c. It calls davinci_spi_calc_clk_div() function. There are 2 division operations there. First is:
prescale = ((clk_rate / hz) - 1);
And second is:
if (hz < (clk_rate / (prescale + 1)))
One of them is probably causing "division by zero" error. I propose you to trace which one is that by modifying davinci_spi_calc_clk_div() function in next way (just add lines marked as "+"):
static void davinci_spi_calc_clk_div(struct davinci_spi *davinci_spi)
{
struct davinci_spi_platform_data *pdata;
unsigned long clk_rate;
u32 hz, cs_num, prescale;
pdata = davinci_spi->pdata;
cs_num = davinci_spi->cs_num;
hz = davinci_spi->speed;
clk_rate = clk_get_rate(davinci_spi->clk);
+ printk(KERN_ERR "### hz = %u\n", hz);
prescale = ((clk_rate / hz) - 1);
if (prescale > 0xff)
prescale = 0xff;
+ printk("### prescale + 1 = %u\n", prescale + 1UL);
if (hz < (clk_rate / (prescale + 1)))
prescale++;
if (prescale < 2) {
pr_info("davinci SPI controller min. prescale value is 2\n");
prescale = 2;
}
clear_fmt_bits(davinci_spi->base, 0x0000ff00, cs_num);
set_fmt_bits(davinci_spi->base, prescale << 8, cs_num);
}
My guess -- it's "hz" variable which is 0 in your case. If it's so, you also may want to add next debug line to davinci_spi_setup_transfer() function:
if (!hz)
hz = spi->max_speed_hz;
+ printk(KERN_ERR "### setup_transfer: setting speed to %u\n", hz);
davinci_spi->speed = hz;
davinci_spi->cs_num = spi->chip_select;
With all those modifications made, rebuild your kernel and you will probably get the clue why you have that "div by zero" error. Just look for lines started with "###" in your kernel boot log. In case you don't know what to do next -- attach those debug lines and I will try to help you.

Atomicity of a read on SPARC

I'm writing a multithreaded application and having a problem on the SPARC platform. Ultimately my question comes down to atomicity of this platform and how I could be obtaining this result.
Some pseudocode to help clarify my question:
// Global variable
typdef struct pkd_struct{
uint16_t a;
uint16_t b;
} __attribute__(packed) pkd_struct_t;
pkd_struct_t shared;
Thread 1:
swap_value() {
pkd_struct_t prev = shared;
printf("%d%d\n", prev.a, prev.b);
...
}
Thread 2:
use_value() {
pkd_struct_t next;
next.a = 0; next.b = 0;
shared = next;
printf("%d%d\n", shared.a, shared.b);
...
}
Thread 1 and 2 are accessing the shared variable "shared". One is setting, the other is getting. If Thread 2 is setting "shared" to zero, I'd expect Thread 1 to read count either before OR after the setting -- since "shared" is aligned on a 4-byte boundary. However, I will occasionally see Thread 1 reading the value of the form 0xFFFFFF00. That is the high-order 24 bits are OLD, but the low-order byte is NEW. It appears I've gotten an intermediate value.
Looking at the disassembly, the use_value function simply does an "ST" instruction. Given that the data is aligned and isn't crossing a word boundary, is there any explanation for this behavior? If ST is indeed NOT atomic to use this way, does this explain the result I see (only 1 byte changed?!?)? There is no problem on x86.
UPDATE 1:
I've found the problem, but not the cause. GCC appears to be generating assembly that reads the shared variably byte-by-byte (thus allowing a partial update to be obtained). Comments added, but I am not terribly comfortable with SPARC assembly. %i0 is a pointer to the shared variable.
xxx+0xc: ldub [%i0], %g1 // ld unsigned byte g1 = [i0] -- 0 padded
xxx+0x10: ...
xxx+0x14: ldub [%i0 + 0x1], %g5 // ld unsigned byte g5 = [i0+1] -- 0 padded
xxx+0x18: sllx %g1, 0x18, %g1 // g1 = [i0+0] left shifted by 24
xxx+0x1c: ldub [%i0 + 0x2], %g4 // ld unsigned byte g4 = [i0+2] -- 0 padded
xxx+0x20: sllx %g5, 0x10, %g5 // g5 = [i0+1] left shifted by 16
xxx+0x24: or %g5, %g1, %g5 // g5 = g5 OR g1
xxx+0x28: sllx %g4, 0x8, %g4 // g4 = [i0+2] left shifted by 8
xxx+0x2c: or %g4, %g5, %g4 // g4 = g4 OR g5
xxx+0x30: ldub [%i0 + 0x3], %g1 // ld unsigned byte g1 = [i0+3] -- 0 padded
xxx+0x34: or %g1, %g4, %g1 // g1 = g4 OR g1
xxx+0x38: ...
xxx+0x3c: st %g1, [%fp + 0x7df] // store g1 on the stack
Any idea why GCC is generating code like this?
UPDATE 2: Adding more info to the example code. Appologies -- I'm working with a mix of new and legacy code and it's difficult to separate what's relevant. Also, I understand sharing a variable like this is highly-discouraged in general. However, this is actually in a lock implementation where higher-level code will be using this to provide atomicity and using pthreads or platform-specific locking is not an option for this.
Because you've declared the type as packed, it gets one byte alignment, which means it must be read and written one byte at a time, as SPARC does not allow unaligned loads/stores. You need to give it 4-byte alignment if you want the compiler to use word load/store instructions:
typdef struct pkd_struct {
uint16_t a;
uint16_t b;
} __attribute__((packed, aligned(4))) pkd_struct_t;
Note that packed is essentially meaningless for this struct, so you could leave that out.
Answering my own question here -- this has bugged me for too long and hopefully I can save someone a bit of frustration at some point.
The problem is that although the shared data is aligned, because it is packed GCC reads it byte-by-byte.
There is some discussion here on how packing leading to load/store bloat on SPARC (and other RISC platforms I'd assume...), but in my case it has lead to a race.

Resources