6502 and little-endian conversion - emulation

For fun I'm implementing an NES emulator. I'm currently reading through documentation for the 6502 CPU and I'm a little confused.
I've seen documentation stating because the 6502 is little-endian so when using absolute addressing mode you need to swap the bytes. I'm writing this on an x86 machine which is also little-endian, so I don't understand why I couldn't simply cast to a uint16_t*, dereference that, and let the compiler work out the details.
I've written some simple tests in google test and they seem to agree with me.
// implementation of READ16
#define READ16(addr) (*(uint16_t*)addr)
TEST(MemMacro, READ16) {
uint8_t arr[] = {0xFF,0xCC};
uint8_t *mem = (&arr[0]);
EXPECT_EQ(0xCCFF, READ16(mem));
}
This passes, so it appears my supposition is correct, but I thought I'd ask someone with more experience than I.
Is this correct for pulling out the operand in 6502 absolute addressing mode? Am I possibly missing something?

It will work for simple cases on little-endian systems, but tying your implementation to those feels unnecessary when the corresponding portable implementation is simple. Sticking to the macro, you could do this instead:
#define READ16(addr) (addr[0] + (addr[1] << 8))
(Just to be pedantic, you should also make sure that addr[1] can't be out-of-bounds, and would need to add some more parentheses if addr could be a complex expression.)
However, as you keep developing your emulator, you will find that it's most natural to use a pair of general-purpose read_mem() and write_mem() functions that operate on single bytes. Remember that the address space is split up into multiple regions (RAM, ROM, and memory-mapped registers from the PPU and APU), so having e.g. a single array that you index into won't work well. The fact that memory regions can be remapped by mappers also complicates things. (You won't have to worry about that for simple games though -- I recommend starting with Donkey Kong.)
What you need to do is to figure out what region or memory-mapped register the address belongs to inside your read_mem() and write_mem() functions (this is called address decoding), and do the right thing for the address.
Returning to the original question, the fact that you'll end up using read_mem() to read the individual bytes of the address anyway means that the uint16_t casting trickery is even less likely to be useful. This is the simplest and most robust approach w.r.t. handling corner cases, and what every emulator I've seen does in practice (Nestopia, Nintendulator, and FCEUX).
In case you've missed it, the #nesdev channel on EFNet is very active and a good resource by the way. I assume you're already familiar with the NESDev wiki. :)
I've also been working on an emulator which can be found here.

Related

How do you approach creating a complete new datatype on the "bit-level"?

I would like to create a new data type in Rust on the "bit-level".
For example, a quadruple-precision float. I could create a structure that has two double-precision floats and arbitrarily increase the precision by splitting the quad into two doubles, but I don't want to do that (that's what I mean by on the "bit-level").
I thought about using a u8-array or a bool-array but in both cases, I waste 7 bits of memory (because also bool is a byte large). I know there are several crates that implement something like bit-arrays or bit-vectors, but looking through their source code didn't help me to understand their implementation.
How would I create such a bit-array without wasting memory, and is this the way I would want to choose when implementing something like a quad-precision type?
I don't know how to implement new data types that don't use the basic types or are structures that combine the basic types, and I haven't been able to find a solution on the internet yet; maybe I'm not searching with the right keywords.
The question you are asking has no direct answer: Just like any other programming language, Rust has a basic set of rules for type layouts. This is due to the fact that (most) real-world CPUs can't address individual bits, need certain alignments when referencing memory, have rules regarding how pointer arithmetic works etc. etc.
For instance, if you create a type of just two bits, you'll still need an 8-bit byte to represent that type, because there is simply no way to address two individual bits on most CPU's opcodes; there is also no way to take the address of such a type because addressing works at least on the byte-level. More useful information regarding this can be found here, section 2, The Anatomy of a Type. Be aware that the non-wasting bit-level type you are thinking about needs to fulfill all the rules mentioned there.
It's a perfectly reasonable approach to represent what you want to do e.g. either as a single, wrapped u128 and implement all arithmetic on top of that type. Another, more generic, approach would be to use a Vec<u8>. You'll always do a relatively large amount of bit-masking, indirecting and such.
Having a look at rust_decimal or similar crates might also be a good idea.

Direct3D 11/HLSL Texture3D<float3> False Error?

I am getting this error:
D3D11 ERROR: ID3D11DeviceContext::Dispatch: The Shader Resource View in slot 0 of the Compute Shader unit is using the Format (R32G32B32_FLOAT). This format does not support 'Sample', 'SampleLevel', 'SampleBias' or 'SampleGrad', at least one of which may being used on the Resource by the shader. This mismatch is invalid if the shader actually uses the view (e.g. it is not skipped due to shader code branching). [ EXECUTION ERROR #371: DEVICE_DRAW_RESOURCE_FORMAT_SAMPLE_UNSUPPORTED]
Here's my working code:
Texture3D< float4 > g_VectorField;
float3 ... = g_VectorField.SampleLevel( ... ).rgb;
Here's my code that is causing the error:
Texture3D< float3 > g_VectorField;
float3 ... = g_VectorField.SampleLevel( ... );
My application works just fine when I'm not catching Direct3D errors (disabled D3D11_CREATE_DEVICE_DEBUG). SampleLevel's behavior is not undefined as far as I can see. It's behaving in exactly the same as the first snippet, yet it is giving me this error. This format does not support 'SampleLevel' my ass.
It seems that this error can be ignored without causing undefined behavior, so why is it an error?
It's perfectly valid for an "undefined behaviour" to give you the results you expected. The problem is that it's also perfactly valid for it to crash the graphics driver or reinstall your operating system, so to speak. The "warning" comes from DirectX, but the actual operation is performed by the graphics card. So you're doing an operation that's invalid for DirectX 11, but a specific GPU might support it. Others don't have to. Future cards don't have to either. And it might be doing something really stupid to allow you to use the sampling or converting the resource to float4.
Take C++ for example - the order of evaluation of function parameters is undefined. That means that while it might be doing the thing you expect it to do on your compiler and your computer, it might be different in a different place (different optimizations possible, for example), or on a different compiler / computer. The code is no longer portable or reliable.
As for why this particular operation is undefined, it's hard to tell. Different texture formats behave in very different ways - some recalibrate gamma, some premultiply the alpha... It might very well be that noone expected your particular type to be widely used, so it wasn't worth the extra branch in the specification. It might be that they didn't include it in the specification because enough GPUs didn't support it. Does it actually give you a measurable performance benefit? If the support is done by translating the texture, for example, the float3 variant might actually be slower.
Of course, in general computing, powers of two are usually preferred when storing data, because they make some operations much easier. GPUs might very well still depend on the old-school bit-hacks to handle some operations - 128 is nice, 96... not so much. Maybe they still store data in 128-bit registers, so there's no point in using 96. Maybe, maybe, maybe all the way :D
EDIT: I found something relevant in the DirectX documentation:
A resource declared with the DXGI_FORMAT_R32G32B32 family of formats
cannot be used simultaneously for vertex and texture data. That is,
you may not create a buffer resource with the DXGI_FORMAT_R32G32B32
family of formats that uses any of the following bind flags:
D3D10_BIND_VERTEX_BUFFER, D3D10_BIND_INDEX_BUFFER,
D3D10_BIND_CONSTANT_BUFFER, or D3D10_BIND_STREAM_OUTPUT
So it seems that it's fine to use R32G32B32 for an actual texture, but not for a vertex/index/constant buffer.

Atomic Compare And Swap with struct in Go

I am trying to create a non-blocking queue package for concurrent application using the algorithm by Maged M. Michael and Michael L. Scott as described here.
This requires the use of atomic CompareAndSwap which is offered by the "sync/atomic" package.
I am however not sure what the Go-equivalent to the following pseudocode would be:
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
where tail and next is of type:
type pointer_t struct {
ptr *node_t
count uint
}
and node is of type:
type node_t struct {
value interface{}
next pointer_t
}
If I understood it correctly, it seems that I need to do a CAS with a struct (both a pointer and a uint). Is this even possible with the atomic-package?
Thanks for help!
If I understood it correctly, it seems that I need to do a CAS with a struct (both a > pointer and a uint). Is this even possible with the atomic-package?
No, that is not possible. Most architectures only support atomic operations on a single word. A lot of academic papers however use more powerful CAS statements (e.g. compare and swap double) that are not available today. Luckily there are a few tricks that are commonly used in such situations:
You could for example steal a couple of bits from the pointer (especially on 64bit systems) and use them, to encode your counter. Then you could simply use Go's CompareAndSwapPointer, but you need to mask the relevant bits of the pointer before you try to dereference it.
The other possibility is to work with pointers to your (immutable!) pointer_t struct. Whenever you want to modify an element from your pointer_t struct, you would have to create a copy, modify the copy and atomically replace the pointer to your struct. This idiom is called COW (copy on write) and works with arbitrary large structures. If you want to use this technique, you would have to change the next attribute to next *pointer_t.
I have recently written a lock-free list in Go for educational reasons. You can find the (imho well documented) source here: https://github.com/tux21b/goco/blob/master/list.go
This rather short example uses atomic.CompareAndSwapPointer excessively and also introduces an atomic type for marked pointers (the MarkAndRef struct). This type is very similar to your pointer_t struct (except that it stores a bool+pointer instead of an int+pointer). It's used to ensure that a node has not been marked as deleted while you are trying to insert an element directly afterwards. Feel free to use this source as starting point for your own projects.
You can do something like this:
if atomic.CompareAndSwapPointer(
(*unsafe.Pointer)(unsafe.Pointer(tail.ptr.next)),
unsafe.Pointer(&next),
unsafe.Pointer(&pointer_t{&node, next.count + 1})
)

What's going on in the 'offsetof' macro?

Visual C++ 2008 C runtime offers an operator 'offsetof', which is actually macro defined as this:
#define offsetof(s,m) (size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
This allows you to calculate the offset of the member variable m within the class s.
What I don't understand in this declaration is:
Why are we casting m to anything at all and then dereferencing it? Wouldn't this have worked just as well:
&(((s*)0)->m)
?
What's the reason for choosing char reference (char&) as the cast target?
Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
An offset is in bytes. So to get a number expressed in bytes, you have to cast the addresses to char, because that is the same size as a byte (on this platform).
The use of volatile is perhaps a cautious step to ensure that no compiler optimisations (either that exist now or may be added in the future) will change the precise meaning of the cast.
Update:
If we look at the macro definition:
(size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
With the cast-to-char removed it would be:
(size_t)&((((s *)0)->m))
In other words, get the address of member m in an object at address zero, which does look okay at first glance. So there must be some way that this would potentially cause a problem.
One thing that springs to mind is that the operator & may be overloaded on whatever type m happens to be. If so, this macro would be executing arbitrary code on an "artificial" object that is somewhere quite close to address zero. This would probably cause an access violation.
This kind of abuse may be outside the applicability of offsetof, which is supposed to only be used with POD types. Perhaps the idea is that it is better to return a junk value instead of crashing.
(Update 2: As Steve pointed out in the comments, there would be no similar problem with operator ->)
offsetof is something to be very careful with in C++. It's a relic from C. These days we are supposed to use member pointers. That said, I believe that member pointers to data members are overdesigned and broken - I actually prefer offsetof.
Even so, offsetof is full of nasty surprises.
First, for your specific questions, I suspect the real issue is that they've adapted relative to the traditional C macro (which I thought was mandated in the C++ standard). They probably use reinterpret_cast for "it's C++!" reasons (so why the (size_t) cast?), and a char& rather than a char* to try to simplify the expression a little.
Casting to char looks redundant in this form, but probably isn't. (size_t) is not equivalent to reinterpret_cast, and if you try to cast pointers to other types into integers, you run into problems. I don't think the compiler even allows it, but to be honest, I'm suffering memory failure ATM.
The fact that char is a single byte type has some relevance in the traditional form, but that may only be why the cast is correct again. To be honest, I seem to remember casting to void*, then char*.
Incidentally, having gone to the trouble of using C++-specific stuff, they really should be using std::ptrdiff_t for the final cast.
Anyway, coming back to the nasty surprises...
VC++ and GCC probably won't use that macro. IIRC, they have a compiler intrinsic, depending on options.
The reason is to do what offsetof is intended to do, rather than what the macro does, which is reliable in C but not in C++. To understand this, consider what would happen if your struct uses multiple or virtual inheritance. In the macro, when you dereference a null pointer, you end up trying to access a virtual table pointer that isn't there at address zero, meaning that your app probably crashes.
For this reason, some compilers have an intrinsic that just uses the specified structs layout instead of trying to deduce a run-time type. But the C++ standard doesn't mandate or even suggest this - it's only there for C compatibility reasons. And you still have to be careful if you're working with class heirarchies, because as soon as you use multiple or virtual inheritance, you cannot assume that the layout of the derived class matches the layout of the base class - you have to ensure that the offset is valid for the exact run-time type, not just a particular base.
If you're working on a data structure library, maybe using single inheritance for nodes, but apps cannot see or use your nodes directly, offsetof works well. But strictly speaking, even then, there's a gotcha. If your data structure is in a template, the nodes may have fields with types from template parameters (the contained data type). If that isn't POD, technically your structs aren't POD either. And all the standard demands for offsetof is that it works for POD. In practice, it will work - your type hasn't gained a virtual table or anything just because it has a non-POD member - but you have no guarantees.
If you know the exact run-time type when you dereference using a field offset, you should be OK even with multiple and virtual inheritance, but ONLY if the compiler provides an intrinsic implementation of offsetof to derive that offset in the first place. My advice - don't do it.
Why use inheritance in a data structure library? Well, how about...
class node_base { ... };
class leaf_node : public node_base { ... };
class branch_node : public node_base { ... };
The fields in the node_base are automatically shared (with identical layout) in both the leaf and branch, avoiding a common error in C with accidentally different node layouts.
BTW - offsetof is avoidable with this kind of stuff. Even if you are using offsetof for some jobs, node_base can still have virtual methods and therefore a virtual table, so long as it isn't needed to dereference member variables. Therefore, node_base can have pure virtual getters, setters and other methods. Normally, that's exactly what you should do. Using offsetof (or member pointers) is a complication, and should only be used as an optimisation if you know you need it. If your data structure is in a disk file, for instance, you definitely don't need it - a few virtual call overheads will be insignificant compared with the disk access overheads, so any optimisation efforts should go into minimising disk accesses.
Hmmm - went off on a bit of a tangent there. Whoops.
char is guarenteed to be the smallest number of bits the architectural can "bite" (aka byte).
All pointers are actually numbers, so cast adress 0 to that type because it's the beginning.
Take the address of member starting from 0 (resulting into 0 + location_of_m).
Cast that back to size_t.
1) I also do not know why it is done in this way.
2) The char type is special in two ways.
No other type has weaker alignment restrictions than the char type. This is important for reinterpret cast between pointers and between expression and reference.
It is also the only type (together with its unsigned variant) for which the specification defines behavior in case the char is used to access stored value of variables of different type. I do not know if this applies to this specific situation.
3) I think that the volatile modifier is used to ensure that no compiler optimization will result in attempt to read the memory.
2 . What's the reason for choosing char reference (char&) as the cast target?
if type s has operator& overloaded then we can't get address using &s
so we reinterpret_cast the type s to primitive type char because primitive type char
doesn't have operator& overloaded
now we can get address from that
if in C then reinterpret_cast is not required
3 . Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
here volatile is not relevant to compiler optimizing.
if type s have const or volatile or both qualifier(s) then
reinterpret_cast can't cast to char& because reinterpret_cast can't remove cv-qualifiers
so result is using <const volatile char&> for casting work from any combination

Why doesn't a swap / exchange operator exist in imperative or OO languages like C/C++/C#/Java...?

I was always wondering why such a simple and basic operation like swapping the contents of two variables is not built-in for many languages.
It is one of the most basic programming exercises in computer science classes; it is heavily used in many algorithms (e.g. sorting); every now and then one needs it and one must use a temporary variable or use a template/generic function.
It is even a basic machine instruction on many processors, so that the standard scheme with a temporary variable will get optimized.
Many less obvious operators have been created, like the assignment operators (e.g. +=, which was probably created for reflecting the cumulative machine instructions, e.g. add ax,bx), or the ?? operator in C#.
So, what is the reason? Or does it actually exist, and I always missed it?
In my experience, it isn't that commonly needed in actual applications, apart from the already-mentioned sort algorithms and occasionally in low level hardware poking, so in my view it's a bit too special-purpose to have in a general-purpose language.
As has also been mentioned, not all processors support it as an instruction (and many do not support it for objects bigger than a word). So if it were supported with some useful additional semantics (e.g. being an atomic operation) it would be difficult to support on some processors, and if it didn't have the additional semantics then it's just (seldom used) synatatic sugar.
The assigment operators (+= etc) were supported because these are much more common in real-world programs - and so the syntacic sugar they provide was more useful, and also as an optimisation - remember C dates from the late 60s/early 70s, and compiler optimisation wasn't as advanced (and the machines less capable, so you didn't want lengthy optimisation passes anyway).
Paul
C++ does have swapping.
#include <algorithm>
#include <cassert>
int
main()
{
using std::swap;
int a(3), b(5);
swap(a, b);
assert(a == 5 && b == 3);
}
Furthermore, you can specialise swap for custom types too!
It's a widely used example in computer science courses, but I almost never find myself needing it in real code - whereas I use += very frequently.
Yes, in sorting it would be handy - but you don't tend to need to implement sorting yourself, so the number of actual uses in source code would still be pretty low.
You do have the XOR operator that does a variable substitution for primitive type...
I think they just forgot to add it :-) Yes, not all CPUs have this kind of instructions, so what ? We have bunch of other things that most CPUs don't have instructions to compute. It would be much easier/clearer and also faster ( by intrinsic ) if we had it !!!

Resources