Resolve union structure in Rust FFI - rust

I have problem with resolving c-union structure XEvent.
I'm experimenting with Xlib and X Record Extension in Rust. I'm generate ffi-bindings with rust-bindgen. All code hosted on github alxkolm/rust-xlib-record.
Trouble happen on line src/main.rs:106 when I try extract data from XEvent structure.
let key_event: *mut xlib::XKeyEvent = event.xkey();
println!("KeyPress {}", (*key_event).keycode); // this always print 128 on any key
My program listen key events and print out keycode. But it is always 128 on any key I press. I think this wrong conversion from C union type to Rust type.
Definition of XEvent starts here src/xlib.rs:1143. It's the c-union. Original C definition here.
Code from GitHub can be run by cargo run command. It's compile without errors.
What I do wrong?

Beware that rustbindgen generates binding to C union with as much safety as in C; as a result, when calling:
event.xkey(); // gets the C union 'xkey' field
There is no runtime check that xkey is the field currently containing a value.
This is because since C does not have tagged union (ie, the union knowing which field is currently in use), developers came up with various ways of encoding this information (*), the two that I know of being:
an external supplier; typically another field of the structure right before the union
the first field of each of the structures in the union
Here, you are in the latter case int type; is the first field of the union and each nested structure starts with int _type; to denote this. As a result, you need a two-steps approach:
consult type()
depending on the value, call the correct reinterpretation
The mapping from the value of type to the actual field being used should be part of the documentation of the C library, hopefully.
I invite you to come up with a wrapper around this low-level union that will make it safer to retrieve the result. At the very least, you could check it is the right type in the accessor; the full approach being to come up with a Rust enum that would wrap proxies to all the fields and allow pattern-matching.
(*) and actually sometimes disregard it altogether, for example in C99 to reinterpret a float as an int a union { float f; int i; } can be used.

Related

Extra [ ] Operator Added When Using 2 overloaded [ ] Operators

As a side project I'm writing a couple of classes to do matrix operations and operations on linear systems. The class LinearSystem holds pointers to Matrix class objects in a std::map. The Matrix class itself holds the 2d array ("matrix") in a double float pointer. To I wrote 2 overloaded [ ] operators, one to return a pointer to a matrix object directly from the LinearSystem object, and another to return a row (float *) from a matrix object.
Both of these operators work perfectly on their own. I'm able to use LinearSystem["keyString"] to get a matrix object pointer from the map. And I'm able to use Matrix[row] to get a row (float *) and Matrix[row][col] to get a specific float element from a matrix objects' 2d array.
The trouble comes when I put them together. My limited understanding (rising senior CompSci major) tells me that I should have no problem using LinearSystem["keyString"][row][col] to get an element from a specific array within the Linear system object. The return types should look like LinearSystem->Matrix->float *->float. But for some reason it only works when I place an extra [0] after the overloaded key operator so the call looks like this: LinearSystem["keyString"][0][row][col]. And it HAS to be 0, anything else and I segfault.
Another interesting thing to note is that CLion sees ["keyString"] as overloaded and [row] as overloaded, but not [0], as if its calling the standard index operator, but on what is the question that has me puzzled. LinearSystem["keyString"] is for sure returning Matrix * which only has an overloaded [ ] operator. See the attached screenshot.
screenshot
Here's the code, let me know if more is needed.
LinearSystem [ ] and map declaration:
Matrix *myNameSpace::LinearSystem::operator[](const std::string &name) {
return matrices[name];
}
std::map<std::string, Matrix *> matrices;
Matrix [ ] and array declaration:
inline float *myNameSpace::Matrix::operator[](const int row) {
return elements[row];
}
float **elements;
Note, the above function is inline'd because I'm challenging myself to make the code as fast as possible and even with compiler optimizations, the overloaded [ ] was 15% to 30% slower than using Matrix.elements[row].
Please let me know if any more info is needed, this is my first post so it I'm sure its not perfect.
Thank you!
You're writing C in C++. You need to not do that. It adds complexity. Those stars you keep putting after your types. Those are raw pointers, and they should almost always be avoided. linearSystem["foo"] is a Matrix*, i.e. a pointer to Matrix. It is not a Matrix. And pointers index as arrays, so linearSystem["foo"][0] gets the first element of the Matrix*, treating it as an array. Since there's actually only one element, this works out and seems to do what you want. If that sounds confusing, that's because it is.
Make sure you understand ownership semantics. Raw pointers are frowned upon because they don't convey any information. If you want a value to be owned by the data structure, it should be a T directly (or a std::unique_ptr<T> if you need virtual inheritance), not a T*. If you're taking a function argument or returning a value that you don't want to pass ownership of, you should take/return a const T&, or T& if you need to modify the referent.
I'm not sure exactly what your data looks like, but a reasonable representation of a Matrix class (if you're going for speed) is as an instance variable of type std::vector<double>. std::vector is a managed array of double. You can preallocate the array to whatever size you want and change it whenever you want. It also plays nice with Rule of Three/Five.
I'm not sure what LinearSystem is meant to represent, but I'm going to assume the matrices are meant to be owned by the linear system (i.e. when the system is freed, the matrices should have their memory freed as well), so I'm looking at something like
std::map<std::string, Matrix> matrices;
Again, you can wrap Matrix in a std::unique_ptr if you plan to inherit from it and do dynamic dispatch. Though I might question your design choices if your Matrix class is intended to be subclassed.
There's no reason for Matrix::operator[] to return a raw pointer to float (in fact, "raw pointer to float" is a pretty pointless type to begin with). I might suggest having two overloads.
float myNameSpace::Matrix::operator[](int row) const;
float& myNameSpace::Matrix::operator[](int row);
Likewise, LinearSystem::operator[] can have two overloads: one constant and one mutable.
const Matrix& myNameSpace::LinearSystem::operator[](const std::string& name) const;
Matrix& myNameSpace::LinearSystem::operator[](const std::string& name);
References (T& as opposed to T*) are smart and will effectively dereference when needed, so you can call Matrix::operator[] on a Matrix&, whereas you can't call that on a Matrix* without acknowledging the layer of indirection.
There's a lot of bad C++ advice out there. If the book / video / teacher you're learning from is telling you to allocate float** everywhere, then it's a bad book / video / teacher. C++-managed data is going to be far less error-prone and will perform comparably to raw pointers (the C++ compiler is smarter than either you or I when it comes to optimization, so let it do its thing).
If you do find yourself really feeling the need to go low-level and allocate raw pointers everywhere, then switch to C. C is a language designed for low-level pointer manipulation. C++ is a higher-level managed-memory language; it just so happens that, for historical reasons, C++ has several hundred footguns in the form of C-style allocations placed sporadically throughout the standard.
In summary: Modern C++ almost never uses T*, new, or delete. Start getting comfortable with smart pointers (std::unique_ptr and std::shared_ptr) and references. You'll thank yourself later.

Why do some struct types let us set members that can only be a certain value?

I was reading up on some vulkan struct types, this is one of many examples, but the one I will use is vkInstanceCreateInfo. The documentation states:
The VkInstanceCreateInfo structure is defined as:
typedef struct VkInstanceCreateInfo {
VkStructureType sType;
const void* pNext;
VkInstanceCreateFlags flags;
const VkApplicationInfo* pApplicationInfo;
uint32_t enabledLayerCount;
const char* const* ppEnabledLayerNames;
uint32_t enabledExtensionCount;
const char* const* ppEnabledExtensionNames;
} VkInstanceCreateInfo;
Then below in the options we see:
sType is the type of this structure
sType must be VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO
If we dont have any options anyway, why is this parameter not just set implicitly upon creation of the type?
Note: I realise this is not something specific to the vulkan API.
Update: I'm not just talking specifically about vulkan, just all parameters that can only be a certain type.
The design allows structures to be chained together so that extensions can create additional parameters to existing calls without interfering with the original API structures and without interfering with each other.
Nearly every struct in Vulkan has sType as it's first member, and pNext as it's second member. That means that if you have a void* and all you know is that it is some kind of Vulkan API struct, you can safely read the first 32 bits and it will be a VkStructureType and read the next 32 or 64 bits and it will tell you if there's another structure in the chain.
So for instance, there's a VkMemoryAllocateInfo structure for allocating memory that has (aside from sType and pNext the size of the allocation and the heap index it should come from. But what if I want to use the "dedicated allocation" extension. Then I also need to fill out a VkMemoryDedicatedAllocateInfo structure with extra information. But I still need to call the same vkAllocateMemory function that only takes a VkMemoryAllocateInfo... so where do I put the VkMemoryDedicatedAllocateInfo structure I filled out? I put a pointer to it in the pNext field of VkMemoryAllocateInfo.
Maybe I also want to share this memory with some OpenGL code. There's an extension that lets you do that, but you need to fill out a VkExportMemoryAllocateInfo structure and pass it in during the allocation as well. Well, I can do that by putting it in the pNext field of my VkMemoryDedicatedAllocateInfo structure. I can create a chain of structures like that as long as I want.
Here's the really important part. Since all structures have sType as their first field, an extension can navigate along this chain of structures and find the ones it cares about without knowing anything about the structures other than that they always start with sType and pNext.
All of this means that Vulkan can be extended in ways that alter the behavior of existing functions, but without changing the function itself, or the structures that are passed to it.
You might ask why all of the core structures have sType and pNext, even though you're passing them to functions with typed pointers, rather than void pointers. The reason is consistency, and because you never know when an existing structure might be needed as part of the chain for some new extension.
If we dont have any options anyway, why is this parameter not just set implicitly upon creation of the type?
Because C isn't C++. There's no way to declare a structure in C and say that this portion of the structure will always have this value. In C++ you can, by declaring something as const and providing the initial default value. In fact, one of the things I like about the Vulkan C++ bindings is that you can basically forget about sType forever. If you're using extensions you still need to populate pNext as appropriate.

How to map a structure from a buffer like in C with a pointer and cast

In C, I can define many structures and structure of structures.
From a buffer, I can just set the pointer at the beginning of this structure to say this buffer represents this structure.
Of course, I do not want to copy anything, just mapping, otherwise I loose the benefit of the speed.
Is it possible in NodeJs ? How can I do ? How can I be sure it's a mapping and not creating a new object and copy information inside ?
Example:
struct House = {
uint8 door,
uint16BE kitchen,
etc...
}
var mybuff = Buffer.allocate(10, 0)
var MyHouse = new House(mybuff) // same as `House* MyHouse = (House*) mybuff`
console.log(MyHouse.door) // will display the value of door
console.log(MyHouse.kitchen) // will display the value of kitchen with BE function.
This is wrong but explain well what I am looking for.
This without copying anything.
And if I do MyHouse.door=56, mybuff contains know the 56. I consider mybuff as a pointer.
Edit after question update below
Opposed to C/C++, javascript uses pionters by default, so you don't have to do anything. It's the other way around, actually: You have to put some effort in if you want a copy of the current object.
In C, a struct is nothing more than a compile-time reference to different parts of data in the struct. So:
struct X {
int foo;
int bar;
}
is nothing more than saying: if you want bar from a variable with type X, just add the length of foo (length of int) to the base pointer.
In Javascript, we do not even have such a type. We can just say:
var x = {
foo: 1,
bar: 2
}
The lookup of bar will automatically be a pointer (we call them references in javascript) lookup. Because javascript does not have types, you can view an object as a map/dictionary with pointers to mixed types.
If you, for any reason, want to create a copy of a datastructure, you would have to iterate through the entire datastructure (recursively) and create a copy of the datastructure manually. The basic types are not pointer based. These include number (Javascript automatically differentiates between int and float under the hood), string and boolean.
Edit after question update
Although I am not an expert on this area, I do not think it is possible. The problem is, the underlying data representation (as in how the data is represented as bytes in memory) is different, because javascript does not have compile-time information about data structures. As I said before, javascript doesn't have classes/structs, just objects with fields, which basically behave (and may be implemented as) maps/dictionaries.
There are, however, some third party libraries to cope with these problems. There are two general approaches:
Unpack everything to javascript objects. The data will be copied, but you can work with it as normal javascript objects. You should use this if you read/write the data intensively, because the performance increase you get when working with normal javascript objects outweighs the advantage of not having to unpack the data. Link to example library
Leave all data in the buffer. When you need some of the data, compute the location of the data in the buffer at runtime, and read/write at this location accordingly. Because the struct data location computations are done in runtime, you should use this only when you have loads of data and only a few reads/writes to it. In this case the performance decrease of unpacking all data outweighs the few runtime computations that have to be done. Link to example library
As a side-note, if the amount of data you have to process isn't that much, I'd recommend to just unpack the data. It saves you the headache of having to use the library as interface to your data. Computers are fast enough nowadays to copy/process some amount of data in memory. Also, these third party libraries are just some examples. I recommend you do a little more research for libraries to decide which one suits your needs.

Haskell FFI for C recursive struct and union

I am trying to write Haskell FFI binding for some C structs. An example is below:
typedef struct s0{int a;
union{unsigned char b;
struct s0*c;
struct{unsigned char d[1];
}; };}*S;
My question is how to write the binding for it in chs (for c2hs) or hsc (for hsc2hs) format? I looked into the tutorials for c2hs but either didn't get enough information, or didn't understand it in the way, that would have helped me write the chs file for above definition.
I can generate haskell bindings using HSFFIG tool but it uses custom access method HSFFIG.FieldAccess.FieldAccess to define bindings. I prefer to write bindings that use core haskell FFI libraries, not third-party libraries.
Hence, this question about how to write a binding for recursive struct above in hsc format, or chs format that uses only core FFI libraries.
The actual definition is more complex, but once I figure out how to write above struct definition for c2hs or hsc2hs tools, I can go from there. I know Storable instances need to be defined for inner union and struct as well, but I don't know how to write wrappers for recursive definition like above. Especially, how do you access inside struct/union from outer struct? I looked into HSFFIG definitions, but the access methods are HSFFIG defined access methods. So, I was unable to figure out how to translate it to chs definition that uses only core FFI libraries.
The questions I have seen in StackOverflow seem to be about simpler definitions. If there is a similar answer somewhere else, I will appreciate pointers.
You can't magic up an equiv data structure in either c2hs or hsc2hs. However, you can do your own marshaling in c2hs with only a little work.
data MyType = Next MyType | MyChar Char | MyString String | MyEnd
Then use hsc2hs' newtype pointer functionality to declare a pointer for MyType (ie. s0). Then, write an explicit function using hsc2hs' accessors to recursively walk your structure and build up your Haskell structure. At each step you test if you've hit a null pointer and if so return a MyEnd (or, depending on the data encoding, just check if the int indicating the type in the union is negative one or whatever), otherwise proceed to parse out whatever you have, and if that thing is a pointer, proceed recursively.
You could do nearly the same thing with hsc2hs as well.

What's going on in the 'offsetof' macro?

Visual C++ 2008 C runtime offers an operator 'offsetof', which is actually macro defined as this:
#define offsetof(s,m) (size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
This allows you to calculate the offset of the member variable m within the class s.
What I don't understand in this declaration is:
Why are we casting m to anything at all and then dereferencing it? Wouldn't this have worked just as well:
&(((s*)0)->m)
?
What's the reason for choosing char reference (char&) as the cast target?
Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
An offset is in bytes. So to get a number expressed in bytes, you have to cast the addresses to char, because that is the same size as a byte (on this platform).
The use of volatile is perhaps a cautious step to ensure that no compiler optimisations (either that exist now or may be added in the future) will change the precise meaning of the cast.
Update:
If we look at the macro definition:
(size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
With the cast-to-char removed it would be:
(size_t)&((((s *)0)->m))
In other words, get the address of member m in an object at address zero, which does look okay at first glance. So there must be some way that this would potentially cause a problem.
One thing that springs to mind is that the operator & may be overloaded on whatever type m happens to be. If so, this macro would be executing arbitrary code on an "artificial" object that is somewhere quite close to address zero. This would probably cause an access violation.
This kind of abuse may be outside the applicability of offsetof, which is supposed to only be used with POD types. Perhaps the idea is that it is better to return a junk value instead of crashing.
(Update 2: As Steve pointed out in the comments, there would be no similar problem with operator ->)
offsetof is something to be very careful with in C++. It's a relic from C. These days we are supposed to use member pointers. That said, I believe that member pointers to data members are overdesigned and broken - I actually prefer offsetof.
Even so, offsetof is full of nasty surprises.
First, for your specific questions, I suspect the real issue is that they've adapted relative to the traditional C macro (which I thought was mandated in the C++ standard). They probably use reinterpret_cast for "it's C++!" reasons (so why the (size_t) cast?), and a char& rather than a char* to try to simplify the expression a little.
Casting to char looks redundant in this form, but probably isn't. (size_t) is not equivalent to reinterpret_cast, and if you try to cast pointers to other types into integers, you run into problems. I don't think the compiler even allows it, but to be honest, I'm suffering memory failure ATM.
The fact that char is a single byte type has some relevance in the traditional form, but that may only be why the cast is correct again. To be honest, I seem to remember casting to void*, then char*.
Incidentally, having gone to the trouble of using C++-specific stuff, they really should be using std::ptrdiff_t for the final cast.
Anyway, coming back to the nasty surprises...
VC++ and GCC probably won't use that macro. IIRC, they have a compiler intrinsic, depending on options.
The reason is to do what offsetof is intended to do, rather than what the macro does, which is reliable in C but not in C++. To understand this, consider what would happen if your struct uses multiple or virtual inheritance. In the macro, when you dereference a null pointer, you end up trying to access a virtual table pointer that isn't there at address zero, meaning that your app probably crashes.
For this reason, some compilers have an intrinsic that just uses the specified structs layout instead of trying to deduce a run-time type. But the C++ standard doesn't mandate or even suggest this - it's only there for C compatibility reasons. And you still have to be careful if you're working with class heirarchies, because as soon as you use multiple or virtual inheritance, you cannot assume that the layout of the derived class matches the layout of the base class - you have to ensure that the offset is valid for the exact run-time type, not just a particular base.
If you're working on a data structure library, maybe using single inheritance for nodes, but apps cannot see or use your nodes directly, offsetof works well. But strictly speaking, even then, there's a gotcha. If your data structure is in a template, the nodes may have fields with types from template parameters (the contained data type). If that isn't POD, technically your structs aren't POD either. And all the standard demands for offsetof is that it works for POD. In practice, it will work - your type hasn't gained a virtual table or anything just because it has a non-POD member - but you have no guarantees.
If you know the exact run-time type when you dereference using a field offset, you should be OK even with multiple and virtual inheritance, but ONLY if the compiler provides an intrinsic implementation of offsetof to derive that offset in the first place. My advice - don't do it.
Why use inheritance in a data structure library? Well, how about...
class node_base { ... };
class leaf_node : public node_base { ... };
class branch_node : public node_base { ... };
The fields in the node_base are automatically shared (with identical layout) in both the leaf and branch, avoiding a common error in C with accidentally different node layouts.
BTW - offsetof is avoidable with this kind of stuff. Even if you are using offsetof for some jobs, node_base can still have virtual methods and therefore a virtual table, so long as it isn't needed to dereference member variables. Therefore, node_base can have pure virtual getters, setters and other methods. Normally, that's exactly what you should do. Using offsetof (or member pointers) is a complication, and should only be used as an optimisation if you know you need it. If your data structure is in a disk file, for instance, you definitely don't need it - a few virtual call overheads will be insignificant compared with the disk access overheads, so any optimisation efforts should go into minimising disk accesses.
Hmmm - went off on a bit of a tangent there. Whoops.
char is guarenteed to be the smallest number of bits the architectural can "bite" (aka byte).
All pointers are actually numbers, so cast adress 0 to that type because it's the beginning.
Take the address of member starting from 0 (resulting into 0 + location_of_m).
Cast that back to size_t.
1) I also do not know why it is done in this way.
2) The char type is special in two ways.
No other type has weaker alignment restrictions than the char type. This is important for reinterpret cast between pointers and between expression and reference.
It is also the only type (together with its unsigned variant) for which the specification defines behavior in case the char is used to access stored value of variables of different type. I do not know if this applies to this specific situation.
3) I think that the volatile modifier is used to ensure that no compiler optimization will result in attempt to read the memory.
2 . What's the reason for choosing char reference (char&) as the cast target?
if type s has operator& overloaded then we can't get address using &s
so we reinterpret_cast the type s to primitive type char because primitive type char
doesn't have operator& overloaded
now we can get address from that
if in C then reinterpret_cast is not required
3 . Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
here volatile is not relevant to compiler optimizing.
if type s have const or volatile or both qualifier(s) then
reinterpret_cast can't cast to char& because reinterpret_cast can't remove cv-qualifiers
so result is using <const volatile char&> for casting work from any combination

Resources