When debugging a Rust program is it possible to break execution when an Err() value is created?
This would serve the same purpose as breaking on exceptions in other languages (C++, Javascript, Java, etc.) in that it shows you the actual source of the error, rather than just the place where you unwrapped it, which is not usually very useful.
I'm using LLDB but interested in answers for any debugger. The Err I am interested in is generated deep in Serde so I cannot really modify any of the code.
I'll try give this one a shot.
I believe you want to accomplish is incompatible with how the (current) "one true Rust implementation" is currently constructed and its take on "enum constructors" without some serious hacks -- and I'll give my best inference about why (as of the time of writing - Thu Sep 22 00:58:49 UTC 2022), and give you some ideas and options.
Breaking it down: finding definitions
"What happens when you "construct" an enum, anyways...?"
As Rust does not have a formal language standard or specification document, its "semantics" are not particularly precisely defined, so there is no "legal" text to really provide the "Word of God" or final authority on this topic.
So instead, let's refer to community materials and some code:
Constructors - The Rustonomicon
There is exactly one way to create an instance of a user-defined type: name it, and initialize all its fields at once:
...
That's it. Every other way you make an instance of a type is just calling a totally vanilla function that does some stuff and eventually bottoms out to The One True Constructor.
Unlike C++, Rust does not come with a slew of built-in kinds of constructor. There are no Copy, Default, Assignment, Move, or whatever constructors. The reasons for this are varied, but it largely boils down to Rust's philosophy of being explicit.
Move constructors are meaningless in Rust because we don't enable types to "care" about their location in memory. Every type must be ready for it to be blindly memcopied to somewhere else in memory. This means pure on-the-stack-but-still-movable intrusive linked lists are simply not happening in Rust (safely).
In comparison to C++'s better-specified semantics for both enum class constructors and std::Variant<T...> (its closest analogue to Rust enum), Rust does not really say anything about "enum constructors" in-specific except that it's just part of "The One True Constructor."
The One True Constructor is not really a well-specified Rust concept. It's not really commonly used in any of its references or books, and it's not a general programming language theory concept (at least, by that exact name -- it's most-likely referring to type constructors, which we'll get to) -- but you can eke out its meaning by reading more and comparison to the programming languages that Rust takes direct inspiration from.
In fact, where C++ might have move, copy, placement new and other types of constructors, Rust simply has a sort of universal "dumb value constructor" for all values (like struct and enum) that does not have special operational semantics besides something like "create the value, wherever it might be stored in memory".
But that's not very precise at all. What if we try to look at the definition of an enum?
Defining an Enum - The Rust Programming Language
...
We attach data to each variant of the enum directly, so there is no need for an extra struct. Here it’s also easier to see another detail of how enums work: the name of each enum variant that we define also becomes a function that constructs an instance of the enum. That is, IpAddr::V4() is a function call that takes a String argument and returns an instance of the IpAddr type. We automatically get this constructor function defined as a result of defining the enum.
Aha! They dropped the words "constructor function" -- so it's pretty much something like a fn(T, ...) -> U or something? So is it some sort of function? Well, as a generally introductory text to Rust, The Rust Programming Language book can be thought as less "technical" and "precise" than The Rust Reference:
Enumerated types - The Rust Reference
An enumerated type is a nominal, heterogeneous disjoint union type, denoted by the name of an enum item. ^1 ...
...
Enum types cannot be denoted structurally as types, but must be denoted by named reference to an enum item.
...
Most of this is pretty standard -- most modern programming languages have "nomimal types" (the type identifier matters for type comparison) -- but the footnote here is the interesting part:
The enum type is analogous to a data constructor declaration in ML, or a pick ADT in Limbo.
This is a good lead! Rust is known for taking a large amount of inspiration from functional programming languages, which are much closer to the mathematical foundations of programming languages.
ML is a whole family of functional programming languages (e.g. OCaml, Standard ML, F#, and sometimes Haskell) and is considered one of the important defining language-families within the functional programming language space.
Limbo is an older concurrent programming language with support for abstract data types, of which enum is one of.
Both are strongly-rooted in the functional programming language space.
Summary: Rust enum in Functional Programming / Programming Language Theory
For brevity, I'll omit quotes and give a summary of the formal programming language theory behind Rust enum's.
Rust enum's are theoretically known as "tagged unions" or "sum types" or "variants".
Functional programming and mathematical type theory place a strong emphasis on modeling computation as basically "changes in typed-value structure" versus "changes in data state".
So, in object-oriented programming where "everything is an [interactable] object" that then send messages or interact with each other...
-- in functional programming, "everything is a pure [non-mutative] value" that is then "transformed" without side effects by "mathematically-pure functions" .
In fact, type theory goes as far as to say "everything is a type" -- they'll do stuff like mock-up the natural numbers by constructing some sort of mathematical recursive type that has properties like the natural numbers.
To construct "[typed] values" as "structures," mathematical type theory defines a fundamental concept called a "type constructor" -- and you can think of type constructors as being just a Rust () and compositions of such.
So functional/mathematical type constructors are not intended to "execute" or have any other behavior. They are simply there to "purely construct the structure of pure data."
Conclusion: "Rust doesn't want you to inject a breakpoint into data"
Per Rust's theoretical roots and inspiring influences, Rust enum type constructors are meant to be functional and only to wrap and create type-tagged data.
In other words, Rust doesn't really want to allow you to "inject" arbitrary logic into type constructors (unlike C++, which has a whole slew of semantics regarding side effects in constructors, such as throwing exceptions, etc.).
They want to make injecting a breakpoint into Err(T) sort of like injecting a breakpoint into 1 or as i32. Err(T) is more of a "data primitive" rather than a "transforming function/computation" like if you were to call foo(123).
In Code: why it's probably hard to inject a breakpoint in Err().
Let's start by looking at the definition of Err(T) itself.
The Definition of std::result::Result::Err()
Here's is where you can find the definition of Err() directly from rust-lang/rust/library/core/src/result.rs # v1.63.0 on GitHub:
// `Result` is a type that represents either success ([`Ok`]) or failure ([`Err`]).
///
/// See the [module documentation](self) for details.
#[derive(Copy, PartialEq, PartialOrd, Eq, Ord, Debug, Hash)]
#[must_use = "this `Result` may be an `Err` variant, which should be handled"]
#[rustc_diagnostic_item = "Result"]
#[stable(feature = "rust1", since = "1.0.0")]
pub enum Result<T, E> {
/// Contains the success value
#[lang = "Ok"]
#[stable(feature = "rust1", since = "1.0.0")]
Ok(#[stable(feature = "rust1", since = "1.0.0")] T),
/// Contains the error value
#[lang = "Err"]
#[stable(feature = "rust1", since = "1.0.0")]
Err(#[stable(feature = "rust1", since = "1.0.0")] E),
}
Err() is just a sub-case of the greater enum std::result::Result<T, E> -- and this means that Err() is not a function, but more of like a "data tagging constructor".
Err(T) in assembly is meant to be optimized out completely
Let's use Godbolt to breakdown usage of std::result::Result::<T, E>::Err(E): https://rust.godbolt.org/z/oocqGj5cd
// Type your code here, or load an example.
pub fn swap_err_ok(r: Result<i32, i32>) -> Result<i32, i32> {
let swapped = match r {
Ok(i) => Err(i),
Err(e) => Ok(e),
};
return swapped;
}
example::swap_err_ok:
sub rsp, 16
mov dword ptr [rsp], edi
mov dword ptr [rsp + 4], esi
mov eax, dword ptr [rsp]
test rax, rax
je .LBB0_2
jmp .LBB0_5
.LBB0_5:
jmp .LBB0_3
ud2
.LBB0_2:
mov eax, dword ptr [rsp + 4]
mov dword ptr [rsp + 12], eax
mov dword ptr [rsp + 8], 1
jmp .LBB0_4
.LBB0_3:
mov eax, dword ptr [rsp + 4]
mov dword ptr [rsp + 12], eax
mov dword ptr [rsp + 8], 0
.LBB0_4:
mov eax, dword ptr [rsp + 8]
mov edx, dword ptr [rsp + 12]
add rsp, 16
ret
Here is the (unoptimized) assembly code that corresponds to the line Ok(i) => Err(i), that constructs the Err:
mov dword ptr [rsp + 12], eax
mov dword ptr [rsp + 8], 1
and Err(e) is basically optimized out if you optimized with -C optlevel=3:
example::swap_err_ok:
mov edx, esi
xor eax, eax
test edi, edi
sete al
ret
Unlike in C++, where C++ leaves room to allow for injection of arbitrary logic in constructors and to even to represent actions like locking a mutex, Rust discourages this in the name of optimization.
Rust is designed to discourage inserting computation in type constructor calls -- and, in fact, if there is no computation associated with a constructor, it should have no operational value or action at the machine-instruction level.
Is there any way this is possible?
If you're still here, you really want a way to do this even though it goes against Rust's philosophy.
"...And besides, how hard can it be? If gcc and MSVC can instrument ALL functions with tracing at the compiler-level, can't rustc do the same?..."
I answered a related StackOverflow question like this in the past: How to build a graph of specific function calls?
In general, you have 2 strategies:
Instrument your application with some sort of logging/tracing framework, and then try to replicate some sort of tracing mixin-like functionality to apply global/local tracing depending on which parts of code you apply the mixins.
Recompile your code with some sort of tracing instrumentation feature enabled for your compiler or runtime, and then use the associated tracing compiler/runtime-specific tools/frameworks to transform/sift through the data.
For 1, this will require you to manually insert more code or something like _penter/_pexit for MSVC manually or create some sort of ScopedLogger that would (hopefully!) log async to some external file/stream/process. This is not necessarily a bad thing, as having a separate process control the trace tracking would probably be better in the case where the traced process crashes. Regardless, you'd probably have to refactor your code since C++ does not have great first-class support for metaprogramming to refactor/instrument code at a module/global level. However, this is not an uncommon pattern anyways for larger applications; for example, AWS X-Ray is an example of a commercial tracing service (though, typically, I believe it fits the use case of tracing network calls and RPC calls rather than in-process function calls).
For 2, you can try something like utrace or something compiler-specific: MSVC has various tools like Performance Explorer, LLVM has XRay, GCC has gprof. You essentially compile in a sort of "debug++" mode or there is some special OS/hardware/compiler magic to automatically insert tracing instructions or markers that help the runtime trace your desired code. These tracing-enabled programs/runtimes typically emit to some sort of unique tracing format that must then be read by a unique tracing format reader.
However, because Err(T) is a [data]type constructor and not really a first-class fn, this means that Err(T) will most likely NOT be instrumented like a usual fn call. Usually compilers with some sort of "instrumentation mode" will only inject "instrumentation code" at function-call boundaries, but not at data-creation points generically.
What about replacing std:: with an instrumented version such that I can instrument std::result::Result<T, E> itself? Can't I just link-in something?
Well, Err(T) simply does not represent any logical computation except the creation of a value, and so, there is no fn or function pointer to really replace or switch-out by replacing the standard library. It's not really part of the surface language-level interface of Rust to do something like this.
So now what?
If you really specifically need this, you would want a custom compiler flag or mode to inject custom instrumentation code every-time you construct an Err(T) data type -- and you would have to rebuild every piece of Rust code where you want it.
Possible Options
Do a text-replace or macro-replacement to turn every usage of /Err(.*)/ in your application code that you want to instrument into your own macro or fn call (to represent computation in the way Rust wants), and inject your own type of instrumentation (probably using either log or tracing crates).
Find or ask for a custom instrumentation flag on rustc that can generate specific assembly/machine-code to instrument per every usage of Err(T).
Yes, it is possible to break execution when an Err() value is created. This can be done by using the debugger to break on the Err() function, and then inspecting the stack trace to find the point at which the Err() value was created.
Related
One way to construct and destruct C++ objects from Rust is to call the constructor and return an int64_t pointer to Rust. Then, Rust can call methods on the object by passing the int64_t which will be cast to the pointer again.
void do_something(int64_t pointer, char* methodName, ...) {
//cast pointer and call method here
}
However, this is extremely unsafe. Instead I tend to store the pointer into a map and pass the map key to Rust, so it can call C++ back:
void do_something(int id, char* methodName, ...) {
//retrieve pointer from id and call method on it
}
Now, imagine I create, from Rust, a C++ object that calls Rust back. I could do the same: give C++ an int64_t and then C++ calls Rust:
#[no_mangle]
pub fn do_somethind(pointer: i64, method_name: &CString, ...) {
}
but that's also insecure. Instead I'd do something similar as C++, using an id:
#[no_mangle]
pub fn do_something(id: u32, method_name: &CString, ...) {
//search id in map and get object
//execute method on the object
}
However, this isn't possible, as Rust does not have support for static variables like a map. And Rust's lazy_static is immutable.
The only way to do safe calls from C++ back to Rust is to pass the address of something static (the function do_something) so calling it will always point to something concrete. Passing pointers is insecure as it could stop existing. However there should be a way for this function to maintain a map of created objects and ids.
So, how to safely call Rust object functions from C++? (for a Rust program, not a C++ program)
Pointers or Handles
Ultimately, this is about object identity: you need to pass something which allows to identify one instance of an object.
The simplest interface is to return a Pointer. It is the most performant interface, albeit requires trust between the parties, and a clear ownership.
When a Pointer is not suitable, the fallback is to use a Handle. This is, for example, typically what kernels do: a file descriptor, in Linux, is just an int.
Handles do not preclude strong typing.
C and Linux are poor examples, here. Just because a Handle is, often, an integral ID does not preclude encapsulating said integer into a strong type.
For example, you could struct FileDescriptor(i32); to represent a file descriptor handed over from Linux.
Handles do not preclude strongly typed functions.
Similarly, just because you have a Handle does not mean that you have a single syscall interface where the name of the function must be passed by ID (or worse string) and an unknown/untyped soup of arguments follow.
You can perfectly, and really should, use strongly typed functions:
int read(FileDescriptor fd, std::byte* buffer, std::size_t size);
Handles are complicated.
Handles are, to a degree, more complicated than pointers.
First of all, handles are meaningless without some repository: 33 has no intrinsic meaning, it is just a key to look-up the real instance.
The repository need not be a singleton. It can perfectly be passed along in the function call.
The repository should likely be thread-safe and re-entrant.
There may be data-races between usage and deletion of a handle.
The latter point is maybe the most surprising, and means that care must be taken when using the repository: accesses to the underlying values must also be thread-safe, and re-entrant.
(Non thread-safe or non re-entrant underlying values leave you open to Undefined Behavior)
Use Pointers.
In general, my recommendation is to use Pointers.
While Handles may feel safer, implementing a correct system is much more complicated than it looks. Furthermore, Handles do not intrinsically solve ownership issues. Instead of Undefined Behavior, you'll get Null Pointer Dangling Handle Exceptions... and have to reinvent the tooling to track them down.
If you cannot solve the ownership issues with Pointers, you are unlikely to solve them with Handles.
Rust doesn't have a "bit" data type, however, x86 instructions have a "field" which is in size of bits. Instead of using bit-wise operations, is there any data structure that can be directly compiled to such "memory/byte alignment" required by x86 instruction set or any binary protocol?
OpCode 1or2 byte
Mod-R/M 0 or 1 byte
Mod 7,6 bit
Reg/OpCode 5,4,3 bit
R/M 2,1,0 bit
SIB 0 or 1 byte
SS 7,6
Index 5,4,3
Base 2,1,0
Displacement 0,1,2 or 4 byte
Immediate 0,1,2 or 4
is there any data structure that can be directly compiled
No, there are no structures that correspond to this:
OpCode 1or2 byte
That is, you cannot have a struct that has a value that is either one or two bytes long. Structures have a fixed size at compile time.
Your main choices are:
Use pretty Rust features like enums and structs. This is likely to not match the bit pattern of the actual instructions.
Make something like struct Instruction([u8; 4]) and implement methods that use bitwise operations. This will allow you to match the bit patterns.
Since you don't want to use bitwise operations and must match the bit representation, I do not believe your problem can currently be solved in the fashion you'd like.
Personally, I'd probably go the enum route and implement methods to parse the raw instructions from a type implementing Read and Write back to bytes.
It's also possible you are interested in bitfields, like this C++ example:
struct S {
unsigned int b : 3;
};
There is no direct support for that in Rust, but a few crates appear to support macros to create them. Perhaps that would be useful.
Some languages, like Haskell, make no distinction between pass-by-value and pass-by-reference. The compiler can then approximately choose the most efficient calling convention with a heuristic. One example heuristic would be for the Linux x64 ABI: if the size of parameter is greater than 16 bytes, pass a pointer to the stack otherwise pass the value in registers.
What is the advantage of keeping both notions of pass-by-value and pass-by-reference (non-mutable of course) in Rust and forcing the user to choose?
Could it be the case that pass-by-value is syntactic sugar for pass-by-reference + copy if the value is seen to be modified?
Two things:
Rust will transform certain pass-by-value calls into pass-by-reference, based on a similar heuristic.
Pass-by-value indicates ownership transfer, while pass-by-reference indicates borrowing. These are very different, and totally orthogonal from the asm-level concern you're asking about.
In other words, in Rust, these two forms have different semantics. That doesn't preclude also doing the optimization, though.
[Edited: changed exampled to work in release mode]
It isn't syntactic sugar, as one can see by looking at the generated code.
Given these functions:
fn by_value(v: (u64, u64)) -> u64 {
v.0 + v.1
}
fn by_ref(v: &(u64, u64)) -> u64 {
v.0 + v.1
}
then if one was syntactic sugar for another, we'd expect them to generate identical assembly code, or at least identical calling conventions. But actually, we find that by_ref passes v in the rdi and rsi registers, whereas by_value passes a pointer to v in the rdi register and has to follow that pointer to get the value: (see details, use release mode):
by_value:
movq 8(%rdi), %rax
addq (%rdi), %rax
retq
by_ref:
leaq (%rdi,%rsi), %rax
retq
I was recently reading this article on structs and classes in D, and at one point the author comments that
...this is a perfect candidate for a struct. The reason is that it contains only one member, a pointer to an ALLEGRO_CONFIG. This means I can pass it around by value without care, as it's only the size of a pointer.
This got me thinking; is that really the case? I can think of a few situations in which believing you're passing a struct around "for free" could have some hidden gotchas.
Consider the following code:
struct S
{
int* pointer;
}
void doStuff(S ptrStruct)
{
// Some code here
}
int n = 123;
auto s = S(&n);
doStuff(s);
When s is passed to doStuff(), is a single pointer (wrapped in a struct) really all that's being passed to the function? Off the top of my head, it seems that any pointers to member functions would also be passed, as well as the struct's type information.
This wouldn't be an issue with classes, of course, since they're always reference types, but a struct's pass by value semantics suggests to me that any extra "hidden" data such as described above would be passed to the function along with the struct's pointer to int. This could lead to a programmer thinking that they're passing around an (assuming a 64-bit machine) 8-byte pointer, when they're actually passing around an 8-byte pointer, plus several other 8-byte pointers to functions, plus however many bytes an object's typeinfo is. The unwary programmer is then allocating far more data on the stack than was intended.
Am I chasing shadows here, or is this a valid concern when passing a struct with a single reference, and thinking that you're getting a struct that is a pseudo reference type? Is there some mechanism in D that prevents this from being the case?
I think this question can be generalized to wrapping native types. E.g. you could make a SafeInt type which wraps and acts like an int, but throws on any integer overflow conditions.
There are two issues here:
Compilers may not optimize your code as well as with a native type.
For example, if you're wrapping an int, you'll likely implement overloaded arithmetic operators. A sufficiently-smart compiler will inline those methods, and the resulting code will be no different than that as with an int. In your example, a dumb compiler might be compiling a dereference in some clumsy way (e.g. get the address of the struct's start, add the offset of the pointer field (which is 0), then dereference that).
Additionally, when calling a function, the compiler may decide to pass the struct in some other way (due to e.g. poor optimization, or an ABI restriction). This could happen e.g. if the compiler doesn't pay attention to the struct's size, and treats all structs in the same way.
struct types in D may indeed have a hidden member, if you declare it in a function.
For example, the following code works:
import std.stdio;
void main()
{
string str = "I am on the stack of main()";
struct S
{
string toString() const { return str; }
}
S s;
writeln(s);
}
It works because S saves a hidden pointer to main()'s stack frame. You can force a struct to not have any hidden pointers by prefixing static to the declaration (e.g. static struct S).
There is no hidden data being passed. A struct consists exactly of what's declared in it (and any padding bytes if necessary), nothing else. There is no need to pass type information and member function information along because it's all static. Since a struct cannot inherit from another struct, there is no polymorphism.
I am trying to create a non-blocking queue package for concurrent application using the algorithm by Maged M. Michael and Michael L. Scott as described here.
This requires the use of atomic CompareAndSwap which is offered by the "sync/atomic" package.
I am however not sure what the Go-equivalent to the following pseudocode would be:
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
where tail and next is of type:
type pointer_t struct {
ptr *node_t
count uint
}
and node is of type:
type node_t struct {
value interface{}
next pointer_t
}
If I understood it correctly, it seems that I need to do a CAS with a struct (both a pointer and a uint). Is this even possible with the atomic-package?
Thanks for help!
If I understood it correctly, it seems that I need to do a CAS with a struct (both a > pointer and a uint). Is this even possible with the atomic-package?
No, that is not possible. Most architectures only support atomic operations on a single word. A lot of academic papers however use more powerful CAS statements (e.g. compare and swap double) that are not available today. Luckily there are a few tricks that are commonly used in such situations:
You could for example steal a couple of bits from the pointer (especially on 64bit systems) and use them, to encode your counter. Then you could simply use Go's CompareAndSwapPointer, but you need to mask the relevant bits of the pointer before you try to dereference it.
The other possibility is to work with pointers to your (immutable!) pointer_t struct. Whenever you want to modify an element from your pointer_t struct, you would have to create a copy, modify the copy and atomically replace the pointer to your struct. This idiom is called COW (copy on write) and works with arbitrary large structures. If you want to use this technique, you would have to change the next attribute to next *pointer_t.
I have recently written a lock-free list in Go for educational reasons. You can find the (imho well documented) source here: https://github.com/tux21b/goco/blob/master/list.go
This rather short example uses atomic.CompareAndSwapPointer excessively and also introduces an atomic type for marked pointers (the MarkAndRef struct). This type is very similar to your pointer_t struct (except that it stores a bool+pointer instead of an int+pointer). It's used to ensure that a node has not been marked as deleted while you are trying to insert an element directly afterwards. Feel free to use this source as starting point for your own projects.
You can do something like this:
if atomic.CompareAndSwapPointer(
(*unsafe.Pointer)(unsafe.Pointer(tail.ptr.next)),
unsafe.Pointer(&next),
unsafe.Pointer(&pointer_t{&node, next.count + 1})
)