When to use "cold" built-in codegen attribute in Rust?

There isn't much information on this attribute in the reference document other than
The cold attribute suggests that the attributed function is unlikely to be called.
How does it work internally and when a Rust developer should use it?

It tells LLVM to mark a function as cold (i.e. not called often), which changes how the function is optimized such that calls to this code is potentially slower, and calls to non-cold code is potentially faster.
Mandatory disclaimer about performance tweaking:
You really should have some benchmarks in place before you start marking various bits of code as cold. You may have some ideas about whether something is in the hot path or not, but unless you test it, you can't know for sure.
FWIW, there's also the perma-unstable LLVM intrinsics likely and unlikely, which do a similar thing, but these have been known to actually hurt performance, even when used correctly, by preventing other optimizations from happening. Here's the RFC: https://github.com/rust-lang/rust/issues/26179
As always: benchmark, benchmark, benchmark! And then benchmark some more.


Which nodejs v8 flags for benchmarking?

For comparison of different libraries with the same functionality, we compare their execution time. This works great. However, there are v8 flags that impact execution time and skew results.
Some flags that are relevant are: --predictable, --always-opt, --no-opt, --minimal.
Question: Which v8 flags should typically be set for running a meaningful benchmarks? What are the tradeoffs?
Edit: The problem is that a benchmark typically runs the same code over and over to get a good average. This might lead to v8 optimizing code, which it would typically not optimize.
V8 developer here. You should definitely run benchmarks with the default configuration. It is the responsibility of the benchmark to be realistic. An unrealistic benchmark cannot be made meaningful with engine flags. (And yes, there are many many unrealistic and/or otherwise meaningless snippets of code out there that people call "benchmarks". Remember, if you can't measure a difference with a realistic benchmark, then any unmeasurable difference that might exist is irrelevant.)
In particular:
Absolutely not. Detrimental to performance. Changes behavior in unrealistic ways. Meant for debugging certain things, and for helping fuzzers find reproducible test cases (at the expense of being somewhat unrealistic), not for anything related to performance testing.
Absolutely not. Contrary to what a naive reader of this flag's name might think, this does not improve performance, on the contrary; it mostly causes V8 to waste a bunch of CPU cycles on useless work. This flag is barely ever useful at all; it can sometimes flush out weird corner case bugs in the compilation pipeline, but most of the time it just creates pointless work for V8 developers by creating artificial situations that never occur in practice.
Absolutely not. Turns off all optimizations. Totally unrealistic.
That's not a V8 flag I've ever heard of. So yeah, sure, pass it along, it won't do anything (beyond printing an "unknown flag" warning), so at least it won't break anything.
Using default flags seems like the best way to me, since that's what most people will use.

How to make a Python "rust-Extension" module that behaves exactly like C-Extensions in terms of call overhead and processing speed?

The closer option I have found is pyo3, but it isn't clear to me if it adds any extra overhead when compared to the traditional C-extensions.
From here it seems like such C-extension behavior is possible through borrowed objects (I still have to understand this concept in detail).
Part of my question comes from the fact the build process (Section Python with Rust here) is entirely managed by cargo which references both cpython and pyo3.
For an example of approach that adds some overhead, but not rust-based, see this comparison.
A related question is about portability, since it seems there is a kind of overhead-portability tradeoff.
For those who prefer to know the concrete case, it is about small hash-like operations that are used millions of times in unpredictable order. So neither a pure Python nor a batch native approach are going to help here. Additionally, there are already gains in a first attempt using a C-extension when compared to pure Python. Now, I'd like to implement it in Rust before writing the remaining functions.

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?
If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.
The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:
The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.
If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.
Initial analysis
This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:
For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?
For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?
Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).
But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

Expression trees vs IL.Emit for runtime code specialization

I recently learned that it is possible to generate C# code at runtime and I would like to put this feature to use. I have code that does some very basic geometric calculations like computing line-plane intersections and I think I could gain some performance benefits by generating specialized code for some of the methods because many of the calculations are performed for the same plane or the same line over and over again. By specializing the code that computes the intersections I think I should be able to gain some performance benefits.
The problem is that I'm not sure where to begin. From reading a few blog posts and browsing MSDN documentation I've come across two possible strategies for generating code at runtime: Expression trees and IL.Emit. Using expression trees seems much easier because there is no need to learn anything about OpCodes and various other MSIL related intricacies but I'm not sure if expression trees are as fast as manually generated MSIL. So are there any suggestions on which method I should go with?
The performance of both is generally same, as expression trees internally are traversed and emitted as IL using the same underlying system functions that you would be using yourself. It is theoretically possible to emit a more efficient IL using low-level functions, but I doubt that there would be any practically important performance gain. That would depend on the task, but I have not come of any practical optimisation of emitted IL, compared to one emitted by expression trees.
I highly suggest getting the tool called ILSpy that reverse-compiles CLR assemblies. With that you can look at the code actually traversing the expression trees and actually emitting IL.
Finally, a caveat. I have used expression trees in a language parser, where function calls are bound to grammar rules that are compiled from a file at runtime. Compiled is a key here. For many problems I came across, when what you want to achieve is known at compile time, then you would not gain much performance by runtime code generation. Some CLR JIT optimizations might be also unavailable to dynamic code. This is only an opinion from my practice, and your domain would be different, but if performance is critical, I would rather look at native code, highly optimized libraries. Some of the work I have done would be snail slow if not using LAPACK/MKL. But that is only a piece of the advice not asked for, so take it with a grain of salt.
If I were in your situation, I would try alternatives from high level to low level, in increasing "needed time & effort" and decreasing reusability order, and I would stop as soon as the performance is good enough for the time being, i.e.:
first, I'd check to see if Math.NET, LAPACK or some similar numeric library already has similar functionality, or I can adapt/extend the code to my needs;
second, I'd try Expression Trees;
third, I'd check Roslyn Project (even though it is in prerelease version);
fourth, I'd think about writing common routines with unsafe C code;
[fifth, I'd think about quitting and starting a new career in a different profession :) ],
and only if none of these work out, would I be so hopeless to try emitting IL at run time.
But perhaps I'm biased against low level approaches; your expertise, experience and point of view might be different.

Future Protections in Managed Languages and Runtimes

In the future, will managed runtimes provide additional protections against subtle data corruption issues?
Managed runtimes such as Java and the .NET CLR reduce or eliminate the possibility of many memory corruption bugs common in native languages like C#. Nonetheless, they are surprisingly not immune from all memory corruption problems. One intuitively expects that a method that validates its input, has no bugs, and robustly handles exceptions will always transform its object from one valid state to another, but this is not the case. (It is more accurate to say that it is not the case using prevailing programming conventions--object implementors need to go out of their way to avoid the problems I describe.)
Consider the following scenarios:
Threading. The caller might share the object with other threads and make concurrent calls on it. If the object does not implement locking, the fields might be corrupted. (Perhaps--unless notified that the object is thread-safe--runtimes should use an interlock on every method call to throw an exception if any method on the same object executing concurrently on another thread. This would be a protection feature and, just like other well-accepted safety features of managed runtimes, it has some cost.)
Re-entrancy. The method makes a callout to an arbitrary function (such as an event handler) that ultimately calls methods on the object that are not designed to be called at that point. This is even trickier than thread safety and many class libraries do not get this right. (Worse yet, class libraries are known to poorly document what re-entrancy is allowed.)
For all of these cases, it can be argued that thorough documentation is a solution. However, documentation also can prescribe how to allocate and deallocate memory in unmanaged languages. We know from experience (e.g., with memory allocation) that the difference between documentation and language/runtime enforcement is night and day.
What can we expect from languages and runtimes in the future to protect us from these problems and other subtle problems like them?
I think languages and runtimes will keep moving forward, keep abstracting away issues from the developer, and keep making our lives easier and more productive.
Take your example - threading. There are some great new features on the horizon in the .NET world to simplify the threading model we use daily. STM.NET may eventually make shared state much, much safer to handle, for example. The parallel extensions in .NET 4 make life very easy for threading compared to current technologies.
I think that transactional memory is promising for addressing some of these issues. I'm not sure if this answers your question in some way but this is an interesting topic in any event:
There was an episode of Software Engineering Radio on the topic a year or so ago maybe.
First of all, "managed" is a bit of a misnomer: languages like OCaml, Haskell, and SML achieve such protections and safety while being fully compiled. All relevant "management" occurs at compile time through static analysis, which aids optimization and speed.
Anyway, to answer your question: if you look at languages like Erlang and Haskell, state is isolated and immutable by default. With kind of system, threading and reentrancy is safe by default, and because you have to go out of your way to break these rules, it is obvious to see where unsafe code can arise.
By starting with safe defaults but leaving room for advanced unsafe usage, you get the best of both worlds. It seems reasonable that future systems that are safe by your definition may follow some of these practices as well.
What can we expect in the future?
Nothing. Thread-state and re-entrancy are not problems I see tools/runtimes solving. Instead I think in the future people will move to styles that avoid programming with mutable state to bypass these issues. Languages and libraries can help make these styles of programming more attractive, but the tools are not the solution - changing the way we write code is the solution.
