How do I find out the number of CPU cores using cpuid? - rust

I am interested in physical cores, not logical cores.
I am aware of https://crates.io/crates/num_cpus, but I want to get the number of cores using cpuid. I am mostly interested in a solution that works on Ubuntu, but cross-platform solutions are welcome.

I see mainly two ways for you to do this.
You could use the higher level library cpuid. With this, it's as simple as cpuid::identify().unwrap().num_cores (of course, please do proper error handling). But since you know about the library num_cpus and still ask this question, I assume you don't want to use an external library.
The second way to do this is do it all on your own. But this method of doing it is mostly unrelated to Rust as the main difficulty lies in understanding the CPUID instruction and what it returns. This is explained in this Q&A, for example. It's not trivial, so I won't repeat it here.
The only Rust specific thing is how to actually execute that instruction in Rust. One way to do it is to use core::arch::x86_64::__cpudid_count. It's an unsafe function that returns the raw result (four registers). After calling it, you have to extract the information you want via bit shifting and masking as described in the Q&A I linked above. Consult core::arch for other architectures or more cpuid related functions.
But again, doing this manually is not trivial, error prone and apparently hard to make work truly cross-CPU. So I would strongly recommend using a library like num_cpus in any real code.

Related

How to make a Python "rust-Extension" module that behaves exactly like C-Extensions in terms of call overhead and processing speed?

The closer option I have found is pyo3, but it isn't clear to me if it adds any extra overhead when compared to the traditional C-extensions.
From here it seems like such C-extension behavior is possible through borrowed objects (I still have to understand this concept in detail).
Part of my question comes from the fact the build process (Section Python with Rust here) is entirely managed by cargo which references both cpython and pyo3.
For an example of approach that adds some overhead, but not rust-based, see this comparison.
A related question is about portability, since it seems there is a kind of overhead-portability tradeoff.
For those who prefer to know the concrete case, it is about small hash-like operations that are used millions of times in unpredictable order. So neither a pure Python nor a batch native approach are going to help here. Additionally, there are already gains in a first attempt using a C-extension when compared to pure Python. Now, I'd like to implement it in Rust before writing the remaining functions.

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?
If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.
The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:
The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.
If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.
Initial analysis
This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:
For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?
For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?
Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).
But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

What happened to libgreen?

As far as I understand libgreen is not a part of Rust standard library anymore. Also I can't find a separate libgreen package. There are a few alternatives - coroutine, which does not provide actual green threads for now, and green-rs, which is broken. Do I right understand that for now there is no lightweight Go-like processes in Rust?
You are correct that there's no lightweight tasking library in std (or the rest of the main distribution), that green doesn't compile and that coroutine doesn't seem to fully handle the threading aspect yet. I do not know of any other library in this space.
As for what happened: the RFC linked to by that issue—RFC 230—is the canonical source of information. The summary is that it was found that the method by which green threading/IO was handled (std tried to abstract across both models, allowing them to be used interoperably automagically) was not worth the downsides. Now, std aims to just provide a minimum base-line of useful support: for IO/threading, that means "thin", safe wrappers for operating system functionality.
Read this https://aturon.github.io/blog/2016/08/11/futures/ and also:
Steve Klabnik's response in the comments:
In the beginning, Rust had only green threads. Eventually, it was
decided that a systems language without systems threads is... strange.
So we needed to add them. Why not add choice? Since the interfaces
could be the same, why not abstract over them, and you could just
choose which one you wanted?
At the same time, the problems with green threads by default were
becoming issues. Segmented stacks cause slow C interop. You need a
runtime to manage them, etc. Furthermore, the overall abstraction was
causing an unacceptable cost. The green threads weren't very green.
Plus, with the need to actually release someday looming, decisions
needed to be made regarding tradeoffs. And since Rust is supposed to
be a systems language, having 1:1 threads and basically no runtime
makes more sense than N:M threads and a runtime. . So libgreen was
removed, the interface was re-done to be 1:1 thread centric.
The 'release someday looming' is a big part of it. We want to be
really stable with Rust, and with all the things to do to actually
ship a 1.0, we didn't want to crystallize an interface we weren't
happy with. Heck, we pulled out a lot of libraries that are even less
important for similar reasons, like rand. Engineering is all about
tradeoffs, and we decided to choose minimalism.
mio is a non starter for us, as are most of the other async i/o frameworks for Rust, because we need Windows and besides we don't want
to get locked into an expensive to replace library which may get
orphaned.
Totally understood here, especially in the general case. In the
specific case, mio is going to either have Windows support, or a
windows-specific version of mio is going to be released, with a
higher-level package providing the features for all platforms. And in
this case, it's maintained by one of the people who's currently using
Rust heavily in production, so it's not likely to go away anytime
soon. But, unless you're actively involved, it's hard to know things
like that, which is, of itself an issue.
One of the reasons we were comfortable removing libgreen is that you
can write your own libraries to do different kinds of IO. 1.0 is a
strong core that we feel good about stabilizing forever, not the final
bit. Libraries like https://github.com/carllerche/mio can test out
different ways of handling things like async IO, and, when they're
mature enough, we can always pull them back in the standard library if
need be. But in the meantime, it's just one line to your Cargo.toml to
add them in.
And such text from reddit:
Unfortunately they ended up canning the greenlet support because
theirs were slower than kernel threads which in turn demonstrates
someone didn’t understand how to get a language compiler to generate
stackless coroutines effectively (not surprising, the number of
engineers wired the right way is not many in this world, but see
http://www.reddit.com/r/rust/comments/2l0a4b/do_rust_web_servers_use_libuv_through_libgreen_or/
for more detail). And they canned the async i/o because libuv is
“slow” (which it is only because it is single threaded only, plus
forces a malloc + free per async operation as the buffers must last
until completion occurs, plus it enforces a penalty over synchronous
i/o see
http://blog.kazuhooku.com/2014/09/the-reasons-why-i-stopped-using-libuv.html),
which was a real shame - they should have taken the opportunity to
replace libuv with something better (hint: ASIO + AFIO, and yes I know
they are both C++, but Rust could do with much better C++ interop than
the presently none it currently has) instead of canning
always-async-everything in what could have been an amazing step up
from C++ with most of the benefits of Erlang without the disadvantages
of Erlang.
For newcomers, there is now may, a crate that implements green threads similar to goroutines.

Is rbp/ebp(x86-64) register still used in conventional way?

I have been writing a small kernel lately based on x86-64 architecture. When taking care of some user space code, I realized I am virtually not using rbp. I then looked up at some other things and found out compilers are getting smarter these days and they really dont use rbp anymore. (I could be wrong here.)
I was wondering if conventional use of rbp/epb is not required anymore in many instances or am I wrong here. If that kind of usage is not required then can it be used like a general purpose register?
Thanks
It is only needed if you have variable-length arrays in your stack frames (recording the array length would require more memory and more computations). It is no longer needed for unwinding because there is now metadata for that.
It is still useful if you are hand-writing entire assembly functions, but who does that? Assembly should only be used as glue to jump into a C (or whatever) function.

How do I implement the ABA solution?

I am trying to implement the Michael-Scott FIFO queue from here. I'm unable to implement their solution for the ABA problem. I get this error.
error: incompatible type for argument 1 of '__sync_val_compare_and_swap'
For reference, I am using a linux box to compile this on an intel architecture. If you need more information on my setup please ask.
It seems that sync_val_CAS handles only up to 32 bit values. So when I remove their counter which is used to eliminate the ABA problem, everything compiles and runs fine.
Does anyone know of the relevant 64-bit CAS instruction I should be using here?
As an additional question, are there better (faster) implementations of lock-free fifo queues out there? I came across this by Nir Shavit et al which seems interesting. I am wondering if others have seen similar efforts? Thanks.
Assuming gcc, try using the "march" switch. Something like this: -march=i686
There is also the __sync_bool_compare_and_swap. I don't know if its faster or not.
GCC, last I looked in 2009, does not support contigious double-word CAS. I had to implement in-line assembly.
You can find my implemenation of the M&S queue (including in the abstraction layer the assembly implementation of DCAS) and other lock-free data structures here;
http://www.liblfds.org
Briefly looking at the Nir Shavit et al paper, the queue requires Safe Memory Reclaimation, which I suspect you'll need to implement - it won't be built into the queue. An SMR API will be available in the next release (couple weeks).
Lock-free may not be what you want, since lock-free is not necessarily wait-free. If you need a fast thread-safe queue (not lock-free!), then consider using Threading Building Blocks concurrent_queue.

Resources