How many bytes will a thread local variable in Rust use? - multithreading

I want to use a thread local variable of type Option<usize> in a Rust library. How many bytes will this use per thread for crates that have a dependency on my library? I'm interested in Rust 1.39 and targeting Linux for three processors: amd64, x86 and arm32.

Related

What is the target when I use --native with Rust on a i7-1165G7 processor?

When I compile the following function with the Rust 1.65 toolchain installed using Rustup, I get assembly that doesn't use the popcnt instruction:
pub fn f(x:u64) -> u32 {
x.count_ones()
}
To make it generate popcnt, I need to pass --native.
rustup show says:
$ rustup show
Default host: x86_64-unknown-linux-gnu
and lscpu says:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz
So, in this case what is the actual target used when I pass --native if it's not x86_64-unknown-linux-gnu?
The target defined for your architecture is still x86_64-unknown-linux-gnu. That triple basically means that your CPU is x86_64 (that is, an x86-64), the vendor is unknown (which is typically the case unless there's a relevant set of vendors with functional differences), and the OS is linux-gnu (that is, a Linux kernel with the GNU toolchain and userland, including glibc).
Now, there are a wide variety of x86-64 processors with varying capabilities. In fact, I dare say that x86-64 provides the largest possible set of instruction capabilities. All of those machines running Linux with the GNU userland will be x86-64-unknown-linux-gnu. By default, Rust targets a generic CPU; that is, it writes code that will work on all processors that meet the x86-64 architectural definition. That means that it will use SSE2, which is part of the architectural definition, but not POPCNT, which came along later and is not.
In most cases, that's exactly what you want. It is almost always far more important to have a binary that just works than one tuned to the local system, and that's usually what distro maintainers and packagers want as well.
However, if you need the POPCNT instruction, then --native can work. If you do rustc --print target-cpus | head -n2, then you'll see something like this:
Available CPUs for this target:
native - Select the CPU of the current host (currently alderlake).
In my case, this is alderlake. You will probably see tigerlake there, but I don't have such a system, so you'll have to look for yourself.

How to work with dynamically allocated memory in Assembly (x64/Linux)?

I'm trying to build a toy language compiler (that generates assembly for NASM) and so far so good, but I got really stuck in the topic of dynamic memory allocation. It's the only part on assembly that's stopping me from starting my implementation. The goal is just to learn how things work at the low level.
Is there a good and comprehensive guide/tutorial/book about how to dynamically allocate, use and free memory using Assembly (preferably x64/Linux)? I have found some tips here and there mentioning brk, sbrk and mmap, but I don't know how to use them and I feel that there is more to it than just checking the arguments and the return value of these syscalls. How do they work exactly?
For example, in this post, it is mentioned that sbrk moves the border of the data segment. Can I know where the border is initially/after calling sbrk? Can I just use my initial data segment for the first dynamic allocations (and how)?
This other post explains how free works in C, but it does not explain how C actually gives the memory back to the OS. I have also started to read some books on assembly, but somehow they seem to ignore this topic (perhaps because it's OS specific).
Are there some working assembly code examples? I really couldn't find enough information.
I know one way is to use glibc's malloc, but I wanted to know how it is done from assembly. How do compiled languages or even LLVM do it? Do they just use C's malloc?
malloc is inteface provided for userspace programs. It may have different implementations, such as ptmalloc, tcmalloc and jemalloc. Depending on different environment, you can choosing different allocators to use and even implement your own allocator. As I know, jemalloc manages memory for userspace programs by mmap a block of demanded memory, and jemalloc controls when the block of memory frees to kernel/system.(I know jemalloc is used in Android.) Also jemalloc also uses sbrk depending on different states od system memory. For more detailed info, I think you have to read the codes of defferent allocators you wanted to learn.

Find total size of struct at runtime

Is there any way to calculate the total stack and heap size of a struct at runtime?
As far as I can tell, std::mem::{size_of, size_of_val} only work for stack-allocated values, but a struct might contain heap-allocated buffers, too (e.g. Vec).
Servo was using the heapsize crate to measure the size of heap allocations during program.
You can call the heap_size_of function to measure the allocated heap size by jemalloc.
Be aware that you can get different results with different allocators.
Regarding Github: "This crate is not maintained and is no longer used by Servo. At the time of writing, Servo uses internal malloc_size_of instead."
You can either use heapsize crate or you can check the implementation details of malloc_size_of as well

Does hyper threading change the binary machine code of a compiled program?

Does hyper threading change the binary code sequence of a compiled program?
If we had a compiled binary code of say:
10100011100011100101010010011111100011111110010111
If hyper threading were enabled, what would a thread represent?
Would it be just some section of this binary code?
How does the operating system allocate time intervals for these threads?
For parallelism:
Would the compiled binary code be any different?
How the cores handle this binary sequence?
Just execute some section of the code in different cores?
How does the operating system allocate parallel task? Is there any specific structure?
Most programs are compiled to be run by a particular operating system, on any member of a broad family of processors (e.g., on Windows, for the x86-64 family). Within a given CPU family, there may be CPUs with different numbers of cores, and there may be cores with or without hyperthreading.
None of that changes the binary code. The same program typically will run on any member of the CPU family, without any changes.
The operating system in which the program runs may or may not be differently configured for different members of the processor family.
Sometimes, a program can be compiled to exploit the features of a specific CPU, but programs compiled in that way are unsuitable for distribution to different sites and/or PCs.
If we had a compiled binary code of say: 101000111000... If hyper threading were enabled, what would a thread represent? Would it be just some section of this binary code?
That's an unanswerable question. You can learn more about what "binary code" means by reading an introductory book on computer architecture. E.g., https://www.amazon.com/Elements-Computing-Systems-Building-Principles/dp/0262640686/

memory fences/barriers in C++: does boost or other libraries have them?

I am reading these days about memory fences and barriers as a way to synchronize multithreaded code and avoid code reordering.
I usually develop in C++ under Linux OS and I use boost libs massively but I am not able to find any class related to it. Do you know if memory barrier of fences are present in boost or if there is a way to achieve the same concept? If not what good library can I have a look to?
There are no low-level memory barriers in boost yet, but there is a proposed boost.atomic library that provides them.
Compilers provide their own either as intrinsics or library functions, such as gcc's __sync_synchronize() or _mm_mfence() for Visual Studio.
The C++0x library provides atomic operations including memory fences in the form of std::atomic_thread_fence. Though gcc has supplied various forms of the C++0x atomics since V4.4, neither V4.4 or V4.5 include this form of fence. My (commercial) just::thread library provides a full implementation of C++0x atomics, including fences for g++ 4.3 and 4.4, and Microsoft Visual Studio 2005, 2008 and 2010.
The place where memory barriers are required is when avoiding using kernel synchronisation mechanisms in an SMP environment - usually for performance reasons.
There is an implicit memory barrier in any kernel synchronisation operation (e.g. signalling semaphores, locking and unlocking mutices) and content switching to guard against data coherence hazards.
I have just found myself needing (moderately) portable memory barrier implementations (ARM and x86), and also found the linux source tree to be the best source for this. Linux has SMP variants of the mb(), rmb() and wmb() macros - which on some platforms result in more specific (and possible less costly) barriers than the non-SMP variants.
This doesn't appear to be a concern on x86 and particularly ARM where though where both are implemented the same way.
This is what I've cribbed together from linux header files (suitable for ARMv7 and non-ancient x86/x64 processors)
#if defined(__i386__ ) || defined(__x64__)
#define smp_mb() asm volatile("mfence":::"memory")
#define smp_rmb() asm volatile("lfence":::"memory")
#define smp_wmb() asm volatile("sfence" ::: "memory")
#endif
#if defined(__arm__)
#define dmb() __asm__ __volatile__ ("dmb" : : : "memory")
#define smp_mb() dmb()
#define smp_rmb() dmb()
#define smp_wmb() dmb()
#endif
Naturally, dabbling with memory barriers has the attendant risk that the resulting code is practically impossible to test, and any resulting bugs will be obscure and difficult to reproduce race conditions :/
There is, incidentally, a very good description of memory barriers in the Linux kernel documentation.
There is a boost::barrier class/concept but it's a bit high level. Speaking of which, why do you need a low level barrier? Synchronization primitives should be enough, shouldn't they? And they should be using memory barriers where necessary, either directly or indirectly through other, lower level primitives.
If you still think you need a low-level implementation, I know of no classes or libraries that implement barriers, but there's some implementation-specific code in the Linux kernel. Search for mb(), rmb() or wmb() in include/asm-{arch}/system.h.
Now, in 2019, the C++11 fences should be available on nearly every implementation of the C++ standard library. The header ist <atomic>.
You can issue a fence by calling std::atomic_thread_fence. In short:
std::atomic_thread_fence(std::memory_order_release); guarantees that no store operations are moved past the call. (All side effects will be made visible to other threads.)
std::atomic_thread_fence(std::memory_order_acquire); guarantees that no load operations are moved before the call. (All side effects of other threads will be made visible.)
There are more details in the documentation.

Resources