This question already has an answer here:
How do I pass disjoint slices from a vector to different threads?
(1 answer)
Closed 6 years ago.
I have 10 threads and a Vec of length 100.
Can I have thread 0 work on elements 0-9 (sort them, for example), while thread 1 is working on elements 10-19, etc.?
Or do I have to use a Vec<Vec<>> for this? (Which I would rather avoid, because the elements would no longer be contiguous in memory)
Yes, you can. You asked about the mutable case, but I'll preface by saying that if the Vec is read only (e.g. for a reduction) you can safely send an immutable reference to the specific slice you want in each thread. You can do this by simply using something like &my_vec[idx1..idx2] in a loop.
For the mutable case it's a bit trickier since the borrow tracker is not sophisticated enough to allow non-overlapping borrows of a Vec. However, there are a number of methods, notably split_at_mut you can call to get these subslices. By far the easiest is the chunks_mut iterator documented here. (Note that there is a matching chunks iterator for the immutable case so you only need to make minor changes when writing either case).
Be aware that the chunks and chunks_mut functions take the size of each chunk, not the number of chunks. However, deriving one from the other is fairly straightforward.
I would like to give a few words of caution with the mutable case, however. If you split the data evenly you may get abysmal performance. The reason is that the CPU doesn't work on individual addresses, instead it works on blocks of memory known as cache lines which are 64-bytes long. If multiple threads work on a single cache line, they have to write and read slower memory in order to ensure consistency between threads.
Unfortunately, in safe Rust there's no easy way to determine where on a cache line a Vec's buffer starts (because the buffer's start may have been allocated in the middle of a CPU cache line), most of the methods I know of to detect this involve twiddling with the lower bytes of the actual pointer address. The easiest way to handle this is to simply add a 64-byte pad of nonsense-data between each chunk you want to use. So, for instance, if you have a Vec containing 1000 32-bit floats and 10 threads, you simply add 16 floats with a dummy value (since 32-bits = 4-bytes, 16*4=64=1 cache line) between each 100 of your "real" floats and ignore the dummies during computation.
This is known as false sharing, and I encourage you to look up other references to learn other methods of dealing with this.
Note that the 64-byte line size is guaranteed on x86 architectures. If you're compiling for ARM, PowerPC, MIPS, or something else this value can and will vary.
Related
I am looking for a way to directly read the content of a file into the provided uninitialized byte array.
Currently, I have a code like the following:
use std::fs::File;
use std::mem::MaybeUninit;
let buf: MaybeUninit<[u8; 4096]> = MaybeUninit::zeroed();
let f = File::open("some_file")?;
f.read(buf.as_mut_ptr().as_mut().unwrap())?;
The code does work, except that it unnecessarily initializes the byte array with 0. I would like to replace MaybeUninit::zeroed() with MaybeUninit::uninit() but doing so will trigger an undefined behavior according to the document of MaybeUninit. Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
The previous shot at the answer is kept below for posterity. Let's deal with the actual elephant in the room:
Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
There is: Read::read_to_end(&mut self, &mut Vec<u8>)
This function will drain your impl Read object, and depending on the underlying implementation will do one or more reads, extending the Vec provided as it goes and appending all bytes to it.
It then returns the number of bytes read. It can also be interrupted, and this error needs to be handled.
You are trying to micro-optimize something based on heuristics you think are the case, when they are not.
The initialization of the array is done in one go as low-level as it can get with memset, all in one chunk. Both calloc and malloc+memset are highly optimized, calloc relies on a trick or two to make it even more performant. Somebody on codereview pitted "highly optimized code" against a naive implementation and lost as a result.
The takeaway is that second-guessing the compiler is typically fraught with issues and, overall, not worth micro-optimizing for unless you can put some real numbers on the issues.
The second takeaway is one of memory logic. As I am sure you are aware, allocation of memory is dramatically faster in some cases depending on the position of the memory you are allocating and the size of the contiguous chunk you are allocating, due to how memory is laid out in atomic units (pages). This is a much more impactful factor, to the point that below the hood, the compiler will often align your memory request to an entire page to avoid having to fragment it, particularly as it gets into L1/L2 caches.
If anything isn't clear, let me know and I'll generate some small benchmarks for you.
Finally, MaybeUninit is not at all the tool you want for the job in any case. The point of MaybeUninit isn't to skip a memset or two, since you will be performing those memsets yourself by having to guarantee (by contract due to assume_init) that those types are sane. There are cases for this, but they're rare.
In larger cases
There is an impact on performance in uninitializing vs. initializing memory, and we're going to show this by taking an absolutely perfect scenario: we're going to make ourselves a 64M buffer in memory and wrap it in a Cursor so we get a Read type. This Read type will have latency far, far inferior to most I/O operations you will encounter in the wild, since it is almost guaranteed to reside entirely in L2 cache during the benchmark cycle (due to its size) or L3 cache (because we're single-threaded). This should allow us to notice the performance loss from memsetting.
We're going to run three versions for each case (the code):
One where we define out buffer as [MaybeUninit::uninit().assume_init(); N], i.e. we're taking N chunks of MaybeUninit<u8>
One where out MaybeUninit is a contiguous N-element long chunk
One where we're just mapping straight into an initialized buffer
The results (on a core i9-9900HK laptop):
large reads/one uninit time: [1.6720 us 1.7314 us 1.7848 us]
large reads/small uninit elements
time: [2.1539 us 2.1597 us 2.1656 us]
large reads/safe time: [2.0627 us 2.0697 us 2.0771 us]
small reads/one uninit time: [4.5579 us 4.5722 us 4.5893 us]
small reads/small uninit elements
time: [5.1050 us 5.1219 us 5.1383 us]
small reads/safe time: [7.9654 us 7.9782 us 7.9889 us]
The results are as expected:
Allocating N MaybeUninit is slower than one huge chunk; this is completely expected and should not come as a surprise.
Small, iterative 4096-byte reads are slower than a huge, single, 128M read even when the buffer only contains 64M
There is a small performance loss in reading using initialized memory, of about 30%
Opening anything else on the laptop while testing causes a 50%+ increase in benchmarked time
The last point is particularly important, and it becomes even more important when dealing with real I/O as opposed to a buffer in memory. The more layers of cache you have to traverse, the more side-effects you get from other processes impacting your own processing. If you are reading a file, you will typically encounter:
The filesystem cache (may or may not be swapped)
L3 cache (if on the same core)
L2 cache
L1 cache
Depending on the level of the cache that produces a cache miss, you're more or less likely to have your performance gain from using uninitialized memory dwarfed by the performance loss in having a cache miss.
So, the (unexpected TL;DR):
Small, iterative reads are slower
There is a performance gain in using MaybeUninit but it is typically an order of magnitude less than any I/O opt
One popular solution to the ABA problem in lock-free data structures is to tag pointers with an additional monotonically incrementing tag.
struct aba {
void *ptr;
uint32_t tag;
};
However, this approach has a problem. It is really slow and has huge cache problems. I can obtain a speed-up of twice as much if I ditch the tag field. But this is unsafe?
So my next attempt stuff for 64 bit platforms stuffs bits in the ptr field.
struct aba {
uintptr __ptr;
};
uint32_t get_tag(struct aba aba) { return aba.__ptr >> 48U; }
But someone said to me that only 16 bits for the tag is unsafe. My new plan is to use pointer alignment to cache-lines to stuff more tag bits in but I want to know if that'll work.
If that fails to work my next plan is to use Linux's MAP_32BIT mmap flag to allocated data so I only need 32 bits of pointer space.
How many bits do I need for the ABA tag in lock-free data-structures?
The amount of tag bits that is practically safe can be estimated based on the preemption time and the frequency of pointer modifications.
To remind, the ABA problem happens when a thread reads the value it wants to change with compare-and-swap, gets preempted, and when it resumes the actual value of the pointer happens to be equal to what the thread read before. Therefore the compare-and-swap operation may succeed despite data structure modifications possibly done by other threads during the preemption time.
The idea of adding the monotonically incremented tag is to make each modification of the pointer unique. For it to succeed, increments must produce unique tag values during the time when a modifying thread might be preempted; i.e. for guaranteed correctness the tag may not wraparound during the whole preemption time.
Let's assume that preemption lasts a single OS scheduling time slice, which is typically tens to hundreds of milliseconds. The latency of CAS on modern systems is tens to hundreds of nanoseconds. So rough worst-case estimate is that there might be millions of pointer modifications while a thread is preempted, and so there should be 20+ bits in the tag in order for it to not wraparound.
In practice it can be possible to make a better estimate for a particular real use case, based on known frequency of CAS operations. One also need to estimate the worst-case preemption time more accurately; for example, a low-priority thread preempted by a higher-priority job might end up with much longer preemption time.
According to the paper
http://web.cecs.pdx.edu/~walpole/class/cs510/papers/11.pdf
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 6, JUNE 2004 p. 491) by PhD Maged M. Michael
tag bits should be sized to make wraparound impossible in real lockfree scenarios (I can read this as if you may have N threads running and each may access the structure, you should have N+1 different states for tags at least):
6.1.1 IBM ABA-Prevention Tags
The earliest and simplest lock-free method for node reuse is
the tag (update counter) method introduced with the
documentation of CAS on the IBM System 370 [11]. It
requires associating a tag with each location that is the
target of ABA-prone comparison operations. By incrementing
the tag when the value of the associated location is
written, comparison operations (e.g., CAS) can determine if
the location was written since it was last accessed by the
same thread, thus preventing the ABA problem.
The method requires that the tag contains enough bits to make
full wraparound impossible during the execution of any
single lock-free attempt. This method is very efficient and
allows the immediate reuse of retired nodes.
Depending on your data structure you could be able to steal some extra bits from the pointers. For example if the objects are 64 bytes and always aligned on 64 byte boundaries, the lower 6 bits of each pointer could be used for the tags (but that's probably what you already suggested for your new plan).
Another option would be to use an index into your objects instead of pointers.
In case of contiguous objects that would of course simply be an index into an array or vector. In case of lists or trees with objects allocated on the heap, you could use a custom allocator and use an index into your allocated block(s).
For say 17M objects you would only need 24 bits, leaving 40 bits for the tags.
This would need some (small and fast) extra calculation to get the address, but if the alignment is a power of 2 only a shift and an addition are needed.
I have recently read bits and pieces about garbage collection (mostly in Java) and one question still remains unanswered: how does a JVM (or runtime system in general) keeps track of CURRENTLY live objects?
I understand there objects are the ones which are currently on the stack, so all the local variables or function parameters, which ARE objects. The roblem with this approch is that whenever runtime system checks what currently is on the stack, how would it differentiate between a reference variable and simple int? it can't, can it?
Therefore, there must be some sort of mechanism to allow runtime to build initial list of live objects to pass for mark-sweep phase...
I found the answer provided by greyfairer is wrong. The JVM runtime does not gather the root set from stack by looking at what bytecodes are used to push data on the stack. The stack frame consists of 4 byte(32bit arch) slots. Each slot could be a reference to a heap object or a primitive value such as an int. When a GC is needed, the runtime scans the stack, from top to bottom. For each slot, it contains a reference if:
a. It's aligned at 4 byte boundary.
b. The value in the slot point to the region of the heap(between lower and upper bound).
c. The allocbit is set. The allocbit is a flag indicating whether the memory location corresponding to it is allocated or not.
Here is my reference: http://www.ibm.com/developerworks/ibm/library/i-garbage2/.
There are some other techniques to find the root set(not in Java). For example, because pointers are usually aligned at 4/8 bytes boundary, the first bit can be used to indicate whether a slot is a primitive value or pointer: for primitive values, the first bit is set to 1. The disadvantage of this is that you only have 31bits(32 bits arch) to represent the integer, and every operations on primitive values involves shifting, which is obvious an overhead.
Also, you can make all types including int allocated on the heap. That is, all things are objects. Then all slots in a stack frame are then references.
The runtime can perfectly differentiate between reference variables and primitives, because that's in the compiled bytecode.
For example if a function f1 calls a function f2(int i, Object o, long l), the calling function f1 will push 4 bytes on the stack (or in a register) representing i, 4 (or 8?) bytes for the reference to o, and 8 bytes for l. The called function f2 knows where to find these bytes on the stack, and could potentially copy the reference to some object on the heap, or not. When the function f2 returns, the calling function will drop the parameters from the stack.
The runtime interpretes the bytecode and keeps record of what it pushes or drops on the stack, so it knows what is a reference and what is a primitive value.
According to http://www.javacoffeebreak.com/articles/thinkinginjava/abitaboutgarbagecollection.html, java uses a tracing garbage collector and not a reference counting algorithm.
The HotSpot VM generates a GC map for each subroutine compiled which contain information about where the roots are. For example, suppose it has compiled a subroutine to machine code (the principle is the same for byte code) which is 120 bytes long, then the GC map for it could look something like this:
0 : [RAX, RBX]
4 : [RAX, [RSP+0]]
10 : [RBX, RSI, [RSP+0]]
...
120 : [[RSP+0],[RSP+8]]
Here [RSP+x] is supposed to indicate stack locations and R?? registers. So if the thread is stopped at the assembly instruction at offset 10 and a gc cycle runs then HotSpot knows that the three roots are in RBX, RSI and [RSP+0]. It traces those roots and updates the pointers if it has to move the objects.
The format I've described for the GC map is just for demonstrating the principle and obviously not the one HotSpot actually uses. It is not complete because it doesn't contain information about registers and stack slots which contain primitive live values and it is not space efficient to use a list for every instruction offset. There are many ways in which you can pack the information in a much more efficient way.
i came across an interview question which asks
while searching a value in an array using 2 perallel threads
which method would be more efficent
(1) read each half of the array on a different thread (spliting it in half)
(2) reading the array on odd and even places (a thread which reads the odd places
and one which reads the even places in the array ).
i don't understand why one would be more efficent then the other
appricate it if someone would clearify this for me
thanks in advance.
Splitting the array in half is almost certainly the way to go. It will almost never be slower, and may be substantially faster.
The reason is fairly simple: when you're reading data from memory, the processor will normally read an entire cache line at a time. The exact size varies between processors, but doesn't matter a whole lot (though, in case you care, something like 64 bytes would be in the ballpark) -- the point is that it reads a contiguous chunk of several bytes at a time.
That means with the odd/even version, both processors running both threads will have to read all the data. By splitting the data in half, each core will read only half the data. If your split doesn't happen to be at a cache line boundary, each will read a little extra (what it needs rounded up to the size of a cache line). On average that will add half a cache line to what each needs to read though.
If the "processors" involved are really two cores on the same processor die, chances are that it won't make a whole lot of difference either way though. In this case, the bottleneck will normally be reading the data from main memory into the lowest-level processor cache. Even with only one thread, you'll (probably) be able to search through the data as fast as you can read it from memory, and adding more threads (no matter how you arrange their use of the data) isn't going to improve things much (if at all).
The difference is that in the case of half split, the memory is accessed linearly by each thread from left to right, searching from index 0 -> N/2 and N/2 -> N respectively, which maximizes the cache usage, since prefetching of memory is done linearly ahead.
In the second case (even-odd) the cache performance would be worse, not only because you would be prefetching items that you are not using (thread 0 takes element 0, 1, etc. but only uses half of them), but also because of cache ping-pong effects (in case of writing, but this is not done in your example).
Are there any modern, common CPUs where it is unsafe to write to adjacent elements of an array concurrently from different threads? I'm especially interested in x86. You may assume that the compiler doesn't do anything obviously ridiculous to increase memory granularity, even if it's technically within the standard.
I'm interested in the case of writing arbitrarily large structs, not just native types.
Note:
Please don't mention the performance issues with regard to false sharing. I'm well aware of these, but they're of no practical importance for my use cases. I'm also aware of visibility issues with regard to data written from threads other than the reader. This is addressed in my code.
Clarification: This issue came up because on some processors (for example, old DEC Alphas) memory could only be addressed at word level. Therefore, writing to memory in non-word size increments (for example, single bytes) actually involved read-modify-write of the byte to be written plus some adjacent bytes under the hood. To visualize this, think about what's involved in writing to a single bit. You read the byte or word in, perform a bitwise operation on the whole thing, then write the whole thing back. Therefore, you can't safely write to adjacent bits concurrently from different threads.
It's also theoretically possible, though utterly silly, for a compiler to implement memory writes this way when the hardware doesn't require it. x86 can address single bytes, so it's mostly not an issue, but I'm trying to figure out if there's any weird corner case where it is. More generally, I want to know if writing to adjacent elements of an array from different threads is still a practical issue or mostly just a theoretical one that only applies to obscure/ancient hardware and/or really strange compilers.
Yet another edit: Here's a good reference that describes the issue I'm talking about:
http://my.safaribooksonline.com/book/programming/java/0321246780/threads-and-locks/ch17lev1sec6
Writing a native sized value (i.e. 1, 2, 4, or 8 bytes) is atomic (well, 8 bytes is only atomic on 64-bit machines). So, no. Writing a native type will always write as expected.
If you're writing multiple native types (i.e. looping to write an array) then it's possible to have an error if there's a bug in the operating system kernel or an interrupt handler that doesn't preserve the required registers.
Yes, definitely, writing a mis-aligned word that straddles the CPU cache line boundary is not atomic.