efficiency issue - searching an array on parallel threads - multithreading

i came across an interview question which asks
while searching a value in an array using 2 perallel threads
which method would be more efficent
(1) read each half of the array on a different thread (spliting it in half)
(2) reading the array on odd and even places (a thread which reads the odd places
and one which reads the even places in the array ).
i don't understand why one would be more efficent then the other
appricate it if someone would clearify this for me
thanks in advance.

Splitting the array in half is almost certainly the way to go. It will almost never be slower, and may be substantially faster.
The reason is fairly simple: when you're reading data from memory, the processor will normally read an entire cache line at a time. The exact size varies between processors, but doesn't matter a whole lot (though, in case you care, something like 64 bytes would be in the ballpark) -- the point is that it reads a contiguous chunk of several bytes at a time.
That means with the odd/even version, both processors running both threads will have to read all the data. By splitting the data in half, each core will read only half the data. If your split doesn't happen to be at a cache line boundary, each will read a little extra (what it needs rounded up to the size of a cache line). On average that will add half a cache line to what each needs to read though.
If the "processors" involved are really two cores on the same processor die, chances are that it won't make a whole lot of difference either way though. In this case, the bottleneck will normally be reading the data from main memory into the lowest-level processor cache. Even with only one thread, you'll (probably) be able to search through the data as fast as you can read it from memory, and adding more threads (no matter how you arrange their use of the data) isn't going to improve things much (if at all).

The difference is that in the case of half split, the memory is accessed linearly by each thread from left to right, searching from index 0 -> N/2 and N/2 -> N respectively, which maximizes the cache usage, since prefetching of memory is done linearly ahead.
In the second case (even-odd) the cache performance would be worse, not only because you would be prefetching items that you are not using (thread 0 takes element 0, 1, etc. but only uses half of them), but also because of cache ping-pong effects (in case of writing, but this is not done in your example).

Related

Fastest way to copy a large file locally

I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?
"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

Linux slab allocator and cache performance

From the guide understanding linux kernel 3rd edition, chapter 8.2.10, Slab coloring-
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this
chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache.
Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped
in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects
from the same cache line back and forth to different RAM locations, while other cache lines go underutilized.
The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different
arbitrary values called colors are assigned to the slabs.
(1) I am unable to understand the issue that the slab coloring tries to solve. When a normal proccess accesses data, if it is not in the cache and a cache miss is encountered, the data is fetched into the cache along with data from the surounding address of the data the process tries to access to boost performance. How can a situation occur such that same specific cache lines keeps getting swapped? the probability that a process keeps accessing two different data addresses in same offset inside a memory area of two different memory areas is very low. And even if it does happen, cache policies usually choose lines to be swapped according to some agenda such as LRU, Random, etc. No policy exist such that chooses to evict lines according to a match in the least significant bits of the addresses being accessed.
(2) I am unable to understand how the slab coloring, which takes free bytes from end of slab to the beginning and results with different slabs with different offsets for the first objects, solve the cache-swapping issue?
[SOLVED] after a small investigation I believe I found an answer to my question. Answer been posted.
After many studying and thinking, I have got explanation seemingly more reasonable, not only by specific address examples.
Firstly, you must learn basics knowledge such as cache , tag, sets , line allocation.
It is certain that colour_off's unit is cache_line_size from linux kernel code. colour_off is the basic offset unit and colour is the number of colour_off in struct kmem_cache.
int __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
cachep->align = ralign;
cachep->colour_off = cache_line_size(); // colour_off's unit is cache_line_size
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;
.....
err = setup_cpu_cache(cachep, gfp);
https://elixir.bootlin.com/linux/v4.6/source/mm/slab.c#L2056
So we can analyse it in two cases.
The first is cache > slab.
You see slab 1 slab2 slab3 ... has no possibility to collide mostly because cache is big enough except slab1 vs slab5 which can collide. So colouring mechanism is not so clear to improve performance in the case. But with slab1 and slab5 we just ignore to explain it why, I am sure you will work it out after reading the following.
The second is slab > cache.
A blank line means a color_off or cache line. Clearly, slab1 and slab2 has no possibility to collide on the lines signed by tick as well as slab2 slab3.
We make sure colouring mechanism optimize two lines between two adjacent slabs, much less slab1 vs slab3 which optimize more lines, 2+2 = 4 lines, you can count it.
To summarize, colouring mechanism optimize cache performance (detailly just optimize some lines of colour_off at the beginning and end, not other lines which can still collide ) by using originally useless memory as possible as it can.
I think I got it, the answer is related to Associativity.
A cache can be divided to certain sets, each set can only cache a certain memory blocks type in it. For example, set0 will contain memory blocks with addresses of multiple of 8, set1 will contain memory blocks with addresses of multiple of 12. The reason for that is to boost cache performance, to avoid the situation where every address is searched throught the whole cache. This way only a certain set of the cache needs to be searched.
Now, from the link Understanding CPU Caching and performance
From page 377 of Henessey and Patterson, the cache placement formula is as follows:
(Block address) MOD (Number of sets in cache)
Lets take memory block address 0x10000008 (from slabX with color C) and memory block address 0x20000009 (from slabY with color Z). For most N (number of sets in cache), the calculation for <address> MOD <N> will yield a different value, hence a different set to cache the data. If the addresses were with same least significant bits values (for example 0x10000008 and 0x20000008) then for most of N the calculation will yield same value, hence the blocks will collide to the same cache set.
So, by keeping an a different offset (colors) for the objects in different slabs, the slabs objects will potentially reach different sets in cache and will not collide to the same set, and overall cache performance is increased.
EDIT: Furthermore, if the cache is a direct mapped one, then according to wikipedia, CPU Cache, no cache replacement policy exist and the modulu calculation yields the cache block to which the memory block will be stored:
Direct-mapped cache
In this cache organization, each location in main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a replacement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and nbe number of blocks in cache, then mapping is done with the help of the equation x = y mod n.
Say you have a 256 KB cache and it uses a super-simple algorithm where it does cache line = (real address AND 0x3FFFFF).
Now if you have slabs starting on each megabyte boundary then item 20 in Slab 1 will kick Item 20 of Slab 2 out of cache because they use the same cache line tag.
By offsetting the slabs it becomes less likely that different slabs will share the same cache line tag. If Slab 1 and Slab 2 both hold 32 byte objects and Slab 2 is offset 8 bytes, its cache tags will never be exactly equal to Slab 1's.
I'm sure I have some details wrong, but take it for what it's worth.

Linux Buddy page frames allocation and freeing

While reading and understanding linux kernel using the guide-
http://www.johnchukwuma.com/training/UnderstandingTheLinuxKernel3rdEdition.pdf
I have something I'm trying to understand in the Buddy system for page allocaion and freeing.
The technique adopted by Linux to solve the external fragmentation
problem is based on the well-known buddy system algorithm. All free
page frames are grouped into 11 lists of blocks that contain groups of
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page
frames, respectively. [chapter 8.1.7]
This makes perfectly sense as now Linux can serve allocation request quickly as there are different chunk sizes ready for different chunk sizes requests.
Now, say the system starts up, and with all the available pages are free and grouped as mentioned above to those 11 groups. Now lets consider a scenario in which a process requires one page of order 1, then free it. According to the free algorithm-
while (order < 10)
{
buddy_idx = page_idx ^ (1 << order);
buddy = base + buddy_idx;
if (!page_is_buddy(buddy, order))
break;
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
ClearPagePrivate(buddy);
buddy->private = 0;
page_idx &= buddy_idx;
order++;
}
so according to this and my scenario, the order 1 chunk (the first ever allocated) will be merged with another order 1 chunk into a chunk of order 2, though the two order 1 chunks have not been splitted from an order 2 chunk in allocation stage.
This way, if I keep on allocate and then free a single chunk, pretty quickly the system will reach a state in which all memory chunks are of the biggest order, which seems not efficient. I would have expected that merging two buddies will be made only when those buddies were previously splitted from a bigger order chunk, that way the initial default state will be preserved as much as possible and the whole system will be kept efficient.
Am I missing something? is it possible that this code is wrong? am I not aware of another advantage this code provides?
The assumption of such an initial state when plenty of smallest order blocks are available might be slightly dubious. If I recall correctly, allocation of a memory block should commence with looking at the smallest (the same size) group and then looking at greater-order groups to find a free block. If the block found is larger than needed, it will be split, and the corresponding groups will be updated. It's not so obvious, but this whole process might commence with the initial state when only highest-order blocks are available.
A handful of examples which I can meet in the literature draw almost the same picture of the initial state. An eloquent example may be found on Wikipedia: https://en.wikipedia.org/wiki/Buddy_memory_allocation#In_practice . The diagram and the description shed some light on a typical situation.
All in all, I can't find any proof for the assumption of the "initial default state". The idea of decrease in efficiency when splitting a smaller chunk from a larger one is the most murky and probably deserves a separate talk.
EDIT:
The initial state from the kernel's point of view might not be the same as the initial state seen by your hypothetical process. Say, the system starts, and at certain point there is one chunk of memory. However, your hypothetical process will not be alone, - for sure. The memory distribution will likely change a lot before the process will be able to begin allocating and freeing any chunks of memory. For sure, the kernel or, to be precise, most of its subsystems will request plenty of memory chunks of different sizes during initialisation, and significant amount of the chunks will be owned by the kernel for considerable amount of time (perhaps, throughout the whole uptime). So, the point is that by the time your process starts, the buddy system will likely be "warmed up" and, indeed, enough small chunks will be available. However, the buddies for the pages (acquired by your process) will still be owned by various subsystems, and, as soon as your process decides to free its pages, those buddies will not be approved as ready for merge, i.e. page_is_buddy() from the excerpt will yield false. Of course, this whole scenario might be true provided that your process indeed succeeds to take existing free chunks without splitting higher-order blocks.
So, the point is that your assumption of the initial buddy distribution might not be the true initial state. It could only be a "warmed up" state where you indeed may have small chunks but their buddies will be busy and thus shall prevent the described hypothetical process of uncontrollable merge.
P.S. Here is the description of what page_is_buddy() is for.

How to prevent two processess from fighting for a common cache?

I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.
Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.
How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...
You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.

Writing to adjacent array elements from different threads?

Are there any modern, common CPUs where it is unsafe to write to adjacent elements of an array concurrently from different threads? I'm especially interested in x86. You may assume that the compiler doesn't do anything obviously ridiculous to increase memory granularity, even if it's technically within the standard.
I'm interested in the case of writing arbitrarily large structs, not just native types.
Note:
Please don't mention the performance issues with regard to false sharing. I'm well aware of these, but they're of no practical importance for my use cases. I'm also aware of visibility issues with regard to data written from threads other than the reader. This is addressed in my code.
Clarification: This issue came up because on some processors (for example, old DEC Alphas) memory could only be addressed at word level. Therefore, writing to memory in non-word size increments (for example, single bytes) actually involved read-modify-write of the byte to be written plus some adjacent bytes under the hood. To visualize this, think about what's involved in writing to a single bit. You read the byte or word in, perform a bitwise operation on the whole thing, then write the whole thing back. Therefore, you can't safely write to adjacent bits concurrently from different threads.
It's also theoretically possible, though utterly silly, for a compiler to implement memory writes this way when the hardware doesn't require it. x86 can address single bytes, so it's mostly not an issue, but I'm trying to figure out if there's any weird corner case where it is. More generally, I want to know if writing to adjacent elements of an array from different threads is still a practical issue or mostly just a theoretical one that only applies to obscure/ancient hardware and/or really strange compilers.
Yet another edit: Here's a good reference that describes the issue I'm talking about:
http://my.safaribooksonline.com/book/programming/java/0321246780/threads-and-locks/ch17lev1sec6
Writing a native sized value (i.e. 1, 2, 4, or 8 bytes) is atomic (well, 8 bytes is only atomic on 64-bit machines). So, no. Writing a native type will always write as expected.
If you're writing multiple native types (i.e. looping to write an array) then it's possible to have an error if there's a bug in the operating system kernel or an interrupt handler that doesn't preserve the required registers.
Yes, definitely, writing a mis-aligned word that straddles the CPU cache line boundary is not atomic.

Resources