How to optimize an algorithm for a given multi-core architecture

How to optimize an algorithm for a given multi-core architecture - multithreading

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.

What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind

The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.

It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?

Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.

Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.

In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions

Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.

The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

Tool to identify app's data/code most susceptible to memory performance

Context:
-- embedded platform running Linux with some static RAM which is declared about 3 times faster then the rest of RAM (dynamic). The amount of this fast memory is 512kB and the official name is eSRAM. (Details not important for this post: Galileo board, information on eSRAM and relevant kernel API: https://communities.intel.com/servlet/JiveServlet/previewBody/22488-102-1-26046/Quark_SWDevManLx_330235_001.pdf)
-- eSRAM can be used by an application with some support from the kernel---a simple driver that allocates kernel memory on its behalf, overlays the memory with eSRAM (this is done in physical space) and mmaps it to app's virtual memory space. This was tested and confirmed to work as expected.
Problem:
Identify which sections of app's data (and possibly code) to map into eSRAM to achieve optimum performance gain. A suitable analysis tool is required.
After some search I'm not sure if any existing tool is actually suited to this task. Currently my best bet is to develop a specialized Valgrind tool. But maybe there is already something in the ecosystem to start with. Any advice/information is welcome even if, for instance, a tool is kind of partially suited etc.
P.S.
Full analysis should probably take a lot of factors into account, like:
-- memory access patterns (cache performance)
-- changes over time (one could consider eSRAM paging)
...

I have taken a look at Valgrind Cachegrind. It can collect data about data cache reades and data cache writes. And cg_annotate can report Line-by-line Counts for you program. Can it be useful for you to find variables in your program that cause most operations with data cache and in this way to identify data that can benefit most from moving to quick memory? http://valgrind.org/docs/manual/cg-manual.html#cg-manual.line-by-line
Probably, you are interested in D cache reads (Dr) and D cache writes (Dw), or even (Dr+Dw). In that way you can find a place in your code which does most (Dr+Dw) and try to move this place in your quick memory.

Performance gain issue in multicore application

I have a serial(nonparallel) application written in C. I have modified and re-written it using Intel Threading Building Blocks. When I run this parallel version on an AMD Phenom II machine which is a quad-core machine, I get a performance gain of more than 4X which conflicts with the Amdahl's law. Can anyone give me a reason why this is happening?
Thanks,
Rakesh.

If you rewrite the program, you could make it more efficient. Amdahl's law only limits the amount of speedup due to parallelism, not how much faster you can make your code by improving it.
You could be realizing the effects of having 4x the cache, since now you can use all four procs. Or maybe having less contention with other processes running on your machine. Or you accidentally fixed a mispredicted branch.
TL/DR: it happens.

It's known as "super linear speedup", and can occur for a variety of reasons, though the most common root cause is probably cache behaviour. Usually when superlinear speedup occurs, it's a clue that you could make the sequential version more efficient.
For example, suppose you have a processor where some of the cores share an L2 cache (a common architecture these days), and suppose your algorithm makes multiple traversals of a large data structure. If you perform the traversals in sequence, then each traversal will have to pull the data into the L2 cache afresh, whereas if you perform the traversals in parallel then you may well avoid a large number of those misses, as long as the traversals run in step (getting out of step is a good source of unpredictable performance here). To make the sequential verison more efficient you could interleave the traversals, thereby improving locality.

Can anyone give me a reason why this is happening?
In a word, caches.
Each core has its own L1 cache and, therefore, simply by using more cores you have increased the amount of cache in-play which has, in turn, brought more of your data closer to where it will be processed. That alone can improve performance significantly (as if you had a bigger cache on a single core). When combined with near linear speedup from effective parallelization, you can see superlinear performance improvements overall.

Do you expect that future CPU generations are not cache coherent?

I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.

I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.

We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?

I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!

Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin

You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.

The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string