How to speed up compilation time in linux - linux

While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?).
I think it does not make much sense but I am not sure as do not know how kernel solves such things.
Kind regards,

I use a non-recursive build system based on GNU make and I was wondering how well it scales.
I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.
In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:
Chart data is here.

The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9
Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.

We have a machine in our shop with the following characteristics:
256 core sparc solaris
~64gb RAM
Some of that memory used for a ram drive for /tmp
Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.
I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.
The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).
These are the things I've been able to do to speed up builds that have a noticeable impact:
Use := wherever possible
If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.
Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).
If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible
I once got it in my head to create macros of the form:
IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)
This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.
An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:
ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif
... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.
Where possible, avoid using custom macros inside recipes
The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.
I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.

Related

Compare-And-Swap not working on many Cores

When I discovered the "CAS" instruction, I remember that I well understood that it could work for threads running on one single CPU but I was surprised that it could for many CPUs
Yesterday, I had my first opportunity to test it on one of my developments. I implemented it and it really worked fine; all my unit-tests was green. Perfect.
But today, I ran my unit-tests on another machine and they are now failing. Less perfect
The main difference on the two machines is that the first one (the one on which the unit-tests are green) is a quit old laptop, with only one core! The second one is more recent i7, and more powerfull...
Now, on my i7, if I force my unit-tests to run on one single core, they become successful. I do this by running
taskset -c <cpu-id> my-unit-test
Legitimately, my original question comes back: is CAS working on many cores? OK, according to what I read, I would be surprised if it didn't...
So what? I hope it comes from a bug in my code. To give you more information, I have a class with a critical section. I added an attribute
bool m_isBeingModified;
It is initialized to false. Moreover, at the beginning of my critical section, I run the function
inline void waitForClassBeingModified()
{
while (!__sync_bool_compare_and_swap(&m_isBeingModified, false, true))
{} /// I concider that I can to such a loop as my critical section is very light/short
}
Finally, at the end of my critical section, I reset my boolean variable
m_isBeingModified = false;
I tried to set my attribute as volatile but it did not change anything: my unit-tests are still failing
Last information:
gcc --version
gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Thank you for your help
Also use __sync_bool_compare_and_swap to unset the variable instead of just m_isBeingModified = false;. Also, don't implement your own mutex...
Both the compiler and the CPU can reorder code in unintended ways. The __sync primitives are marked in such a way to prevent this reordering from happining. Thus, with m_isBeingModified = false; it could very well the case that the compiler would first set the variable to false and only then generate the code for whatever you intended to be inside of the critical region.
Thanks to Uli's precious help, I think that I have now all the elements to answer to my question.
First of all, I may not be clear until there but the function I want to protect against concurrent access is very light. It takes around 80 cpu cycles to complete (TSC). That's why I prefer to implement my own 'light' concurrent mutex based one CAS than using pthread_mutex.
I found this interesting page that explains how to 'temporarily' disable the code-reordering thanks to the following instruction:
__asm__ __volatile__("":::"memory");
Using it, I really boost my concurrency-protection and, of course all my tests are still successful.
To get a summary, the following list reports the impact on performance of different solutions I tried:
Original code (without protection): around 80 TSC
Double CAS (set & unset variable): around 105 TSC
Mutexes based solution: around 120 TSC
Single CAS + disable reordering: around 85 TSC

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

How to prevent compiler & linker from taking over entire machine?

When I compile a large project the compiler slows down the machine tremendously, virtually freezes it out. If I'm lucky a keystroke in vim takes a few seconds to register. If I'm not I may as well go for a walk since nothing can be done on my workstation at all.
Is there any way to prevent compiler and linker from consuming the entire machine? More generally, is it possible to limit a family of processes to a portion of computing resources, such as threads, memory, disk access bandwidth?
Something like limiting the resources available to the process tree that originates from the shell that runs the build would be ideal.
Most linux distros have a package called cpulimit. You can use this to limit the CPU usage for the gcc tool chain binaries.
It's mention as an answer to this question.
Limiting certain processes to CPU % - Linux
I'm not an expert about it but you could try starting the compilation with a specific cgroup that has limited resources. I don't know exactly how complicated it is to do it but it shouldn't be too hard.
You could also try changing the nice of the process to give it a lower priority so that it does take the entire machine but will be easily bumped by another process.

Accurate way of measuring overhead in kernel space

I recently implemented a security mechanism for Linux which hooks into system calls. Now I have to measure the overhead caused by it. The project requires to compare the execution time of typical Linux apps with and without the mechanism. By typical Linux apps I assume ex. gzipping 1G file, doing 'find /', grepping files. The main goal is to show the overhead in different types of tasks: CPU bound, I/O bound etc.
The question is: how to organise the test so that they will be reliable? The first important thing is the fact that my mechanism works only in kernel space, so it is relevant to compare systime. I can use 'time' command for it, but is it the most accurate way of measuring systime? Another idea is to run those apps in long loops to minimize error. Then the loops should be inside or outside time command? If they are outside I will get many results - should I choose min, max, median, average?
Thanks for any suggestions.
I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.
Perhaps phoronix test suite might be relevant.
You might be interested by oprofile
See also this answer and this question

Reducing memory usage in an extended Mathematica session

I'm doing some rather long computations, which can easily span a few days. In the course of these computations, sometimes Mathematica will run out of memory. To this end, I've ended up resorting to something along the lines of:
ParallelEvaluate[$KernelID]; (* Force the kernels to launch *)
kernels = Kernels[];
Do[
If[Mod[iteration, n] == 0,
CloseKernels[kernels];
LaunchKernels[kernels];
ClearSystemCache[]];
(* Complicated stuff here *)
Export[...], (* If a computation ends early I don't want to lose past results *)
{iteration, min, max}]
This is great and all, but over time the main kernel accumulates memory. Currently, my main kernel is eating up roughly 1.4 GB of RAM. Is there any way I can force Mathematica to clear out the memory it's using? I've tried littering Share and Clear throughout the many Modules I'm using in my code, but the memory still seems to build up over time.
I've tried also to make sure I have nothing big and complicated running outside of a Module, so that something doesn't stay in scope too long. But even with this I still have my memory issues.
Is there anything I can do about this? I'm always going to have a large amount of memory being used, since most of my calculations involve several large and dense matrices (usually 1200 x 1200, but it can be more), so I'm wary about using MemoryConstrained.
Update:
The problem was exactly what Alexey Popkov stated in his answer. If you use Module, memory will leak slowly over time. It happened to be exacerbated in this case because I had multiple Module[..] statements. The "main" Module was within a ParallelTable where 8 kernels were running at once. Tack on the (relatively) large number of iterations, and this was a breeding ground for lots of memory leaks due to the bug with Module.
Since you are using Module extensively, I think you may be interested in knowing this bug with non-deleting temporary Module variables.
Example (non-deleting unlinked temporary variables with their definitions):
In[1]:= $HistoryLength=0;
a[b_]:=Module[{c,d},d:=9;d/;b===1];
Length#Names[$Context<>"*"]
Out[3]= 6
In[4]:= lst=Table[a[1],{1000}];
Length#Names[$Context<>"*"]
Out[5]= 1007
In[6]:= lst=.
Length#Names[$Context<>"*"]
Out[7]= 1007
In[8]:= Definition#d$999
Out[8]= Attributes[d$999]={Temporary}
d$999:=9
Note that in the above code I set $HistoryLength = 0; to stress this buggy behavior of Module. If you do not do this, temporary variables can still be linked from history variables (In and Out) and will not be removed with their definitions due to this reason in more broad set of cases (it is not a bug but a feature, as Leonid mentioned).
UPDATE: Just for the record. There is another old bug with non-deleting unreferenced Module variables after Part assignments to them in v.5.2 which is not completely fixed even in version 7.0.1:
In[1]:= $HistoryLength=0;$Version
Module[{L=Array[0&,10^7]},L[[#]]++&/#Range[100];];
Names["L$*"]
ByteCount#Symbol##&/#Names["L$*"]
Out[1]= 7.0 for Microsoft Windows (32-bit) (February 18, 2009)
Out[3]= {L$111}
Out[4]= {40000084}
Have you tried to evaluate $HistoryLength=0; in all subkernels and as well as in the master kernel? History tracking is the most common source for going out of memory.
Have you tried do not use slow and memory-consuming Export and use fast and efficient Put instead?
It is not clear from your post where you evaluate ClearSystemCache[] - in the master kernel or in subkernels? It looks like you evaluate it in the master kernel only. Try to evaluate it in all subkernels too before each iteration.

Resources