I noticed that as my project grows, the release compilation/build time gets slower at a faster pace than I expected (and hoped for). I decided to look into what I could do to improve compilation speed. I am not talking about the initial build time, which involves compilation of dependencies and is largely irrelevant.
One thing that seems to be helping significantly is the incremental = true profile setting. On my project, it seems to shorten build time by ~40% on 4+ cores. With fewer cores the gains are even larger, as builds with incremental = true don't seem to use (much) parallelization. With the default (for --release) incremental = false build times are 3-4 times slower on a single core, compared to 4+ cores.
What are the reasons to refrain from using incremental = true for production builds? I don't see any (significant) increase in binary size or storage size of cached objects. I read somewhere it is possible that incremental builds lead to slightly worse performance of the built binary. Is that the only reason to consider or are there others, like stability, etc.?
I know this could vary, but is there any data available on how much of a performance impact might be expected on real-world applications?
Don't use an incremental build for production releases, because it is:
not reproducible (i.e. you can't get the exact same binary by compiling it again) and
quite possibly subtly broken (incremental compilation is way more complex and way less tested than clean compilation, in particular with optimizations turned on).
Related
When I upgrade to 3.3.5 version of ArangoDB, there is the following warning
2018-05-24T10:25:32Z [26942] WARNING {memory} maximum number of memory mappings per process is 65530, which seems too low. it is recommended to set it to at least 512000
2018-05-24T10:25:32Z [26942] WARNING {memory} execute 'sudo sysctl -w "vm.max_map_count=512000"'
Is it safe to tinker with the system setting (as I understand it)? And what is the meaning of increasing the max_map_count toArangoDb in particular?
It is safe to increase this value. It allows applications to allocate more RAM. The preset value on different distributions are sane values for user interaction. However, when you operate inherently memory-heavy applications like databases, you might have to relax such limits to your needs. Having said that, if you have a malicious program on your system, it also is allowed to allocate more memory.
But let's not forget, that the warning is only a warning. So for as long as your database is not huge, you are not working with lots of open cursors you and don't experience any performance issues, you might not have to make any changes for now. Just keep it in the back of your head, so that you know, what to tweak, when suddenly performance goes down.
Also, the MMFile storage engine is more affected than the RocksDB engine.
Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.
I have written a script for a project that stress tests the cpu, the vm and the i/o whilst running vmstat, iostat and sar. The scripts all run for 30 seconds. My tutor has asked me however to ensure that the results are accurate? How can I ever be sure? Surely I just take the machine's word for it after running a few tests? The tests have been run for 60 seconds each and so have the commands to try and ensure a fair test, but how can I be sure that they are accurate according to my tutor's concerns? Any ideas?
The systems are server versions of Ubuntu 12.04, Debian 7 and Suse 11
There is no way to know which are your tutor's concerns, so you should ask him!
"accuracy" usually means that your test results should not be offset by a factor you're not taking into account, like some CPU features being disabled or not used, differences in software configuration, etc.
What is it that you evaluate, anyway? Evaluating CPU performance is not the same as evaluating a particular hardware system, which is yet different if you consider the software as well. Basically, you need to eliminate all differences which are not part of your evaluation, and make sure the rest of the configuration is representative (e.g. installing a modern OS which supports all the features the CPU provides).
And remember that in the end you will always take the machine's word for it, there's just no other way. All you can say is that you have considered all factors you're aware of, and hope that the factors remaining unknown don't have a big influence.
While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?).
I think it does not make much sense but I am not sure as do not know how kernel solves such things.
Kind regards,
I use a non-recursive build system based on GNU make and I was wondering how well it scales.
I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.
In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:
Chart data is here.
The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9
Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.
We have a machine in our shop with the following characteristics:
256 core sparc solaris
~64gb RAM
Some of that memory used for a ram drive for /tmp
Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.
I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.
The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).
These are the things I've been able to do to speed up builds that have a noticeable impact:
Use := wherever possible
If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.
Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).
If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible
I once got it in my head to create macros of the form:
IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)
This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.
An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:
ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif
... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.
Where possible, avoid using custom macros inside recipes
The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.
I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.
I'm doing some rather long computations, which can easily span a few days. In the course of these computations, sometimes Mathematica will run out of memory. To this end, I've ended up resorting to something along the lines of:
ParallelEvaluate[$KernelID]; (* Force the kernels to launch *)
kernels = Kernels[];
Do[
If[Mod[iteration, n] == 0,
CloseKernels[kernels];
LaunchKernels[kernels];
ClearSystemCache[]];
(* Complicated stuff here *)
Export[...], (* If a computation ends early I don't want to lose past results *)
{iteration, min, max}]
This is great and all, but over time the main kernel accumulates memory. Currently, my main kernel is eating up roughly 1.4 GB of RAM. Is there any way I can force Mathematica to clear out the memory it's using? I've tried littering Share and Clear throughout the many Modules I'm using in my code, but the memory still seems to build up over time.
I've tried also to make sure I have nothing big and complicated running outside of a Module, so that something doesn't stay in scope too long. But even with this I still have my memory issues.
Is there anything I can do about this? I'm always going to have a large amount of memory being used, since most of my calculations involve several large and dense matrices (usually 1200 x 1200, but it can be more), so I'm wary about using MemoryConstrained.
Update:
The problem was exactly what Alexey Popkov stated in his answer. If you use Module, memory will leak slowly over time. It happened to be exacerbated in this case because I had multiple Module[..] statements. The "main" Module was within a ParallelTable where 8 kernels were running at once. Tack on the (relatively) large number of iterations, and this was a breeding ground for lots of memory leaks due to the bug with Module.
Since you are using Module extensively, I think you may be interested in knowing this bug with non-deleting temporary Module variables.
Example (non-deleting unlinked temporary variables with their definitions):
In[1]:= $HistoryLength=0;
a[b_]:=Module[{c,d},d:=9;d/;b===1];
Length#Names[$Context<>"*"]
Out[3]= 6
In[4]:= lst=Table[a[1],{1000}];
Length#Names[$Context<>"*"]
Out[5]= 1007
In[6]:= lst=.
Length#Names[$Context<>"*"]
Out[7]= 1007
In[8]:= Definition#d$999
Out[8]= Attributes[d$999]={Temporary}
d$999:=9
Note that in the above code I set $HistoryLength = 0; to stress this buggy behavior of Module. If you do not do this, temporary variables can still be linked from history variables (In and Out) and will not be removed with their definitions due to this reason in more broad set of cases (it is not a bug but a feature, as Leonid mentioned).
UPDATE: Just for the record. There is another old bug with non-deleting unreferenced Module variables after Part assignments to them in v.5.2 which is not completely fixed even in version 7.0.1:
In[1]:= $HistoryLength=0;$Version
Module[{L=Array[0&,10^7]},L[[#]]++&/#Range[100];];
Names["L$*"]
ByteCount#Symbol##&/#Names["L$*"]
Out[1]= 7.0 for Microsoft Windows (32-bit) (February 18, 2009)
Out[3]= {L$111}
Out[4]= {40000084}
Have you tried to evaluate $HistoryLength=0; in all subkernels and as well as in the master kernel? History tracking is the most common source for going out of memory.
Have you tried do not use slow and memory-consuming Export and use fast and efficient Put instead?
It is not clear from your post where you evaluate ClearSystemCache[] - in the master kernel or in subkernels? It looks like you evaluate it in the master kernel only. Try to evaluate it in all subkernels too before each iteration.