Profiling a Rust application in CLion takes very long

Profiling a Rust application in CLion takes very long - rust

I have a Rust program, that I wish to profile to find potential bottlenecks in performance.
To do this, I use perf as my profiler (I use linux with the 5.15.0-52-generic kernel) and th builtin profiling option in CLion.
I did run the program with the --release flag enabled and attached perf to it directly inside the IDE. The program ran succesfully (Taking approx. 1 minute, on 8 threads (4 hyperthreaded cores) with almost 100% cpu usage most of the time), but after it finished, CLion is taking ages to read the profiling data (it has been doing this for about 20hours now).
Why is this taking so long and is there a way to speed this up? The project lies on an older HDD drive, which CLion probably also uses to store the profiling data on. Is that the bottleneck here?
For comparison, I used flamegraph to compare how long it would take to profile the program. It also took quite long, but less than an hour, not 20 hours.

Related

Take full advantage of processor speed on Windows machine

New member here but long time Perl programmer.
I have a process that I run on a Windows machine that iterates through combinations of records from arrays/lists to identify a maximum combination, following a set of criteria.
On an old Intel i3 machine, an example would take about 45 mins to run. I purchased a new AMD Ryzen 7 machine that on benchmarks is about 7 or 8 times faster than the old machine. But the execution time was only reduced from 45 to 22 minutes.
This new machine has crazy processor capabilities, but it does not appear that Perl takes advantage of these.
Are there Perl settings or ways of coding to take advantage of all of the processor speed that I have on my new machine? Threads, etc?
thanks

Perl by default will only use a single thread and thus only a single CPU core. This means it will only use a small part of what current multi-core systems offer. It has the ability to make use of multiple threads though and thus multiple CPU core. But this needs to be explicitly done, i.e. the implementation needs to be adapted to make use of parallel execution. This can involve major changes to the algorithm used to solve your problem. And not all problems can be easily parallelized.
Apart from the Perl is not the preferred languages if performance is the goal. There is lots of overhead due to being a dynamically typed language and no explicitly control over memory allocation. Languages like C, C++ or Rust which are closer to the hardware start with significantly less overhead and then allow even more low-level control to further reuse overhead. But they don't magically parallelize either.

How do I make linux swap more eagerly?

I have a use-case where I have bursts of allocations in the range of 5-6gb, specifically when Visual Studio Code compiles my D project while I'm typing. (The compiler doesn't release memory at all, in order to be as fast as possible.)
DMD does memory allocation in a bit of a sneaky way. Since compilers are short-lived programs, and speed is of the essence, DMD just mallocs away, and never frees. This eliminates the scaffolding and complexity of figuring out who owns the memory and when it should be released. (It has the downside of consuming all the resources of your machine if the module being compiled is big enough.)
source
The machine is a Dell XPS 13 running Manjaro 64-bit, with 16gb of memory -- and I'm hitting that roof. The system seizes up completely, REISUB may or may not work, etc. I can leave it for an hour and it's still hung, not slowly resolving itself. The times I've been able to get to a tty, dmesg has had all kinds of jovial messages. So I thought to enable a big swap partition to alleviate the pressure, but it isn't helping.
I realise that swap won't be used until it's needed, but by then it's too late. Even with the swap, when I run out of memory everything segfaults; Qt, zsh, fuse-ntfs, Xorg. At that point it will report a typical 70mb of swap in use.
vm.swappiness is at 100. swapon reports the swap as being active, automatically enabled by systemd.
NAME TYPE SIZE USED PRIO
/dev/nvme0n1p8 partition 17.6G 0B -2
What can I do to make it swap more?

Try this. Remember to put this question in superuser or serverfault. Stackoverflow is only for programming stuff.
https://askubuntu.com/questions/371302/make-my-ubuntu-use-more-swap-than-ram

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!

To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

How do I measure CPU, memory and disk usage during a build?

I'm trying to improve my build times and want to have some feedback in place to measure where my problems are.
I'm using GNU Make on a Linux CentOS system to build the Linux kernel along with some application code. I can run Make with 'time' to get an overall time for the complete build, but that doesn't tell me where the bottlenecks are.
I used -j with Make to run it on multiple cores on my build machine, but I ran top during the build and noticed the CPU cores were often idle.
Any suggestions for the best way to measure disk and memory usage during the build?
Anything else I should be measuring?
No preference on text-based or GUI - whatever gives me some data I can use.

For real time measurement I use tex-based htop from third-party repositories. It is like top but better, it shows graphically cpu (all cpu's separately), ram load.

Why doesn't WD Velociraptor speed up my VC++-compilation significantly?

Several people round here recommended switching to the new WD Velociraptor 10000rpm harddisk. Also magazine articles praise the performance.
I bought one and mirrored my old system to it. The resulting increase in compilation-speed is somewhat disappointing:
On my old Samsung drive (SATA, 7200), the compilation time was 16:02.
On the Velociraptor the build takes 15:23.
I have a E6600 with 1.5G ram. It's a C++-Project with 1200 files. The build is done in Visual Studio 2005. The acoustic managment is switchted off (no big difference anyway).
Did something go wrong or is this modest acceleration really all, I can expect?
Edit:
Some recommended increasing the RAM. I did now and got a minimal gain (3-5%) by doubling my RAM to 3GB.

Are you using the /MP option (undocumented, you have to enter it manually to your processor options) to enable source-level parallel build? That'll speed up your compile much more than just a faster harddisk. Gains from that are marginal.

Visual Studio 2005 can build multiple projects in parallel, and will do so by default on a multi-core machine, but depending on how your projects depend on each other it may be unable to parallel build them.
If your 1200 cpp files are in a single project, you're probably not using all of your CPU. If I'm not mistaken a C6600 is a quad-core CPU.
Dave

I imagine that hard disk reading was not your bottleneck in compilation. Realistically, few things need to be read/written from/to the hard disk. You would likely see more performance increase from more ram or a faster processor.

I'd suggest from the results that either your hdd latency speed wasn't the bottleneck you were looking for, or that your project is already close to building as fast as possible. Other items to consider would be:
hdd access time (although you may not be able to do much with this due to bus speed limitations)
RAM access speed and size
Processor speed
Reducing background processes

~6% increase in speed just from improving your hard drive. Just like Howler said. Grab some faster ram, and PCU.

As many have already pointed out, you probably didn't attack the real bottleneck. Randomly changing parts (or code for that matter) is as one could say "bass ackwards".
You first identify the performance bottleneck and then you changesomething.
Perfmon can help you get a good overview if you're CPU or I/O bound, you want to look at CPU utilization, disk queue length and IO bytes to get a first glimpse on what's going on.

That is actually a pretty big bump in speed for just replacing a hard disk. You are probably memory or CPU bound at this point. 1.5GB is light these days, and RAM is very cheap. You might see some pretty big improvements with more memory.
Just as a recommendation, if you have more than one drive installed, you could try setting your build directory to be somewhere on a different disk than your source files.
As for this comment:
If your 1200 cpp files are in a single project, you're probably not using all of your CPU. If I'm not mistaken a C6600 is a quad-core CPU.
Actually, a C6600 isn't anything. There is a E6600 and a Q6600. The E6600 is a dual core and the Q6600 is a quad core. On my dev machine I use a quad core CPU, and although our project has more than 1200 files, it is still EASILY processor limited during compile time (although a faster hard drive would still help speed things up!).

1200 Source files is a lot, but none of them is likely to be more than a couple hundred K, so while they all need to be read into memory, it's not going to take long to do so.
Bumping your system memory to 4G (yes, yes I know about the 3.somethingorother limit that 32-bit OSes have), and maybe looking at your CPU are going to provide a lot more performance improvement than merely using a faster disk drive could.

VC 2005 does not compile more then one file at the time per project so either move to VC 2008 to use both of your CPU cores, or break your solution to multiple libraries sub projects to get multiple compilations going.

I halved my compilation time by putting all my source onto a ram drive.
I tried these guys http://www.superspeed.com/desktop/ramdisk.php, installed a 1GB ramdrive, then copied all my source onto it. If you build directly from RAM, the IO overhead is vastly reduced.
To give you an idea of what I'm compiling, and on what;
WinXP 64-bit
4GB ram
2.? GHz dual-core processors
62 C# projects
approx 250kloc.
My build went from about 135s to 65s.
Downsides are that your source files are living in RAM, so you need to be more vigilant about source control. If your machine lost power, you'd lose all unversioned changes. Mitigated slightly by the fact that some RAMdrives will save themselves to disk when you shut the machine down, but still, you'll lose everything from either your last checkout, or the last time you shut down.
Also, you have to pay for the software. But since you're shelling out for hard drives, maybe this isn't that big a deal.
Upsides are the increased compilation time, and the fact that the exes are already living in memory, so the startup time and debugging time is a bit better. The real benefit is the compilation time, though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string