How do I monitor the amount of SIMD instruction usage

How do I monitor the amount of SIMD instruction usage - linux

How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage.

I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).
See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX
I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.
It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.
You'd need hardware instruction-counting support for this to be really efficient the way top is.
For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.
e.g. from perf list output:
fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired. Each count represents 4
computations. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as
they perform multiple calculations per element]
So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)
You can use perf to count their execution globally across processes and on all cores. Or for a single process
## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.
(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)
(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?
You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.
Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.

Related

Why do register renaming, when we can increase the number of registers in the architecture?

In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?

Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.

Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
this Q&A about how superscalar execution works.

Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.

Paralellization vs vectorization performance bottlenec: Does AVX and MT compete?

I tried to compute the sum of all elements in a large matrix. Here are the test cases:
MT and AVX takes 37 s
MT and no AVX takes 40 s
AVX and no MT takes 49 s
Neither AVX or MT 105 s
In all cases, the CPU clock is fixed to 3.0 GHz (claimed by cpufreq-info):
current policy: frequency should be within 1.60 GHz and 3.40 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 3.00 GHz.
The matrix has 25000000 elements of type double and value 1.0. And the sum is computed repeatedly 4096 times in a loop. Without AVX, the speed improvement when using MT is 2.6. With AVX it is only 1.3. When running MT, the matrix is divided into 4 blocks, one per thread. If I reduce the CPU frequency, the MT improvement is larger for AVX, so there might be some issue with cache misses also, but that cannot explain the difference between (4)/(2) and (3)/(1). Does AVX and MT compete with each other in some way? The chip is i3570K.

It's quite possible that your baseline performance was bounded by execution latency, but either form of parallelization (MT or vectorization) allowed you to break that and reach the next bottleneck which is the memory BW of your CPU.
Check the peak BW your CPU can reach and compare with your data, looks like you're simply saturating at 20.5GB/s (25000000 elements * 4096 loops * 8Bytes assuming that's what your system uses for double / ~40 seconds), which seems a little low as this link says it should reach 25GB/s, but around the same ballpark so it could be due to other inefficiencies, like type of DDR, other apps / OS working in the background, frequency scaling done by the CPU to save power / reduce heat, etc..
You could also try running some memory benchmarks (lmbench, sandra, ..) and see if they do better under the same environment.

MT should not compete with MT, they are two different things. Although the summation idea is simple but depending on your implementation you can get very different numbers. I suggest you use the Stream benchmarks to test performance as they are the standard. I don't see your code but there are some issues:
you are initializing the matrix with 1.0 for all the elements. I think that is not a good idea. You should use random numbers or at lease initialize based on the index (e.g. (i%10)/10.0).
How do you measure time? you should place your timers out side the repetition loop and take the average over the number of repetition. Also do you use accurate timers?
Did you make sure that your code is actually vectorized? did you enable any compiler flags to display this information? Did you make sure that the AVX version of your code is used? maybe the compiler chose to use the scalar version.
you mentioned that the frequency is fixed, are you sure that the turbo mode is not enabled at any point of time?
What about thread affinity when measuring with MT?

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

I want to get the CPU cycles at a specific point. I use this function at that point:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.
Am I using the functions incorrectly?

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)
I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.
In reply to the further question in the comment:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).

good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

Time Stamp Counter

I am using time stamp counter in my C++ programme by querying the register. However, one problem I encounter is that the function to acquire the time stamp would acquire from different CPU. How could I ensure that my function would always acquire the timestamp from the same CPU or is there anyway to synchronize the CPU? By the way, my programme is running on 4 cores server in Fedora 13 64 bit.
Thanks.

Look at the following excerpt from Intel manual. According to section 16.12, I think the "newer processors" below refers to any processor newer than pentium 4. You can simultaneously and atomically determine the tsc value and the core ID using the rdtscp instruction if it is supported. I haven't tried it though. Good Luck.
Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (3A & 3B): System Programming Guide:
Chapter 16.12.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred
to as invariant TSC. Processor’s support for invariant TSC is indicated by
CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is
the architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead of ACPI or
HPET timers). TSC reads are much more efficient and do not incur the overhead
associated with a ring transition or access to a platform resource.
Intel also has a guide on code execution benchmarking that discusses cpu association with rdtsc - http://download.intel.com/embedded/software/IA/324264.pdf

In my experience, it is wise to avoid TSC altogether, unless you really want to measure individual clock cycles on individual cores/CPUs.
Potential problems with TSC:
Frequency scaling. Counter does not increment linearly with time...
Different clocks on different CPUs/cores (I would not rule out different frequency scaling on different CPUs, or even differently clocked CPUs - though the latter should be rare).
Unsynchronized counters on different CPUs/cores (even if they use the same frequency).
This basically boils down to that you can only use the TSC to measure elapsed CPU cycles (not elapsed time) on a single CPU in a single threaded application, if you force the affinity for the thread.
The preferred alternative is to use system functions. The most portable (on Unix/Mac) is gettimeofday(), which is usually very accurate. A more appropriate function might be clock_gettime(), but check if it is supported on your system first. Under Windows you can safely use QueryPerformanceCounter().

You can use sched_setaffinity or cpuset feature that lets you create a cpuset and assign tasks to the set.

Is gettimeofday() guaranteed to be of microsecond resolution?

I am porting a game, that was originally written for the Win32 API, to Linux (well, porting the OS X port of the Win32 port to Linux).
I have implemented QueryPerformanceCounter by giving the uSeconds since the process start up:
BOOL QueryPerformanceCounter(LARGE_INTEGER* performanceCount)
{
gettimeofday(&currentTimeVal, NULL);
performanceCount->QuadPart = (currentTimeVal.tv_sec - startTimeVal.tv_sec);
performanceCount->QuadPart *= (1000 * 1000);
performanceCount->QuadPart += (currentTimeVal.tv_usec - startTimeVal.tv_usec);
return true;
}
This, coupled with QueryPerformanceFrequency() giving a constant 1000000 as the frequency, works well on my machine, giving me a 64-bit variable that contains uSeconds since the program's start-up.
So is this portable? I don't want to discover it works differently if the kernel was compiled in a certain way or anything like that. I am fine with it being non-portable to something other than Linux, however.

Maybe. But you have bigger problems. gettimeofday() can result in incorrect timings if there are processes on your system that change the timer (ie, ntpd). On a "normal" linux, though, I believe the resolution of gettimeofday() is 10us. It can jump forward and backward and time, consequently, based on the processes running on your system. This effectively makes the answer to your question no.
You should look into clock_gettime(CLOCK_MONOTONIC) for timing intervals. It suffers from several less issues due to things like multi-core systems and external clock settings.
Also, look into the clock_getres() function.

High Resolution, Low Overhead Timing for Intel Processors
If you're on Intel hardware, here's how to read the CPU real-time instruction counter. It will tell you the number of CPU cycles executed since the processor was booted. This is probably the finest-grained counter you can get for performance measurement.
Note that this is the number of CPU cycles. On linux you can get the CPU speed from /proc/cpuinfo and divide to get the number of seconds. Converting this to a double is quite handy.
When I run this on my box, I get
11867927879484732
11867927879692217
it took this long to call printf: 207485
Here's the Intel developer's guide that gives tons of detail.
#include <stdio.h>
#include <stdint.h>
inline uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a" (lo), "=d" (hi)
:
: "%ebx", "%ecx");
return (uint64_t)hi << 32 | lo;
}
main()
{
unsigned long long x;
unsigned long long y;
x = rdtsc();
printf("%lld\n",x);
y = rdtsc();
printf("%lld\n",y);
printf("it took this long to call printf: %lld\n",y-x);
}

#Bernard:
I have to admit, most of your example went straight over my head. It does compile, and seems to work, though. Is this safe for SMP systems or SpeedStep?
That's a good question... I think the code's ok.
From a practical standpoint, we use it in my company every day,
and we run on a pretty wide array of boxes, everything from 2-8 cores.
Of course, YMMV, etc, but it seems to be a reliable and low-overhead
(because it doesn't make a context switch into system-space) method
of timing.
Generally how it works is:
declare the block of code to be assembler (and volatile, so the
optimizer will leave it alone).
execute the CPUID instruction. In addition to getting some CPU information
(which we don't do anything with) it synchronizes the CPU's execution buffer
so that the timings aren't affected by out-of-order execution.
execute the rdtsc (read timestamp) execution. This fetches the number of
machine cycles executed since the processor was reset. This is a 64-bit
value, so with current CPU speeds it will wrap around every 194 years or so.
Interestingly, in the original Pentium reference, they note it wraps around every
5800 years or so.
the last couple of lines store the values from the registers into
the variables hi and lo, and put that into the 64-bit return value.
Specific notes:
out-of-order execution can cause incorrect results, so we execute the
"cpuid" instruction which in addition to giving you some information
about the cpu also synchronizes any out-of-order instruction execution.
Most OS's synchronize the counters on the CPUs when they start, so
the answer is good to within a couple of nano-seconds.
The hibernating comment is probably true, but in practice you
probably don't care about timings across hibernation boundaries.
regarding speedstep: Newer Intel CPUs compensate for the speed
changes and returns an adjusted count. I did a quick scan over
some of the boxes on our network and found only one box that
didn't have it: a Pentium 3 running some old database server.
(these are linux boxes, so I checked with: grep constant_tsc /proc/cpuinfo)
I'm not sure about the AMD CPUs, we're primarily an Intel shop,
although I know some of our low-level systems gurus did an
AMD evaluation.
Hope this satisfies your curiosity, it's an interesting and (IMHO)
under-studied area of programming. You know when Jeff and Joel were
talking about whether or not a programmer should know C? I was
shouting at them, "hey forget that high-level C stuff... assembler
is what you should learn if you want to know what the computer is
doing!"

You may be interested in Linux FAQ for clock_gettime(CLOCK_REALTIME)

Wine is actually using gettimeofday() to implement QueryPerformanceCounter() and it is known to make many Windows games work on Linux and Mac.
Starts http://source.winehq.org/source/dlls/kernel32/cpu.c#L312
leads to http://source.winehq.org/source/dlls/ntdll/time.c#L448

So it says microseconds explicitly, but says the resolution of the system clock is unspecified. I suppose resolution in this context means how the smallest amount it will ever be incremented?
The data structure is defined as having microseconds as a unit of measurement, but that doesn't mean that the clock or operating system is actually capable of measuring that finely.
Like other people have suggested, gettimeofday() is bad because setting the time can cause clock skew and throw off your calculation. clock_gettime(CLOCK_MONOTONIC) is what you want, and clock_getres() will tell you the precision of your clock.

The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate.
I obtained this answer from High Resolution Time Measurement and Timers, Part I

This answer mentions problems with the clock being adjusted. Both your problems guaranteeing tick units and the problems with the time being adjusted are solved in C++11 with the <chrono> library.
The clock std::chrono::steady_clock is guaranteed not to be adjusted, and furthermore it will advance at a constant rate relative to real time, so technologies like SpeedStep must not affect it.
You can get typesafe units by converting to one of the std::chrono::duration specializations, such as std::chrono::microseconds. With this type there's no ambiguity about the units used by the tick value. However, keep in mind that the clock doesn't necessarily have this resolution. You can convert a duration to attoseconds without actually having a clock that accurate.

From my experience, and from what I've read across the internet, the answer is "No," it is not guaranteed. It depends on CPU speed, operating system, flavor of Linux, etc.

Reading the RDTSC is not reliable in SMP systems, since each CPU maintains their own counter and each counter is not guaranteed to by synchronized with respect to another CPU.
I might suggest trying clock_gettime(CLOCK_REALTIME). The posix manual indicates that this should be implemented on all compliant systems. It can provide a nanosecond count, but you probably will want to check clock_getres(CLOCK_REALTIME) on your system to see what the actual resolution is.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string