Does the output of running turbostat --debug show max turbo per core per socket?
below is an example output. If I have say a 2 socket Server does that mean 2 active cores per Socket can boost to 4GHz (4 cores total) or over both sockets - so only 1 core per socket=2 cores in total?
cpu0: MSR_NHM_TURBO_RATIO_LIMIT: 0x25262727
37 * 100 = 3700 MHz max turbo 4 active cores
38 * 100 = 3800 MHz max turbo 3 active cores
39 * 100 = 3900 MHz max turbo 2 active cores
40 * 100 = 4000 MHz max turbo 1 active cores
I also presume a core here is made up of 2 cpu's I.e a dual socket 20 core machine has 40 cpu's per socket?
Any help is much appreciated - thanks!
Related
I'm trying to understand what node distances in numactl --hardware mean?
On our cluster, it outputs the following
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 32143 MB
node 0 free: 188 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 32254 MB
node 1 free: 69 MB
node distances:
node 0 1
0: 10 21
1: 21 10
This is what I understood so far:
we have 24 virtual CPUs and that each node has 32Gb of DRAM.
On a numa cluster, we will have to make a "hop" to the next cluster to access the memory on other node and this incurs a higher latency.
In this context, do the numbers 10 and 21 indicate the latencies for "hops"? How do I find the latency in ns? is that specified somewhere?
This and this didn't help me much.
EDIT: This link says that distances are not in ns, but are relative distances. how do I get the absolute latency in ns?
Any help will be appreciated.
numactl --hardware gives you stats about the architecture of your hardware, not about on its performance.
If you want the performance characteristics of your hardware you will have to measure it yourself, either by finding an existing one online or writing your own benchmark.
https://stackoverflow.com/a/47815885/1411628 will give you an idea on how to get started on writing your own bench.
To get absolute latency numbers, if you're on an Intel system you can use their Memory Latency Checker tool for any specific system. https://software.intel.com/en-us/articles/intel-memory-latency-checker
It prefers to use root/admin powers to disable the hardware prefetching which otherwise skews the numbers, but if you don't have that, the docs also point out that you can ask it to get random elements from the other nodes to get very close to the true numbers e.g.:
./mlc --latency_matrix -e -l128 -r
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --latency_matrix -e -l128 -r
Using buffer size of 200.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 112.5 180.3
1 180.8 112.4
server: i have servers with 2 intel 10 cores cpus or 8 cores. So some has 40 cores, some has 32 cores (enable intel HT)
background: i am running our application, which will isolate cpus, currently, i isolate the last 32 cores (core 8-39) for that application. 4 cores (core 4-7) for other use(normally, it will used 50% sys cpu). And i want to assign core 0-3 for system IRQ usage. since currently, if i run the application, system response is very slow, i think some of irq requests have been disputed to core 4-7, that cause low response.
do you think if that is possible just use 4 cores to handle system irq?
If you have more then one socket ("stone") that means you have NUMA system.
Here is a link to get more info https://en.wikipedia.org/wiki/Non-uniform_memory_access
Try to use CPUs on the same socket. Below I will explain why and how to do that
Determine what exactly CPU ids located on each socket.
% numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 24565 MB
node 0 free: 2069 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 24575 MB
node 1 free: 1806 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Here "node" means "socket" (stone). So 0,2,4,6 CPUs are located on the same node.
And it makes sense to move all IRQs into one node to use L3 cache for set of CPUs.
Isolate all CPUs except 0,2,4,6.
Need to add argument to start Linux kernel
isolcpus= cpu_number [, cpu_number ,...]
for example
isolcpus=1,3,5,7-31
Control what IRQs are running on what CPUs
cat /proc/interrupts
Start your application with numactl command to aligne to CPUs and Memory.
(Here need to understand what NUMA and aligned is. Please follow the link at the beginning of the article)
numactl [--membind=nodes] [--cpunodebind=nodes]
Your question is much bigger than I mentioned here.
If you see the system is slow need to understand bottleneck.
Try to gather raw info with top, vmstat, iostat to find out the point of weakness.
Provide some stat of your system and I will help you to turn it up right way.
I currently use Standard DS 15v2 on Azure server but I currently experience huge lag because The game I want to run (Minecraft) does not support multi-core(Minecraft does its job on 1 core)
I was adviced that a beasty 20 core cpu with a low clock speed is worse than a dual core with a high clock speed.
so, Which VM size should I choose for a high clock speedserver?
FYI, The Standard DS 15v2 offers me a Xeon E5
So I've found benchmarks from Microsoft, according to the article they are run on all cores, but you could deduce single core performance from that.
Also this table:
SKU Family ACU/Core
Standard_A0 50
Standard_A1-4 100
Standard_A5-7 100
Standard_A1-8v2 100
Standard_A2m-8mv2 100
A8-A11 225*
D1-14 160
D1-15v2 210 - 250*
DS1-14 160
DS1-15v2 210-250*
F1-F16 210-250*
F1s-F16s 210-250*
G1-5 180 - 240*
GS1-5 180 - 240*
H 290 - 300*
ACUs marked with a * use IntelĀ® Turbo technology to increase CPU frequency and provide a performance boost. The amount of the boost can vary based on the VM size, workload, and other workloads running on the same host.
I am using Ubuntu 15.04 on a two sockets Power8 machine, each socket has 10 cores. "numactl -H" outputs:
available: 4 nodes (0-3)
node 0 cpus: 0 8 16 24 32
node 0 size: 30359 MB
node 0 free: 26501 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 80 88 96 104 112
node 2 size: 30425 MB
node 2 free: 27884 MB
node 3 cpus: 120 128 136 144 152
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10
The problem is, are there two NUMA nodes on each Power8 processor? Any why one has memory but the other one has nothing. I can't find any document about this. Any information would be appreciated.
A further question, if there are two nodes on a socket, then are their last level cache shared like NUMA nodes(a data can reside in all of the caches) or like on the same socket(only one copy can exist).
Scale-out POWER8 systems use Dual-Chip Modules (DCMs). As the name suggests, a DCM packages two multi-core chips with some additional logic within the same physical package. There is an on-package cache-coherent 32 GBps interconnect (misleadingly called an SMP bus) between the two chips and two separate paths to the external memory buffers, one for each chip. Thus, each socket is a dual-node NUMA system itself, similar to e.g., the multi-module AMD Opterons. In your case, all of the memory local to a given socket is probably installed in the slots belonging to the first chip of that socket only, therefore the second NUMA domain shows up as 0 MB.
Both the on-package (X bus) and inter-package (A bus) interconnects are cache-coherent, i.e. the L3 caches are kept in sync. Within a multi-core chip, each core is directly connected to a region of L3 cache and through the chip interconnect has access to all other L3 caches of the same chip, i.e. a NUCA (Non-Uniform Cache Architecture).
For more information, see the logical diagram of an S824 system in this Redpaper.
I have computer with 2 Intel Xeon CPUs and 48 GB of RAM. RAM is divided between CPUs - two parts 24 GB + 24 GB. How can I check how much of each specific part is used?
So, I need something like htop, which shows how fully each core is used (see this example), but rather for memory than for cores. Or something that would specify which part (addresses) of memory are used and which are not.
The information is in /proc/zoneinfo, contains very similar information to /proc/vmstat except broken down by "Node" (Numa ID). I don't have a NUMA system here to test it for you and provide a sample output for a multi-node config; it looks like this on a one-node machine:
Node 0, zone DMA
pages free 2122
min 16
low 20
high 24
scanned 0
spanned 4096
present 3963
[ ... followed by /proc/vmstat-like nr_* values ]
Node 0, zone Normal
pages free 17899
min 932
low 1165
high 1398
scanned 0
spanned 223230
present 221486
nr_free_pages 17899
nr_inactive_anon 3028
nr_active_anon 0
nr_inactive_file 48744
nr_active_file 118142
nr_unevictable 0
nr_mlock 0
nr_anon_pages 2956
nr_mapped 96
nr_file_pages 166957
[ ... more of those ... ]
Node 0, zone HighMem
pages free 5177
min 128
low 435
high 743
scanned 0
spanned 294547
present 292245
[ ... ]
I.e. a small statistic on the usage/availability total followed by the nr_* values also found on a system-global level in /proc/vmstat (which then allow a further breakdown as of what exactly the memory is used for).
If you have more than one memory node, aka NUMA, you'll see these zones for all nodes.
edit
I'm not aware of a frontend for this (i.e. a numa vmstat like htop is a numa-top), but please comment if anyone knows one !
The numactl --hardware command will give you a short answer like this:
node 0 cpus: 0 1 2 3 4 5
node 0 size: 49140 MB
node 0 free: 25293 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 49152 MB
node 1 free: 20758 MB
node distances:
node 0 1
0: 10 21
1: 21 10