Below UNIX command:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
shows,
CPU(s): 4
Thread(s) per core: 2
which is 4 X 2 = 8 logical processors. Correct me.
Below is another Linux command:
$ cat /proc/cpuinfo
processor : 0
....
cpu cores : 2
.....
processor : 1
.....
cpu cores : 2
.....
processor : 2
.....
cpu cores : 2
.....
processor : 3
.....
cpu cores : 2
.....
$
But the below program shows only 4 logical processors:
package main
import (
"fmt"
"runtime"
)
func main() {
fmt.Println(runtime.GOMAXPROCS(0)) // gives 4
fmt.Println(runtime.NumCPU()) // gives 4
}
Output:
$ go install github.com/myhub/cs61a
$ bin/cs61a
4
4
code$
More details:
$ go version
go version go1.14.1 linux/amd64
$ uname -a
Linux mohet01-ubuntu 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Documentation says,
NumCPU returns the number of logical CPUs usable by the current process.
My understanding is,
Go scheduler creates OS threads(M) which will be equivalent to number of logical processors.
Why runtime api does not give value as 8?
According to all of your listings above, you have:
two cores per socket
and one socket
which means you have only two CPU cores. However, you also have:
two threads per core
which means your two CPUs can run up to four threads simultaneously (with some potential drawbacks, but at least in most cases this should offer significantly more computing power than using two threads total).
Since that's the actual hardware limit, the fact that Go computes the number of threads to use as 4 seems correct.
(I think the reason you are counting to eight is that you are assuming that each of the "cpus" that Linux reports supports two threads. That is not the case: there are only two physical cores, but each supports two threads, so Linux reports this as four "cpus".)
Related
I'm running very "simple" Test with.
#Fork(value = 1, jvmArgs = { "--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:ActiveProcessorCount=7",
"-XX:+UseNUMA"
, "-XX:+UnlockDiagnosticVMOptions", "-XX:DisableIntrinsic=_currentTimeMillis,_nanoTime",
"-Xmx10G", "-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
)
#Benchmark
public String generateRandom() {
return UUID.randomUUID().toString();
}
May be it's not very simple, because uses random, but same issue is on any other tests with java
On my home desktop
Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Performance with 7 threads:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 1312295.357 ± 27853.707 ops/s
Flame Graph with AsyncProfiler Result with 7 Thread At Home
I have an issue on Oracle Linux
Linux 5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"
with 1 thread, I have very great performance:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 2377471.113 ± 8049.532 ops/s
Flame Graph with AsyncProfiler Result 1 Thread
But with 7 thread
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 688612.296 ± 70895.058 ops/s
Flame Graph with AsyncProfiler Result 7 Thread
May be it's an issue of NUMA becase there is 2 Sockets, and system is configured with only 1 NUMA node
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1030835 MB
node 0 free: 1011029 MB
node distances:
node 0
0: 10
But after disabling some cpu threads using:
for i in {12..55}
do
# your-unix-command-here
echo '0'| sudo tee /sys/devices/system/cpu/cpu$i/online
done
Performance little improved, not much.
This is just very "simple" test. On complex test with real code, it's even worth,
It spends a lot of time on .annobin___pthread_cond_signal.start
I also deployed vagrant image with the same version of Oracle Linux and kernel version on my home desktop and run it with 10 cpu threads, and performance was nearly as same (~1M op/sec) as on my descktop. So it's not about OS or kernel, but some configuration
Tested with several jDK versions and vendors (jdk 11 and above). It's very little performance when using OpenJDK 11 from YUM distribution, but not significant.
Can you sugest some advice
Thanks in advance
In essense, your benchmark tests the throughput of SecureRandom. The default implementation is synchronized (more precisely, the default implementation mixes the input form /dev/urandom and the above provider).
The paradox is, more threads result in more contention, and thus lower overall performance, as the main part of the algorithm is under a global lock anyway. Async-profiler indeed shows that the bottleneck is the synchronization on a Java monitor: __lll_unlock_wake, __pthread_cond_wait, __pthread_cond_signal - all come from that synchronization.
The contention overhead definitely depends on the hardware, the firmware, and the OS configuration. Instead of trying to reduce this overhead (which can be hard, as, you know, some day will arrive yet another security patch that will make syscalls 2x slower, for example), I'd suggest to get rid of the contention in the first place.
This can be achieved by installing a different, non-blocking SecureRandom provider like shown in this answer. I won't give a recommendation on a particular SecureRandomSpi, as it depends on your specific requirements (throughput/scalability/security). Will just mention that an implementation can be based on
rdrand
OpenSSL
raw /dev/urandom access, etc.
in presto document the definition of - task.max-worker-threads is:
Type: integer
Default value: Node CPUs * 2
Sets the number of threads used by workers to process splits. Increasing this number can improve throughput if worker CPU utilization is low and all the threads are in use,
is it mean that ?
{number of CPU} X {Thread(s) per core}
for example
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16 <--
On-line CPU(s) list: 0-15
Thread(s) per core: 1 <--
Core(s) per socket: 16
Socket(s): 1
so do we need to set ? ( according to lscpu )
task.max-worker-threads=16
other example
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16 <--
On-line CPU(s) list: 0-15
Thread(s) per core: 2 <--
Core(s) per socket: 16
Socket(s): 1
in this case my guess is we need to set ? ( 16 X 2 )
task.max-worker-threads=32
am I right?
the configuration file with task.max-worker-threads
more config.properties
coordinator=false
http-server.http.port=1010
task.max-worker-threads=$VAL
discovery.uri=http://coordinator01:1010
server: i have servers with 2 intel 10 cores cpus or 8 cores. So some has 40 cores, some has 32 cores (enable intel HT)
background: i am running our application, which will isolate cpus, currently, i isolate the last 32 cores (core 8-39) for that application. 4 cores (core 4-7) for other use(normally, it will used 50% sys cpu). And i want to assign core 0-3 for system IRQ usage. since currently, if i run the application, system response is very slow, i think some of irq requests have been disputed to core 4-7, that cause low response.
do you think if that is possible just use 4 cores to handle system irq?
If you have more then one socket ("stone") that means you have NUMA system.
Here is a link to get more info https://en.wikipedia.org/wiki/Non-uniform_memory_access
Try to use CPUs on the same socket. Below I will explain why and how to do that
Determine what exactly CPU ids located on each socket.
% numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 24565 MB
node 0 free: 2069 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 24575 MB
node 1 free: 1806 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Here "node" means "socket" (stone). So 0,2,4,6 CPUs are located on the same node.
And it makes sense to move all IRQs into one node to use L3 cache for set of CPUs.
Isolate all CPUs except 0,2,4,6.
Need to add argument to start Linux kernel
isolcpus= cpu_number [, cpu_number ,...]
for example
isolcpus=1,3,5,7-31
Control what IRQs are running on what CPUs
cat /proc/interrupts
Start your application with numactl command to aligne to CPUs and Memory.
(Here need to understand what NUMA and aligned is. Please follow the link at the beginning of the article)
numactl [--membind=nodes] [--cpunodebind=nodes]
Your question is much bigger than I mentioned here.
If you see the system is slow need to understand bottleneck.
Try to gather raw info with top, vmstat, iostat to find out the point of weakness.
Provide some stat of your system and I will help you to turn it up right way.
The column number of /proc/stat is different on various on different platforms. Will the meaning of the i-th column change ?
For example, the man page http://man7.org/linux/man-pages/man5/proc.5.html gives ten value columns, but on my linux only showns eight columns.
[root#localhost ~]# cat /proc/stat
cpu 148894509 5214 64962656 4534478045 18407228 6482288 24487520 0
cpu0 71026365 2633 34928452 2246110103 18371398 6482288 21933024 0
cpu1 77868143 2580 30034204 2288367942 35829 0 2554495 0
the link clearly states kernel versions that start implementing those columns:
steal (since Linux 2.6.11)
guest (since Linux 2.6.24)
guest_nice (since Linux 2.6.33)
If your kernel version is less than 2.6.24, then you will see 8 columns. But the order and meaning of columns remain the same.
Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?