I'm running very "simple" Test with.
#Fork(value = 1, jvmArgs = { "--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:ActiveProcessorCount=7",
"-XX:+UseNUMA"
, "-XX:+UnlockDiagnosticVMOptions", "-XX:DisableIntrinsic=_currentTimeMillis,_nanoTime",
"-Xmx10G", "-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
)
#Benchmark
public String generateRandom() {
return UUID.randomUUID().toString();
}
May be it's not very simple, because uses random, but same issue is on any other tests with java
On my home desktop
Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Performance with 7 threads:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 1312295.357 ± 27853.707 ops/s
Flame Graph with AsyncProfiler Result with 7 Thread At Home
I have an issue on Oracle Linux
Linux 5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"
with 1 thread, I have very great performance:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 2377471.113 ± 8049.532 ops/s
Flame Graph with AsyncProfiler Result 1 Thread
But with 7 thread
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 688612.296 ± 70895.058 ops/s
Flame Graph with AsyncProfiler Result 7 Thread
May be it's an issue of NUMA becase there is 2 Sockets, and system is configured with only 1 NUMA node
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1030835 MB
node 0 free: 1011029 MB
node distances:
node 0
0: 10
But after disabling some cpu threads using:
for i in {12..55}
do
# your-unix-command-here
echo '0'| sudo tee /sys/devices/system/cpu/cpu$i/online
done
Performance little improved, not much.
This is just very "simple" test. On complex test with real code, it's even worth,
It spends a lot of time on .annobin___pthread_cond_signal.start
I also deployed vagrant image with the same version of Oracle Linux and kernel version on my home desktop and run it with 10 cpu threads, and performance was nearly as same (~1M op/sec) as on my descktop. So it's not about OS or kernel, but some configuration
Tested with several jDK versions and vendors (jdk 11 and above). It's very little performance when using OpenJDK 11 from YUM distribution, but not significant.
Can you sugest some advice
Thanks in advance
In essense, your benchmark tests the throughput of SecureRandom. The default implementation is synchronized (more precisely, the default implementation mixes the input form /dev/urandom and the above provider).
The paradox is, more threads result in more contention, and thus lower overall performance, as the main part of the algorithm is under a global lock anyway. Async-profiler indeed shows that the bottleneck is the synchronization on a Java monitor: __lll_unlock_wake, __pthread_cond_wait, __pthread_cond_signal - all come from that synchronization.
The contention overhead definitely depends on the hardware, the firmware, and the OS configuration. Instead of trying to reduce this overhead (which can be hard, as, you know, some day will arrive yet another security patch that will make syscalls 2x slower, for example), I'd suggest to get rid of the contention in the first place.
This can be achieved by installing a different, non-blocking SecureRandom provider like shown in this answer. I won't give a recommendation on a particular SecureRandomSpi, as it depends on your specific requirements (throughput/scalability/security). Will just mention that an implementation can be based on
rdrand
OpenSSL
raw /dev/urandom access, etc.
Related
I'm trying to understand what node distances in numactl --hardware mean?
On our cluster, it outputs the following
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 32143 MB
node 0 free: 188 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 32254 MB
node 1 free: 69 MB
node distances:
node 0 1
0: 10 21
1: 21 10
This is what I understood so far:
we have 24 virtual CPUs and that each node has 32Gb of DRAM.
On a numa cluster, we will have to make a "hop" to the next cluster to access the memory on other node and this incurs a higher latency.
In this context, do the numbers 10 and 21 indicate the latencies for "hops"? How do I find the latency in ns? is that specified somewhere?
This and this didn't help me much.
EDIT: This link says that distances are not in ns, but are relative distances. how do I get the absolute latency in ns?
Any help will be appreciated.
numactl --hardware gives you stats about the architecture of your hardware, not about on its performance.
If you want the performance characteristics of your hardware you will have to measure it yourself, either by finding an existing one online or writing your own benchmark.
https://stackoverflow.com/a/47815885/1411628 will give you an idea on how to get started on writing your own bench.
To get absolute latency numbers, if you're on an Intel system you can use their Memory Latency Checker tool for any specific system. https://software.intel.com/en-us/articles/intel-memory-latency-checker
It prefers to use root/admin powers to disable the hardware prefetching which otherwise skews the numbers, but if you don't have that, the docs also point out that you can ask it to get random elements from the other nodes to get very close to the true numbers e.g.:
./mlc --latency_matrix -e -l128 -r
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --latency_matrix -e -l128 -r
Using buffer size of 200.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 112.5 180.3
1 180.8 112.4
server: i have servers with 2 intel 10 cores cpus or 8 cores. So some has 40 cores, some has 32 cores (enable intel HT)
background: i am running our application, which will isolate cpus, currently, i isolate the last 32 cores (core 8-39) for that application. 4 cores (core 4-7) for other use(normally, it will used 50% sys cpu). And i want to assign core 0-3 for system IRQ usage. since currently, if i run the application, system response is very slow, i think some of irq requests have been disputed to core 4-7, that cause low response.
do you think if that is possible just use 4 cores to handle system irq?
If you have more then one socket ("stone") that means you have NUMA system.
Here is a link to get more info https://en.wikipedia.org/wiki/Non-uniform_memory_access
Try to use CPUs on the same socket. Below I will explain why and how to do that
Determine what exactly CPU ids located on each socket.
% numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 24565 MB
node 0 free: 2069 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 24575 MB
node 1 free: 1806 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Here "node" means "socket" (stone). So 0,2,4,6 CPUs are located on the same node.
And it makes sense to move all IRQs into one node to use L3 cache for set of CPUs.
Isolate all CPUs except 0,2,4,6.
Need to add argument to start Linux kernel
isolcpus= cpu_number [, cpu_number ,...]
for example
isolcpus=1,3,5,7-31
Control what IRQs are running on what CPUs
cat /proc/interrupts
Start your application with numactl command to aligne to CPUs and Memory.
(Here need to understand what NUMA and aligned is. Please follow the link at the beginning of the article)
numactl [--membind=nodes] [--cpunodebind=nodes]
Your question is much bigger than I mentioned here.
If you see the system is slow need to understand bottleneck.
Try to gather raw info with top, vmstat, iostat to find out the point of weakness.
Provide some stat of your system and I will help you to turn it up right way.
I am using Ubuntu 15.04 on a two sockets Power8 machine, each socket has 10 cores. "numactl -H" outputs:
available: 4 nodes (0-3)
node 0 cpus: 0 8 16 24 32
node 0 size: 30359 MB
node 0 free: 26501 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 80 88 96 104 112
node 2 size: 30425 MB
node 2 free: 27884 MB
node 3 cpus: 120 128 136 144 152
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10
The problem is, are there two NUMA nodes on each Power8 processor? Any why one has memory but the other one has nothing. I can't find any document about this. Any information would be appreciated.
A further question, if there are two nodes on a socket, then are their last level cache shared like NUMA nodes(a data can reside in all of the caches) or like on the same socket(only one copy can exist).
Scale-out POWER8 systems use Dual-Chip Modules (DCMs). As the name suggests, a DCM packages two multi-core chips with some additional logic within the same physical package. There is an on-package cache-coherent 32 GBps interconnect (misleadingly called an SMP bus) between the two chips and two separate paths to the external memory buffers, one for each chip. Thus, each socket is a dual-node NUMA system itself, similar to e.g., the multi-module AMD Opterons. In your case, all of the memory local to a given socket is probably installed in the slots belonging to the first chip of that socket only, therefore the second NUMA domain shows up as 0 MB.
Both the on-package (X bus) and inter-package (A bus) interconnects are cache-coherent, i.e. the L3 caches are kept in sync. Within a multi-core chip, each core is directly connected to a region of L3 cache and through the chip interconnect has access to all other L3 caches of the same chip, i.e. a NUCA (Non-Uniform Cache Architecture).
For more information, see the logical diagram of an S824 system in this Redpaper.
My Ubuntu machine's performance is terrible for R kmeans {stats}, whereas Windows 7 shows no problems.
X is a 5000 x 5 matrix (numerical variables).
k = 6
My desktop machine is an Intel Xeon CPU W3530 # 2.80GHz x 8 (i.e., 8 cores) Dell Precision T3500, with Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-58-generic x86_64) with 24 GB RAM.
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Copyright (C) 2013
The R Foundation for Statistical Computing Platform:
x86_64-pc-linux-gnu (64-bit)
> system.time(X.km <- kmeans(X, centers=k, nstart=25))
user system elapsed
49.763 52.347 103.426
Compared to a Windows 7 64-bit laptop with Intel Core i5-2430M # 2.40GHz, 2 cores, 8 GB RAM, R 3.0.1, and the same data:
> system.time(X.km <- kmeans(X, centers=k, nstart=25))
user system elapsed
0.36 0.00 0.37
Much, much faster. For nstart=1 the problem still exists, I just wanted to amplify the execution time.
Is there something obvious I'm missing?
Try it for yourselves, see what times you achieve:
set.seed(101)
k <- 6
n <- as.integer(10)
el.time <- vector(length=n)
X <- matrix(rnorm(25000, mean=0.5, sd=1), ncol=5)
for (i in 1:n) { # sorry, not clever enough to vectorise
el.time[i] <- system.time(kmeans(X, centers=k, nstart=i))[[3]]
}
print(el.time)
plot(el.time, type="b")
My results (ubuntu machine):
> print(el.time)
[1] 0.056 0.243 0.288 0.489 0.510 0.572 0.623 0.707 0.830 0.846
Windows machine:
> print(el.time)
[1] 0.01 0.12 0.14 0.19 0.20 0.21 0.22 0.25 0.28 0.30
Are you running Ubuntu in a Virtual Machine? If that were the case I could see the the results are much slower - Depending on how much memory, processors, diskspace was allocated for the VM. If it isn't running in a VM then the results are puzzling. I would want to see Performance counter for each of the runs (what is the cpu usage, memory usage, etc) on both systems when you run this? Otherwise, the only thing I could link of is that the code "fits" in the L1 cache of your windows system but doesn't in the Linux system. The Xeon has 8GB (L3?) Cache where the Core i5 only has 3MB - but I'm assuming that's L3. I don't know what the L1 and L2 cache structures look like.
My guess is it's a BLAS issue. R might use the internal BLAS if compiled like that. In addition to this, different BLAS versions can show significant performance differences (openBLAS <> MKL).
I have computer with 2 Intel Xeon CPUs and 48 GB of RAM. RAM is divided between CPUs - two parts 24 GB + 24 GB. How can I check how much of each specific part is used?
So, I need something like htop, which shows how fully each core is used (see this example), but rather for memory than for cores. Or something that would specify which part (addresses) of memory are used and which are not.
The information is in /proc/zoneinfo, contains very similar information to /proc/vmstat except broken down by "Node" (Numa ID). I don't have a NUMA system here to test it for you and provide a sample output for a multi-node config; it looks like this on a one-node machine:
Node 0, zone DMA
pages free 2122
min 16
low 20
high 24
scanned 0
spanned 4096
present 3963
[ ... followed by /proc/vmstat-like nr_* values ]
Node 0, zone Normal
pages free 17899
min 932
low 1165
high 1398
scanned 0
spanned 223230
present 221486
nr_free_pages 17899
nr_inactive_anon 3028
nr_active_anon 0
nr_inactive_file 48744
nr_active_file 118142
nr_unevictable 0
nr_mlock 0
nr_anon_pages 2956
nr_mapped 96
nr_file_pages 166957
[ ... more of those ... ]
Node 0, zone HighMem
pages free 5177
min 128
low 435
high 743
scanned 0
spanned 294547
present 292245
[ ... ]
I.e. a small statistic on the usage/availability total followed by the nr_* values also found on a system-global level in /proc/vmstat (which then allow a further breakdown as of what exactly the memory is used for).
If you have more than one memory node, aka NUMA, you'll see these zones for all nodes.
edit
I'm not aware of a frontend for this (i.e. a numa vmstat like htop is a numa-top), but please comment if anyone knows one !
The numactl --hardware command will give you a short answer like this:
node 0 cpus: 0 1 2 3 4 5
node 0 size: 49140 MB
node 0 free: 25293 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 49152 MB
node 1 free: 20758 MB
node distances:
node 0 1
0: 10 21
1: 21 10