When I run the !finalizequeue command on a dump file, it shows me different heap and the the total objects to be finalized under this. What are the different heaps like Heap 0, Heap 1 and so on in the result below? My understanding is that there would be one heap per process, is it correct?
0:000> !finalizequeue
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
------------------------------
Heap 0
generation 0 has 19 finalizable objects (41c7ed80->41c7edcc)
generation 1 has 19 finalizable objects (41c7ed34->41c7ed80)
generation 2 has 2283 finalizable objects (41c7c988->41c7ed34)
Ready for finalization 0 objects (41c7edcc->41c7edcc)
------------------------------
Heap 1
generation 0 has 101 finalizable objects (41ccc27c->41ccc410)
generation 1 has 25 finalizable objects (41ccc218->41ccc27c)
generation 2 has 2636 finalizable objects (41cc98e8->41ccc218)
Ready for finalization 0 objects (41ccc410->41ccc410)
------------------------------
Heap 2
generation 0 has 6 finalizable objects (41d4195c->41d41974)
generation 1 has 11 finalizable objects (41d41930->41d4195c)
generation 2 has 2328 finalizable objects (41d3f4d0->41d41930)
Ready for finalization 0 objects (41d41974->41d41974)
------------------------------
Heap 3
generation 0 has 21 finalizable objects (41c96188->41c961dc)
generation 1 has 16 finalizable objects (41c96148->41c96188)
generation 2 has 2584 finalizable objects (41c938e8->41c96148)
Ready for finalization 0 objects (41c961dc->41c961dc)
Those are the GC heaps. The GC in this process is running in server mode, which has one GC heap per processor. The output is showing you the locations of the finalizable objects by generation and by heap.
Related
I wrote this simple program that multiplies matrices. I can specify how
many OS threads are used to run it with the environment variable
OMP_NUM_THREADS. It slows down a lot when the thread count gets
larger than my CPU's physical threads.
Here's the program.
static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
c[DIMENSION][DIMENSION];
#pragma omp parallel for schedule(static)
for (unsigned i = 0; i < DIMENSION; i++)
for (unsigned j = 0; j < DIMENSION; j++)
for (unsigned k = 0; k < DIMENSION; k++)
c[i][k] += a[i][j] * b[j][k];
My CPU is i7-8750H. It has 12 threads. When the matrices are large
enough, the program is fastest on around 11 threads. It is 4 times as
slow when the thread count reaches 17. Then run time stays about the
same as I increase the thread count.
Here's the results. The top row is DIMENSION. The left column is the
thread count. Times are in seconds. The column with * is when
compiled with -fno-loop-unroll-and-jam.
1024 2048 4096 4096* 8192
1 0.2473 3.39 33.80 35.94 272.39
2 0.1253 2.22 18.35 18.88 141.23
3 0.0891 1.50 12.64 13.41 100.31
4 0.0733 1.13 10.34 10.70 82.73
5 0.0641 0.95 8.20 8.90 62.57
6 0.0581 0.81 6.97 8.05 53.73
7 0.0497 0.70 6.11 7.03 95.39
8 0.0426 0.63 5.28 6.79 81.27
9 0.0390 0.56 4.67 6.10 77.27
10 0.0368 0.52 4.49 5.13 55.49
11 0.0389 0.48 4.40 4.70 60.63
12 0.0406 0.49 6.25 5.94 68.75
13 0.0504 0.63 6.81 8.06 114.53
14 0.0521 0.63 9.17 10.89 170.46
15 0.0505 0.68 11.46 14.08 230.30
16 0.0488 0.70 13.03 20.06 241.15
17 0.0469 0.75 20.67 20.97 245.84
18 0.0462 0.79 21.82 22.86 247.29
19 0.0465 0.68 24.04 22.91 249.92
20 0.0467 0.74 23.65 23.34 247.39
21 0.0458 1.01 22.93 24.93 248.62
22 0.0453 0.80 23.11 25.71 251.22
23 0.0451 1.16 20.24 25.35 255.27
24 0.0443 1.16 25.58 26.32 253.47
25 0.0463 1.05 26.04 25.74 255.05
26 0.0470 1.31 27.76 26.87 253.86
27 0.0461 1.52 28.69 26.74 256.55
28 0.0454 1.15 28.47 26.75 256.23
29 0.0456 1.27 27.05 26.52 256.95
30 0.0452 1.46 28.86 26.45 258.95
Code inside the loop compiles to this on gcc 9.3.1 with
-O3 -march=native -fopenmp. rax starts from 0 and increases by 64
each iteration. rdx points to c[i]. rsi points to b[j]. rdi
points to b[j+1].
vmovapd (%rsi,%rax), %ymm1
vmovapd 32(%rsi,%rax), %ymm0
vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
vmovapd %ymm1, (%rdx,%rax)
vmovapd %ymm0, 32(%rdx,%rax)
I wonder why the run time increases so much when the thread count
increases.
My estimate says this shouldn't be the case when DIMENSION is 4096.
What I thought before I remembered that the compiler does 2 j loops at
a time. Each iteration of the j loop needs rows c[i] and b[j].
They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2
cache per 2 threads. The four rows the two hardware threads are working
with don't fit in L1 but fit in L2. So when j advances, c[i] is
read from L2. When the program is run on 24 OS threads, the number of
involuntary context switches is around 29371. Each thread gets
interrupted before it has a chance to finish one iteration of the j
loop. Since 8 matrix rows can fit in the L2 cache, the other software
thread's 2 rows are probably still in L2 when it resumes. So the
execution time shouldn't be much different from the 12 thread case.
However measurements say it's 4 times as slow.
Now that I have realized 2 j loops are done at a time. This way each
j iteration works on 96KB of memory. So 4 of them can't fit in the
256KB L2 cache. To verify this is what slows the program down, I
compiled the program with -fno-loop-unroll-and-jam. I got
vmovapd ymm0, YMMWORD PTR [rcx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
vmovapd YMMWORD PTR [rdx+rax], ymm0
The results are in the table. They are like when 2 rows are done at a
time. Which makes me wonder even more. When DIMENSION is 4096, 4
software threads' 8 rows fit in the L2 cache when each thread works on 1
row at a time, but 12 rows don't fit in the L2 cache when each thread
works on 2 rows at a time. Why are the run times similar?
I thought maybe it's because the CPU warmed up when running with less
threads and has to slow down. I ran the tests multiple times, both in
the order of increasing thread count and decreasing thread count. They
yield similar results. And dmesg doesn't contain anything related to
thermal or clock.
I tried separately changing 4096 columns to 4104 columns and setting
OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.
This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).
I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.
Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):
#threads | time (in seconds)
----------------------------
1 | 16.726885
2 | 9.062372
3 | 6.397651
4 | 5.494580
5 | 4.054391
6 | 5.724844 <-- maximum number of hardware threads
7 | 6.113844
8 | 7.351382
9 | 8.992128
10 | 10.789389
11 | 10.993626
12 | 11.099117
24 | 11.283873
48 | 11.412288
We can see that the computation time starts to significantly grow between 5 and 12 cores.
This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!
If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:
With 6 threads: 4.4 G loads
1.1 G load-misses (25% of all LL-cache hits)
With 12 threads: 6.1 G loads
4.5 G load-misses (74% of all LL-cache hits)
This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.
Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).
PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.
When I type vmstat -m in command line, it shows:
Cache Num Total Size Pages
fuse_request 0 0 424 9
fuse_inode 0 0 768 5
pid_2 0 0 128 30
nfs_direct_cache 0 0 200 19
nfs_commit_data 0 0 704 11
nfs_write_data 36 36 960 4
nfs_read_data 0 0 896 4
nfs_inode_cache 8224 8265 1048 3
nfs_page 0 0 128 30
fscache_cookie_jar 2 48 80 48
rpc_buffers 8 8 2048 2
rpc_tasks 8 15 256 15
rpc_inode_cache 17 24 832 4
bridge_fdb_cache 14 59 64 59
nf_conntrack_expect 0 0 240 16
For the nfs_write_data line(line 7), why the "pages" is less than "total"?
For some of them, the "total" is always equal to "pages".
Taken from vmstat man page
...
The -m switch displays slabinfo.
...
Field Description For Slab Mode
cache: Cache name
num: Number of currently active objects
total: Total number of available objects
size: Size of each object
pages: Number of pages with at least one active object
totpages: Total number of allocated pages
pslab: Number of pages per slab
Thus, total is the number of slabinfo objects (objects used by the OS as inodes, buffers and so on) and a page can contain more than one object
hi I am a Linux programmer
I have an order that monitor process cpus usage, so I use data on /proc/[pid]/stat № 14 and 15. That values are called utime and stime.
Example [/proc/[pid]/stat]
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5274 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
State after 5 sec
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5440 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
In test environment, this file refreshed 1 ~ 2 sec, so I assume this file often updated by system at least 1 sec.
So I use this calculation
process_cpu_usage = ((utime - old_utime) + (stime - old_stime))/ period
In case of above values
33.2 = ((5440 - 5274) + (0 - 0)) / 5
But, In commercial servers environment, process run with high load (cpu and file IO), /proc/[pid]/stat file update period increasing even 20~60 sec!!
So top/htop utility can't measure correct process usage value.
Why is this phenomenon occurring??
Our system is [CentOS Linux release 7.1.1503 (Core)]
Most (if not all) files in the /proc filesystem are special files, their content at any given moment reflect the actual OS/kernel data at that very moment, they're not files with contents periodically updated. See the /proc filesystem doc.
In particular the /proc/[pid]/stat content changes whenever the respective process state changes (for example after every scheduling event) - for processes mostly sleeping the file will appear to be "updated" at slower rates while for active/running processes at higher rates on lightly loaded systems. Check, for example, the corresponding files for a shell process which doesn't do anything and for a browser process playing some video stream.
On heavily loaded systems with many processes in the ready state (like the one mentioned in this Q&A, for example) there can be scheduling delays making the file content "updates" appear less often despite the processes being ready/active. Such conditions seem to be more often encountered in commercial/enterprise environments (debatable, I agree).
On Android shell:
/data/local/valgrind/enter code herebin/valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=/sdcard/valgrind.log /data/local/Test
========================================================================
valgrind.log--HEAP SUMMARY:
==21314== HEAP SUMMARY:
==21314== in use at exit: 2,098,270 bytes in 6 blocks
==21314== total heap usage: 6 allocs, 0 frees, 2,098,270 bytes allocated
==21314==
==21314== 4 bytes in 1 blocks are definitely lost in loss record 1 of 6
==21314== at 0x482BAEC: malloc (vg_replace_malloc.c:291)
==21314== by 0x864B: ??? (in /data/local/Test)
==21314==
==21314== 10 bytes in 1 blocks are definitely lost in loss record 2 of 6
==21314== at 0x482BAEC: malloc (vg_replace_malloc.c:291)
==21314== by 0x863B: ??? (in /data/local/Test)
==21314==
==21314== 80 bytes in 1 blocks are definitely lost in loss record 3 of 6
==21314== at 0x482C2E4: operator new[](unsigned int) (vg_replace_malloc.c:378)
==21314== by 0x85DF: ??? (in /data/local/Test)
==21314==
==21314== 1,024 bytes in 1 blocks are still reachable in loss record 4 of 6
==21314== at 0x482BAEC: malloc (vg_replace_malloc.c:291)
==21314== by 0x4852DB3: __smakebuf (in /system/lib/libc.so)
==21314==
==21314== 1,048,576 bytes in 1 blocks are possibly lost in loss record 5 of 6
==21314== at 0x482BAEC: malloc (vg_replace_malloc.c:291)
==21314== by 0x86C3: ??? (in /data/local/Test)
==21314==
==21314== 1,048,576 bytes in 1 blocks are definitely lost in loss record 6 of 6
==21314== at 0x482BAEC: malloc (vg_replace_malloc.c:291)
==21314== by 0x869F: ??? (in /data/local/Test)
==21314==
==21314== LEAK SUMMARY:
==21314== definitely lost: 1,048,670 bytes in 4 blocks
==21314== indirectly lost: 0 bytes in 0 blocks
==21314== possibly lost: 1,048,576 bytes in 1 blocks
==21314== still reachable: 1,024 bytes in 1 blocks
==21314== suppressed: 0 bytes in 0 blocks
==21314==
==21314== For counts of detected and suppressed errors, rerun with: -v
==21314== ERROR SUMMARY: 137 errors from 18 contexts (suppressed: 0 from 0)
the Android.mk:
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := Test
LOCAL_SRC_FILES := helloworld.cpp
APP_CPPFLAGS += -O0
include $(BUILD_EXECUTABLE)**
On Linux I run ndk-build NDK_DEBUG=1 so this Test is debugable, why doesn't the valgrind log show the line numbers?
If a process is interrupted by a hardware interrupt (First Level Interrupt Handler), then does the CPU scheduler becomes aware of that (e.g. Does the Scheduler count execution time for hardware interrupts separately from interrupted process)?
More details:
I am trying to troubleshoot an issue where CPU utilization in htop is way too low for the specified packet encryption task (CPU is at <10% while encrypting packets at 400Mbps; Raw encryption speed is only 1.6Gbps, so packet encryption should not go any faster than raw encryption speed).
Explanation:
My hypothesis is that packet encapsulation happens at hardware interrupts hence giving me the illusion of the low CPU usage in htop. Usually FLIHs are implemented so that they finish their task as quickly as possible and defer their work to SLIHs (Second Level Interrupt Handler which I guess is executed on behalf of ksoftirqd/X). But what happens if FLIH interrupts a process for a very long time? Does that introduce some kind of a OS jitter?
I am using Ubuntu 10.04.1 on x86-64 platform.
Additional debugging info:
while [ 1 ]; do cat /proc/stat | grep "cpu "; sleep 1; done;
cpu 288 1 1677 356408 1145 0 20863 0 0
cpu 288 1 1677 356772 1145 0 20899 0 0
cpu 288 1 1677 357108 1145 0 20968 0 0
cpu 288 1 1677 357392 1145 0 21083 0 0
cpu 288 1 1677 357620 1145 0 21259 0 0
cpu 288 1 1677 357972 1145 0 21310 0 0
cpu 288 1 1677 358289 1145 0 21398 0 0
cpu 288 1 1677 358517 1145 0 21525 0 0
cpu 288 1 1678 358838 1145 0 21652 0 0
cpu 289 1 1678 359141 1145 0 21704 0 0
cpu 289 1 1678 359563 1145 0 21729 0 0
cpu 290 1 1678 359886 1145 0 21758 0 0
cpu 290 1 1678 360296 1145 0 21801 0 0
Seventh (or sixth number column) column here I guess is the time spent inside Hardware interrupt handlers (htop uses this proc file to get statistics). I am wondering if this will end up as a bug in linux or the driver. When I took these /proc/stat snapshots the traffic was going at 500Mbps in and 500Mbps out.
The time spent in interrupt handlers is accounted.
htop shows it in "si" (soft interrupt) and "hi" (hard interrupt). ni is nice and wa is io-wait.
Edit:
From man proc:
sixth column is hardware irq time
seventh column is softirq
eight is stolen time
nienth is guest time.
the latter two are only meaningful for virtualized systems.
Do you have a kernel built with the CONFIG_IRQ_TIME_ACCOUNTING (Processor type and features/Fine granularity task level IRQ time accounting) option set?