How do I get 4 MB huge pages on Linux - linux

According to:
$ ls -l /sys/kernel/mm/hugepages
drwxr-xr-x 2 root root 0 Dec 6 10:38 hugepages-1048576kB
drwxr-xr-x 2 root root 0 Dec 6 10:38 hugepages-2048kB
There is a choice of 2 MB and 1 GB sizes of huge pages on my system which is running a 5.4.17 kernel
However according to:
$ cpuid | grep -i tlb |sort| uniq
0x03: data TLB: 4K pages, 4-way, 64 entries
0x63: data TLB: 2M/4M pages, 4-way, 32 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
cache and TLB information (2):
data TLB: 1G pages, 4-way, 4 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
the TLBs on my Skylake also support 4 MB pages. The same information can be found at
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
So the question is: can I really have 4 MB pages, and if so what do I need to do to set up my system to have that option?

The best answer is probably to install and/or use libhugetlbfs
If it's already installed, you can check status of huge pages in the OS with a command like:
$ hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 0 1 257388 *
4194304 0 0 128694
8388608 0 0 64347
16777216 0 0 32173
33554432 0 0 16086
67108864 0 0 8043
134217728 0 0 4021
268435456 0 0 2010
536870912 0 0 1005
1073741824 0 0 502
2147483648 0 0 251
The same hugeadm command can also be run as sudo with various options to configure the available huge memory pools. See the hugeadm man page for details.

Related

How to unfreeze a user's memory limit?

For these two days, I have met a weird question.
STAR from https://github.com/alexdobin/STAR is a program used to build suffix array indexes. I have been used this program for years. It works OK until recently.
For these days, when I run STAR, it will always be killed.
root#localhost:STAR --runMode genomeGenerate --runThreadN 10 --limitGenomeGenerateRAM 31800833920 --genomeDir star_GRCh38 --genomeFastaFiles GRCh38.fa --sjdbGTFfile GRCh38.gtf --sjdbOverhang 100
.
.
.
Killed
root#localhost:STAR --runMode genomeGenerate --runThreadN 10 --genomeDir star_GRCh38 --genomeFastaFiles GRCh38.fa --sjdbGTFfile GRCh38.gtf --sjdbOverhang 100
Jun 03 10:15:08 ..... started STAR run
Jun 03 10:15:08 ... starting to generate Genome files
Jun 03 10:17:24 ... starting to sort Suffix Array. This may take a long time...
Jun 03 10:17:51 ... sorting Suffix Array chunks and saving them to disk...
Killed
A month ago, the same command with same inputs and same parameters runs OK. It does cost some memory, but not a lot.
I have tried 3 recently released version of this program, all failed. So I do not think it is the question of STAR program but my sever configuration.
I also tried to run this program as both root and normal user, no lucky for each.
I suspect there is a limitation of memory usage in my server.
But I do not know how the memory is limited? I wonder if some one can give me some hints.
Thanks!
Tong
The following is my debug process and system info.
Command dmesg -T| grep -E -i -B5 'killed process' showing it is Out of memory problem.
But before the STAR program is killed, top command showing only 5% mem is occupied by this porgram.
[一 6 1 23:43:00 2020] [40479] 1002 40479 101523 18680 112 487 0 /anaconda2/bin/
[一 6 1 23:43:00 2020] [40480] 1002 40480 101526 18681 112 486 0 /anaconda2/bin/
[一 6 1 23:43:00 2020] [40481] 1002 40481 101529 18682 112 485 0 /anaconda2/bin/
[一 6 1 23:43:00 2020] [40482] 1002 40482 101531 18673 111 493 0 /anaconda2/bin/
[一 6 1 23:43:00 2020] Out of memory: Kill process 33822 (STAR) score 36 or sacrifice child
[一 6 1 23:43:00 2020] Killed process 33822 (STAR) total-vm:23885188kB, anon-rss:10895128kB, file-rss:4kB, shmem-rss:0kB
[三 6 3 10:02:13 2020] [12296] 1002 12296 101652 18681 113 486 0 /anaconda2/bin/
[三 6 3 10:02:13 2020] [12330] 1002 12330 101679 18855 112 486 0 /anaconda2/bin/
[三 6 3 10:02:13 2020] [12335] 1002 12335 101688 18682 112 486 0 /anaconda2/bin/
[三 6 3 10:02:13 2020] [12365] 1349 12365 30067 1262 11 0 0 bash
[三 6 3 10:02:13 2020] Out of memory: Kill process 7713 (STAR) score 40 or sacrifice child
[三 6 3 10:02:13 2020] Killed process 7713 (STAR) total-vm:19751792kB, anon-rss:12392428kB, file-rss:0kB, shmem-rss:0kB
--
[三 6月 3 10:42:17 2020] [ 4697] 1002 4697 101526 18681 112 486 0 /anaconda2/bin/
[三 6月 3 10:42:17 2020] [ 4698] 1002 4698 101529 18682 112 485 0 /anaconda2/bin/
[三 6月 3 10:42:17 2020] [ 4699] 1002 4699 101532 18680 112 487 0 /anaconda2/bin/
[三 6月 3 10:42:17 2020] [ 4701] 1002 4701 101534 18673 110 493 0 /anaconda2/bin/
[三 6月 3 10:42:17 2020] Out of memory: Kill process 21097 (STAR) score 38 or sacrifice child
[三 6月 3 10:42:17 2020] Killed process 21097 (STAR) total-vm:19769500kB, anon-rss:11622928kB, file-rss:884kB, shmem-rss:0kB
Command free -hl showing I have enough memory.
total used free shared buff/cache available
Mem: 251G 10G 11G 227G 229G 12G
Low: 251G 240G 11G
High: 0B 0B 0B
Swap: 29G 29G 0B
Also as showed by ulimit -a, no virtual memory limitation is set.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1030545
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1030545
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Here is the version of my Centos and Kernel (output by hostnamectl):
hostnamectl
Static hostname: localhost.localdomain
Icon name: computer-server
Chassis: server
Operating System: CentOS Linux 7 (Core)
CPE OS Name: cpe:/o:centos:centos:7
Kernel: Linux 3.10.0-514.26.2.el7.x86_64
Architecture: x86-64
Here shows the content of cat /etc/security/limits.conf:
#* soft core 0
#* hard rss 10000
##student hard nproc 20
##faculty soft nproc 20
##faculty hard nproc 50
#ftp hard nproc 0
##student - maxlogins 4
* soft nofile 65536
* hard nofile 65536
##intern hard as 162400000
##intern hard nproc 150
# End of file
As suggested, I have updated the output of df -h:
Filesystem All Used Available (Used)% Mount
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 1.3M 126G 1% /dev/shm
tmpfs 126G 4.0G 122G 4% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/mapper/cl-root 528G 271G 257G 52% /
/dev/sda1 492M 246M 246M 51% /boot
tmpfs 26G 0 26G 0% /run/user/0
tmpfs 26G 0 26G 0% /run/user/1002
tmpfs 26G 0 26G 0% /run/user/1349
tmpfs 26G 0 26G 0% /run/user/1855
ls -a /dev/shm/
. ..
grep Shmem /proc/meminfo
Shmem: 238640272 kB
Several tmpfs costs 126G memory. I am googleing this, but still not sure what should be done?
That is the problem of shared memory due to program terminated abnormally.
ipcrm is used to clear all shared memory and then STAR running is fine.
$ ipcrm
.....
$ free -h
total used free shared buff/cache available
Mem: 251G 11G 226G 3.9G 14G 235G
Swap: 29G 382M 29G
It looks like the problem is with shared memory: you have 227G of memory eaten up by shared objects.
Shared memory files are persistent. Have a look in /dev/shm and any other tmpfs mounts to see if there are large files that can be removed to free up more physical memory (RAM+swap).
$ ls -l /dev/shm
...
$ df -h | grep '^Filesystem\|^tmpfs'
...
When I run a program called STAR, it will always be killed.
It probably has some memory leak. Even old programs may have residual bugs, and they could appear in some very particular cases.
Check with strace(1) or ltrace(1) and pmap(1). Learn also to query /proc/, see proc(5), top(1), htop(1). See LinuxAteMyRam and read about memory over-commitment and virtual address space and perhaps a textbook on operating systems.
If you have access to the source code of your STAR, consider recompiling it with all warnings and debug info (with GCC, you would pass -Wall -Wextra -g to gcc or g++) then use valgrind and/or some address sanitizer. If you don't have legal access to the source code of STAR, contact the entity (person or organization) which provided it to you.
You could be interested in that draft report and by the Clang static analyzer or Frama-C (or coding your own GCC plugin).
So I do not think it is the question of STAR program but my server configuration.
I recommend using valgrind or gdb and inspect your /proc/ to validate that optimistic hypothesis.

In ss -s, what is the kernel counter actually counting?

While troubleshooting a problem on an OEL 7 server (3.10.0-1062.9.1.el7.x86_64), I ran the command
sudo ss -s
Which gave me the output of:
Total: 601 (kernel 1071)
TCP: 8 (estab 2, closed 0, orphaned 0, synrecv 0, timewait 0/0), ports 0
Transport Total IP IPv6
1071 - -
RAW 2 0 2
UDP 6 4 2
TCP 8 5 3
INET 16 9 7
FRAG 0 0 0
Doing an ss -a | wc -l came back with 225 entries.
It leads me to the question, what is kernel 1071 actually counting?
Looking through the various man pages did not provide an answer.
Using strace, I can see where ss reads:
/proc/net/sockstat
/proc/net/sockstat6
/proc/net/snmp
/proc/slabinfo
Looking through those files and the docs, it looks like the value is coming from /proc/slabinfo.
Grepping through /proc/slabinfo for 1071 came back with one entry:
sock_inode_cache 1071 1071 640 51 8 : tunables 0 0 0 : slabdata 21 21 0
Looking through the files and docs on sock_inode_cache has not helped so far. I am hoping someone here knows what the kernel counter is actually counting, or can point me in the right direction.
what is kernel 1071 actually counting?
sock_inode_cache represents Linux kernel Slab statistics. It shows how many socket inodes (active objects) are there.
struct socket_alloc corresponds to the sock_inode_cache slab cache and contains the struct socket and struct inode, so it is connected to VFS.

How to identify, what is stalling the system in Linux?

I have an embedded system, when I do the user i/o operations, the system just stalls. It does the action after a long time. This system is quite complex and has many process running. My question is how can I identify what is making the system stall - it does nothing literally for 5 minutes. After 5 minutes, I see the outcome. I really don't know what is stalling the system. Any inputs on how to debug this issue. I have run the top on the system. However, it doesn't lead to any issue. See here, the jup_render is just taking 30% of CPU, which is not enough to stall the system. So, I am not sure whether top is useful here or not.
~ # top
top - 12:01:05 up 21 min, 1 user, load average: 1.49, 1.26, 0.87
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 44.4%us, 13.9%sy, 0.0%ni, 40.3%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st
Mem: 822572k total, 389640k used, 432932k free, 1980k buffers
Swap: 0k total, 0k used, 0k free, 227324k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
850 root 20 0 309m 32m 16m S 30 4.0 3:10.88 jup_render
870 root 20 0 221m 13m 10m S 27 1.7 2:28.78 jup_render
688 root 20 0 1156m 4092 3688 S 11 0.5 1:25.49 rxserver
9 root 20 0 0 0 0 S 2 0.0 0:06.81 ksoftirqd/1
16 root 20 0 0 0 0 S 1 0.0 0:06.87 ksoftirqd/3
9294 root 20 0 1904 616 508 R 1 0.1 0:00.10 top
812 root 20 0 865m 85m 46m S 1 10.7 1:21.17 lippo_main
13 root 20 0 0 0 0 S 1 0.0 0:06.59 ksoftirqd/2
800 root 20 0 223m 8316 6268 S 1 1.0 0:08.30 rat-cadaemon
3 root 20 0 0 0 0 S 1 0.0 0:05.94 ksoftirqd/0
1456 root 20 0 80060 10m 8208 S 1 1.2 0:04.82 jup_render
1330 root 20 0 202m 10m 8456 S 0 1.3 0:06.08 jup_render
8905 root 20 0 1868 556 424 S 0 0.1 0:02.91 dropbear
1561 root 20 0 80084 10m 8204 S 0 1.2 0:04.92 jup_render
753 root 20 0 61500 7376 6184 S 0 0.9 0:04.06 ale_app
1329 root 20 0 79908 9m 8208 S 0 1.2 0:04.77 jup_render
631 dbus 20 0 3248 1636 676 S 0 0.2 0:13.10 dbus-daemon
1654 root 20 0 80068 10m 8204 S 0 1.2 0:04.82 jup_render
760 root 20 0 116m 15m 12m S 0 1.9 0:10.19 jup_server
8 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/1:0
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
170 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
167 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers
281 root 0 -20 0 0 0 S 0 0.0 0:00.00 nfsiod
For an embedded system that has many process running, there can be multitude of reasons. You may need to investigate in all perspective.
Check code for race conditions and deadlock.The kernel might be busy looping in a certain condition . There can be scenario where your application is waiting on a select call or the CPU resource is used up (This choice of CPU resource usage is ruled out based on the output of top command shared by you) or blocked on a read.
If you are performing a blocking I/O operations, the process shall get into wait queue and only move back to the execution path(ready queue) after the completion of the request. That is, it is moved out of the scheduler run queue and put with a special state. It shall be put back into the run queue only if they wake from the sleep or the resource waited for is made available.
Immediate step shall be to try out 'strace'. It shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
There are other many handy tools that can be tried based on your development environment/setup. Key tools are as below :
'iotop' - It shall provide you a table of current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
'LTTng' - Makes tracing of race conditions and interrupt cascades possible. It is the successor to LTT. It is a combination of kprobes, tracepoint and perf functionalities.
'Ftrace' - This is a Linux kernel internal tracer with which you can analyze/debug latency and performance related issues.
If your system is based on TI processor, the CCS(Trace analyzer) provides capability to perform non-intrusive debug and analysis of system activity. So, note that based on your setup, you may also need to use the relevant tool .
Came across few more ideas :
magic SysRq key is another option in linux. If the driver is stuck, the command SysRq p can take you to the exact routine that is causing the problem.
Profiling of data can tell where exactly the time is being spent by the kernel. There are couple of tools like Readprofile and Oprofile. Oprofile can be enabled by configuring with CONFIG_PROFILING and CONFIG_OPROFILE. Another option is to rebuild the kernel by enabling the profiling option and reading the profile counters using Readprofile utility by booting up with profile=2 via command line.
mpstat can give 'the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request' via 'iowait' argument.
You said you run the top app. Did you find out which programme gets the biggest CPU time and how much is a percentage for it?
If you run the top you should see another screen in there, which you neither provided nor mentioned a cpu load percentage (or other relevant info).
I advise you to include what you can find interesting/relevant or suspicious through top. If it was already done you should discover it in your question more distinctively because now it's not obvious what is the CPU maximum load.

How do I know if my server has NUMA?

Hopping from Java Garbage Collection, I came across JVM settings for NUMA. Curiously I wanted to check if my CentOS server has NUMA capabilities or not. Is there a *ix command or utility that could grab this info?
I'm no expert here, but here's something:
Box 1, no NUMA:
~$ dmesg | grep -i numa
[ 0.000000] No NUMA configuration found
Box 2, some NUMA:
~$ dmesg | grep -i numa
[ 0.000000] NUMA: Initialized distance table, cnt=8
[ 0.000000] NUMA: Node 4 [0,80000000) + [100000000,280000000) -> [0,280000000)
I think this previous question is similar: How to confirm NUMA?
In particular, you can review the NUMA man page here:
http://man7.org/linux/man-pages/man7/numa.7.html
And from there you'll see:
$ find /proc -name numa_maps
/proc/1/task/1/numa_maps
/proc/1/numa_maps
/proc/2/task/2/numa_maps
/proc/2/numa_maps
/proc/3/task/3/numa_maps
[etc if you have numa]
And you can get more detail like so:
$ grep NUMA=y /boot/config-`uname -r`
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ACPI_NUMA=y
$ numactl --hardware
available: 2 nodes (0-1)
node 0 size: 18156 MB
node 0 free: 9053 MB
node 1 size: 18180 MB
node 1 free: 6853 MB
node distances:
node 0 1
0: 10 20
1: 20 10
For Redhat 4,5,6 and 7 systems, one can try the following to determine if NUMA configuration is disabled:
numactl --show does not show multiple nodes
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0
nodebind: 0
membind: 0
or numactl --hardware does not list multiple nodes
# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 524163 MB
node 0 free: 505253 MB
node distances:
node 0
0: 10
You can also get this info from lscpu command:
lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
You can also just query the information from /sys (this is what tools like numactl do underneath). As others pointed out, using dmesg will be unreliable since it usually does not have unlimited buffering.
To find out how many NUMA nodes are currently available, do:
cat /sys/devices/system/node/online
0-3

Why the SWAP listed in the detail list of the TOP command is greater than in the summary?

The TOP command results:
Mem: 3991840k total, 1496328k used, 2495512k free, 156752k buffers
**Swap**: 3905528k total, **3980k** used, 3901548k free, 447860k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ **SWAP** COMMAND
28250 www-data 20 0 430m 210m 21m R 63 5.4 0:07.29 **219m** apache2
28266 www-data 20 0 256m 40m 21m S 30 1.0 0:01.94 **216m** apache2
28206 www-data 20 0 260m 44m 21m S 27 1.1 0:10.27 **215m** apache2
28259 www-data 20 0 256m 40m 21m S 26 1.0 0:02.21 **216m** apache2
The details list shows a group of apache2 processes are using SWAP memory about 210m+ each, but the summary reports only 3980k is used. The total SWAP memory in the detail list is much greater than in the summary. Do the two swap refer the same thing?
Quoted from http://www.linuxforums.org/articles/using-top-more-efficiently_89.html :
VIRT=RES+SWAP
As explained previously, VIRT includes anything inside task's
address space, no matter it is in RAM,
swapped out or still not loaded from
disk. While RES represents total RAM
consumed by this task. So, SWAP here
means it represents the total amount
of data being swapped out OR still not
loaded from disk. Don't be fooled by
the name, it doesn't just represent
the swapped out data.

Resources