Why KernelStack > ThreadCount*16k

Why KernelStack > ThreadCount*16k - linux

Why KernelStack > ThreadCount*16k
Every thread has a KernelStack with a size of 16k, so i tried to count the number of threads in the system with
[root#docker31 ~]# ps -eT | wc -l
714
and got KernelStack from /proc/meminfo with
[root#docker31 ~]# cat /proc/meminfo | grep KernelStack
KernelStack: 12640 kB
If one thread has 16k kernelstack
the total kernelstack size should be 714*16k=11424k
but the KernelStack from/proc/meminfo is 1216K(76*16k), more than the thread count
What is the 1216K? Is it the Interrupt Stack per CPU?
I searched the sourcecode of 3.10.0-975.el7 and found KernelStack of /proc/meminfo is counted in do_fork->copy_process->dup_task_struct->account_kernel_stack oprerion, so i think it should equal to thread count
but in fact they are not equal, Why?

Related

Is there an equivalent for time([some command]) for checking peak memory usage of a bash command?

I want to figure out how much memory a specific command uses but I'm not sure how to check for the peak memory of the command. Is there anything like the time([command]) usage but for memory?
Basically, I'm going to have to run an interactive queue using SLURM, then test a command for a program I need to use for a single sample, see how much memory was used, then submit a bunch of jobs using that info.

Yes, time is the program that monitors programs and shows the Maximum resident set size. Not to be confused with time Bash builtin that only shows real/user/sys times. On my Arch Linux you have to install time with pacman -S time, it's a separate package.
$ command time -v echo 1
1
Command being timed: "echo 1"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 90
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Note:
$ type time
time is a shell keyword
$ time -V
bash: -V: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.002s
$ command time -V
time (GNU Time) 1.9
$ /bin/time -V
time (GNU Time) 1.9
$ /usr/bin/time -V
time (GNU Time) 1.9

How to select huge page sizes for DPDK and malloc?

We develop a Linux application that uses DPDK and which must also be heavily optimised for speed.
We must specify huge pages for use by DPDK and also for general dynamic memory allocation. For the latter we use the libhugetlbfs library.
sudo mount -t hugetlbfs none /mnt/hugetlbfs
We specify the huge pages in the bootcmd line as follows:
hugepagesz=1G hugepages=20 hugepages=0 default_hugepagesz=1G
We are using Centos 7 and full boot cmdline is:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet hugepagesz=1G hugepages=20 default_hugepagesz=1G irqaffinity=0,1 isolcpus=4-23 nosoftlockup mce=ignore_ce idle=poll
These values are fairly arbitrary. With these values, I see:
$ free -m
total used free shared buff/cache available
Mem: 47797 45041 2468 9 287 2393
Swap: 23999 0 23999
So 2.468GB RAM is free out of 48GB. So a very large amount of memory is allocated to huge pages and I want to reduce this.
My question is what would be sensible values for them?
I am confused by the interpretation of the parameters. I see:
$ cat /proc/meminfo
<snip>
HugePages_Total: 43
HugePages_Free: 43
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
and also:
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
43
Why are 43 pages reported when my parameters only specify 20 pages of 1G?
I would like some guidelines on:
huge page size/quantity that I might need for DPDK?
huge page size/quantity that I might need for malloc?
I know these are highly application dependant but some guidelines would be helpful. Also how could I detect if the huge pages were insufficent for the application?
Additional info:
$ cat /proc/mounts | grep huge
cgroup /sys/fs/cgroup/hugetlb cgroup rw,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime 0 0
Update 4 March:
My boot cmdline is now:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet hugepagesz=1G hugepages=20 default_hugepagesz=1G irqaffinity=0,1 isolcpus=4-23 nosoftlockup mce=ignore_ce idle=poll transparent_hugepage=never
and transparent hugepages are disabled (I activated a custom tune profile):
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
but I still see 43 hugepages:
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
43
whereas I have only specified 20 in cmdline. Why is this?

Why using pipe for sort (linux command) is slow?

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried
Method 1
awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted
Method 2
awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted
Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?
EDIT
awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to #Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

I think sort does not use threads when read from pipe.
I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4 > bf.txt"
I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.
Below are results for the first and second scenarios:
$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4 > bf.txt"
Command being timed: "bash -c seq 1 10000000 | sort --parallel 4 > bf.txt"
User time (seconds): 35.85
System time (seconds): 0.84
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2899
Voluntary context switches: 1920
Involuntary context switches: 1323
Swaps: 0
File system inputs: 0
File system outputs: 459136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
User time (seconds): 43.03
System time (seconds): 0.85
Percent of CPU this job got: 175%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1018004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2445
Voluntary context switches: 299
Involuntary context switches: 4387
Swaps: 0
File system inputs: 0
File system outputs: 308160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

You have more system calls, if you use the pipe.
seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059
Without the pipe the file is mapped into memory.
seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33
Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

how to get the tasks taking more size on RAM in linux

With the command free -g, I am able to get the total occupied size and free size of RAM in Linux. But want to understand which tasks or process taking more size, so that I can free up the RAM size.
total used free shared buffers cached
Mem: 125 121 4 0 6 94
-/+ buffers/cache: 20 105
Swap: 31 0 31

Go for top command
then press shift+f
press a for pid information
ALso check
ps -eo pmem,vsz,pid
man ps
checkout pmem,vsz,pid.......
hope it helps..
thanks for the question !

You can use below command to find running processes sorted by memory use:
ps -eo pmem,pcpu,rss,vsize,args | sort -k 1 -r | less

Unknown memory utilization in Ubuntu14.04 Trusty

I'm running Ubuntu Trusty 14.04 on a new machine with 8GB of RAM, and it seems to be locking up periodically and nothing is in syslog file. I've installed Nagios and have been watching the graphs, and it looks like memory is going high from 7% to 72% in just a span of 10 mins. Only node process are running on server. In top I found all process are running very normal memory consumption. Even after stopping node process. Memory remains with same utilization.
free agrees, claiming I'm using more than 5.7G of memory:
free -h
total used free shared buffers cached
Mem: 7.8G 6.5G 1.3G 2.2M 233M 612M
-/+ buffers/cache: 5.7G 2.1G
Swap: 2.0G 0B 2.0G
This other formula for totaling the memory roughly agrees:
# ps -e -orss=,args= | sort -b -k1,1n | awk '{total = total + $1}END{print total}'
503612
If the processes only total 500 MiB, where's the rest of the memory going?

I've got solution on this... so just wanna to update the same...
echo 2 > /proc/sys/vm/drop_caches
This resolved my issue. So I have added the same in my cron for every 5 mins on each of ubuntu server

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why KernelStack > ThreadCount*16k - linux

Related

Is there an equivalent for time([some command]) for checking peak memory usage of a bash command?

How to select huge page sizes for DPDK and malloc?

Why using pipe for sort (linux command) is slow?

how to get the tasks taking more size on RAM in linux

Unknown memory utilization in Ubuntu14.04 Trusty

Categories

Resources