Why using pipe for sort (linux command) is slow?

Why using pipe for sort (linux command) is slow? - linux

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried
Method 1
awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted
Method 2
awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted
Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?
EDIT
awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to #Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

I think sort does not use threads when read from pipe.
I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4 > bf.txt"
I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.
Below are results for the first and second scenarios:
$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4 > bf.txt"
Command being timed: "bash -c seq 1 10000000 | sort --parallel 4 > bf.txt"
User time (seconds): 35.85
System time (seconds): 0.84
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2899
Voluntary context switches: 1920
Involuntary context switches: 1323
Swaps: 0
File system inputs: 0
File system outputs: 459136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
User time (seconds): 43.03
System time (seconds): 0.85
Percent of CPU this job got: 175%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1018004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2445
Voluntary context switches: 299
Involuntary context switches: 4387
Swaps: 0
File system inputs: 0
File system outputs: 308160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

You have more system calls, if you use the pipe.
seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059
Without the pipe the file is mapped into memory.
seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33
Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

Related

Is there an equivalent for time([some command]) for checking peak memory usage of a bash command?

I want to figure out how much memory a specific command uses but I'm not sure how to check for the peak memory of the command. Is there anything like the time([command]) usage but for memory?
Basically, I'm going to have to run an interactive queue using SLURM, then test a command for a program I need to use for a single sample, see how much memory was used, then submit a bunch of jobs using that info.

Yes, time is the program that monitors programs and shows the Maximum resident set size. Not to be confused with time Bash builtin that only shows real/user/sys times. On my Arch Linux you have to install time with pacman -S time, it's a separate package.
$ command time -v echo 1
1
Command being timed: "echo 1"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 90
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Note:
$ type time
time is a shell keyword
$ time -V
bash: -V: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.002s
$ command time -V
time (GNU Time) 1.9
$ /bin/time -V
time (GNU Time) 1.9
$ /usr/bin/time -V
time (GNU Time) 1.9

Why KernelStack > ThreadCount*16k

Why KernelStack > ThreadCount*16k
Every thread has a KernelStack with a size of 16k, so i tried to count the number of threads in the system with
[root#docker31 ~]# ps -eT | wc -l
714
and got KernelStack from /proc/meminfo with
[root#docker31 ~]# cat /proc/meminfo | grep KernelStack
KernelStack: 12640 kB
If one thread has 16k kernelstack
the total kernelstack size should be 714*16k=11424k
but the KernelStack from/proc/meminfo is 1216K(76*16k), more than the thread count
What is the 1216K? Is it the Interrupt Stack per CPU?
I searched the sourcecode of 3.10.0-975.el7 and found KernelStack of /proc/meminfo is counted in do_fork->copy_process->dup_task_struct->account_kernel_stack oprerion, so i think it should equal to thread count
but in fact they are not equal, Why?

.NET Core application on Linux slow to crash the first time

I'm running a .NET Core 2.0 console EXE on Ubuntu 16.04. The application terminates with an unhandled exception - which can be anything, e.g. this is enough:
static void Main(string[] args)
{
throw new ApplicationException("Test .NET core exception.");
}
The first time this happens on a given machine there is noticeable delay (2 seconds or so) between the time it prints
Unhandled Exception: System.ApplicationException: Test .NET core exception.
and the time it prints
Aborted (core dumped)
and the process dies. During this time the process uses all of one CPU. (No core files is actually created.)
If I run the application again there is no such delay any more. What is it doing during that first crash and how do I prevent this delay to let it crash quickly? The delay is only a second for a trivial application that just throws an exception, but can be 20-30 seconds for one that uses a lot of RAM.
Running with /usr/bin/time -v dotnet CoreConsoleApp1.dll the first time:
Unhandled Exception: System.ApplicationException: Test .NET core exception.
at CoreConsoleApp1.Program.Main(String[] args) in CoreConsoleApp1\Program.cs:line 20
Command terminated by signal 6
Command being timed: "dotnet CoreConsoleApp1.dll"
User time (seconds): 0.04
System time (seconds): 0.02
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.92
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 32388
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 193
Minor (reclaiming a frame) page faults: 2708
Voluntary context switches: 1358
Involuntary context switches: 29
Swaps: 0
File system inputs: 57736
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The second time:
Unhandled Exception: System.ApplicationException: Test .NET core exception.
at CoreConsoleApp1.Program.Main(String[] args) in CoreConsoleApp1\Program.cs:line 20
Command terminated by signal 6
Command being timed: "dotnet CoreConsoleApp1.dll"
User time (seconds): 0.03
System time (seconds): 0.00
Percent of CPU this job got: 26%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 29416
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2413
Voluntary context switches: 26
Involuntary context switches: 33
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The project was built in Visual Studio 2017 on Windows.
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
$ uname -a
Linux evgeny-linux 4.4.0-1052-aws #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport %p %s %c %d %P

Why is the system CPU time (% sy) high?

I am running a script that loads big files. I ran the same script in a single core OpenSuSe server and quad core PC. As expected in my PC it is much more faster than in the server. But, the script slows down the server and makes it impossible to do anything else.
My script is
for 100 iterations
Load saved data (about 10 mb)
time myscript (in PC)
real 0m52.564s
user 0m51.768s
sys 0m0.524s
time myscript (in server)
real 32m32.810s
user 4m37.677s
sys 12m51.524s
I wonder why "sys" is so high when i run the code in server. I used top command to check the memory and cpu usage.
It seems there is still free memory, so swapping is not the reason. % sy is so high, its probably the reason for the speed of server but I dont know what is causing % sy so high. The process that is using highest percent of CPU (99%) is "myscript". %wa is zero in the screenshot but sometimes it gets very high (50 %).
When the script is running, load average is greater than 1 but have never seen to be as high as 2.
I also checked my disc:
strt:~ # hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 16480 MB in 2.00 seconds = 8247.94 MB/sec
Timing buffered disk reads: 20 MB in 3.44 seconds = 5.81 MB/sec
john#strt:~> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 245G 102G 131G 44% /
udev 4.0G 152K 4.0G 1% /dev
tmpfs 4.0G 76K 4.0G 1% /dev/shm
I have checked these things but I am still not sure what is the real problem in my server and how to fix it. Can anyone identify a probable reason for the slowness? What could be the solution?
Or is there anything else I should check?
Thanks!

You're getting a high sys activity because the load of the data you're doing takes system calls that happen in kernel. To resolve your slowness problems without upgrading the server might be possible. You can modify scheduling priority. See the man pages for nice and renice. See here and especially:
Niceness values range from -20 (the highest priority, lowest niceness) and 19 (the lowest priority, highest niceness).
$ ps -lp 941
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 941 1 0 70 -10 - 1713 poll_s ? 00:00:00 sshd
$ nice -n 19 ./test.sh
My niceness value is 19
$ renice -n 10 -p 941
941 (process ID) old priority -10, new priority 10

How to measure CPU usage

I would like to log CPU usage at a frequency of 1 second.
One possible way to do it is via vmstat 1 command.
The problem is that the time between each output is not always exactly one second, especially on a busy server. I would like to be able to output the timestamp along with the CPU usage every second. What would be a simple way to accomplish this, without installing special tools?

There are many ways to do that. Except top another way is to you the "sar" utility. So something like
sar -u 1 10
will give you the cpu utilization for 10 times every 1 second. At the end it will print averages for each one of the sys, user, iowait, idle
Another utility is the "mpstat", that gives you similar things with sar

Use the well-known UNIX tool top that is normally available on Linux systems:
top -b -d 1 > /tmp/top.log
The first line of each output block from top contains a timestamp.
I see no command line option to limit the number of rows that top displays.
Section 5a. SYSTEM Configuration File and 5b. PERSONAL Configuration File of the top man page describes pressing W when running top in interactive mode to create a $HOME/.toprc configuration file.
I did this, then edited my .toprc file and changed all maxtasks values so that they are maxtasks=4. Then top only displays 4 rows of output.
For completeness, the alternative way to do this using pipes is:
top -b -d 1 | awk '/load average/ {n=10} {if (n-- > 0) {print}}' > /tmp/top.log

You might want to try htop and atop. htop is beautifully interactive while atop gathers information and can report CPU usage even for terminated processes.

I found a neat way to get the timestamp information to be displayed along with the output of vmstat.
Sample command:
vmstat -n 1 3 | while read line; do echo "$(date --iso-8601=seconds) $line"; done
Output:
2013-09-13T14:01:31-0700 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
2013-09-13T14:01:31-0700 r b swpd free buff cache si so bi bo in cs us sy id wa
2013-09-13T14:01:31-0700 1 1 4197640 29952 124584 12477708 12 5 449 147 2 0 7 4 82 7
2013-09-13T14:01:32-0700 3 0 4197780 28232 124504 12480324 392 180 15984 180 1792 1301 31 15 38 16
2013-09-13T14:01:33-0700 0 1 4197656 30464 124504 12477492 344 0 2008 0 1892 1929 32 14 43 10

To monitor the disk usage, cpu and load i created a small bash scripts that writes the values to a log file every 10 seconds.
This logfile is processed by logstash kibana and riemann.
# #!/usr/bin/env bash
# Define a timestamp function
LOGPATH="/var/log/systemstatus.log"
timestamp() {
date +"%Y-%m-%dT%T.%N"
}
#server load
while ( sleep 10 ) ; do
echo -n "$(timestamp) linux::systemstatus::load " >> $LOGPATH
cat /proc/loadavg >> $LOGPATH
#cpu usage
echo -n "$(timestamp) linux::systemstatus::cpu " >> $LOGPATH
top -bn 1 | sed -n 3p >> $LOGPAT
#disk usage
echo -n "$(timestamp) linux::systemstatus::storage " >> $LOGPATH
df --total|grep total|sed "s/total//g"| sed 's/^ *//' >> $LOGPATH
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string