Different pipe size between Ubuntu and Debian - linux

I was testing file transfer with OpenBSD netcat and noticed that it takes a bit more time to transfer the same file on Ubuntu rather than Debian. Using strace, I found that data is transferred in 64k blocks on Ubuntu.
mgamal#ubuntu:~$ strace cat test | nc -vvvv 10.10.172.11 8888
...
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
On Debian on the other hand:
mgamal#ubuntu:~$ strace cat test | nc -vvvv 10.10.172.11 8888
....
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
I wrote the following piece of code on Debian to check the pipe size:
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int main()
{
int pipefd[2];
int size;
int i;
pipefd[0] = STDIN_FILENO;
pipefd[1] = STDOUT_FILENO;
pipe(pipefd);
size = fcntl(pipefd[0], F_GETPIPE_SZ);
printf("%d\n", size);
size = fcntl(pipefd[1], F_GETPIPE_SZ);
printf("%d\n", size);
return 0;
}
Running it, it still reports 64k
mgamal#debian:~$ ./test
65536
65536
I also tried using something other than netcat to check. And I still see the pipe size being 128k
root#debian:~# strace cat foo | less
...
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
I've tried to check the source packages for netcat, the kernel, glibc to see if the pipe size is set to 128k, or if there are any calls to fcntl() that change the pipe size, but could find no trace.
Why is the pipe size reported as 64k, while actual size is 128k?

GNU cat is in the coreutils package. GNU cat does a stat or fstat on its input and output and looks at st_blksize, the optimal blocksize for filesystem I/O. It then takes the max of that number and a hardwired number and uses that as the buffer size for input and output. This is done in io_blksize.
Ubuntu 14 comes with coreutils 8.21. The minimum blocksize in that version is 64KiB.
Debian 8 comes with coreutils 8.23. The minimum blocksize in that version is 128KiB.

Related

Why does this strace on a pipeline not finish

I have a directory with a single file, one.txt. If I run ls | cat, it works fine. However, if I try to strace both sides of this pipeline, I do see the output of the command as well as strace, but the process doesn't finish.
strace ls 2> >(stdbuf -o 0 sed 's/^/command1:/') | strace cat 2> >(stdbuf -o 0 sed 's/^/command2:/')
The output I get is:
command2:execve("/usr/bin/cat", ["cat"], [/* 50 vars */]) = 0
command2:brk(0) = 0x1938000
command2:mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f87e5a93000
command2:access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
<snip>
command2:open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
command2:fstat(3, {st_mode=S_IFREG|0644, st_size=106070960, ...}) = 0
command2:mmap(NULL, 106070960, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f87def8a000
command2:close(3) = 0
command2:fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
command2:fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
command2:fadvise64(0, 0, 0, POSIX_FADV_SEQUENTIAL) = -1 ESPIPE (Illegal seek)
command2:read(0, "command1:execve(\"/usr/bin/ls\", ["..., 65536) = 4985
command1:execve("/usr/bin/ls", ["ls"], [/* 50 vars */]) = 0
command1:brk(0) = 0x1190000
command1:mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fae869c3000
command1:access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
<snip>
command1:close(3) = 0
command1:fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
command2:write(1, "command1:close(3) "..., 115) = 115
command2:read(0, "command1:mmap(NULL, 4096, PROT_R"..., 65536) = 160
command1:mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fae869c2000
one.txt
command1:write(1, "one.txt\n", 8) = 8
command2:write(1, "command1:mmap(NULL, 4096, PROT_R"..., 160) = 160
command2:read(0, "command1:close(1) "..., 65536) = 159
command1:close(1) = 0
command1:munmap(0x7fae869c2000, 4096) = 0
command1:close(2) = 0
command2:write(1, "command1:close(1) "..., 159) = 159
command2:read(0, "command1:exit_group(0) "..., 65536) = 53
command1:exit_group(0) = ?
command2:write(1, "command1:exit_group(0) "..., 53) = 53
command2:read(0, "command1:+++ exited with 0 +++\n", 65536) = 31
command1:+++ exited with 0 +++
command2:write(1, "command1:+++ exited with 0 +++\n", 31) = 31
and it hangs from then on. ps reveals that both commands in the pipeline (ls and cat here) are running.
I am on RHEL7 running Bash version 4.2.46.
I put a strace on your strace:
strace bash -c 'strace true 2> >(cat > /dev/null)'
It hangs on a wait4, indicating that it's stuck waiting on children. ps f confirms this:
24740 pts/19 Ss 0:00 /bin/bash
24752 pts/19 S+ 0:00 \_ strace true
24753 pts/19 S+ 0:00 \_ /bin/bash
24755 pts/19 S+ 0:00 \_ cat
Based on this, my working theory is that this effect is a deadlock because:
strace waits on all children, even the ones it didn't spawn directly
Bash spawns the process substitution as a child of the process. Since the process substitution is attached to stderr, it essentially waits for the parent to exit.
This suggests at least two workarounds, both of which appear to work:
strace -D ls 2> >(nl)
{ strace ls; true; } 2> >(nl)
-D, to quote the man page, "[runs the] tracer process as a detached grandchild, not as parent of the tracee". The second one forces bash to do another fork to run strace by adding another command to do after.
In both cases, the extra forks mean that the process substitution doesn't end up as strace's child, avoiding the issue.

Interprocess communication via Pipes

It is known that during Interprocess Communication in Linux, the processes communicate with each other through a special file named as "Pipe".
It is also known that the the operations performed on that file is write by one process and read by one process in order to communicate with each other.
Now, the question is :
Do these write and read operations are performed in parallel during the communication (operations are executed parallely) ? and if not than,
What happens when one of the process enters the SLEEP state during the communication? Does it performs the write operation first for the second process to read or it goes directly to sleep without performing any of the write and read operation?
The sending process can write until the pipe buffer is full (64k on Linux since 2.6.11). After that, write(2) will block.
The receiving process will block until data is available to read(2).
For a more detailed look into pipe buffering, look at https://unix.stackexchange.com/a/11954.
For example, this program
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
int
main(int argc, char *argv[])
{
int pipefd[2];
pid_t cpid;
char wbuf[32768];
char buf[16384];
/* Initialize writer buffer with 012...89 sequence */
for (int i = 0; i < sizeof(wbuf); i++)
wbuf[i] = '0' + i % 10;
if (pipe(pipefd) == -1) {
perror("pipe");
exit(EXIT_FAILURE);
}
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) { /* Child reads from pipe */
close(pipefd[1]); /* Close unused write end */
while (read(pipefd[0], &buf, sizeof(buf)) > 0);
close(pipefd[0]);
_exit(EXIT_SUCCESS);
} else { /* Parent writes sequence to pipe */
close(pipefd[0]); /* Close unused read end */
for (int i = 0; i < 5; i++)
write(pipefd[1], wbuf, sizeof(wbuf));
close(pipefd[1]); /* Reader will see EOF */
wait(NULL); /* Wait for child */
exit(EXIT_SUCCESS);
}
}
will produce the following sequence when run with gcc pipes.c && strace -e trace=open,close,read,write,pipe,clone -f ./a.out:
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
close(3) = 0
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\3\2\0\0\0\0\0"..., 832) = 832
close(3) = 0
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f32117489d0) = 21114
close(3) = 0
write(4, "01234567890123456789012345678901"..., 32768) = 32768
write(4, "01234567890123456789012345678901"..., 32768) = 32768
write(4, "01234567890123456789012345678901"..., 32768strace: Process 21114 attached
<unfinished ...>
[pid 21114] close(4) = 0
[pid 21114] read(3, "01234567890123456789012345678901"..., 16384) = 16384
[pid 21114] read(3, <unfinished ...>
[pid 21113] <... write resumed> ) = 32768
[pid 21114] <... read resumed> "45678901234567890123456789012345"..., 16384) = 16384
[pid 21113] write(4, "01234567890123456789012345678901"..., 32768 <unfinished ...>
[pid 21114] read(3, "01234567890123456789012345678901"..., 16384) = 16384
[pid 21114] read(3, <unfinished ...>
[pid 21113] <... write resumed> ) = 32768
[pid 21114] <... read resumed> "45678901234567890123456789012345"..., 16384) = 16384
[pid 21113] write(4, "01234567890123456789012345678901"..., 32768 <unfinished ...>
[pid 21114] read(3, "01234567890123456789012345678901"..., 16384) = 16384
[pid 21114] read(3, <unfinished ...>
[pid 21113] <... write resumed> ) = 32768
[pid 21114] <... read resumed> "45678901234567890123456789012345"..., 16384) = 16384
[pid 21113] close(4) = 0
[pid 21114] read(3, "01234567890123456789012345678901"..., 16384) = 16384
[pid 21114] read(3, "45678901234567890123456789012345"..., 16384) = 16384
[pid 21114] read(3, "01234567890123456789012345678901"..., 16384) = 16384
[pid 21114] read(3, "45678901234567890123456789012345"..., 16384) = 16384
[pid 21114] read(3, "", 16384) = 0
[pid 21114] close(3) = 0
[pid 21114] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21114, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++
You'll notice that the reads and writes are interleaved and that the writing and reading processes will block a few times as either the pipe is full or not enough data is available for reading.

Measure total time spent by a process on IO

While running time command, one of the programs gives following output:
real 1m33.523s
user 0m15.156s
sys 0m1.312s
Here the real and user+sys time have a lot of difference. This is most likely due to time spent on IO wait/calls. I want to measure total time spend by program in IO wait or IO calls. Is there any way to do that?
I tried using iotop. However, it doesnot report total time spent by the program performing IO.
Yes, strace - which can provide per-system-call statistics.
Example 1
I want to measure time spent on I/O while accessing stackoverflow.com:
$ time curl stackoverflow.com >/dev/null 2>&1
curl stackoverflow.com > /dev/null 2>&1 0.00s user 0.01s system 2% cpu 0.392 total
OK, 2% CPU and 0.01 s in system. Let's find out:
$ strace -c curl stackoverflow.com >/dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 240k 0 240k 0 0 127k 0 --:--:-- 0:00:01 --:--:-- 130k
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.12 0.005497 11 506 write
18.16 0.001845 43 43 fstat
11.95 0.001214 30 41 poll
5.75 0.000584 32 18 recvfrom
3.40 0.000345 3 101 mmap
2.51 0.000255 4 62 mprotect
1.98 0.000201 4 50 close
1.84 0.000187 31 6 getsockname
0.29 0.000029 1 42 1 open
Especially useful compare this results with results measured for runing curl without args.
Anyway. strace shows that curl mostly spends time in write, fstat and poll.
Another example
The first approach seems show incorrect results for sleep. If you are not satisfied with the first approach you can just print get times of each syscall (strace -T). Get this data and process them to find summary time of each syscall.
$ strace 2>&1 -T curl stackoverflow.com >/dev/null | head -n 20
execve("/usr/bin/curl", ["curl", "stackoverflow.com"], [/* 62 vars */]) = 0 <0.000219>
brk(0) = 0x186e000 <0.000175>
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc04c9e6000 <0.000166>
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) <0.000238>
open("/etc/ld.so.cache", O_RDONLY) = 3 <0.000144>
fstat(3, {st_mode=S_IFREG|0644, st_size=96498, ...}) = 0 <0.000175>
mmap(NULL, 96498, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fc04c9ce000 <0.000164>
close(3) = 0 <0.000160>
open("/usr/lib64/libcurl.so.4", O_RDONLY) = 3 <0.000047>
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\333\300\">\0\0\0"..., 832) = 832 <0.000160>
fstat(3, {st_mode=S_IFREG|0755, st_size=346008, ...}) = 0 <0.000216>
mmap(0x3e22c00000, 2438600, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3e22c00000 <0.000189>
mprotect(0x3e22c51000, 2097152, PROT_NONE) = 0 <0.000032>
mmap(0x3e22e51000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x51000) = 0x3e22e51000 <0.000119>
close(3) = 0 <0.000110>
open("/lib64/libidn.so.11", O_RDONLY) = 3 <0.000257>
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0/#U1\0\0\0"..., 832) = 832 <0.000051>
fstat(3, {st_mode=S_IFREG|0755, st_size=209088, ...}) = 0 <0.000041>
mmap(0x3155400000, 2301736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3155400000 <0.000037>
mprotect(0x3155432000, 2093056, PROT_NONE) = 0 <0.000037>

Implicit system calls in UNIX commands

I've been studying UNIX and system calls and I came across a low-level and tricky questions. The question asks what system calls are called for this command:
grep word1 word2 > file.txt
I did some research and I was unable to find a huge number of resources on the underlying UNIX calls. However, it seems to me that the answer would be open (to open and the file descriptor for the file file.txt), then dup2 (to change the STDOUT of grep to the file descriptor of open), then write to write the STDOUT of grep (which is now the file descriptor of file.txt), and finally close(), to close the file descriptor of file.txt... However, I have no idea if I am right or on the correct path, can anyone with experience in UNIX enlighten me on this topic?
You are on correct direction in your research. This command is very helpful to trace system calls in any program:
strace
On my PC it shows output (without stream redirection):
$ strace grep abc ss.txt
execve("/bin/grep", ["grep", "abc", "ss.txt"], [/* 237 vars */]) = 0
brk(0) = 0x13de000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1785694000
close(3) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
stat("ss.txt", {st_mode=S_IFREG|0644, st_size=13, ...}) = 0
open("ss.txt", O_RDONLY) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffa0e4f370) = -1 ENOTTY (Inappropriate ioctl for device)
read(3, "abc\n123\n321\n\n", 32768) = 13
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f178568c000
write(1, "abc\n", 4abc
) = 4
read(3, "", 32768) = 0
close(3) = 0
close(1) = 0
munmap(0x7f178568c000, 4096) = 0
close(2) = 0
exit_group(0) = ?

Invalid argument" setting key "net.core.somaxconn"

I tried setting Linux kernel. After editing /etc/sysctl.conf and executing the sysctl -p
it shows error
Invalid argument" setting key "net.core.somaxconn"
Linux distribution: Ubuntu 12.04.4 LTS, x86_64, 3.2.0-60-generic
$ cat /etc/sysctl.conf
net.ipv4.conf.eth0.arp_notify = 1
vm.swappiness = 0
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.somaxconn = 262144
net.core.netdev_max_backlog = 262144
fs.file-max = 1048576
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_syn_backlog = 409600
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_max_orphans = 262144
Can I increase the net.core.somaxconn to 262144?
Same issue I got into when I tried to fine tune my nginx. This is the problem with the patch that been made to Ubuntu kernel.
The
sk_max_ack_backlog field of the sock structure is defined as unsigned short.
Therefore, the backlog argument in inet_listen()
shouldn't exceed USHRT_MAX. The backlog argument in the listen() syscall is truncated to the somaxconn value. So, the somaxconn value shouldn't exceed 65535 (USHRT_MAX).
So in short to make your net.core.somaxconn work you should not give value greater then 65535
net.core.somaxconn = 65535
This is sad but we have to live with it until unless you are ok to repatch your kernel:
https://lists.ubuntu.com/archives/kernel-team/2013-October/033041.html

Resources