OpenMPI cannot fully utilize 10 GE - openmpi

I tried to perform data exchange between two machines connected with 10GE. The size of data is large enough (8 GB) to expect network utilization near the maximum. But surprisingly I observed absolutely different behavior.
To check the throughput I have used two different programs - nethogs and nload, both of them show that network utilization is much lower than expected. Moreover the results are unpredictable - sometimes in and out channels are utilized simultaneously, but sometimes transmission and reception are separated as if there is a half-duplex channel. Sample output of nload:
Device enp1s0f0 [192.168.0.11] (1/1):
======================================================================================================================
Incoming:
|||||||||||||||||||
.###################
####################|
##################### Curr: 0.00 GBit/s
##################### Avg: 2.08 GBit/s
.##################### Min: 0.00 GBit/s
####################### Max: 6.32 GBit/s
####################### Ttl: 57535.38 GByte
Outgoing:
||||||||||||||||||
##################
|##################
###################|
#################### Curr: 0.00 GBit/s
#################### Avg: 2.09 GBit/s
.#################### Min: 0.00 GBit/s
#####################. Max: 6.74 GBit/s
###################### Ttl: 57934.64 GByte
The code I use is here:
int main(int argc, char** argv) {
boost::mpi::environment env{};
boost::mpi::communicator world{};
boost::mpi::request reqs[2];
int k = 10;
if(argc > 1)
k = std::atoi(argv[1]);
uint64_t n = (1ul << k);
std::vector<std::complex<double>> sv(n, world.rank());
std::vector<std::complex<double>> rv(n);
int dest = world.rank() == 0 ? 1 : 0;
int src = dest;
world.barrier();
reqs[0] = world.irecv(src, 0, rv.data(), n);
reqs[1] = world.isend(dest, 0, sv.data(), n);
boost::mpi::wait_all(reqs, reqs + 2);
return 0;
}
And here is the command I use to run on cluster:
mpirun --mca btl_tcp_if_include 192.168.0.0/24 --hostfile ./host_file -n 2 --bind-to core /path/to/shared/folder/mpi_exp 29
29 here means that 2^(29 + 4) = 8 GBytes will be sent
What I have done:
Proved that there is no hardware problem by successful saturation of the channel with netcat.
Checked with tcpdump that the size of TCP packets during the communication is unstable and rarely reach the maximum size (in netcat case it is stable).
Checked with strace that socket operations are correct.
Checked TCP parameters in sysctl - they are ok.
Could you please advise me why OpenMPI doesn't work as expected?
EDIT (14.08.2018):
Finally I was able to continue to dig into this problem. Below is the output of OSU bandwidth benchmark (it was run without any mca options):
# OSU MPI Bandwidth Test v5.3
# Size Bandwidth (MB/s)
1 0.50
2 0.98
4 1.91
8 3.82
16 6.92
32 10.32
64 22.03
128 43.95
256 94.74
512 163.96
1024 264.90
2048 400.01
4096 533.47
8192 640.02
16384 705.02
32768 632.03
65536 667.29
131072 842.00
262144 743.82
524288 654.09
1048576 775.50
2097152 759.44
4194304 774.81
Actually I think that such poor performance is caused by CPU bound. Each MPI process is single-threaded by default, and it is just not able to saturate 10GE channel.
I know it is possible to communicate with several threads by enabling multithreading when building OpenMPI. But such approach will lead to increased complexity on application level.
So is it possible to have multithreaded sending/receiving in OpenMPI internally on the level responsible for point-to-point data transfer?

Related

Linux SLUB: Unable to allocate memory on node

We are getting very frequently below message in /var/log/messages
kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)
In some cases followed by an allocation table
kernel: cache: sigqueue(12019:454c4ebd186d964699132181ad7367c669700f7d8991c47d4bc053ed101675bc), object size: 160, buffer size: 160, default order: 0, min order: 0
kernel: node 0: slabs: 57, objs: 23313, free: 0
kernel: node 1: slabs: 35, objs: 14315, free: 0
Ok, free is 0, but how may this be tuned?
Following is set information
OS - Centos7.3
Kernel - 3.10.0-327.36.3.el7.x86_64
Docker - 1.12.6
Kubernetes - 1.5.5
We have private cloud powered by kurbernetes, having 10 nodes; it was working fine till last month and now we are getting these alerts very frequently on every nodes, pods/container also increased in last few days.
We have enough memory and cpu available on each node.
Any fine tuning for these alert will be very helpful.
Additional information:
sysctl.conf options
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 1024
net.ipv4.tcp_syncookies = 1
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 65535
net.core.wmem_default = 65535
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.ip_local_port_range = 1024 65535
vm.max_map_count = 262144
vm.swappiness=10
vm.vfs_cache_pressure=100
Please look at this: https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/. It's a kernel bug.
problems seems to be with kernel, first a fall check whether swap memory is properly allocated or not by free -m and mkswap -c, if swap is not properly allocated, do it. if swap is fine, then you might need to update the kernel.

Use linux perf utility to report counters every second like vmstat

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?
There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }

how to get disk read/write bytes per second from /proc in programming on linux?

purpose :i want to get information like iostat command can get .
I have already known that if open /proc/diskstats or /sys/block/sdX/stat there are information that :sectors read and sectors write. So if i want to get read/write bytes/s ,the following formula is right ?
read/write bytes per second:
(sectors read/write(now)-sectors read/write(last))*512 bytes/time interval
read /write operations per second :
(read/write IOs(now)+read/write merges(now)-read/write IOs(last)-read/write merges(last ))/time interval
So if i have a timer that every second control software read the information from those two files ,and then using the above formula to calculate the value .Can i get the correct answer ?
TLDR Sector is 512 bytes (octets; 1 sector is 512 bytes; each bytes is 8 bits; every bit is either 0 or 1, but not superposition of them).
"The standard sector size of 512 bytes for magnetic disks was established ....[dubious – discuss] " (c) wiki https://en.wikipedia.org/wiki/Disk_sector
How to check sector size for io statistics (in /proc) in linux:
Check how iostat tool works (it shows kilobyte per second when started as iostat 1) - it is part of sysstat package:
https://github.com/sysstat/sysstat/blob/master/iostat.c
* Read stats from /proc/diskstats.
void read_diskstats_stat(int curr)
...
/* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
&major, &minor, dev_name,
&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);
if (i == 14) {
....
sdev.rd_sectors = rd_sec_or_wr_ios;
....
sdev.wr_sectors = wr_sec;
....
* #fctr Conversion factor.
...
if (DISPLAY_KILOBYTES(flags)) {
printf(" kB_read/s kB_wrtn/s kB_read kB_wrtn\n");
*fctr = 2;
}
...
/* rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await r_await w_await svctm %util */
... 4 columns skipped
cprintf_f(4, 8, 2,
S_VALUE(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr,
S_VALUE(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr,
So, read sector count and divide by two to get kilobyte/s (seems like 1 sector read is 0.5 kb read; 2 sector read is 1 kb read and so on). We can conclude that the sector is always 512 bytes. Same is stated in the doc, isn't it?:
internet search for "/proc/diskstats" ->
https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats ->
https://www.kernel.org/doc/Documentation/iostats.txt "I/O statistics fields" by ricklind from usa's ibm
Field 3 -- # of sectors read
This is the total number of sectors read successfully.
Field 7 -- # of sectors written
This is the total number of sectors written successfully.
No info about sector size here (why?). Is the source code being the best documentation (it may be)? The writer of /proc/diskstats is in kernel sources in file block/genhd.c, function diskstats_show:
http://lxr.free-electrons.com/source/block/genhd.c?v=4.4#L1149
1170 seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
1171 "%u %lu %lu %lu %u %u %u %u\n",
...
1176 part_stat_read(hd, sectors[READ]),
...
1180 part_stat_read(hd, sectors[WRITE]),
Structure sectors is defined in http://lxr.free-electrons.com/source/include/linux/genhd.h?v=4.4#L82
82 struct disk_stats {
83 unsigned long sectors[2]; /* READs and WRITEs */
It is read with part_stat_read and written with __part_stat_add
http://lxr.free-electrons.com/source/include/linux/genhd.h?v=4.4#L307
Adding to the sectors counter ... is... at http://lxr.free-electrons.com/source/block/blk-core.c?v=4.4#L2264
2264 void blk_account_io_completion(struct request *req, unsigned int bytes)
2265 {
2266 if (blk_do_io_stat(req)) {
2267 const int rw = rq_data_dir(req);
2268 struct hd_struct *part;
2269 int cpu;
2270
2271 cpu = part_stat_lock();
2272 part = req->part;
2273 part_stat_add(cpu, part, sectors[rw], bytes >> 9);
2274 part_stat_unlock();
2275 }
2276 }
It uses hard-coded "bytes >> 9" to compute sector size from request size in bytes (why round down??) or for human, not no-floating-point compiler, it is the same as bytes / 512.
There is also blk_rq_sectors function (unused here...) to get sector count from request, which does the same >>9 from bytes to sectors
http://lxr.free-electrons.com/source/include/linux/blkdev.h?v=4.4#L853
841 static inline unsigned int blk_rq_bytes(const struct request *rq)
842 {
843 return rq->__data_len;
844 }
853 static inline unsigned int blk_rq_sectors(const struct request *rq)
854 {
855 return blk_rq_bytes(rq) >> 9;
856 }
Authors of FS/VFS subsystem in Linux says in reply to https://lkml.org/lkml/2015/8/17/234 "Why is SECTOR_SIZE = 512 inside kernel ?" (2015):
#define SECTOR_SHIFT 9
Message https://lkml.org/lkml/2015/8/17/269 by Theodore Ts'o:
It's cast in stone. There are too many places all over the kernel,
especially in a huge number of file systems, which assume that the
sector size is 512 bytes. So above the block layer, the sector size
is always going to be 512.
This is actually better for user space programs using
/proc/diskstats, since they don't need to know whether a particular
underlying hardware is using 512, 4k, (or if the HDD manufacturers
fantasies become true 32k or 64k) sector sizes.
For similar reason, st_blocks in struct size is always in units of 512
bytes. We don't want to force userspace to have to figure out whether
the underlying file system is using 1k, 2k, or 4k. For that reason
the units of st_blocks is always going to be 512 bytes, and this is
hard-coded in the POSIX standard.

Can the logical erase block size of an MTD device be increased?

The minimum erase block size for jffs2 (mtd-utils version 1.5.0, mkfs.jffs2 revision 1.60) seems to be 8KiB:
Erase size 0x1000 too small. Increasing to 8KiB minimum
However I am running Linux 3.10 with an at25df321a,
m25p80 spi32766.0: at25df321a (4096 Kbytes),
and the erase block size is only 4KiB:
mtd5
Name: spi32766.0
Type: nor
Eraseblock size: 4096 bytes, 4.0 KiB
Amount of eraseblocks: 1024 (4194304 bytes, 4.0 MiB)
Minimum input/output unit size: 1 byte
Sub-page size: 1 byte
Character device major/minor: 90:10
Bad blocks are allowed: false
Device is writable: true
Is there a way to make the mtd system treat multiple erase blocks as one? Maybe some ioctl or module parameter?
If I flash a jffs2 image with larger erase block size, I get lots of kernel error messages, missing files and sometimes panic.
workaround
I found that flasherase --jffs2 results in a working filesystem inspite of the 4KiB erase block size. So I hacked the mkfs.jfss2.c file and the resulting image seems to work fine. I'll give it some testing.
diff -rupN orig/mkfs.jffs2.c new/mkfs.jffs2.c
--- orig/mkfs.jffs2.c 2014-10-20 15:43:31.751696500 +0200
+++ new/mkfs.jffs2.c 2014-10-20 15:43:12.623431400 +0200
## -1659,11 +1659,11 ## int main(int argc, char **argv)
}
erase_block_size *= units;
- /* If it's less than 8KiB, they're not allowed */
- if (erase_block_size < 0x2000) {
- fprintf(stderr, "Erase size 0x%x too small. Increasing to 8KiB minimum\n",
+ /* If it's less than 4KiB, they're not allowed */
+ if (erase_block_size < 0x1000) {
+ fprintf(stderr, "Erase size 0x%x too small. Increasing to 4KiB minimum\n",
erase_block_size);
- erase_block_size = 0x2000;
+ erase_block_size = 0x1000;
}
break;
}
http://lists.infradead.org/pipermail/linux-mtd/2010-September/031876.html
JFFS2 should be able to fit at least one node to eraseblock. The
maximum node size is 4KiB+few bytes. This is why the minimum
eraseblocks size is 8KiB.
But in practice, even 8KiB is bad because you and up with wasting a
lot of space at the end of eraseblocks.
You should join several erasblock into one virtual eraseblock of 64 or
128 KiB and use it - this will be more optimal.
Some drivers have already implemented this. I know about
MTD_SPI_NOR_USE_4K_SECTORS
Linux configuration option. It have to be set to "n" to enable large erase sectors of size 0x00010000.

RES != CODE + DATA in the output information of the top command,why?

what 'man top' said is: RES = CODE + DATA
q: RES -- Resident size (kb)
The non-swapped physical memory a task has used.
RES = CODE + DATA.
r: CODE -- Code size (kb)
The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS.
s: DATA -- Data+Stack size (kb)
The amount of physical memory devoted to other than executable code, also known as the 'data >resident set' size or DRS.
what when i run 'top -p 4258',i get the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ CODE DATA COMMAND
258 root 16 0 3160 1796 1328 S 0.0 0.3 0:00.10 476 416 bash
1796 != 476+416
why?
ps:
linux distribution:
linux-iguu:~ # lsb_release -a
LSB Version: core-2.0-noarch:core-3.0-noarch:core-2.0-ia32:core-3.0-ia32:desktop-3.1-ia32:desktop-3.1-noarch:graphics-2.0-ia32:graphics-2.0-noarch:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: SUSE LINUX
Description: SUSE Linux Enterprise Server 9 (i586)
Release: 9
Codename: n/a
kernel version:
linux-iguu:~ # uname -a
Linux linux-iguu 2.6.16.60-0.21-default #1 Tue May 6 12:41:02 UTC 2008 i686 i686 i386 GNU/Linux
I'll explain this with the help of an example of what happens when a program allocates and uses memory. Specifically, this program:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main(){
int *data, size, count, i;
printf( "fyi: your ints are %d bytes large\n", sizeof(int) );
printf( "Enter number of ints to malloc: " );
scanf( "%d", &size );
data = malloc( sizeof(int) * size );
if( !data ){
perror( "failed to malloc" );
exit( EXIT_FAILURE );
}
printf( "Enter number of ints to initialize: " );
scanf( "%d", &count );
for( i = 0; i < count; i++ ){
data[i] = 1337;
}
printf( "I'm going to hang out here until you hit <enter>" );
while( getchar() != '\n' );
while( getchar() != '\n' );
exit( EXIT_SUCCESS );
}
This is a simple program that asks you how many integers to allocate, allocates them, asks how many of those integers to initialize, and then initializes them. For a run where I allocate 1250000 integers and initialize 500000 of them:
$ ./a.out
fyi: your ints are 4 bytes large
Enter number of ints to malloc: 1250000
Enter number of ints to initialize: 500000
Top reports the following information:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND
<program start>
11129 xxxxxxx 16 0 3628 408 336 S 0 0.0 0:00.00 3220 4 124 a.out
<allocate 1250000 ints>
11129 xxxxxxx 16 0 8512 476 392 S 0 0.0 0:00.00 8036 4 5008 a.out
<initialize 500000 ints>
11129 xxxxxxx 15 0 8512 2432 396 S 0 0.0 0:00.00 6080 4 5008 a.out
The relevant information is:
DATA CODE RES VIRT
before allocation: 124 4 408 3628
after 5MB allocation: 5008 4 476 8512
after 2MB initialization: 5008 4 2432 8512
After I malloc'd 5MB of data, both VIRT and DATA increased by ~5MB, but RES did not. RES did increase after I touched 2MB of the integers I allocated, but DATA and VIRT stayed the same.
VIRT is the total amount of virtual memory used by the process, including what is shared and what is over-committed. DATA is the amount of virtual memory used that isn't shared and that isn't code-text. I.e., it is the virtual stack and heap of the process. RES is not virtual: it is a measurment of how much memory the process is actually using at that specific time.
So in your case, the large inequality CODE+DATA < RES is likely the shared libraries included by the process. In my example (and yours), SHR+CODE+DATA is a closer approximation to RES.
Hope this helps.
There's a lot of hand-waving and voodoo associated with top and ps. There are many articles (rants?) online about the descrepancies. E.g., this and this.
This explanation is terrific to resolve my some queries. Thanks!
And meanwhile, trying to add something got during my understanding of linux memory management knowledge. If any misunderstand, please correct me!
Modern OS process concepts are based on virtual memory. Virtual memory system includes the RAM+SWAP;
So I think most of the memory concepts related with processes refer to the virtual memory, except that there are some supplement notes.
Any virtual address(page) allocated to a process is in below state:
a) allocated, but no mapping to any physical memory(something like COW)
b) allocated, already mapped to physical memory
c) allocated, already mapped to swapped memory.
The fields ouput of top command:
a) VIRT -- it refers to all virtual memory that the process have the right
to access, no matter it is already mapped to physical memory or swapped
memory, or even has no any mapping.
b) RES -- it refers to the virtual address already mapped to physical address and it still in RAM.
c) SWAP -- refers to the virtual address already mapped to physical address and it is swapped into SWAP space.
d) SHR -- it refers to the shared memory available to a process(VM?)
e) CODE + DATA -- CODE could be in a state of 2.b/2.c, and DATA could be in any of 3 state 2.a/2.b/3.c, and 3.b/3.c also have a fields name called "USED".
4) So the calculation maybe look like:
a) VIRT(VM) = RES(VM in memory) + SWAP(VM in swap) + VM unmapped(DATA, SHR?).
b) USED = RES + SWAP
c) SWAP = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
d) RES = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
At least DATA segment still have a "DATA(VM unmapped)", this could be observed from above malloc example. That's a little different from the manpage of top command which says "DATA: The amount of physical memory devoted to other than executable code, also known as the Data Resident Set size or DRS". Thanks again.
So amount of (CODE + DATA + SHR) usually larger than RES, because at least DATA(vm unmapped) actually calculated in "DATA", not like the manpge claiming.
Regards,

Resources