Linux SLUB: Unable to allocate memory on node - linux

We are getting very frequently below message in /var/log/messages
kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)
In some cases followed by an allocation table
kernel: cache: sigqueue(12019:454c4ebd186d964699132181ad7367c669700f7d8991c47d4bc053ed101675bc), object size: 160, buffer size: 160, default order: 0, min order: 0
kernel: node 0: slabs: 57, objs: 23313, free: 0
kernel: node 1: slabs: 35, objs: 14315, free: 0
Ok, free is 0, but how may this be tuned?
Following is set information
OS - Centos7.3
Kernel - 3.10.0-327.36.3.el7.x86_64
Docker - 1.12.6
Kubernetes - 1.5.5
We have private cloud powered by kurbernetes, having 10 nodes; it was working fine till last month and now we are getting these alerts very frequently on every nodes, pods/container also increased in last few days.
We have enough memory and cpu available on each node.
Any fine tuning for these alert will be very helpful.
Additional information:
sysctl.conf options
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 1024
net.ipv4.tcp_syncookies = 1
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 65535
net.core.wmem_default = 65535
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.ip_local_port_range = 1024 65535
vm.max_map_count = 262144
vm.swappiness=10
vm.vfs_cache_pressure=100

Please look at this: https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/. It's a kernel bug.

problems seems to be with kernel, first a fall check whether swap memory is properly allocated or not by free -m and mkswap -c, if swap is not properly allocated, do it. if swap is fine, then you might need to update the kernel.

Related

u-boot gives Error 22 for ubi partition, but mounts ok in linux

I have a buildroot system, which mounts ubi ok in linux, but in u-boot I get error 22
When starting in linux this is in dmesg:
ubi0: scanning is finished
ubi0: attached mtd2 (name "rootfs", size 32 MiB)
ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
ubi0: good PEBs: 256, bad PEBs: 0, corrupted PEBs: 0
ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
ubi0: max/mean erase counter: 2/0, WL threshold: 4096, image sequence number: 894512245
ubi0: available PEBs: 0, total reserved PEBs: 256, PEBs reserved for bad PEB handling: 40
ubi0: background thread "ubi_bgt0d" started, PID 1103
--
UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode
UBIFS (ubi0:0): LEB size: 126976 bytes (124 KiB), min./max. I/O unit sizes: 2048 bytes/2048 bytes
UBIFS (ubi0:0): FS size: 25649152 bytes (24 MiB, 202 LEBs), journal size 4444160 bytes (4 MiB, 35 LEBs)
UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB)
UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID 29B5D4CF-8B0B-465A-8D03-F3A464E6250E, small LPT model
UBIFS (ubi0:0): full atime support is enabled.
VFS: Mounted root (ubifs filesystem) readonly on device 0:13.
in u-boot mtd returns:
device nand0 <nand0>, # parts = 4
#: name size offset mask_flags
0: u-boot 0x00200000 0x00000000 0
1: kernel 0x01e00000 0x00200000 0
2: rootfs 0x02000000 0x02000000 0
3: user 0x0c000000 0x04000000 0
active partition: nand0,0 - (u-boot) 0x00200000 # 0x00000000
defaults:
mtdids : nand0=nand0
mtdparts: mtdparts=nand0:0x200000#0x0(u-boot),0x1e00000#0x200000(kernel),0x2000000#0x2000000(rootfs),-(user)
but when it try to attach:
=> ubi part rootfs
ubi0: attaching mtd1
UBI init error 22
It's on an embedded system which uses older versions U-Boot 2016.11 and Linux/arm 4.4.289 Kernel
I suppose some parameter is wrong somewhere, can somebody give me some advise where to look?

ubiformat in barebox giving timeout

I have a custom iMX 6UL board with Barebox (partially) functional. I have on board a Semper s25hs512t Flash being detected (after adding the necessary device id indrivers/mtd/spi-nor/spi-nor.c)
The problem - My board does not have ethernet or removable SD. I need to burn the boot loader/ flash on the s25hs512. I need to format the flash accordingly and copy the files on it.
my dtsi has
&qspi {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_qspi>;
status = "okay";
flash0: s25hs512t#0 {
#address-cells = <1>;
#size-cells = <1>;
compatible = "spansion,s25hs512t", "jedec,spi-nor";
spi-max-frequency = <40000000>;
spi-rx-bus-width = <4>;
spi-tx-bus-width = <4>;
reg = <0>;
spi-mode = <0>;
m25p,fast-read;
status = "okay";
partition#0 {
label = "barebox";
reg = <0x00000000 0x00100000>;
};
partition#1 {
label = "barebox-env";
reg = <0x00100000 0x00040000>;
};
partition#2 {
label = "barebox-of";
reg = <0x00140000 0x00040000>;
};
partition#3 {
label = "kernel";
reg = <0x00180000 0x00800000>;
};
partition#4 {
label = "root";
reg = <0x00980000 0x03640000>;
};
};
};
on boot barebox detects the flash
Board: Freescale i.MX6 UltraLite Caisteal Board
detected i.MX6 UltraLite revision 1.0
i.MX6 UltraLite unique ID: 241e09d4e317402a
m25p80 s25hs512t#00: s25hs512t (65536 Kbytes). <=====
imx-esdhc 2194000.mmc#2194000.of: registered as mmc1
rng_self_test: RNG software self-test passed
caam 2140000.crypto#2140000.of: Instantiated RNG4 SH0
caam 2140000.crypto#2140000.of: Instantiated RNG4 SH1
malloc space: 0x8eefcf80 -> 0x9ddf9eff (size 239 MiB)
barebox-environment chosen:environment.of: probe failed: No such file or directory
devinfo shows
`-- 21e0000.spi#21e0000.of
`-- s25hs512t#00
`-- m25p0
`-- 0x00000000-0x03ffffff ( 64 MiB): /dev/m25p0
`-- m25p0.barebox
`-- 0x00000000-0x000fffff ( 1 MiB): /dev/m25p0.barebox
`-- m25p0.barebox-env
`-- 0x00000000-0x0003ffff ( 256 KiB): /dev/m25p0.barebox-env
`-- m25p0.barebox-of
`-- 0x00000000-0x0003ffff ( 256 KiB): /dev/m25p0.barebox-of
`-- m25p0.kernel
`-- 0x00000000-0x007fffff ( 8 MiB): /dev/m25p0.kernel
`-- m25p0.root
`-- 0x00000000-0x0363ffff ( 54.3 MiB): /dev/m25p0.root
but when I run ubiformat, I am oddly getting this
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiformat /dev/m25p0.barebox -y
ubiformat: m25p0.barebox (nor), size 1048576 bytes (1 MiB), 4 eraseblocks of 262144 bytes (256 KiB), min. I/O size 1 bytes
libscan: scanning eraseblock 3 -- 100 % complete
ubiformat: 1 eraseblocks are supposedly empty
ubiformat: warning!: 3 of 4 eraseblocks contain non-ubifs data
ubiformat: warning!: only 0 of 4 eraseblocks have valid erase counter
ubiformat: erase counter 0 will be used for all eraseblocks
ubiformat: note, arbitrary erase counter value may be specified using -e option
ubiformat: use erase counter 0 for all eraseblocks
ubiformat: formatting eraseblock 3 -- 100 % complete
ERROR: m25p80 s25hs512t#00: flash operation timed out
ERROR: m25p0.barebox: error -110 while writing 262144 bytes to PEB 0:0, written 0 bytes
libubigen: error!: cannot write 262144 bytes
ubiformat: error!: cannot write layout volume
ubiformat: Operation not permitted
Any way ahead from this?
PS : Update
Thanks for help from #TrentP - I am focusing only on formatting the larger partitions so that I can write the kernel and root partition. but I have not been able to mount the ubi partition. I get the following issue (Readonly filesystem)
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ erase /dev/m25p0.kernel
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiattach /dev/m25p0.kernel
NOTICE: ubi0: scanning is finished
NOTICE: ubi0: empty MTD device detected
NOTICE: ubi0: registering /dev/m25p0.kernel.ubi
NOTICE: ubi0: attached mtd0 (name "m25p0.kernel", size 8 MiB) to ubi0
NOTICE: ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 262016 bytes
NOTICE: ubi0: min./max. I/O unit sizes: 1/256, sub-page size 1
NOTICE: ubi0: VID header offset: 64 (aligned 64), data offset: 128
NOTICE: ubi0: good PEBs: 32, bad PEBs: 0, corrupted PEBs: 0
NOTICE: ubi0: user volume: 0, internal volumes: 1, max. volumes count: 128
NOTICE: ubi0: max/mean erase counter: 1/0, WL threshold: 65536, image sequence number: 1700878141
NOTICE: ubi0: available PEBs: 28, total reserved PEBs: 4, PEBs reserved for bad PEB handling: 0
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubimkvol /dev/m25p0.kernel.ubi kernel 0
NOTICE: ubi0: registering kernel as /dev/m25p0.kernel.ubi.kernel
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ mount -t ubifs /dev/m25p0.kernel.ubi.kernel /mnt/kernel/
ERROR: UBIFS error (ubi0:0): 9de5a2d5: can't format empty UBI volume: read-only mount
ERROR: ubifs ubifs0: probe failed: Read-only file system
mount: Invalid argument
If I use ubiformat I get this
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiformat /dev/m25p0.kernel -y
ubiformat: m25p0.kernel (nor), size 8388608 bytes (8 MiB), 32 eraseblocks of 262144 bytes (256 KiB), min. I/O size 1 bytes
libscan: scanning eraseblock 31 -- 100 % complete
ubiformat: warning!: 32 of 32 eraseblocks contain non-ubifs data
ubiformat: warning!: only 0 of 32 eraseblocks have valid erase counter
ubiformat: erase counter 0 will be used for all eraseblocks
ubiformat: note, arbitrary erase counter value may be specified using -e option
ubiformat: use erase counter 0 for all eraseblocks
ubiformat: formatting eraseblock 31 -- 100 % complete
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiattach /dev/m25p0.kernel
NOTICE: ubi0: scanning is finished
ERROR: ubi0 error: ubi_read_volume_table: the layout volume was not found
ERROR: ubi0 error: ubi_attach_mtd_dev: failed to attach mtd0, error -22
failed to attach: Invalid argument
devinfo
Parent: m25p0.kernel
Parameters:
available_pebs: 0 (type: uint32)
bad_peb_count: 0 (type: uint32)
good_peb_count: 32 (type: uint32)
leb_size: 262016 (type: uint32)
max_erase_counter: 2 (type: uint32)
mean_erase_counter: 0 (type: uint32)
min_io_size: 1 (type: uint32)
peb_size: 262144 (type: uint32)
reserved_pebs: 32 (type: uint32) <=== why all PEBs are reserved?
sub_page_size: 1 (type: uint32)
vid_header_offset: 64 (type: uint32)
Any suggestions on what I am doing wrong. I know its something ridiculously simple. just unknown to me
You aren't supposed to use ubiformat on the barebox partition. It's too small. That's why it fails.
UBI is a Linux layer for putting UBI filesystems into NAND or NOR flash. The iMX6UL CPU boot ROM does not understand UBI. It can't boot something in a UBI formatted partition. It's for the root filesystem in the root partition.
Read section 8 of the iMX6UL reference manual, especially §8.6 about QuadSPI booting. This will tell you what you must put into flash to make it bootable.
Also look at the barebox_update command, which can be used to flash the bootloader from Barebox. The board needs to support it and I don't know about your board. The code is in various imx6_bbu_* functions. I'm not sure if qspi is supported, as I only see eMMC/SD,eMMC boot, NAND, and I2C/SPI. The qspi interface isn't the same as a serial EEPROM on one of the eCSPI controllers (again, see RM §8!). But perhaps it would work with an appropriate header already on the image.

OpenMPI cannot fully utilize 10 GE

I tried to perform data exchange between two machines connected with 10GE. The size of data is large enough (8 GB) to expect network utilization near the maximum. But surprisingly I observed absolutely different behavior.
To check the throughput I have used two different programs - nethogs and nload, both of them show that network utilization is much lower than expected. Moreover the results are unpredictable - sometimes in and out channels are utilized simultaneously, but sometimes transmission and reception are separated as if there is a half-duplex channel. Sample output of nload:
Device enp1s0f0 [192.168.0.11] (1/1):
======================================================================================================================
Incoming:
|||||||||||||||||||
.###################
####################|
##################### Curr: 0.00 GBit/s
##################### Avg: 2.08 GBit/s
.##################### Min: 0.00 GBit/s
####################### Max: 6.32 GBit/s
####################### Ttl: 57535.38 GByte
Outgoing:
||||||||||||||||||
##################
|##################
###################|
#################### Curr: 0.00 GBit/s
#################### Avg: 2.09 GBit/s
.#################### Min: 0.00 GBit/s
#####################. Max: 6.74 GBit/s
###################### Ttl: 57934.64 GByte
The code I use is here:
int main(int argc, char** argv) {
boost::mpi::environment env{};
boost::mpi::communicator world{};
boost::mpi::request reqs[2];
int k = 10;
if(argc > 1)
k = std::atoi(argv[1]);
uint64_t n = (1ul << k);
std::vector<std::complex<double>> sv(n, world.rank());
std::vector<std::complex<double>> rv(n);
int dest = world.rank() == 0 ? 1 : 0;
int src = dest;
world.barrier();
reqs[0] = world.irecv(src, 0, rv.data(), n);
reqs[1] = world.isend(dest, 0, sv.data(), n);
boost::mpi::wait_all(reqs, reqs + 2);
return 0;
}
And here is the command I use to run on cluster:
mpirun --mca btl_tcp_if_include 192.168.0.0/24 --hostfile ./host_file -n 2 --bind-to core /path/to/shared/folder/mpi_exp 29
29 here means that 2^(29 + 4) = 8 GBytes will be sent
What I have done:
Proved that there is no hardware problem by successful saturation of the channel with netcat.
Checked with tcpdump that the size of TCP packets during the communication is unstable and rarely reach the maximum size (in netcat case it is stable).
Checked with strace that socket operations are correct.
Checked TCP parameters in sysctl - they are ok.
Could you please advise me why OpenMPI doesn't work as expected?
EDIT (14.08.2018):
Finally I was able to continue to dig into this problem. Below is the output of OSU bandwidth benchmark (it was run without any mca options):
# OSU MPI Bandwidth Test v5.3
# Size Bandwidth (MB/s)
1 0.50
2 0.98
4 1.91
8 3.82
16 6.92
32 10.32
64 22.03
128 43.95
256 94.74
512 163.96
1024 264.90
2048 400.01
4096 533.47
8192 640.02
16384 705.02
32768 632.03
65536 667.29
131072 842.00
262144 743.82
524288 654.09
1048576 775.50
2097152 759.44
4194304 774.81
Actually I think that such poor performance is caused by CPU bound. Each MPI process is single-threaded by default, and it is just not able to saturate 10GE channel.
I know it is possible to communicate with several threads by enabling multithreading when building OpenMPI. But such approach will lead to increased complexity on application level.
So is it possible to have multithreaded sending/receiving in OpenMPI internally on the level responsible for point-to-point data transfer?

How to Get Free Swap Memory for Matrix Computation in Linux Matlab?

Situation: estimate if you can compute big matrix with your Ram and Swap in Linux Matlab
I need the sum of Mem and Swap, corresponding values by free -m under Heading total in Linux
total used free shared buff/cache available
Mem: 7925 3114 3646 308 1164 4220
Swap: 28610 32 28578
Free Ram memory in Matlab by
% http://stackoverflow.com/a/12350678/54964
[r,w] = unix('free | grep Mem');
stats = str2double(regexp(w, '[0-9]*', 'match'));
memsize = stats(1)/1e6;
freeRamMem = (stats(3)+stats(end))/1e6;
Free Swap memory in Matlab: ...
Relation between Memory requirement and Matrix size of Matlab: ...
Testing Suever's 2nd iteration
Suever's command gives me 29.2 GB that is corresponding to free's output so correct
$ free
total used free shared buff/cache available
Mem: 8115460 4445520 1956672 350692 1713268 3024604
Swap: 29297656 33028 29264628
System: Linux Ubuntu 16.04 64 bit
Linux kernel: 4.6
Linux kernel options: wl, zswap
Matlab: 2016a
Hardware: Macbook Air 2013-mid
Ram: 8 GB
Swap: 28 Gb on SSD (set up like in the thread How to Allocate More Space to Swap and Increase its Size Greater than Ram?)
SSD: 128 GB
You can just make a slight modification to the code that you've posted to get the swap amount.
function freeMem = freeMemory(type)
[r, w] = unix(['free | grep ', type]);
stats = str2double(regexp(w, '[0-9]*', 'match'));
memsize = stats(1)/1e6;
if numel(stats) > 3
freeMem = (stats(3)+stats(end))/1e6;
else
freeMem = stats(3)/1e6;
end
end
totalFree = freeMemory('Mem') + freeMemory('Swap')
To figure out how much memory a matrix takes up, use the size of the datatype and multiply by the number of elements as a first approximation.

PID hash table entries: 4096 (order: 12, 32768 bytes)?

Reading a server boot sequence of a redhat server 5.8, i saw this and for me is very unclear, maybe i wrong, but i know the linux kernel body allocator uses a power of two mechanism for allocate and dellocate the system memory,
From boot messagges:
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 33554432 (order: 16, 268435456 bytes)
Inode-cache hash table entries: 16777216 (order: 15, 134217728 bytes)
Order in power of two
python -c 'import math ; print int(math.pow(2,12))*4096'
16777216
python -c 'import math ; print int(math.pow(2,16))*4096'
268435456
python -c 'import math ; print int(math.pow(2,15))*4096'
134217728
So, my question is, Why the first line "PID hash table entrie" isn't 16777216 bytes?
PID hash table entries allocated as 2^N struct hlist_heads, which on a 64bit system are 8 bytes each. 2^12*8 = 32768.
Inode/Dentry caches are allocated as 2^N pages, usually 4096 bytes each. 2^15*4096 = 134217728.
This info is available in the source, kernel/pid.c and fs/inode.c respectively.

Resources