Ubuntu 16 gives "fork: retry: Resource temporarily unavailable", Ubuntu 20 doesn't

Ubuntu 16 gives "fork: retry: Resource temporarily unavailable", Ubuntu 20 doesn't - python-3.x

I have 2 similar machines in terms of HW. One has Ubuntu 16, the other Ubuntu 20.
I'm running a python program that is meant to open 30K TCP connections to an end point. The Ubuntu 20 machine machine was able to do the job well just by doing these 2 commands before executing the program:
#ulimit -n 1000000
#ulimit -u 1000000
However the Ubuntu 16 machine after creating 12K connections gives this:
-su: fork: retry: No child processes
-su: fork: retry: Resource temporarily unavailable
-su: fork: retry: Resource temporarily unavailable
-su: fork: retry: Resource temporarily unavailable
Any idea what may be causing Ubuntu 16 to behave like that while Ubuntu 20 seems fine?
Note: I tried to do few things now from different posts but none has worked.
Thanks in advance.

I think the maximum number of processes overall is lower on Ubuntu 16.04 vs 20.04
i.e. On Ubuntu 16.04 /proc/sys/kernel/pid_max is 32768, while on 18.04 it's 131072
I think you might have enough other processes to hit this limit, at least it's worth checking pid_max to see.
Also it would be better to write your test program to make many connections from a single process/thread using async code, as that would allow higher limits.
PS: Ubuntu 16.04 is end-of-life (unless you're paying for the extension), so you might want to ensure you upgrade.
PS: Ubuntu versions (except the minimial-types) need to second digit block to be correct.

OK the solution here was to stay away from Ubuntu 16

Related

Slurm and Munge "Invalid Credential"

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04
I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:
slurmd: error: Unable to register: Zero Bytes were transmitted or received
That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.
So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.
Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.
I guess I'm missing something simple. I don't know what else to do to properly install munge.

It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.

Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing
1: install slurm:
$ apt install -y slurm-client
2: copy slurm.conf
(perhaps create slurm-llnl beforehand):
$ cp slurm.conf /etc/slurm-llnl
3: copy munge key to client
(munge.key copied before from slurm server/slurmctld)
$ cp munge.key /etc/munge
and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with corresponding entries in the Slurm SERVER/slurmctld logs ala
[SERVER]$ tail /var/log/munge/munged.log
2022-12-30 22:57:23 +0100 Notice: Running on ..
2022-12-30 23:01:11 +0100 Info: Invalid credential ...
and
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error
All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon
(CLIENT)$ sudo systemctl restert munge.service
and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with server log entries
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error
And, after munge restart, voilà:
[CLIENT] $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
LocalQ* up infinite 1 idle XXX
for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

cgroup limit reached - no space left on device

We have two servers running ubuntu 14.04 using docker. Every other month when starting or building a container we get the message:
container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused
\"mkdir /sys/fs/cgroup/memory/docker/cf657a58a1382e62976b4d339946f07e8a40f22f18b52822f884834f78830806: no space left on device\""
The disks have still lots of space but cat /proc/cgroups gives this: (num_cgroups keeps increasing)
#subsys_name hierarchy num_cgroups enabled
cpuset 1 65805 1
cpu 2 65807 1
cpuacct 3 65803 1
blkio 4 65803 1
memory 5 65535 1
devices 6 65805 1
freezer 7 65803 1
net_cls 8 65803 1
perf_event 9 65803 1
net_prio 10 65803 1
hugetlb 11 65803 1
Restarting the server always helped so far but we don't want to restart a server every few months.
So I started some research and found a directory in the /sys/fs/cgroup/*/user path.
/sys/fs/cgroup/systemd/user/998.user is itself holding 65662 subdirectories. All named somewhat like 36309.session (the number increases)
Is there a ways to see what process is creating those cgroups?
I thought it was process 998, but that doesn't even exists.

I ran into this same problem with AWS Batch. I have no solution but I found this discussion https://github.com/moby/moby/issues/29638. It seems that the problem is some kind of leak in kernel and/or Docker.

I encountered the same issue. You probably have a lot of dangling images/containers
which is causing the cgroup of docker to run out of space. check it by:
docker images -a
docker ps -a
You need to clean it up. One solution is to remove all images/containers/etc that are not being used at the moment:
docker system prune -a

Missing "kernel: Firewall" messages

Where are my iptables logging Blocked messages? I wonder if this is an OpenVZ issue or something from the scripted install. Note, I'm highly technical, but not a server admin. Could the OpenVZ host be blocking and logging outside of my VSP?
I have two newly installed machines running running text-mode CentOS 7 x64, yum up to date packages, and with iptables/CSF.
Also, I ensured machine #2 has all the packages that are on machine #1, though #2 has some extras.
OpenVZ VPS (installed with their image of CentOS 7 x64)
VMware VM (installed with official CentOS 7 x64 minimal mode)
I performed my extra installs/configs exactly the same on both machines, and I have these lines in /etc/csf/csf.conf
TESTING = "0"
TCP_IN = "22,80,443"
UDP_IN = ""
On the VM, I'm getting these /var/log/messages when I nmap scan it:
Apr 12 17:25:23 mach kernel: Firewall: *UDP_IN Blocked* IN=ens192 OUT= ...
Apr 12 17:25:55 mach kernel: Firewall: *TCP_IN Blocked* IN=ens192 OUT= ...
On the VPS, I'm NOT getting any Firewall /var/log/messages when I nmap scan it... but I think it is properly blocking traffic.
How do I even proceed/diagnose this?

How do I access a USB drive on a OSX host from inside a docker container?

I have an application that I eventually want to run on a cloud computing service (e.g., such as AWS or Google Cloud) packaged inside a docker image. The reason the application will need to run in the cloud is because it's designed to process large data files, but before I actually deploy, I'd like to test it first on a local laptop, using a single large data file that I've stored (for test and development purposes) on an external USB drive.
My development machine is an OSX laptop, and I'm using a recent version of docker:
stachyra> uname -a
Darwin Andrews-MacBook-Pro-76.local 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep 1 21:23:09 PDT 2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64
stachyra> docker --version
Docker version 1.10.2, build c3959b1
OSX has mounted my external USB drive, device /dev/disk2s2, as /Volumes/MGR DATA:
stachyra> df
Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk1 974770480 435721376 538537104 45% 54529170 67317138 45% /
devfs 375 375 0 100% 650 0 100% /dev
map -hosts 0 0 0 100% 0 0 100% /net
map auto_home 0 0 0 100% 0 0 100% /home
/dev/disk2s2 3906291632 3869523640 36767992 100% 483690453 4595999 99% /Volumes/MGR DATA
/dev/disk3s1 196608 193160 3448 99% 24143 431 98% /Volumes/VirtualBox
stachyra> diskutil list
/dev/disk0
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *500.3 GB disk0
1: EFI EFI 209.7 MB disk0s1
2: Apple_CoreStorage 499.4 GB disk0s2
3: Apple_Boot Recovery HD 650.0 MB disk0s3
/dev/disk1
#: TYPE NAME SIZE IDENTIFIER
0: Apple_HFS Macintosh HD *499.1 GB disk1
Logical Volume on disk0s2
DB70B91A-3B57-4C82-A758-C4BDEA4160FD
Unlocked Encrypted
/dev/disk2
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *2.0 TB disk2
1: EFI EFI 209.7 MB disk2s1
2: Apple_HFS MGR DATA 2.0 TB disk2s2
/dev/disk3
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *100.7 MB disk3
1: Apple_HFS VirtualBox 100.7 MB disk3s1
and it should also be noted, the drive has several directories and data which are visible inside it, at least when viewed directly through OSX:
stachyra> ls -l /Volumes/MGR\ DATA
total 0
drwxr-xr-x 6 stachyra staff 204 Apr 14 2015 1000genomes
drwxr-xr-x 5 stachyra staff 170 Oct 12 17:41 GIAB
drwxr-xr-x 4 stachyra staff 136 Apr 28 2015 genome_browser_tracks
drwxr-xr-x 24 stachyra staff 816 Oct 6 14:00 mitty
I have tried to follow the advice from this question, which describes how to mount a USB drive in docker when docker is running within a linux host. But my local laptop is OSX, not linux, so it doesn't seem to work.
Explicitly, when attempting to follow the advice of the accepted answer, I obtain the following result:
stachyra> docker run -i -t --privileged -v /dev/disk2s2:/dev/foo ubuntu bash
root#8da7b492a707:/# uname -a
Linux 8da7b492a707 4.1.18-boot2docker #1 SMP Sat Feb 20 08:24:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root#8da7b492a707:/# ls -l /dev/foo
total 0
root#8da7b492a707:/#
Based upon the response, one can see that docker does indeed launch a linux container correctly, and it also creates a volume /dev/foo inside of the container as requested, but the actual contents of the USB drive are not accessible via that location--the ls -l command claims there are no files or directories there.
I also tried the second method described in an alternate response to the same question, and that fails even worse:
stachyra> docker run -i -t --device=/dev/disk2s2 ubuntu bash
docker: Error response from daemon: error gathering device information while adding custom device "/dev/disk2s2": not a device node.
stachyra>
I have found another discussion thread on stackoverflow which suggests that raw USB access is handled quite differently in OSX than in linux, which I suspect is probably the reason why both of the above attempts at USB access are failing.
But, what should I actually do about it? That is to say, what is the correct sequence of actions or commands to allow docker to access a USB device mounted on an OSX host, rather than linux?

I was finally able to access my USB drive from /var/media inside my container by using the machine-diskutil.sh script mentioned in warmoverflow's comment like so
machine-diskutil.sh mount my-machine-name /Volumes/my-usb-drive
and then starting the container like so
docker run -v /Volumes/my-usb-drive:/var/media -it my/image:latest bash
Because I had tried to add /Volumes/my-usb-drive as a shared folder manually in VirtualBox, I first got this error.
Error: The shared folder /Volumes/Seagate already exists on the
docker machine, please unmount it first.
So I removed it manually and re-ran the machine-diskutil.sh mount command without any problems. Great stuff!

As per #pgayvallet comment on GitHub:
As the daemon runs inside a VM in Docker Desktop, it is not possible to actually share a mac host device with the container inside the VM, and this will most definitely never be possible.

Telnet inside chroot evironment

I have set up a chroot jail inside a folder using debootstrap. Inisde this jail, I installed telnetd. But when I try to login from a remote host, the connection is closed just after login.
administrator#ubuntu:/$ telnet 192.168.1.100
Trying 192.168.1.100...
Connected to 192.168.1.100.
Escape character is '^]'.
Ubuntu 12.04 LTS
dchub login: trail
Password:
Last login: Mon Sep 9 09:51:47 UTC 2013 from 192.168.1.200 on pts/3
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.9.9-1-ARCH x86_64)
* Documentation: https://help.ubuntu.com/
Cannot execute /bin/bash: Resource temporarily unavailable
Connection closed by foreign host.
administrator#ubuntu:/$
I have already mounted /proc and /dev/pts.

I finally figured out what the problem was.
My host system has zsh as default shell and I used it to go inside chroot jail and start the telnet server, which has bash as its default shell. So when I used bash to go inside chroot jail and start the telnet server, it worked!
This error message still shown to me on each login but everything else works fine.
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: Resource temporarily unavailable

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Ubuntu 16 gives "fork: retry: Resource temporarily unavailable", Ubuntu 20 doesn't - python-3.x

OK the solution here was to stay away from Ubuntu 16

Related

Slurm and Munge "Invalid Credential"

cgroup limit reached - no space left on device

Missing "kernel: Firewall" messages

How do I access a USB drive on a OSX host from inside a docker container?

Telnet inside chroot evironment

Categories

Resources