The Linux(CentOS 7.9) kernel has prompted a bug. Is it harmful? - linux

My runtime environment is CentOS 7.9(kernel is version 5.16.11) in the VMBox virtual machine, it is allocated 1G memory and 8 CPU cores.
[root#dev236 ~]# uname -a
Linux dev236 5.16.11-1.el7.elrepo.x86_64 #1 SMP PREEMPT Tue Feb 22 10:22:37 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
I ran a computation-intensive program that used 8 threads to continuously use the CPU.
After some time, the operating system issues a bug alert, like this:
[root#dev236 src]# ./server --p:command-threads-count=8
[31274.179023] rcu: INFO: rcu_preempt self-detected stall on CPU
[31274.179039] watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
[31274.179042] watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
[31274.179070] watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
[31274.179214] rcu: 0-...!: (1 GPs behind) idle=52f/1/0x4000000000000000 softirq=10073864/10073865
fqs=0
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
^C
[root#dev236 src]#
Then, I looked at the program log, and the log file was constantly being appended, which indicated that my test program was still running.
I wonder if I can ignore this bug tip?
Or, do I have to do something?
for example:
    Reduce the computational intensity of the program?
    Give the CPU a break once in a while?
    Reduce the number of threads started in the program?
Thank you all

Related

memcheck-amd64- killed by OOM

I'm using Valgrind to correct a segmentation fault in my code, but when the run arrives to the segmentation fault point my process is killed.
Searching in /var/log/syslog file, I can see that the process memcheck-amd64- (valgrind?) has been killed:
Sep 7 12:48:34 fabiano-HP kernel: [10154.654505] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-c2.scope,task=memcheck-amd64-,pid=3688,uid=1000
Sep 7 12:48:34 fabiano-HP kernel: [10154.654560] Out of memory: Killed process 3688 (memcheck-amd64-) total-vm:11539708kB, anon-rss:6503332kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:12952kB oom_score_adj:0
Sep 7 12:46:26 fabiano-HP org.freedesktop.thumbnails.Cache1[3661]: message repeated 3 times: [ libpng error: Read Error]
Sep 7 12:48:34 fabiano-HP systemd[1]: session-c2.scope: A process of this unit has been killed by the OOM killer.
Now, the problem is that Valgrind doesn't write the output file, so I can't understand what's going on... how can I avoid this? I mean, what's happening?
EDIT:
I'm running Valgrind with this command valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=valgrind-out.txt ./mainCPU -ellpack

Which process sends SIGKILL and terminates all SSH connections on/to my Namecheap Server?

I've been trying to troubleshoot this problem for some days now.
A couple of minutes after starting an SSH connection to my Namecheap server (on Mac/windows/cPanel's "Terminal"), it crashes and give the following error message :
Error: The connection to the server ended in failure at {TIME} PM. (SIGKILL)
and :
Exit Code: 137
I've tried to create some kind of log file for any SIGKILL signal, but, it seems like none can be made on a Namecheap server :
auditctl doesn't exist,
We can't get systemtap because no package managers are available.
Precision :
uname -a : Linux [-n] 2.6.32-954.3.5.lve1.4.78.el6.x86_64 #1 SMP Thu Mar 26 08:20:27 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I calculated the time between each crash : around 6min.
I don't have a very good knowledge of Linux servers, and maybe didn't include needed information. So please ask for any specificities!

Role of TLB in Linux VM

Does linux make use of TLB when running as VM?
I wrote a C program to find 'Average access time per page' and found that with any number of frequent page access, the 'Average access time per page' is same.
For example,
With frequent access of 32 pages (10000000 times), its ~10 nano seconds.
With frequent access of 64 pages (10000000 times), its ~10 nano seconds.
With frequent access of 124 pages (10000000 times), its ~10 nano seconds.
With frequent access of 256 pages (10000000 times), its ~10 nano seconds.
With frequent access of 512 pages (10000000 times), its ~10 nano seconds.
and so on.
Does the above test reflect that linux does can not use TLB when running as a virtual machine and always goes to page table entry to get the translation for every 'virtual page number' to 'Physical page frame' for a process?
My machine details are:
[ ] uname -mpi
x86_64 x86_64 x86_64
[ ] getconf PAGE_SIZE
4096
[ ] uname -a
Linux VirtualBox 5.8.0-53-generic #60~20.04.1-Ubuntu SMP Thu May 6 09:52:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Thanks,

lsyncd - OVERFLOW in event queue - Solution is to tune fs.inotify.max_queued_events

lsyncd is a fantastic alternative to NFS or NAS for replicating files among your Linux hosts. I have found the daemon works well with large Linux filesystems (many files, small to large sizes, xfs, ext4, luks) but requires some sysctl tuning as your filesystem grows.
This "question" is a note to myself so I can always find the answer via searching on stack overflow. Hope it helps you!
Github Project: https://github.com/axkibe/lsyncd
Exception in /var/log/lsyncd.log:
Thu Jun 18 17:48:52 2020 Normal: --- OVERFLOW in event queue ---
Thu Jun 18 17:48:52 2020 Normal: --- HUP signal, resetting ---
Thu Jun 18 17:48:52 2020 Normal: waiting for 1 more child processes.
Solution when you see "OVERFLOW in event queue" in lsyncd.log
From other knowledge bases, I had learned to tune max_user_watches, but by also tuning the max_queued_events, I corrected an OVERFLOW in event queue exception.
The temporary solution worked without needing to restart my lsyncd process.
I picked the number 1000000 as an arbitrarily large number. The default Ubuntu 18 value is 16384.
Temporary Solution
Check your current tuning values:
$ sysctl fs.inotify.max_queued_events
fs.inotify.max_queued_events = 16384
$ sysctl fs.inotify.max_user_watches
fs.inotify.max_user_watches = 8192
Update both max_user_watches and max_queued_events via shell
sudo sysctl fs.inotify.max_user_watches=1000000
sudo sysctl fs.inotify.max_queued_events=1000000
Permanent Solution, Persists after reboot
Update both max_user_watches and max_queued_events in /etc/sysctl.conf
fs.inotify.max_user_watches=1000000
fs.inotify.max_queued_events=1000000
Lsyncd.conf Basic Configuration
/etc/lsyncd/lsyncd.conf
settings {
logfile = "/var/log/lsyncd.log",
pidfile = "/var/run/lsyncd/lsyncd.pid",
insist = true
}
sync {
default.rsyncssh,
source="/var/application/data",
host="node2",
excludeFrom="/etc/lsyncd/exclude",
targetdir="/var/application/data",
rsync = {
archive = true,
compress = false,
whole_file = true
},
ssh = {
port = 22
}
}
System Details
Linux service1staging 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 18.04.4 LTS
lsyncd --version
Version: 2.1.6

When I run JavaFx Application in Linux Fedora, my application crash..!

I have made JavaFx application which is running fine in Window , Mac OS but when i run in Linux Fedora the application make crash the whole system with the following log.
1) What is the reason of crash in Linux ?
2) What may the be the possible solution of this crash?
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00840e58, pid=2114, tid=2694839152 JRE version:
Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
Java VM: Java HotSpot(TM) Client VM (24.51-b03 mixed mode linux-x86 )
Problematic frame: C [libc.so.6+0x2fe58] exit+0x38 Failed to write
core dump. Core dumps have been disabled. To enable core dumping, try
"ulimit -c unlimited" before starting Java again If you would like to
submit a bug report, please visit:
http://bugreport.sun.com/bugreport/crash.jsp The crash happened
outside the Java Virtual Machine in native code. See problematic
frame for where to report the bug.
--------------- T H R E A D ---------------
Current thread (0xa0a8d800): JavaThread "JNativeHook Native Hook"
[_thread_in_native, id=2306, stack(0xa01ff000,0xa0a00000)]
--------------- S Y S T E M ---------------
OS:Fedora release 14 (Laughlin)
uname:Linux 2.6.35.6-45.fc14.i686 #1 SMP Mon Oct 18 23:56:17 UTC 2010
i686 libc:glibc 2.12.90 NPTL 2.12.90 rlimit: STACK 8192k, CORE 0k,
NPROC 1024, NOFILE 1024, AS infinity load average:20.56 6.52 4.06
/proc/meminfo: MemTotal: 1013996 kB MemFree: 112652 kB
Buffers: 4224 kB Cached: 140000 kB
Memory: 4k page, physical 1013996k(112652k free), swap
1535996k(665220k free)
vm_info: Java HotSpot(TM) Client VM (24.51-b03) for linux-x86 JRE
(1.7.0_51-b13), built on Dec 18 2013 18:49:34 by "java_re" with gcc
4.3.0 20080428 (Red Hat 4.3.0-8)
time: Mon Feb 10 16:29:44 2014 elapsed time: 15804 seconds
I am not entering the whole log because it is too long to post. please provide possible solution of Exception log
Please file a bug at https://github.com/kwhat/jnativehook with the entire crash log. Chances are the issue has already been fixed in the 1.2 trunk.

Resources