Illegal instruction (core dumped) after packages upgrade on Centos 8 - linux

Some problems with my centos 8 began when i entered "yum update -y" command. After that i started receive Illegal instruction (core dumped) messages when i run yum or any python scripts.
During upgrade i got such errors:
Upgrading : NetworkManager-libnm-1:1.30.0-0.5.el8.x86_64 18/168
Running scriptlet: NetworkManager-libnm-1:1.30.0-0.5.el8.x86_64 18/168
Running scriptlet: NetworkManager-1:1.30.0-0.5.el8.x86_64 19/168
Upgrading : NetworkManager-1:1.30.0-0.5.el8.x86_64 19/168
Running scriptlet: NetworkManager-1:1.30.0-0.5.el8.x86_64 19/168
/var/tmp/rpm-tmp.1WBtNY: line 4: 4581 Illegal instruction (core dumped) firewall-cmd --reload --quiet
Upgrading : elfutils-libs-0.182-3.el8.x86_64 20/168
Upgrading : elfutils-debuginfod-client-0.182-3.el8.x86_64 21/168
Upgrading : python3-syspurpose-1.28.9-1.el8.x86_64 22/168
Running scriptlet: selinux-policy-targeted-3.14.3-59.el8.noarch 58/168
Upgrading : selinux-policy-targeted-3.14.3-59.el8.noarch 58/168
Running scriptlet: selinux-policy-targeted-3.14.3-59.el8.noarch 58/168
Upgrading : subscription-manager-1.28.9-1.el8.x86_64 59/168
Running scriptlet: subscription-manager-1.28.9-1.el8.x86_64 59/168
chmod: cannot access '/etc/pki/entitlement/*.pem': No such file or directory
Upgrading : lvm2-8:2.03.11-0.4.20201222gitb84a992.el8.x86_64 60/168
Running scriptlet: lvm2-8:2.03.11-0.4.20201222gitb84a992.el8.x86_64 60/168
Upgrading : kpartx-0.8.4-7.el8.x86_64 61/168
Upgrading : xfsprogs-5.0.0-7.el8.x86_64 62/168
Cleanup : elfutils-default-yama-scope-0.182-2.el8.noarch 154/168
Cleanup : tzdata-java-2020d-1.el8.noarch 155/168
Cleanup : samba-common-4.13.2-5.el8.noarch 156/168
Running scriptlet: NetworkManager-1:1.30.0-0.4.el8.x86_64 157/168
Cleanup : NetworkManager-1:1.30.0-0.4.el8.x86_64 157/168
Running scriptlet: NetworkManager-1:1.30.0-0.4.el8.x86_64 157/168
/var/tmp/rpm-tmp.M7HXpR: line 4: 9248 Illegal instruction (core dumped) firewall-cmd --reload --quiet
Cleanup : NetworkManager-libnm-1:1.30.0-0.4.el8.x86_64 158/168
Running scriptlet: NetworkManager-libnm-1:1.30.0-0.4.el8.x86_64 158/168
Cleanup : perl-libs-4:5.26.3-417.el8.x86_64 159/168
Cleanup : python3-libs-3.6.8-33.el8.x86_64 164/168
Cleanup : glibc-2.28-141.el8.x86_64 165/168
Cleanup : glibc-langpack-en-2.28-141.el8.x86_64 166/168
Cleanup : glibc-common-2.28-141.el8.x86_64 167/168
Cleanup : tzdata-2020d-1.el8.noarch 168/168
Running scriptlet: libwbclient-4.13.3-1.el8.x86_64 168/168
Running scriptlet: tuned-2.15.0-1.el8.noarch 168/168
Running scriptlet: kmod-kvdo-6.2.4.26-76.el8.x86_64 168/168
sort: fflush failed: 'standard output': Broken pipe
sort: write error
gzip: stdout: Broken pipe
gzip: stdout: Broken pipe
sort: write failed: 'standard output': Broken pipe
sort: write error
sort: fflush failed: 'standard output': Broken pipe
sort: write error
gzip: stdout: Broken pipe
gzip: stdout: Broken pipe
sort: write failed: 'standard output': Broken pipe
sort: write error
Running scriptlet: tzdata-2020d-1.el8.noarch 168/168
[/usr/lib/tmpfiles.d/pesign.conf:1] Line references path below legacy directory /var/run/, updating /var/run/pesign вЖТ /run/pesign; please update the tmpfiles.d/ drop-in file accordingly.
Running scriptlet: glibc-common-2.28-145.el8.x86_64 168/168
Verifying : python3-pexpect-4.3.1-3.el8.noarch 1/168
Verifying : python3-ptyprocess-0.5.2-4.el8.noarch 2/168
Verifying : abattis-cantarell-fonts-0.0.25-6.el8.noarch 3/168
Now problems are with yum or with import of some python packages. But curl is ok.
[root#ddns-ard-2 ~]#
[root#ddns-ard-2 ~]# yum --version
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]# yum upgrade
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]# python3
Python 3.6.8 (default, Dec 22 2020, 19:04:08)
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import pymongo
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]# rpm -q curl
curl-7.61.1-17.el8.x86_64
[root#ddns-ard-2 ~]# curl https://www.google.com | more
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0<!doctype html>
Messages from my /var/log/messages
Jan 12 02:18:02 ddns-ard-2 kernel: traps: yum[3731] trap invalid opcode ip:7efdef6dd59f sp:7ffef19166c0 error:0 in libm-2.28.so[7efdef655000+181000]
Jan 12 02:18:02 ddns-ard-2 systemd[1]: Started Process Core Dump (PID 3732/UID 0).
Jan 12 02:18:02 ddns-ard-2 systemd-coredump[3733]: Resource limits disable core dumping for process 3731 (yum).
Jan 12 02:18:02 ddns-ard-2 systemd-coredump[3733]: Process 3731 (yum) of user 0 dumped core.
Jan 12 02:18:02 ddns-ard-2 systemd[1]: systemd-coredump#278-3732-0.service: Succeeded.
Workaround i founded in internet not helped
[root#ddns-ard-2 ~]# export NSS_DISABLE_HW_AES=1
[root#ddns-ard-2 ~]# yum --version
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]# export NSS_DISABLE_HW_GCM=1
[root#ddns-ard-2 ~]# yum --version
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]#
[root#ddns-ard-2 ~]# dnf
Illegal instruction (core dumped)
[root#ddns-ard-2 ~]#
Any ideas how to resolve this problem ?
I have 5 virtual CPUs with the same cores
here is information about one of them:
[root#ddns-ard-2 ~]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2699 v3 # 2.30GHz
stepping : 2
microcode : 0x38
cpu MHz : 2294.686
cache size : 46080 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm epb pti cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4589.37
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

This happens if the hypervisor masks XSAVE support, but does not mask the FMA feature bit in CPUID. It's a bit of a weird hypervisor configuration, which is why it was not caught in testing. Selecting a different CPU model in the hypervisor should work around this (something that supports AVX2 should do it). As a side effect, you will likely get better performance from your VMs.
Some reported this through appropriate channels, and updates have been released:
glibc: FMA4 math routines are not selected properly after bug 1817513
RHSA-2021:1585: glibc security, bug fix, and enhancement update

Related

Unable to start more than docker containers (Ubuntu vServer)

I am currently desperately looking for help. I am trying to get my server up and running using docker containers but it seems I have hit a wall.
I have rented a vServer (4 Cores, 8 GB RAM) and I am trying to get everything up and running.
A brief overview of my idea:
Traefik as reverse proxy to distribute my applications
Crowdsec for security and traefik middleware
As Applications: Gitea, Nextcloud and a basic Apache/PHP Web-Server
Watchtower managing updates
I installed everything using docker compose images, and it works but has one major flaw:
I am unable to start every container.
At first, I thought everything worked until I realised that the last container I started, forced another container to exit with Code 137.
To avoid that I added resource limits to my docker-compose.yml files:
deploy:
resources:
limits:
cpus: '0.001'
memory: 50M
reservations:
cpus: '0.0001'
memory: 20M
I populated the values as I seemed fit for each container. After that, no container was shut down again, but now I am unable to get the last container up (It's not any specific container, just the last container I try to start.).
I get one of the following three error messages:
$ docker compose up -d
[+] Running 3/4
⠿ Network nextcloud_default Created 3.5s
⠿ Container nextcloud-redis Started 1.3s
⠿ Container nextcloud-db Starting 2.1s
⠿ Container nextcloud-app Created 0.2s
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 2, stdout: , stderr: runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
goroutine 0 [idle]:
runtime: unknown pc 0x7f6605cd0a7c
stack: frame={sp:0x7ffd4137b1a0, fp:0x0} stack=[0x7ffd40b7c810,0x7ffd4137b850)
0x00007ffd4137b0a0: 0x000055de6cee2154 <runtime.funcspdelta+0x0000000000000034> ...
$ docker compose up -d
...
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: failed to write "380000": write /sys/fs/cgroup/cpu,cpuacct/docker/dfbfa79b7a987b1248a5498fbd6fe4438a68cb0e147a3285a8638a845193f4bf/cpu.cfs_quota_us: invalid argument: unknown
Or with sudo:
$ sudo docker compose up -d
...
Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/dfbfa79b7a987b1248a5498fbd6fe4438a68cb0e147a3285a8638a845193f4bf/log.json: no such file or directory): runc did not terminate successfully: exit status 2: unknown
I did absolutely nothing in between these commands, and I got different error messages. Huh?
I think it might be some resource issue, but inspecting the server with htop or my containers with docker stats shows the absolute opposite. CPU idles at 1-5% and 800M of 8GB RAM are used.
After some research I found this but my understanding of the linux OS is not that deep, that I can certainly say this has something to do with my issue. Furthermore, I don't get how I would set systemd.unified_cgroup_hierarchy=0.
I tried to increase vm.min_free_kbytes and took a look at
$ ulimit
unlimited
and obviously searched the internet. But sadly, I didn't find any exact matching issues. (Or I didn't use the correct buzzwords as my understanding of Linux is not that deep.)
Any idea would be much apprecheated!
Thanks!
P.S.: Here are some version details of my server:
$ docker --version
Docker version 20.10.23, build 7155243
$ uname -r
5.2.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
CPU family: 6
Model: 63
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 2
BogoMIPS: 4994.44
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nons
top_tsc aperfmperf eagerfpu cpuid_faulting pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm abm epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pt
s md_clear spec_ctrl intel_stibp flush_l1d
Virtualization features:
Virtualization: VT-x
Hypervisor vendor: Parallels
Virtualization type: container
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 30 MiB (1 instance)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
Ok I have found the issue.
My hosting provider has a process limit on their VMs. I asked them to increase it, and now my issue is gone.

Why I have no scaling_governor?

I am running v8's benchmark program, and I run the following command
./tools/cpu.sh fast
It prints out
Setting CPU frequency governor to "ondemand"
./tools/cpu.sh: line 13: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: no such file or directory
And I run
# ls /sys/devices/system/cpu/cpu0
cache crash_notes crash_notes_size microcode node0 power subsystem topology uevent
And find there is no "cpufreq"
After some searching, I find that I should install cpufrequtils, and I run
yum install cpufrequtils
After that, no thing works. So I wonder what is wrong here.
My system is
# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.2 (Final)
Release: 7.2
Codename: Final
And my cpuinfo is
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) Gold 61xx CPU
stepping : 3
microcode : 0x1
cpu MHz : 2499.998
cache size : 4096 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1 arat
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
It depends on your kernel configuration whether the governor interface is exposed at all, and which governors are available. I don't know any specifics about CentOS. On Debian/Ubuntu the governors should be available by default (last I checked, they were).
Maybe https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/power_management_guide/cpufreq_governors helps?

how to enable PMU in KVM guest

I am running KVM/QEMU in my Lenovo X1 laptop.
The guest OS is Ubuntu 15.04 x86_64.
Now, I want to run 'perf' command in guest OS, but I found followings in guest OS in dmesg.
...
[ 0.055442] smpboot: CPU0: Intel Xeon E3-12xx v2 (Ivy Bridge) (fam: 06, model: 3a, stepping: 09)
[ 0.056000] Performance Events: unsupported p6 CPU model 58 no PMU driver, software events only.
[ 0.057602] x86: Booting SMP configuration:
[ 0.058686] .... node #0, CPUs: #1
[ 0.008000] kvm-clock: cpu 1, msr 0:1ffd6041, secondary cpu clock
...
So, the perf command could NOT work hardware PMU event in guest OS.
How could I enable hardware PMU from my host to the Ubuntu guest?
Thanks,
-Tao
Page https://github.com/mozilla/rr/wiki/Building-And-Installing gives some hints how to enable guest PMU:
Qemu: On QEMU command line use
-cpu host
Libvirt/KVM: Specify CPU passthrough in domain XML definition:
<cpu mode='host-passthrough'/>
Same advice in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-monitoring_tools-vpmu
I edit <cpu mode='host-passthrough'/> line into /etc/libvirt/qemu/my_vm_name.xml file instead of <cpu>...</cpu> block.
(In virt-manager use "host-passthrough" as CPU "Model:" field - http://blog.wikichoon.com/2016/01/using-cpu-host-passthrough-with-virt.html)
Now PMU works, tested with perf stat echo inside the VM, there are "arch_perfmon" in /proc/cpuinfo and PMUs are enabled in dmesg|grep PMU.
-cpu host option of Qemu was used according to /var/log/libvirt/qemu/vm_name.log:
/usr/bin/kvm-spice ... -machine ...,accel=kvm,... -cpu host ...

different CPU cache size reported by /sys/device/ and dmidecode

I'm trying to get the size of different cache level in my system.
I tried two techniques.
a) Using information from /sys/device. Here is the output.
$ cat /sys/devices/system/cpu/cpu0/cache/index1/size
32K
$ cat /sys/devices/system/cpu/cpu0/cache/index2/size
256K
$ cat /sys/devices/system/cpu/cpu0/cache/index3/size
8192K
b) Using information from dmidecode
$ sudo dmidecode -t cache
Cache Information
Socket Designation: CPU Internal L1
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Through
Location: Internal
Installed Size: 256 KB
Maximum Size: 256 KB
< .... >
Cache Information
Socket Designation: CPU Internal L2
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Through
Location: Internal
Installed Size: 1024 KB
Maximum Size: 1024 KB
< .... >
Cache Information
Socket Designation: CPU Internal L3
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 8192 KB
Maximum Size: 8192 KB
< .... >
The size reported for L2 and L3 cache is different. Any ideas as to a) why this discrepancy? b) Which method gives the correct value?
Other related information:
$uname -a
Linux 3.0.0-14-generic #23somerville3-Ubuntu SMP Mon Dec 12 09:20:18 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3770 CPU # 3.40GHz
stepping : 9
cpu MHz : 2400.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 6784.23
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
< ... >
A few things:
You have a quad-core CPU
The index<n> name in /sys/devices/system/cpu/cpu<n>/cache does not correspond to L1/L2/L3 etc. There is a .../index<n>/level file that will tell you the level of the cache.
Your L1 cache is split into two caches (likely index0 and index1), one for data, and one for instructions (see .../index<n>/type), per core. 4 cores * 2 halves * 32K matches the 256K that dmidecode reports.
The L2 cache is split per-core. 4 cores * 256K (from index2) = 1024K, which matches dmidecodes L2 number.

how to config linux/CPU better for large scale software running (NUMA)

I am doing performance analysis on linux for large scale programs which is memory driven(tens of Gigabytes memory).
I am thinking if it's possible to config linux/hardware to be more suitable to run such kind of large programs. But I am not familiar with this side.
Anybody have points about how to config
memory allocation strategy of OS
cache config for CPU
else...
Any comment is appreciated..
This is the typical CPU model (4 Opteron processors each has dual core):
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 2
cpu MHz : 2600.000
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 5200.09
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
Useful for investigating memory / caching on a multi-socket system:
hwloc's lstopo (example):
lstopo
numactl / libnuma (but only if it really is a NUMA system)
numactl --hardware
numactl --show
sysfs, procfs:
sudo grep . /sys/devices/system/cpu/cpu*/cpufreq/*
grep . /sys/devices/system/cpu/cpu*/topology/physical_package_id
sudo grep . /proc/irq/*/smp_affinity # compare w/ /proc/interrupts

Resources