hwloc + lstopo Failing to Generate Topology on Dual-CPU Machine for Open-MPI - linux

I've been attempting to setup a dual-cpu workstation (Dell Precision 7820) to run local parallel jobs utilizing openmpi 2.1.1-8 (as preinstalled on Ubuntu 18.04) however it fails to run with the following error:
mpirun: pci-common.c:125: hwloc_pci_compare_busids: Assertion `0' failed.
Examining the source code of pci-common.c you can find a comment before the assert(0) line that states nothing should normally reach this point and will abort all debug and non-debug builds. Attempting to generate a system topology map via lstopo (a program within hwloc) also fails with a similar error.
I was able to locally compile a newer release of hwloc (2.0.4 compared to the preinstalled 1.11.9-1) and found that I was only able to get lstopo to generate a topology map when I compiled hwloc using libpciaccess-dev over the standard libpciaccess0 that comes preinstalled. The summary output from making hwloc with the different pciaccess libraries displays the following results
Probe / display I/O devices: PCI(linux) LinuxIO GL
Probe / display I/O devices: PCI(pciaccess+linux) LinuxIO GL
with the former being compiled with libpciaccess0 and the latter being compiled with libpciaccess-dev. Again, the latter is the only one capable of generating a system topology map and I'm under the impression openmpi needs this information to properly scatter jobs on the system. I'm currently unsure how to enforce these version changes to the current openmpi package or if things need to compiled entirely from source. Is there potentially a simpler way to approach this problem?

Problem was solved through trial and error. First, purge installation of openmpi from the system (if installed via apt) by:
sudo apt purge openmpi-bin
sudo apt purge openmpi-common
Then, download hwloc 1.11.13 (ultrastable) from https://www.open-mpi.org/software/hwloc/v1.11/ and extract to a local directory. Enter the hwloc directory and on the command line enter:
./configure
make
sudo make install
After this is completed, install libhwloc5 then openmpi from apt:
sudo apt-get install libhwloc5
sudo apt-get install openmpi-bin
sudo apt-get install openmpi-common
Open-MPI should run as intended now and you should be able to generate system topology by running 'lstopo' and ensure mpi is working by running 'mpirun' without errors.
Hope this helps anyone who has a similar issue in the future!

Related

Any success installing VMware Workstation on virgin Rocky Linux 8.5?

Using a virgin (but updated) version of Rocky Linux 8.5, I am trying to install VMware Workstation 16.2.1 (and others), but get compile errors during the first attempt to run, when vmmon and vmnet are being built.
All the proper, current headers from kernel-devel and kernel-headers are installed.
I tried upgrading to the 5.16.4 kernal at kernel.org, with all associated headers, and basically get the same errors.
"Unable to install all modules." i.e., vmmon and vmnet
Posts i have found with searching the net seem to indicate that there was a "back-port" of an upstream fix to Rocky that has affected the ability to build the loadable kernel modules necessary to run vmware - but i cannot confirm this is actually the problem that I am experiencing.
So i simply ask these questions: Can anyone (today) install VMware Workstation 16.2.1 (or any version), on a fresh install of Rocky Linux 8.5?
If so, would you please point me at your installation instructions, because I am unable to build "vmmon" and "vmnet" modules today (2022-01-04), that allow me to actually run virtual machines with vmware? (The kernel modules fail to compile and build.)
(and after 15 years of using stackoverflow i do not have the reputation to create a "rocky-linux" question tag...)
See https://unix.stackexchange.com/questions/689436/the-vmmon-and-vmnet-vmware-workstation-kernel-modules-fail-to-build-on-rocky-lin
mbubecek's instructions work for a variety of releases and should compile perfectly and run without issue, if you follow his instructions.
I have successfully used these methods at least a half dozen times with Rocky 8.5 and 8.6 with vmware workstation 16.1 up to version 16.2.1
NOTE: This error is NOT Rocky Linux specific. Also happens on some versions of RHEL 8 and CentOS 8.x I would also expect this "fix" to work on all of the other linux versions that are RHEL 8-derived.
I've been having difficulty with the same issue, and a colleague pointed me to check my kernel. This is our "official" resolution. See if the below works for you.
This is due to differences between the kernel and the source code for the VMWare modules, see here for more information. You can get the correct kernel modules, and build them by executing the following commands
wget https://github.com/mkubecek/vmware-host-modules/archive/workstation-16.1.0.tar.gz
tar -xf workstation-16.1.0.tar.gz
cd vmware-host-modules-workstation-16.1.0/
make
sudo make install
If you get the error,
crosspage.c:53:16: fatal error: linux/frame.h: No such file or directory
The error is described here. The solution is to remove (i.e. comment out) the offending include file in crosspage.c After doing the sudo make install, it is a very good idea to restart you host.
You may need to manually insert the modules into the kernel the first time after running make install'. The kernel modules (vmmon.ko and vmnet.ko) will be found at /lib/modules//misc. The following set of command will do this:
cd /lib/modules/$(uname -r)/misc
sudo insmod vmmon.ko
sudo insmod vmnet.ko
The modules should be load automatically after a restart/reboot.
If you update vmware to a different version (say 16.2.1) you may need to this again. Just change the versions in the above commands. If you hit the update button on the splash-screen and failed to notice the version you are updating to, you can run `vmware -v' at a command prompt to get the version you updated to.

Install4j: Unix Installer causes bin/unpack200: not found

we use a setup, created with install4j (we still use 5.0.11). On a new local unix machine (Linux version 3.8.13-44.1.4.el6uek.x86_64) this setup failed, the log shows:
Unpacking JRE ...
micsetup.sh: 210: micsetup.sh: bin/unpack200: not found
Preparing JRE ...
Error unpacking jar files. The architecture or bitness (32/64) of the bundled JVM might not match your machine.
Searching for this error I found this:
the program tries to run the file /bin/unpack200 which does not exist. However, the file /usr/bin/unpack200 does exist. This is due to the fact that this file is in different places depending on the architecture of the machine used - if it is 32bits, it is in one place, if it is 64bits, it is in the other. I am having this problem because the file was made to run on a 32bit architecture but I am using a 64bits machine. Therefore to fix this problem one must install 32bits libraries.
After running
sudo dpkg --add-architecture i386
sudo apt-get update
sudo apt-get install libc6:i386 libncurses5:i386 libstdc++6:i386
our setup works.
My question: is there a way to configure the "Unix installer" in install4j to build the setup, that it works on 64 bit Linux systems like the mentioned without installing additional libraries on this system? I think not all of our customers would allow this.
Thanks in advance!
Frank
No, there is no such functionality in install4j. Bundling JREs on Linux is generally problematic.
One strategy would be to offer installers with 32-bit JRE and installers with 64-bit JREs.

Linux environments. make the machine slow

This may seem weird, but is there a way to make the machine(linux/unix flavours - preferably RHEL).
I need to control the speed of the machine to make sure the code works on very slow systems and identify the right break point (in terms of time)..
One way i can do it is to run some heavy background process.
Any other smarter way?
Thanks
How to produce high CPU load, memory, I/O or stress test Linux server
Install some prerequisites
On CentOS/RHEL
yum install gcc gcc-c++ autoconf automake
On Debian, Ubuntu
sudo su -
apt-get update
apt-get install build-essential
Download the latest tarball and run configure, make, make install
wget http://pkgs.fedoraproject.org/repo/pkgs/stress/stress1.0.4.tar.gz/a607afa695a511765b40993a64c6e2f4/stress-1.0.4.tar.gz
tar zxvf stress-1.0.4.tar.gz
cd stress-1.0.4
./configure
make
make install
The binary gets installed under /usr/local/bin
To start stress run stress followed by the -c flag for load stress, -m for memory stress, -i for io and -d for HDD. For example to stress cpu execute
stress -c 5
Execution of the command above will hog all available cpu power and create a load five times a single core would happily handle.
Similarly to stress some memory you can execute
stress -m 512M
You could also use some emulator, maybe a suitably configured qemu. If you have the source code of your application you might cross-compile it for e.g. an ARM and emulate that binary (addition benefit: you know that you application can work on some ARM).
And of course you might use Fabrice Bellard's javascript PC emulator in your browser; it is a slow "PC"
You could also buy some cheap slow Linux running hardware (e.g. Rasberry Pi). Or run your thing on some old netbook.
What you seems to looking for is named in production a stress test.
A good start : https://en.wikipedia.org/wiki/Stress_testing_%28software%29
https://wiki.archlinux.org/index.php/Stress_Test
Using stress command line :
Example :
# yum install stress
# stress --cpu 16 --io 8

Error setting up ddd to debug bash scripts

I am comparatively new to Linux. I am running Fedora 64 bit at my PC. I am having difficulty setting up ddd with bashdb. I am able to install it using yum but when I run it for bashdb, the software environment for ddd comes up but it keeps on working for infinite time, unless I manually kill it.
I used google to know what the problem is and came to know many people are having same problem, when using linux's package installers. It has bugs so I have to compile the latest source and install it manually. So I downloaded the source and tried to ./configure, it produced the following error and exited:
configure: error: Cannot find termcap compatible library
I searched again and found out I need termcap library at my PC, here:
https://lists.gnu.org/archive/html/bug-ddd/2013-01/msg00004.html
http://www.cplusplus.com/forum/unices/58299/
I used yum to install ncurses but found out it is already installed. Used locate to find the path of ncurses and passed it to configure using following commands:
sudo ./configure --with-termlib-libraries=/lib/libncurses.so.5
sudo ./configure --with-termlib-libraries=/lib/libncurses.so.5.9
Still, I am having the same error.
It is very frustrating because I have tried almost everything I found on internet. May be, there is a minor point that I am overlooking due to my inexperience. My main concern is to be able to debug complex bash scripts that I am going to develop in near future. I am not very comfortable with command line debugging i.e. without an interface. Any tips/advice that, can get me going with debugging with some other application may be, are also welcomed
I installed the ncurses development package to get past this problem:
sudo yum install ncurses-devel*

OpenCV keeps "uninstalling" itself (Linux)

Really annoying issue here. On Linux Mint OS. Every so often, I'll get this error when running OpenCV code:
HIGHGUI ERROR: V4L/V4L2: VIDIOC_S_CROP
OpenCV Error: Unspecified error (The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Carbon support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script) in cvNamedWindow, file /home/ravi/Desktop/opencv/OpenCV-2.1.0/src/highgui/window.cpp, line 180
terminate called after throwing an instance of 'cv::Exception'
what(): /home/ravi/Desktop/opencv/OpenCV-2.1.0/src/highgui/window.cpp:180: error: (-2) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Carbon support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function cvNamedWindow
The way to fix this, I've found, it to do the following:
cd OpenCV/
cd build/
cmake ..
make
sudo make install
sudo ldconfig
<restart computer>
Then I'll come back, start running my OpenCV code again, and it'll be fine. But then a few hours later, or possibly between turning cpu on/off, I'll be back to the same stupid error!
Does anyone have any idea what's going on here and how I can prevent this? It's frustrating as hell.
It sounds like a general critical error in the program code. Is there a specific task that is done when the error occurs? You might want to use strace to get the output of the program as it runs or enable application memory dumps for the user you are running the process as. This would be passed to the developer for debugging and inspection.
I believe the problem was solved by paying attention to where my USB camera was actually located in /dev/. Giving a faulty path to the get video source functions causes this type of error; restarting my computer occasionally shifted which /dev/video# my device was attached to.
Please do ls /dev/vid* to find out if you're using the right video source!

Resources