MPI hello_world to test infiniband - linux

I have virtual machine which has passthrough infiniband nic. I am testing inifinband functionality using hello world program. I am new in this world so may need help to understand following error
I have install openmpi on ubuntu using apt-get command
spatel#ib-1:~$ mpirun -V
mpirun (Open MPI) 4.0.3
Infiniband nic
spatel#ib-1:~$ lspci -nn | grep -i mell
00:05.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
My hello world program
spatel#ib-1:~$ mpirun -np 2 ./mpi_hello_world
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: ib-1
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4124
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: ib-1
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: ib-1
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: ib-1
Location: mtl_ofi_component.c:629
Error: Unspecified error (256)
--------------------------------------------------------------------------
Hello world from processor ib-1, rank 0 out of 2 processors
Hello world from processor ib-1, rank 1 out of 2 processors
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[ib-1:65704] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[ib-1:65704] 1 more process has sent help message help-mtl-ofi.txt / OFI call fail
It throws bunch of warning and error so not sure what i should understand, does it use ib interface to run this job?
UPDATE
After suggested by #Gilles Gouaillardet in comment i have compiled ompi with ucx and now i am seeing following output during hello_world prog
spatel#ib-1:~$ /home/spatel/ompi/bin/mpirun -np 2 ./hello_world_ucx --mca opal_common_ucx_opal_mem_hooks 1
--------------------------------------------------------------------------
PMIx was unable to find a usable compression library
on the system. We will therefore be unable to compress
large data streams. This may result in longer-than-normal
startup times and larger memory footprints. We will
continue, but strongly recommend installing zlib or
a comparable compression library for better user experience.
You can suppress this warning by adding "pcompress_base_silence_warning=1"
to your PMIx MCA default parameter file, or by adding
"PMIX_MCA_pcompress_base_silence_warning=1" to your environment.
--------------------------------------------------------------------------
Hello world from processor ib-1, rank 0 out of 2 processors
Hello world from processor ib-1, rank 1 out of 2 processors
Now to test my infiniband network i created similar another vm ib-2 with inifinband nic to see hello_world using RDMA for communication.
/home/spatel/ompi/bin/mpirun --host ib-1,ib-2 -np 2 ./hello_world_ucx --mca opal_common_ucx_opal_mem_hooks 1
Same time i run tcpdump on ibs5 interface which is my Infiniband nic but i see no activity and notice MPI messages still using traditional nic eth0 for communication. how do i make sure it use only infiniband for MPI (i don't have any IP configure on ib nic)

Related

Running OpenOCD fails with jtagRocketConfig

This is what I get when I try to connect Software RTL simulation and OpenOCD:
xPack OpenOCD, x86_64 Open On-Chip Debugger 0.10.0+dev-00068-ge1e63ef30 (2020-03-16-05:57)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : only one transport option; autoselect 'jtag'
Info : Initializing remote_bitbang driver
Info : Connecting to localhost:38000
Info : remote_bitbang driver initialized
Info : This adapter doesn't support configurable speed
Error: fflush: Broken pipe
Error: read: count=-1, error=Broken pipe
Error: Trying to use configured scan chain anyway...
Error: fflush: Broken pipe
Error: fflush: Broken pipe
Warn : Bypassing JTAG setup events due to errors
Error: fflush: Broken pipe
Error: fflush: Broken pipe
Error: failed jtag scan: -4
Error: Unsupported DTM version: 12
Info : Listening on port 3333 for gdb connections
Error: Target not examined yet
Error: Unsupported DTM version: 12
Error: fflush: Broken pipe
Error: failed: -4
I applied the instructions at 8.2.2.1. Creating a DTM+JTAG Config.
The following is the content of OpenOCD config file:
interface remote_bitbang
remote_bitbang_host localhost
remote_bitbang_port 38000
set _CHIPNAME riscv
jtag newtap $_CHIPNAME cpu -irlen 5
set _TARGETNAME $_CHIPNAME.cpu
target create $_TARGETNAME riscv -chain-position $_TARGETNAME
gdb_report_data_abort enable
init
halt
And this is the command I use to run the simulation:
./simulator-chipyard-jtagRocketConfig +jtag_rbb_enable=1 --rbb-port=38000 <TESTNAME>
I managed to run OpenOCD.
This is the current content of OpenOCD config file:
adapter_khz 10000
interface remote_bitbang
remote_bitbang_host $::env(REMOTE_BITBANG_HOST)
remote_bitbang_port $::env(REMOTE_BITBANG_PORT)
set _CHIPNAME riscv
jtag newtap $_CHIPNAME cpu -irlen 5 -expected-id 0x10e31913
set _TARGETNAME $_CHIPNAME.cpu
target create $_TARGETNAME riscv -chain-position $_TARGETNAME
$_TARGETNAME configure -work-area-phys $::env(WORK_AREA) -work-area-size 8096 -work-area-backup 1
gdb_report_data_abort enable
gdb_report_register_access_error enable
# Expose an unimplemented CSR so we can test non-existent register access
# behavior.
riscv expose_csrs 2288
riscv expose_custom 1,12345-12348
init
set challenge [riscv authdata_read]
riscv authdata_write [expr $challenge + 1]
halt
And this is how I run OpenOCD:
REMOTE_BITBANG_HOST=localhost REMOTE_BITBANG_PORT=38000 WORK_AREA=0x1212340000 openocd --command 'gdb_port 0' --command 'tcl_port disabled' --command 'telnet_port disabled' -f openocd.cfg
I also installed Verilator (v4.028).

Is setting the linux memory to unlimit will have an adverse effect?

I am running MPI job in linux server. I got error:
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: yw0431
OMPI source: ../../../../../ompi/mca/btl/openib/btl_openib_component.c:1216
Function: ompi_free_list_init_ex_new()
Device: mlx4_0
Memlock limit: 65536
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: yw0431
Local device: mlx4_0
--------------------------------------------------------------------------
[yw0431:20193] 11 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[yw0431:20193] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[yw0431:20193] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
forrtl: error (78): process killed (SIGTERM)
it means that my linux server have locked memory with 65M, but my job needed more memory. I think 2G should be emough.
I have found a solution about ulimiting the memory:
ulimit -l unlimited
But i am worried that i will cause system crash or some bad things happen.
so can i set "ulimit -l umlimited"?
When you set ulimit as unlimited and your process starting using memory exhaustively then OOM killer will kill ur job for system stability,I would set the ulimit as 80 to 90% of RAM of instead of unlimited.

Raspberry Pi: IR LED works, but irsend does not transmit any IR code

I installed the current lirc package (0.9.0~pre1-1.2) on a Raspian jessie (no pixel) (everything updated and upgraded) and connected to the (lirc default) GPIO ports:
to gpio port 17 - an IR LED via transistor etc
to gpio port 18 - an IR receiver nodule
The receiver part works perfectly:
mode2 command receiving raw data from transmitter
the IR code recognition of previously recorded keys works
However, the IR LED only works only while lirc is not involved:
a shell script can switch the IR LED on and off with no problem
The only thing that doesn't work:
irsend does not make the IR transmitter emit anything, however no error message is shown
So the hardware, especially the IR LED is definitely working, while lirc cannot make the LED emit any configured IR code.
Please note that this seems to be a duplicate of
stackoverflow: irsend is not giving errors, but does not send signal on Raspbian
Unfortunately it is not. The "solution" provided there was placing the data for /etc/modules into the file /etc/modules-load.d/lirc_rpi.conf. I tried that as well, but it makes no difference.
Any help is greatly appreciated!
Configuration data follows - if any other data is required, I'd be happy to add it! TIA!
System and lirc Configuration
Extract fom: /boot/config.txt
dtoverlay=lirc-rpi,gpio_in_pin=18,gpio_out_pin=17,debug=on
Extract of: /etc/modules
lirc_dev
lirc_rpi gpio_in_pin=18 gpio_out_pin=17
(not sure if that is necessary at all, does not make a difference if this is not configured!? Any hint apppreciated)
All active entries in: /etc/lirc/hardware.conf
LIRCD_ARGS="--uinput"
DRIVER="default"
DEVICE="/dev/lirc0"
MODULES="lirc_rpi"
LIRCD_CONF=""
LIRCMD_CONF=""
Some system output
1) The driver is loaded, output of following command right after boot, output of: dmesg | grep lirc
lirc_dev: IR Remote Control driver registered, major 245
lirc_rpi: module is from the staging directory, the quality is unknown, you have been warned.
lirc_rpi: to_irq 178
lirc_rpi: auto-detected active low receiver on GPIO pin 18
lirc_rpi lirc_rpi: lirc_dev: driver lirc_rpi registered at minor = 0
lirc_rpi: driver registered!
input: lircd as /devices/virtual/input/input0
lirc_rpi: Interrupt 178 obtained
2) the service is started and running, output of: systemctl status lirc
? lirc.service - LSB: Starts LIRC daemon.
Loaded: loaded (/etc/init.d/lirc)
Active: active (running) since Mo 2017-06-12 20:04:03 CEST; 2h 58min ago
Process: 377 ExecStart=/etc/init.d/lirc start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/lirc.service
+-437 /usr/sbin/lircd --driver=default --device=/dev/lirc0 --uinput
3) the modules are loaded, output of: lsmod | grep Module;lsmod | grep lirc
Module Size Used by
lirc_rpi 8453 3
lirc_dev 10211 1 lirc_rpi
rc_core 23776 1 lirc_dev
I followed the troubleshooting steps in the (outdated) manual at http://aron.ws/projects/lirc_rpi/
to get some more information.
Output of: cat /sys/kernel/debug/gpio
gpiochip0: GPIOs 0-53, parent: platform/20200000.gpio, pinctrl-bcm2835:
gpio-35 ( |? ) in hi
gpio-47 ( |? ) out lo
I have seen that output also in this case:
raspberrypi.stcakexchange: LIRC won't transmit (irsend: hardware does not support sending)
This user is as irritated by that output as I am - can somebody please tell why gpio-35 and gpio-47 are listed here? shouldn't it be gpio-17 and gpio-18?
Output of: cat /proc/interrupts | grep lirc
178: 875 pinctrl-bcm2835 18 Edge lirc_rpi
This matches the dmesg output on having obtained interrupt 178
Any other dmesg output of lircd, no matter what action, is repeatedly (most likely due to the debug option set) only
lirc_rpi: SET_SEND_CARRIER
lirc_rpi: in init_timing_params, freq=38000 pulse=13157, space=13158
lirc_rpi: SET_SEND_DUTY_CYCLE
lirc_rpi: in init_timing_params, freq=38000 pulse=13157, space=13158
Having restarted testing again after some time for to build up a test copy of the circuit, the problem occurred again. And now, after some more month of much testing, having asked lots of people for help (no one could help), even having purchased and built up a cheap mini USB oscilloscope kit for to examine the hardware further, I finally found the solution.
Long story short: everything in the configuration was correct, and all of the attached hardware was fine. The problem was the testing script - see my remark on
"a shell script can switch the IR LED on and off with no problem"
and as I did not put it in the above description, nobody could have found the solution myself....
The script uses the pseudo files in /sys/class/gpio, see an example here:
raspberry-projects.com: IO pin control from the command line
At the end of the script a command writes to /sys/class/gpio/unexport for cleanup purposes, and this step seems to reset a GPIO port to always end up in the state of being configured for input. As a result LIRC is not longer able to control this GPIO port, since it seems to configure the GPIO port for output only during system boot, and after that always expecting the port to be in that state.
I tracked the problem down to this point by using the gpio utility from the wirinpi package (install with sudo apt-get wiringpi), executing gpio readall and checking for differences.
The time when everything suddenly worked again, I simply may have fogotten about to run my testscript before testing LIRC, which I otherwise always did...
Luckily the problem with the port configuration can easily be fixed without having to reboot the system. Again I use the gpio utility to reset reset the used port for output, where in the below example
the default output port 17 for LIRC is used and
the parameter -g lets the utility use the ordinary GPIO port numbering and not that very different one of the wiringpi package and library
Here is the command, after having executed this last in my test script, LIRC can properly send IR codes again:
gpio -g mode 17 out

Failure running Overtone and SuperCollider

I can't get overtone to work with supercollider server, I'm following the getting started guide at https://github.com/overtone/overtone/wiki/Getting-Started, I've got Jack audio server running through qjackctl, then I ran SuperCollider with scsynth -u 8888 which produced the following output:
Found 12 LADSPA plugins
JackDriver: client name is 'SuperCollider'
SC_AudioDriver: sample rate = 48000.000000, driver's block size = 1024
SuperCollider 3 server ready.
Zeroconf: registered service 'SuperCollider'
then in the clojure repl I connect to SC server:
(connect-external-server 8888)
then when I run (definst foo [] (saw 220))
I get the following error:
CompilerException java.util.concurrent.TimeoutException: deref! timeout
error. Dereference took longer than 5000 ms whilst blocking until the
following node has completed loading: #<synth-group[loading]: Inst foo
Container 41>, compiling:(form-init1483192646581877285.clj:131:7)
and scsynth outputs FAILURE IN SERVER /g_new Group 31 not found
also if I try (demo (sin-osc)) I get the error FAILURE IN SERVER /s_new Group 7 not found
although if I run using sclang:
s.boot;
{ SinOsc.ar(440, 0, 0.2) }.play;
it does produce a sound.
I'm running Manjaro Linux using the Linux 4.9.27 real time Manjaro kernel
and an HDA Intel PCH sound card.

why does the i2cdetect always gives UU on my RTC in embedded Linux

I'd like to communicate read from my RTC in C code rather than the "hwclock" shell command.
However, when I use i2cdetect, it shows 0x68(which is my RTC slave address) is having the status "UU", which means "Probing was skipped, because this address is currently in use by a driver". And after I tried the i2cget, its givng "could bot set address to 0x68: Device or resource busy".
So I'm thinking if there are some problem in my Linux kernel that will force to read from my RTC all the time, or some other reason.
Thanks
I am assuming that you are using DS-1307 RTC, or one of its variants (because of 0x68 slave address). Check if its driver is loaded by:
$ lsmod | grep rtc
If you seen an entry of rtc_ds1307, (like this -> rtc_ds1307 17394 0 ) in the output of above command then this driver might be in hold of that address.
If the driver is loaded in system then unload it using
$ rmmod rtc-ds1307
EDIT:
(In light of OP's feedback,) Please do the following
1) cat /sys/bus/i2c/devices/3-0068/modalias. This will give you the name of the kernel driver that is keeping this device busy. Copy the driver-name after the colon(:)
OP's output of the command tells us that its ds1337
2) Check if ds1337 is an alias for a driver, using
grep ds1337 /lib/modules/`uname -r`/modules.alias
Hopefully you will get the following output
alias i2c:ds1337 rtc_ds1307
This confirms our presumption that rtc_ds1307 is infact the driver in hold of the I2C address 0x68.
3) use rmmod rtc_ds1307 to unload the driver.
Note: This will only work if the driver is a Loadable Kernel Module, otherwise you will see the following error:
ERROR: Module rtc_ds1307 does not exist in /proc/modules
In that case you will have to recompile the kernel again with that driver disabled/modularized.
0x68 is being used by some driver,
Disable that driver in kernel source code and recompile source code.

Resources